[PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2)

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2)
@ 2023-11-17 13:07 Yi Liu
  2023-11-17 13:07 ` [PATCH v6 1/6] iommu: Add cache_invalidate_user op Yi Liu
                   ` (7 more replies)
  0 siblings, 8 replies; 93+ messages in thread
From: Yi Liu @ 2023-11-17 13:07 UTC (permalink / raw)
  To: joro, alex.williamson, jgg, kevin.tian, robin.murphy, baolu.lu
  Cc: cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.l.liu, yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi,
	lulu, suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	zhenzhong.duan, joao.m.martins, xin.zeng, yan.y.zhao

Nested translation is a hardware feature that is supported by many modern
IOMMU hardwares. It has two stages (stage-1, stage-2) address translation
to get access to the physical address. stage-1 translation table is owned
by userspace (e.g. by a guest OS), while stage-2 is owned by kernel. Changes
to stage-1 translation table should be followed by an IOTLB invalidation.

Take Intel VT-d as an example, the stage-1 translation table is I/O page
table. As the below diagram shows, guest I/O page table pointer in GPA
(guest physical address) is passed to host and be used to perform the stage-1
address translation. Along with it, modifications to present mappings in the
guest I/O page table should be followed with an IOTLB invalidation.

    .-------------.  .---------------------------.
    |   vIOMMU    |  | Guest I/O page table      |
    |             |  '---------------------------'
    .----------------/
    | PASID Entry |--- PASID cache flush --+
    '-------------'                        |
    |             |                        V
    |             |           I/O page table pointer in GPA
    '-------------'
Guest
------| Shadow |---------------------------|--------
      v        v                           v
Host
    .-------------.  .------------------------.
    |   pIOMMU    |  |  FS for GIOVA->GPA     |
    |             |  '------------------------'
    .----------------/  |
    | PASID Entry |     V (Nested xlate)
    '----------------\.----------------------------------.
    |             |   | SS for GPA->HPA, unmanaged domain|
    |             |   '----------------------------------'
    '-------------'
Where:
 - FS = First stage page tables
 - SS = Second stage page tables
<Intel VT-d Nested translation>

This series adds the cache invalidation path for the userspace to invalidate
cache after modifying the stage-1 page table. This is based on the first part
of nesting [1]

Complete code can be found in [2], QEMU could can be found in [3].

At last, this is a team work together with Nicolin Chen, Lu Baolu. Thanks
them for the help. ^_^. Look forward to your feedbacks.

[1] https://lore.kernel.org/linux-iommu/20231026044216.64964-1-yi.l.liu@intel.com/ - merged
[2] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[3] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1

Change log:

v6:
 - No much change, just rebase on top of 6.7-rc1 as part 1/2 is merged

v5: https://lore.kernel.org/linux-iommu/20231020092426.13907-1-yi.l.liu@intel.com/#t
 - Split the iommufd nesting series into two parts of alloc_user and
   invalidation (Jason)
 - Split IOMMUFD_OBJ_HW_PAGETABLE to IOMMUFD_OBJ_HWPT_PAGING/_NESTED, and
   do the same with the structures/alloc()/abort()/destroy(). Reworked the
   selftest accordingly too. (Jason)
 - Move hwpt/data_type into struct iommu_user_data from standalone op
   arguments. (Jason)
 - Rename hwpt_type to be data_type, the HWPT_TYPE to be HWPT_ALLOC_DATA,
   _TYPE_DEFAULT to be _ALLOC_DATA_NONE (Jason, Kevin)
 - Rename iommu_copy_user_data() to iommu_copy_struct_from_user() (Kevin)
 - Add macro to the iommu_copy_struct_from_user() to calculate min_size
   (Jason)
 - Fix two bugs spotted by ZhaoYan

v4: https://lore.kernel.org/linux-iommu/20230921075138.124099-1-yi.l.liu@intel.com/
 - Separate HWPT alloc/destroy/abort functions between user-managed HWPTs
   and kernel-managed HWPTs
 - Rework invalidate uAPI to be a multi-request array-based design
 - Add a struct iommu_user_data_array and a helper for driver to sanitize
   and copy the entry data from user space invalidation array
 - Add a patch fixing TEST_LENGTH() in selftest program
 - Drop IOMMU_RESV_IOVA_RANGES patches
 - Update kdoc and inline comments
 - Drop the code to add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation,
   this does not change the rule that resv regions should only be added to the
   kernel-managed HWPT. The IOMMU_RESV_SW_MSI stuff will be added in later series
   as it is needed only by SMMU so far.

v3: https://lore.kernel.org/linux-iommu/20230724110406.107212-1-yi.l.liu@intel.com/
 - Add new uAPI things in alphabetical order
 - Pass in "enum iommu_hwpt_type hwpt_type" to op->domain_alloc_user for
   sanity, replacing the previous op->domain_alloc_user_data_len solution
 - Return ERR_PTR from domain_alloc_user instead of NULL
 - Only add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation (Kevin)
 - Add IOMMU_RESV_IOVA_RANGES to report resv iova ranges to userspace hence
   userspace is able to exclude the ranges in the stage-1 HWPT (e.g. guest I/O
   page table). (Kevin)
 - Add selftest coverage for the new IOMMU_RESV_IOVA_RANGES ioctl
 - Minor changes per Kevin's inputs

v2: https://lore.kernel.org/linux-iommu/20230511143844.22693-1-yi.l.liu@intel.com/
 - Add union iommu_domain_user_data to include all user data structures to avoid
   passing void * in kernel APIs.
 - Add iommu op to return user data length for user domain allocation
 - Rename struct iommu_hwpt_alloc::data_type to be hwpt_type
 - Store the invalidation data length in iommu_domain_ops::cache_invalidate_user_data_len
 - Convert cache_invalidate_user op to be int instead of void
 - Remove @data_type in struct iommu_hwpt_invalidate
 - Remove out_hwpt_type_bitmap in struct iommu_hw_info hence drop patch 08 of v1

v1: https://lore.kernel.org/linux-iommu/20230309080910.607396-1-yi.l.liu@intel.com/

Thanks,
	Yi Liu

Lu Baolu (1):
  iommu: Add cache_invalidate_user op

Nicolin Chen (4):
  iommu: Add iommu_copy_struct_from_user_array helper
  iommufd/selftest: Add mock_domain_cache_invalidate_user support
  iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op
  iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl

Yi Liu (1):
  iommufd: Add IOMMU_HWPT_INVALIDATE

 drivers/iommu/iommufd/hw_pagetable.c          | 35 ++++++++
 drivers/iommu/iommufd/iommufd_private.h       |  9 ++
 drivers/iommu/iommufd/iommufd_test.h          | 22 +++++
 drivers/iommu/iommufd/main.c                  |  3 +
 drivers/iommu/iommufd/selftest.c              | 69 +++++++++++++++
 include/linux/iommu.h                         | 84 +++++++++++++++++++
 include/uapi/linux/iommufd.h                  | 35 ++++++++
 tools/testing/selftests/iommu/iommufd.c       | 75 +++++++++++++++++
 tools/testing/selftests/iommu/iommufd_utils.h | 63 ++++++++++++++
 9 files changed, 395 insertions(+)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 93+ messages in thread

* [PATCH v6 1/6] iommu: Add cache_invalidate_user op
  2023-11-17 13:07 [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2) Yi Liu
@ 2023-11-17 13:07 ` Yi Liu
  2023-11-20  7:53   ` Tian, Kevin
                     ` (2 more replies)
  2023-11-17 13:07 ` [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE Yi Liu
                   ` (6 subsequent siblings)
  7 siblings, 3 replies; 93+ messages in thread
From: Yi Liu @ 2023-11-17 13:07 UTC (permalink / raw)
  To: joro, alex.williamson, jgg, kevin.tian, robin.murphy, baolu.lu
  Cc: cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.l.liu, yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi,
	lulu, suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	zhenzhong.duan, joao.m.martins, xin.zeng, yan.y.zhao

From: Lu Baolu <baolu.lu@linux.intel.com>

The updates of the PTEs in the nested page table will be propagated to the
hardware caches on both IOMMU (IOTLB) and devices (DevTLB/ATC).

Add a new domain op cache_invalidate_user for the userspace to flush the
hardware caches for a nested domain through iommufd. No wrapper for it,
as it's only supposed to be used by iommufd. Then, pass in invalidation
requests in form of a user data array conatining a number of invalidation
data entries.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 include/linux/iommu.h | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index ec289c1016f5..0c1ff7fe4fa1 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -284,6 +284,24 @@ struct iommu_user_data {
 	size_t len;
 };
 
+/**
+ * struct iommu_user_data_array - iommu driver specific user space data array
+ * @type: The data type of all the entries in the user buffer array
+ * @uptr: Pointer to the user buffer array for copy_from_user()
+ * @entry_len: The fixed-width length of a entry in the array, in bytes
+ * @entry_num: The number of total entries in the array
+ *
+ * A array having a @entry_num number of @entry_len sized entries, each entry is
+ * user space data, an uAPI defined in include/uapi/linux/iommufd.h where @type
+ * is also defined as enum iommu_xyz_data_type.
+ */
+struct iommu_user_data_array {
+	unsigned int type;
+	void __user *uptr;
+	size_t entry_len;
+	int entry_num;
+};
+
 /**
  * __iommu_copy_struct_from_user - Copy iommu driver specific user space data
  * @dst_data: Pointer to an iommu driver specific user data that is defined in
@@ -440,6 +458,15 @@ struct iommu_ops {
  * @iotlb_sync_map: Sync mappings created recently using @map to the hardware
  * @iotlb_sync: Flush all queued ranges from the hardware TLBs and empty flush
  *            queue
+ * @cache_invalidate_user: Flush hardware cache for user space IO page table.
+ *                         The @domain must be IOMMU_DOMAIN_NESTED. The @array
+ *                         passes in the cache invalidation requests, in form
+ *                         of a driver data structure. The driver must update
+ *                         array->entry_num to report the number of handled
+ *                         invalidation requests. The 32-bit @error_code can
+ *                         forward a driver specific error code to user space.
+ *                         Both the driver data structure and the error code
+ *                         must be defined in include/uapi/linux/iommufd.h
  * @iova_to_phys: translate iova to physical address
  * @enforce_cache_coherency: Prevent any kind of DMA from bypassing IOMMU_CACHE,
  *                           including no-snoop TLPs on PCIe or other platform
@@ -465,6 +492,9 @@ struct iommu_domain_ops {
 			      size_t size);
 	void (*iotlb_sync)(struct iommu_domain *domain,
 			   struct iommu_iotlb_gather *iotlb_gather);
+	int (*cache_invalidate_user)(struct iommu_domain *domain,
+				     struct iommu_user_data_array *array,
+				     u32 *error_code);
 
 	phys_addr_t (*iova_to_phys)(struct iommu_domain *domain,
 				    dma_addr_t iova);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* RE: [PATCH v6 1/6] iommu: Add cache_invalidate_user op
  2023-11-17 13:07 ` [PATCH v6 1/6] iommu: Add cache_invalidate_user op Yi Liu
@ 2023-11-20  7:53   ` Tian, Kevin
  2023-12-06 18:32   ` Jason Gunthorpe
  2024-01-08  7:32   ` Binbin Wu
  2 siblings, 0 replies; 93+ messages in thread
From: Tian, Kevin @ 2023-11-20  7:53 UTC (permalink / raw)
  To: Liu, Yi L, joro@8bytes.org, alex.williamson@redhat.com,
	jgg@nvidia.com, robin.murphy@arm.com, baolu.lu@linux.intel.com
  Cc: cohuck@redhat.com, eric.auger@redhat.com, nicolinc@nvidia.com,
	kvm@vger.kernel.org, mjrosato@linux.ibm.com,
	chao.p.peng@linux.intel.com, Liu, Yi L, yi.y.sun@linux.intel.com,
	peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

> From: Yi Liu <yi.l.liu@intel.com>
> Sent: Friday, November 17, 2023 9:07 PM
> 
> +/**
> + * struct iommu_user_data_array - iommu driver specific user space data
> array
> + * @type: The data type of all the entries in the user buffer array
> + * @uptr: Pointer to the user buffer array for copy_from_user()

remove "for copy_from_user()"

> + * @entry_len: The fixed-width length of a entry in the array, in bytes

s/a/an/

> + * @entry_num: The number of total entries in the array
> + *
> + * A array having a @entry_num number of @entry_len sized entries, each
> entry is
> + * user space data, an uAPI defined in include/uapi/linux/iommufd.h where
> @type
> + * is also defined as enum iommu_xyz_data_type.

* The user buffer array has @entry_num entries, each with a fixed size
 * @entry_len. The entry format and related type are defined in
 * include/uapi/linux/iommufd.h

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 1/6] iommu: Add cache_invalidate_user op
  2023-11-17 13:07 ` [PATCH v6 1/6] iommu: Add cache_invalidate_user op Yi Liu
  2023-11-20  7:53   ` Tian, Kevin
@ 2023-12-06 18:32   ` Jason Gunthorpe
  2023-12-06 18:43     ` Nicolin Chen
  2024-01-08  7:32   ` Binbin Wu
  2 siblings, 1 reply; 93+ messages in thread
From: Jason Gunthorpe @ 2023-12-06 18:32 UTC (permalink / raw)
  To: Yi Liu
  Cc: joro, alex.williamson, kevin.tian, robin.murphy, baolu.lu, cohuck,
	eric.auger, nicolinc, kvm, mjrosato, chao.p.peng, yi.y.sun,
	peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	zhenzhong.duan, joao.m.martins, xin.zeng, yan.y.zhao

On Fri, Nov 17, 2023 at 05:07:12AM -0800, Yi Liu wrote:
> +/**
> + * struct iommu_user_data_array - iommu driver specific user space data array
> + * @type: The data type of all the entries in the user buffer array
> + * @uptr: Pointer to the user buffer array for copy_from_user()
> + * @entry_len: The fixed-width length of a entry in the array, in bytes
> + * @entry_num: The number of total entries in the array
> + *
> + * A array having a @entry_num number of @entry_len sized entries, each entry is
> + * user space data, an uAPI defined in include/uapi/linux/iommufd.h where @type
> + * is also defined as enum iommu_xyz_data_type.
> + */
> +struct iommu_user_data_array {
> +	unsigned int type;
> +	void __user *uptr;
> +	size_t entry_len;
> +	int entry_num;

These are u32 in the uapi, they should probably be u32 here
too. Otherwise we have to worry about truncation.

> @@ -465,6 +492,9 @@ struct iommu_domain_ops {
>  			      size_t size);
>  	void (*iotlb_sync)(struct iommu_domain *domain,
>  			   struct iommu_iotlb_gather *iotlb_gather);
> +	int (*cache_invalidate_user)(struct iommu_domain *domain,
> +				     struct iommu_user_data_array *array,
> +				     u32 *error_code);

Regarding the other conversation I worry a u32 error_code is too small.

Unfortunately there is no obvious place to put something better so if
we reach it we will have to add more error_code space via normal
extension.

Maybe expand this to u64? That is 64 bits of error register data and
the consumer index. It should do for SMMUv3 at least?

Jason

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 1/6] iommu: Add cache_invalidate_user op
  2023-12-06 18:32   ` Jason Gunthorpe
@ 2023-12-06 18:43     ` Nicolin Chen
  2023-12-06 18:50       ` Jason Gunthorpe
  0 siblings, 1 reply; 93+ messages in thread
From: Nicolin Chen @ 2023-12-06 18:43 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yi Liu, joro, alex.williamson, kevin.tian, robin.murphy, baolu.lu,
	cohuck, eric.auger, kvm, mjrosato, chao.p.peng, yi.y.sun, peterx,
	jasowang, shameerali.kolothum.thodi, lulu, suravee.suthikulpanit,
	iommu, linux-kernel, linux-kselftest, zhenzhong.duan,
	joao.m.martins, xin.zeng, yan.y.zhao

On Wed, Dec 06, 2023 at 02:32:09PM -0400, Jason Gunthorpe wrote:
> On Fri, Nov 17, 2023 at 05:07:12AM -0800, Yi Liu wrote:
 
> > @@ -465,6 +492,9 @@ struct iommu_domain_ops {
> >  			      size_t size);
> >  	void (*iotlb_sync)(struct iommu_domain *domain,
> >  			   struct iommu_iotlb_gather *iotlb_gather);
> > +	int (*cache_invalidate_user)(struct iommu_domain *domain,
> > +				     struct iommu_user_data_array *array,
> > +				     u32 *error_code);
> 
> Regarding the other conversation I worry a u32 error_code is too small.
> 
> Unfortunately there is no obvious place to put something better so if
> we reach it we will have to add more error_code space via normal
> extension.
> 
> Maybe expand this to u64? That is 64 bits of error register data and
> the consumer index. It should do for SMMUv3 at least?

I think Yi is moving the error_code to the entry data structure,
where we can even define a list of error_codes as a driver data
needs. So, I assume this u32 pointer would be gone too.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 1/6] iommu: Add cache_invalidate_user op
  2023-12-06 18:43     ` Nicolin Chen
@ 2023-12-06 18:50       ` Jason Gunthorpe
  2023-12-07  6:53         ` Yi Liu
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Gunthorpe @ 2023-12-06 18:50 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Yi Liu, joro, alex.williamson, kevin.tian, robin.murphy, baolu.lu,
	cohuck, eric.auger, kvm, mjrosato, chao.p.peng, yi.y.sun, peterx,
	jasowang, shameerali.kolothum.thodi, lulu, suravee.suthikulpanit,
	iommu, linux-kernel, linux-kselftest, zhenzhong.duan,
	joao.m.martins, xin.zeng, yan.y.zhao

On Wed, Dec 06, 2023 at 10:43:34AM -0800, Nicolin Chen wrote:
> On Wed, Dec 06, 2023 at 02:32:09PM -0400, Jason Gunthorpe wrote:
> > On Fri, Nov 17, 2023 at 05:07:12AM -0800, Yi Liu wrote:
>  
> > > @@ -465,6 +492,9 @@ struct iommu_domain_ops {
> > >  			      size_t size);
> > >  	void (*iotlb_sync)(struct iommu_domain *domain,
> > >  			   struct iommu_iotlb_gather *iotlb_gather);
> > > +	int (*cache_invalidate_user)(struct iommu_domain *domain,
> > > +				     struct iommu_user_data_array *array,
> > > +				     u32 *error_code);
> > 
> > Regarding the other conversation I worry a u32 error_code is too small.
> > 
> > Unfortunately there is no obvious place to put something better so if
> > we reach it we will have to add more error_code space via normal
> > extension.
> > 
> > Maybe expand this to u64? That is 64 bits of error register data and
> > the consumer index. It should do for SMMUv3 at least?
> 
> I think Yi is moving the error_code to the entry data structure,
> where we can even define a list of error_codes as a driver data
> needs. So, I assume this u32 pointer would be gone too.

Oh, lets see that then..

Jason

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 1/6] iommu: Add cache_invalidate_user op
  2023-12-06 18:50       ` Jason Gunthorpe
@ 2023-12-07  6:53         ` Yi Liu
  0 siblings, 0 replies; 93+ messages in thread
From: Yi Liu @ 2023-12-07  6:53 UTC (permalink / raw)
  To: Jason Gunthorpe, Nicolin Chen
  Cc: joro, alex.williamson, kevin.tian, robin.murphy, baolu.lu, cohuck,
	eric.auger, kvm, mjrosato, chao.p.peng, yi.y.sun, peterx,
	jasowang, shameerali.kolothum.thodi, lulu, suravee.suthikulpanit,
	iommu, linux-kernel, linux-kselftest, zhenzhong.duan,
	joao.m.martins, xin.zeng, yan.y.zhao

On 2023/12/7 02:50, Jason Gunthorpe wrote:
> On Wed, Dec 06, 2023 at 10:43:34AM -0800, Nicolin Chen wrote:
>> On Wed, Dec 06, 2023 at 02:32:09PM -0400, Jason Gunthorpe wrote:
>>> On Fri, Nov 17, 2023 at 05:07:12AM -0800, Yi Liu wrote:
>>   
>>>> @@ -465,6 +492,9 @@ struct iommu_domain_ops {
>>>>   			      size_t size);
>>>>   	void (*iotlb_sync)(struct iommu_domain *domain,
>>>>   			   struct iommu_iotlb_gather *iotlb_gather);
>>>> +	int (*cache_invalidate_user)(struct iommu_domain *domain,
>>>> +				     struct iommu_user_data_array *array,
>>>> +				     u32 *error_code);
>>>
>>> Regarding the other conversation I worry a u32 error_code is too small.
>>>
>>> Unfortunately there is no obvious place to put something better so if
>>> we reach it we will have to add more error_code space via normal
>>> extension.
>>>
>>> Maybe expand this to u64? That is 64 bits of error register data and
>>> the consumer index. It should do for SMMUv3 at least?
>>
>> I think Yi is moving the error_code to the entry data structure,
>> where we can even define a list of error_codes as a driver data
>> needs. So, I assume this u32 pointer would be gone too.
> 
> Oh, lets see that then..

yes, I'm going to move it.

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 1/6] iommu: Add cache_invalidate_user op
  2023-11-17 13:07 ` [PATCH v6 1/6] iommu: Add cache_invalidate_user op Yi Liu
  2023-11-20  7:53   ` Tian, Kevin
  2023-12-06 18:32   ` Jason Gunthorpe
@ 2024-01-08  7:32   ` Binbin Wu
  2 siblings, 0 replies; 93+ messages in thread
From: Binbin Wu @ 2024-01-08  7:32 UTC (permalink / raw)
  To: Yi Liu
  Cc: joro, alex.williamson, jgg, kevin.tian, robin.murphy, baolu.lu,
	cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	zhenzhong.duan, joao.m.martins, xin.zeng, yan.y.zhao



On 11/17/2023 9:07 PM, Yi Liu wrote:
> From: Lu Baolu <baolu.lu@linux.intel.com>
>
> The updates of the PTEs in the nested page table will be propagated to the
> hardware caches on both IOMMU (IOTLB) and devices (DevTLB/ATC).
>
> Add a new domain op cache_invalidate_user for the userspace to flush the
> hardware caches for a nested domain through iommufd. No wrapper for it,
> as it's only supposed to be used by iommufd. Then, pass in invalidation
> requests in form of a user data array conatining a number of invalidation

s/conatining/containing/
> data entries.
>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-11-17 13:07 [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2) Yi Liu
  2023-11-17 13:07 ` [PATCH v6 1/6] iommu: Add cache_invalidate_user op Yi Liu
@ 2023-11-17 13:07 ` Yi Liu
  2023-11-20  8:09   ` Tian, Kevin
                     ` (3 more replies)
  2023-11-17 13:07 ` [PATCH v6 3/6] iommu: Add iommu_copy_struct_from_user_array helper Yi Liu
                   ` (5 subsequent siblings)
  7 siblings, 4 replies; 93+ messages in thread
From: Yi Liu @ 2023-11-17 13:07 UTC (permalink / raw)
  To: joro, alex.williamson, jgg, kevin.tian, robin.murphy, baolu.lu
  Cc: cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.l.liu, yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi,
	lulu, suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	zhenzhong.duan, joao.m.martins, xin.zeng, yan.y.zhao

In nested translation, the stage-1 page table is user-managed but cached
by the IOMMU hardware, so an update on present page table entries in the
stage-1 page table should be followed with a cache invalidation.

Add an IOMMU_HWPT_INVALIDATE ioctl to support such a cache invalidation.
It takes hwpt_id to specify the iommu_domain, and a multi-entry array to
support multiple invalidation requests in one ioctl.

Check cache_invalidate_user op in the iommufd_hw_pagetable_alloc_nested,
since all nested domains need that.

Co-developed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/hw_pagetable.c    | 35 +++++++++++++++++++++++++
 drivers/iommu/iommufd/iommufd_private.h |  9 +++++++
 drivers/iommu/iommufd/main.c            |  3 +++
 include/uapi/linux/iommufd.h            | 35 +++++++++++++++++++++++++
 4 files changed, 82 insertions(+)

diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index 2abbeafdbd22..367459d92f69 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -238,6 +238,11 @@ iommufd_hwpt_nested_alloc(struct iommufd_ctx *ictx,
 		rc = -EINVAL;
 		goto out_abort;
 	}
+	/* Driver is buggy by missing cache_invalidate_user in domain_ops */
+	if (WARN_ON_ONCE(!hwpt->domain->ops->cache_invalidate_user)) {
+		rc = -EINVAL;
+		goto out_abort;
+	}
 	return hwpt_nested;
 
 out_abort:
@@ -370,4 +375,34 @@ int iommufd_hwpt_get_dirty_bitmap(struct iommufd_ucmd *ucmd)
 
 	iommufd_put_object(&hwpt_paging->common.obj);
 	return rc;
+};
+
+int iommufd_hwpt_invalidate(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_hwpt_invalidate *cmd = ucmd->cmd;
+	struct iommu_user_data_array data_array = {
+		.type = cmd->req_type,
+		.uptr = u64_to_user_ptr(cmd->reqs_uptr),
+		.entry_len = cmd->req_len,
+		.entry_num = cmd->req_num,
+	};
+	struct iommufd_hw_pagetable *hwpt;
+	int rc = 0;
+
+	if (cmd->req_type == IOMMU_HWPT_DATA_NONE)
+		return -EINVAL;
+	if (!cmd->reqs_uptr || !cmd->req_len || !cmd->req_num)
+		return -EINVAL;
+
+	hwpt = iommufd_hw_pagetable_get_nested(ucmd, cmd->hwpt_id);
+	if (IS_ERR(hwpt))
+		return PTR_ERR(hwpt);
+
+	rc = hwpt->domain->ops->cache_invalidate_user(hwpt->domain, &data_array,
+						      &cmd->out_driver_error_code);
+	cmd->req_num = data_array.entry_num;
+	if (iommufd_ucmd_respond(ucmd, sizeof(*cmd)))
+		return -EFAULT;
+	iommufd_put_object(&hwpt->obj);
+	return rc;
 }
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index a74cfefffbc6..160521800d9b 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -301,6 +301,7 @@ void iommufd_hwpt_paging_abort(struct iommufd_object *obj);
 void iommufd_hwpt_nested_destroy(struct iommufd_object *obj);
 void iommufd_hwpt_nested_abort(struct iommufd_object *obj);
 int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd);
+int iommufd_hwpt_invalidate(struct iommufd_ucmd *ucmd);
 
 static inline void iommufd_hw_pagetable_put(struct iommufd_ctx *ictx,
 					    struct iommufd_hw_pagetable *hwpt)
@@ -318,6 +319,14 @@ static inline void iommufd_hw_pagetable_put(struct iommufd_ctx *ictx,
 	refcount_dec(&hwpt->obj.users);
 }
 
+static inline struct iommufd_hw_pagetable *
+iommufd_hw_pagetable_get_nested(struct iommufd_ucmd *ucmd, u32 id)
+{
+	return container_of(iommufd_get_object(ucmd->ictx, id,
+					       IOMMUFD_OBJ_HWPT_NESTED),
+			    struct iommufd_hw_pagetable, obj);
+}
+
 struct iommufd_group {
 	struct kref ref;
 	struct mutex lock;
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 45b9d40773b1..6edef860f91c 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -309,6 +309,7 @@ union ucmd_buffer {
 	struct iommu_hwpt_alloc hwpt;
 	struct iommu_hwpt_get_dirty_bitmap get_dirty_bitmap;
 	struct iommu_hwpt_set_dirty_tracking set_dirty_tracking;
+	struct iommu_hwpt_invalidate cache;
 	struct iommu_ioas_alloc alloc;
 	struct iommu_ioas_allow_iovas allow_iovas;
 	struct iommu_ioas_copy ioas_copy;
@@ -348,6 +349,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 		 struct iommu_hwpt_get_dirty_bitmap, data),
 	IOCTL_OP(IOMMU_HWPT_SET_DIRTY_TRACKING, iommufd_hwpt_set_dirty_tracking,
 		 struct iommu_hwpt_set_dirty_tracking, __reserved),
+	IOCTL_OP(IOMMU_HWPT_INVALIDATE, iommufd_hwpt_invalidate,
+		 struct iommu_hwpt_invalidate, out_driver_error_code),
 	IOCTL_OP(IOMMU_IOAS_ALLOC, iommufd_ioas_alloc_ioctl,
 		 struct iommu_ioas_alloc, out_ioas_id),
 	IOCTL_OP(IOMMU_IOAS_ALLOW_IOVAS, iommufd_ioas_allow_iovas,
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 0b2bc6252e2c..7f92cecc87d7 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -49,6 +49,7 @@ enum {
 	IOMMUFD_CMD_GET_HW_INFO,
 	IOMMUFD_CMD_HWPT_SET_DIRTY_TRACKING,
 	IOMMUFD_CMD_HWPT_GET_DIRTY_BITMAP,
+	IOMMUFD_CMD_HWPT_INVALIDATE,
 };
 
 /**
@@ -613,4 +614,38 @@ struct iommu_hwpt_get_dirty_bitmap {
 #define IOMMU_HWPT_GET_DIRTY_BITMAP _IO(IOMMUFD_TYPE, \
 					IOMMUFD_CMD_HWPT_GET_DIRTY_BITMAP)
 
+/**
+ * struct iommu_hwpt_invalidate - ioctl(IOMMU_HWPT_INVALIDATE)
+ * @size: sizeof(struct iommu_hwpt_invalidate)
+ * @hwpt_id: HWPT ID of a nested HWPT for cache invalidation
+ * @reqs_uptr: User pointer to an array having @req_num of cache invalidation
+ *             requests. The request entries in the array are of fixed width
+ *             @req_len, and contain a user data structure for invalidation
+ *             request specific to the given hardware page table.
+ * @req_type: One of enum iommu_hwpt_data_type, defining the data type of all
+ *            the entries in the invalidation request array. It should suit
+ *            with the data_type passed per the allocation of the hwpt pointed
+ *            by @hwpt_id.
+ * @req_len: Length (in bytes) of a request entry in the request array
+ * @req_num: Input the number of cache invalidation requests in the array.
+ *           Output the number of requests successfully handled by kernel.
+ * @out_driver_error_code: Report a driver speicifc error code upon failure.
+ *                         It's optional, driver has a choice to fill it or
+ *                         not.
+ *
+ * Invalidate the iommu cache for user-managed page table. Modifications on a
+ * user-managed page table should be followed by this operation to sync cache.
+ * Each ioctl can support one or more cache invalidation requests in the array
+ * that has a total size of @req_len * @req_num.
+ */
+struct iommu_hwpt_invalidate {
+	__u32 size;
+	__u32 hwpt_id;
+	__aligned_u64 reqs_uptr;
+	__u32 req_type;
+	__u32 req_len;
+	__u32 req_num;
+	__u32 out_driver_error_code;
+};
+#define IOMMU_HWPT_INVALIDATE _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_INVALIDATE)
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* RE: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-11-17 13:07 ` [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE Yi Liu
@ 2023-11-20  8:09   ` Tian, Kevin
  2023-11-20  8:29     ` Yi Liu
  2023-11-21  5:02   ` Baolu Lu
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 93+ messages in thread
From: Tian, Kevin @ 2023-11-20  8:09 UTC (permalink / raw)
  To: Liu, Yi L, joro@8bytes.org, alex.williamson@redhat.com,
	jgg@nvidia.com, robin.murphy@arm.com, baolu.lu@linux.intel.com
  Cc: cohuck@redhat.com, eric.auger@redhat.com, nicolinc@nvidia.com,
	kvm@vger.kernel.org, mjrosato@linux.ibm.com,
	chao.p.peng@linux.intel.com, yi.y.sun@linux.intel.com,
	peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Friday, November 17, 2023 9:07 PM
> +
> +	hwpt = iommufd_hw_pagetable_get_nested(ucmd, cmd->hwpt_id);
> +	if (IS_ERR(hwpt))
> +		return PTR_ERR(hwpt);
> +
> +	rc = hwpt->domain->ops->cache_invalidate_user(hwpt->domain,
> &data_array,
> +						      &cmd-
> >out_driver_error_code);
> +	cmd->req_num = data_array.entry_num;
> +	if (iommufd_ucmd_respond(ucmd, sizeof(*cmd)))
> +		return -EFAULT;
> +	iommufd_put_object(&hwpt->obj);

the object is not put when ucmd response fails. It can be put right
after calling @cache_invalidate_user()

> @@ -309,6 +309,7 @@ union ucmd_buffer {
>  	struct iommu_hwpt_alloc hwpt;
>  	struct iommu_hwpt_get_dirty_bitmap get_dirty_bitmap;
>  	struct iommu_hwpt_set_dirty_tracking set_dirty_tracking;
> +	struct iommu_hwpt_invalidate cache;

In alphabetic order this should be put before "hwpt_set_dirty"

>  	struct iommu_ioas_alloc alloc;
>  	struct iommu_ioas_allow_iovas allow_iovas;
>  	struct iommu_ioas_copy ioas_copy;
> @@ -348,6 +349,8 @@ static const struct iommufd_ioctl_op
> iommufd_ioctl_ops[] = {
>  		 struct iommu_hwpt_get_dirty_bitmap, data),
>  	IOCTL_OP(IOMMU_HWPT_SET_DIRTY_TRACKING,
> iommufd_hwpt_set_dirty_tracking,
>  		 struct iommu_hwpt_set_dirty_tracking, __reserved),
> +	IOCTL_OP(IOMMU_HWPT_INVALIDATE, iommufd_hwpt_invalidate,
> +		 struct iommu_hwpt_invalidate, out_driver_error_code),

ditto

> +/**
> + * struct iommu_hwpt_invalidate - ioctl(IOMMU_HWPT_INVALIDATE)
> + * @size: sizeof(struct iommu_hwpt_invalidate)
> + * @hwpt_id: HWPT ID of a nested HWPT for cache invalidation

remove the first 'HWPT'

> + * @reqs_uptr: User pointer to an array having @req_num of cache
> invalidation
> + *             requests. The request entries in the array are of fixed width
> + *             @req_len, and contain a user data structure for invalidation
> + *             request specific to the given hardware page table.
> + * @req_type: One of enum iommu_hwpt_data_type, defining the data
> type of all
> + *            the entries in the invalidation request array. It should suit
> + *            with the data_type passed per the allocation of the hwpt pointed
> + *            by @hwpt_id.

"It should match the data_type given to the specified hwpt"

> + * @req_len: Length (in bytes) of a request entry in the request array
> + * @req_num: Input the number of cache invalidation requests in the array.
> + *           Output the number of requests successfully handled by kernel.
> + * @out_driver_error_code: Report a driver speicifc error code upon failure.
> + *                         It's optional, driver has a choice to fill it or
> + *                         not.

Being optional how does the user tell whether the code is filled or not?

> + *
> + * Invalidate the iommu cache for user-managed page table. Modifications
> on a
> + * user-managed page table should be followed by this operation to sync
> cache.
> + * Each ioctl can support one or more cache invalidation requests in the
> array
> + * that has a total size of @req_len * @req_num.
> + */
> +struct iommu_hwpt_invalidate {
> +	__u32 size;
> +	__u32 hwpt_id;
> +	__aligned_u64 reqs_uptr;
> +	__u32 req_type;
> +	__u32 req_len;
> +	__u32 req_num;
> +	__u32 out_driver_error_code;
> +};

Probably 'data_uptr', 'data_type', "entry_len' and 'entry_num" read slightly
better.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-11-20  8:09   ` Tian, Kevin
@ 2023-11-20  8:29     ` Yi Liu
  2023-11-20  8:34       ` Tian, Kevin
  0 siblings, 1 reply; 93+ messages in thread
From: Yi Liu @ 2023-11-20  8:29 UTC (permalink / raw)
  To: Tian, Kevin, joro@8bytes.org, alex.williamson@redhat.com,
	jgg@nvidia.com, robin.murphy@arm.com, baolu.lu@linux.intel.com
  Cc: cohuck@redhat.com, eric.auger@redhat.com, nicolinc@nvidia.com,
	kvm@vger.kernel.org, mjrosato@linux.ibm.com,
	chao.p.peng@linux.intel.com, yi.y.sun@linux.intel.com,
	peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

On 2023/11/20 16:09, Tian, Kevin wrote:
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Sent: Friday, November 17, 2023 9:07 PM
>> +
>> +	hwpt = iommufd_hw_pagetable_get_nested(ucmd, cmd->hwpt_id);
>> +	if (IS_ERR(hwpt))
>> +		return PTR_ERR(hwpt);
>> +
>> +	rc = hwpt->domain->ops->cache_invalidate_user(hwpt->domain,
>> &data_array,
>> +						      &cmd-
>>> out_driver_error_code);
>> +	cmd->req_num = data_array.entry_num;
>> +	if (iommufd_ucmd_respond(ucmd, sizeof(*cmd)))
>> +		return -EFAULT;
>> +	iommufd_put_object(&hwpt->obj);
> 
> the object is not put when ucmd response fails. It can be put right
> after calling @cache_invalidate_user()

yes.

>> @@ -309,6 +309,7 @@ union ucmd_buffer {
>>   	struct iommu_hwpt_alloc hwpt;
>>   	struct iommu_hwpt_get_dirty_bitmap get_dirty_bitmap;
>>   	struct iommu_hwpt_set_dirty_tracking set_dirty_tracking;
>> +	struct iommu_hwpt_invalidate cache;
> 
> In alphabetic order this should be put before "hwpt_set_dirty"

yes.

>>   	struct iommu_ioas_alloc alloc;
>>   	struct iommu_ioas_allow_iovas allow_iovas;
>>   	struct iommu_ioas_copy ioas_copy;
>> @@ -348,6 +349,8 @@ static const struct iommufd_ioctl_op
>> iommufd_ioctl_ops[] = {
>>   		 struct iommu_hwpt_get_dirty_bitmap, data),
>>   	IOCTL_OP(IOMMU_HWPT_SET_DIRTY_TRACKING,
>> iommufd_hwpt_set_dirty_tracking,
>>   		 struct iommu_hwpt_set_dirty_tracking, __reserved),
>> +	IOCTL_OP(IOMMU_HWPT_INVALIDATE, iommufd_hwpt_invalidate,
>> +		 struct iommu_hwpt_invalidate, out_driver_error_code),
> 
> ditto
> 
>> +/**
>> + * struct iommu_hwpt_invalidate - ioctl(IOMMU_HWPT_INVALIDATE)
>> + * @size: sizeof(struct iommu_hwpt_invalidate)
>> + * @hwpt_id: HWPT ID of a nested HWPT for cache invalidation
> 
> remove the first 'HWPT'

>> + * @reqs_uptr: User pointer to an array having @req_num of cache
>> invalidation
>> + *             requests. The request entries in the array are of fixed width
>> + *             @req_len, and contain a user data structure for invalidation
>> + *             request specific to the given hardware page table.
>> + * @req_type: One of enum iommu_hwpt_data_type, defining the data
>> type of all
>> + *            the entries in the invalidation request array. It should suit
>> + *            with the data_type passed per the allocation of the hwpt pointed
>> + *            by @hwpt_id.
> 
> "It should match the data_type given to the specified hwpt"

above comments are well received. :)

>> + * @req_len: Length (in bytes) of a request entry in the request array
>> + * @req_num: Input the number of cache invalidation requests in the array.
>> + *           Output the number of requests successfully handled by kernel.
>> + * @out_driver_error_code: Report a driver speicifc error code upon failure.
>> + *                         It's optional, driver has a choice to fill it or
>> + *                         not.
> 
> Being optional how does the user tell whether the code is filled or not?

seems like we need a flag for it. otherwise, a reserved special value is
required. or is it enough to just document it that this field is available
or not per the iommu_hw_info_type?

>> + *
>> + * Invalidate the iommu cache for user-managed page table. Modifications
>> on a
>> + * user-managed page table should be followed by this operation to sync
>> cache.
>> + * Each ioctl can support one or more cache invalidation requests in the
>> array
>> + * that has a total size of @req_len * @req_num.
>> + */
>> +struct iommu_hwpt_invalidate {
>> +	__u32 size;
>> +	__u32 hwpt_id;
>> +	__aligned_u64 reqs_uptr;
>> +	__u32 req_type;
>> +	__u32 req_len;
>> +	__u32 req_num;
>> +	__u32 out_driver_error_code;
>> +};
> 
> Probably 'data_uptr', 'data_type', "entry_len' and 'entry_num" read slightly
> better.

sure.

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 93+ messages in thread

* RE: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-11-20  8:29     ` Yi Liu
@ 2023-11-20  8:34       ` Tian, Kevin
  2023-11-20 17:36         ` Nicolin Chen
  0 siblings, 1 reply; 93+ messages in thread
From: Tian, Kevin @ 2023-11-20  8:34 UTC (permalink / raw)
  To: Liu, Yi L, joro@8bytes.org, alex.williamson@redhat.com,
	jgg@nvidia.com, robin.murphy@arm.com, baolu.lu@linux.intel.com
  Cc: cohuck@redhat.com, eric.auger@redhat.com, nicolinc@nvidia.com,
	kvm@vger.kernel.org, mjrosato@linux.ibm.com,
	chao.p.peng@linux.intel.com, yi.y.sun@linux.intel.com,
	peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Monday, November 20, 2023 4:30 PM
> 
> On 2023/11/20 16:09, Tian, Kevin wrote:
> >> From: Liu, Yi L <yi.l.liu@intel.com>
> >> Sent: Friday, November 17, 2023 9:07 PM
> >> + * @req_len: Length (in bytes) of a request entry in the request array
> >> + * @req_num: Input the number of cache invalidation requests in the
> array.
> >> + *           Output the number of requests successfully handled by kernel.
> >> + * @out_driver_error_code: Report a driver speicifc error code upon
> failure.
> >> + *                         It's optional, driver has a choice to fill it or
> >> + *                         not.
> >
> > Being optional how does the user tell whether the code is filled or not?
> 
> seems like we need a flag for it. otherwise, a reserved special value is
> required. or is it enough to just document it that this field is available
> or not per the iommu_hw_info_type?
> 

No guarantee that a reserved special value applies to all vendors.

I'll just remove the optional part. the viommu driver knows what a valid
code is, if the driver fills it.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-11-20  8:34       ` Tian, Kevin
@ 2023-11-20 17:36         ` Nicolin Chen
  2023-11-21  2:50           ` Tian, Kevin
  0 siblings, 1 reply; 93+ messages in thread
From: Nicolin Chen @ 2023-11-20 17:36 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, joro@8bytes.org, alex.williamson@redhat.com,
	jgg@nvidia.com, robin.murphy@arm.com, baolu.lu@linux.intel.com,
	cohuck@redhat.com, eric.auger@redhat.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

On Mon, Nov 20, 2023 at 08:34:58AM +0000, Tian, Kevin wrote:
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Monday, November 20, 2023 4:30 PM
> >
> > On 2023/11/20 16:09, Tian, Kevin wrote:
> > >> From: Liu, Yi L <yi.l.liu@intel.com>
> > >> Sent: Friday, November 17, 2023 9:07 PM
> > >> + * @req_len: Length (in bytes) of a request entry in the request array
> > >> + * @req_num: Input the number of cache invalidation requests in the
> > array.
> > >> + *           Output the number of requests successfully handled by kernel.
> > >> + * @out_driver_error_code: Report a driver speicifc error code upon
> > failure.
> > >> + *                         It's optional, driver has a choice to fill it or
> > >> + *                         not.
> > >
> > > Being optional how does the user tell whether the code is filled or not?

Well, naming it "error_code" indicates zero means no error while
non-zero means something? An error return from this ioctl could
also tell the user space to look up for this driver error code,
if it ever cares.

> > seems like we need a flag for it. otherwise, a reserved special value is
> > required. or is it enough to just document it that this field is available
> > or not per the iommu_hw_info_type?
> >
> 
> No guarantee that a reserved special value applies to all vendors.
> 
> I'll just remove the optional part. the viommu driver knows what a valid
> code is, if the driver fills it.

Hmm, remove out_driver_error_code? Do you mean by reusing ioctl
error code to tell the user space what driver error code it gets?

Nicolin

^ permalink raw reply	[flat|nested] 93+ messages in thread

* RE: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-11-20 17:36         ` Nicolin Chen
@ 2023-11-21  2:50           ` Tian, Kevin
  2023-11-21  5:24             ` Nicolin Chen
  0 siblings, 1 reply; 93+ messages in thread
From: Tian, Kevin @ 2023-11-21  2:50 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Liu, Yi L, joro@8bytes.org, alex.williamson@redhat.com,
	jgg@nvidia.com, robin.murphy@arm.com, baolu.lu@linux.intel.com,
	cohuck@redhat.com, eric.auger@redhat.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Tuesday, November 21, 2023 1:37 AM
> 
> On Mon, Nov 20, 2023 at 08:34:58AM +0000, Tian, Kevin wrote:
> > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > Sent: Monday, November 20, 2023 4:30 PM
> > >
> > > On 2023/11/20 16:09, Tian, Kevin wrote:
> > > >> From: Liu, Yi L <yi.l.liu@intel.com>
> > > >> Sent: Friday, November 17, 2023 9:07 PM
> > > >> + * @req_len: Length (in bytes) of a request entry in the request array
> > > >> + * @req_num: Input the number of cache invalidation requests in the
> > > array.
> > > >> + *           Output the number of requests successfully handled by
> kernel.
> > > >> + * @out_driver_error_code: Report a driver speicifc error code upon
> > > failure.
> > > >> + *                         It's optional, driver has a choice to fill it or
> > > >> + *                         not.
> > > >
> > > > Being optional how does the user tell whether the code is filled or not?
> 
> Well, naming it "error_code" indicates zero means no error while
> non-zero means something? An error return from this ioctl could
> also tell the user space to look up for this driver error code,
> if it ever cares.

probably over-thinking but I'm not sure whether zero is guaranteed to
mean no error in all implementations...

> 
> > > seems like we need a flag for it. otherwise, a reserved special value is
> > > required. or is it enough to just document it that this field is available
> > > or not per the iommu_hw_info_type?
> > >
> >
> > No guarantee that a reserved special value applies to all vendors.
> >
> > I'll just remove the optional part. the viommu driver knows what a valid
> > code is, if the driver fills it.
> 
> Hmm, remove out_driver_error_code? Do you mean by reusing ioctl
> error code to tell the user space what driver error code it gets?
> 

No. I just meant to remove the last sentence "It's optional ..."

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-11-21  2:50           ` Tian, Kevin
@ 2023-11-21  5:24             ` Nicolin Chen
  2023-11-24  2:36               ` Tian, Kevin
  0 siblings, 1 reply; 93+ messages in thread
From: Nicolin Chen @ 2023-11-21  5:24 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, joro@8bytes.org, alex.williamson@redhat.com,
	jgg@nvidia.com, robin.murphy@arm.com, baolu.lu@linux.intel.com,
	cohuck@redhat.com, eric.auger@redhat.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

On Tue, Nov 21, 2023 at 02:50:05AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Tuesday, November 21, 2023 1:37 AM
> >
> > On Mon, Nov 20, 2023 at 08:34:58AM +0000, Tian, Kevin wrote:
> > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > Sent: Monday, November 20, 2023 4:30 PM
> > > >
> > > > On 2023/11/20 16:09, Tian, Kevin wrote:
> > > > >> From: Liu, Yi L <yi.l.liu@intel.com>
> > > > >> Sent: Friday, November 17, 2023 9:07 PM
> > > > >> + * @req_len: Length (in bytes) of a request entry in the request array
> > > > >> + * @req_num: Input the number of cache invalidation requests in the
> > > > array.
> > > > >> + *           Output the number of requests successfully handled by
> > kernel.
> > > > >> + * @out_driver_error_code: Report a driver speicifc error code upon
> > > > failure.
> > > > >> + *                         It's optional, driver has a choice to fill it or
> > > > >> + *                         not.
> > > > >
> > > > > Being optional how does the user tell whether the code is filled or not?
> >
> > Well, naming it "error_code" indicates zero means no error while
> > non-zero means something? An error return from this ioctl could
> > also tell the user space to look up for this driver error code,
> > if it ever cares.
> 
> probably over-thinking but I'm not sure whether zero is guaranteed to
> mean no error in all implementations...

Well, you are right. Usually HW conveniently raises a flag in a
register to indicate something wrong, yet it is probably unsafe
to say it definitely.

> > > > seems like we need a flag for it. otherwise, a reserved special value is
> > > > required. or is it enough to just document it that this field is available
> > > > or not per the iommu_hw_info_type?
> > > >
> > >
> > > No guarantee that a reserved special value applies to all vendors.
> > >
> > > I'll just remove the optional part. the viommu driver knows what a valid
> > > code is, if the driver fills it.
> >
> > Hmm, remove out_driver_error_code? Do you mean by reusing ioctl
> > error code to tell the user space what driver error code it gets?
> >
> 
> No. I just meant to remove the last sentence "It's optional ..."

OK. Would it force all IOMMU drivers to report something upon
an error/failure?

Thanks
Nic

^ permalink raw reply	[flat|nested] 93+ messages in thread

* RE: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-11-21  5:24             ` Nicolin Chen
@ 2023-11-24  2:36               ` Tian, Kevin
  2023-11-27 19:53                 ` Nicolin Chen
  0 siblings, 1 reply; 93+ messages in thread
From: Tian, Kevin @ 2023-11-24  2:36 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Liu, Yi L, joro@8bytes.org, alex.williamson@redhat.com,
	jgg@nvidia.com, robin.murphy@arm.com, baolu.lu@linux.intel.com,
	cohuck@redhat.com, eric.auger@redhat.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Tuesday, November 21, 2023 1:24 PM
> 
> On Tue, Nov 21, 2023 at 02:50:05AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Tuesday, November 21, 2023 1:37 AM
> > >
> > > On Mon, Nov 20, 2023 at 08:34:58AM +0000, Tian, Kevin wrote:
> > > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > > Sent: Monday, November 20, 2023 4:30 PM
> > > > >
> > > > > On 2023/11/20 16:09, Tian, Kevin wrote:
> > > > > >> From: Liu, Yi L <yi.l.liu@intel.com>
> > > > > >> Sent: Friday, November 17, 2023 9:07 PM
> > > > > >> + * @req_len: Length (in bytes) of a request entry in the request
> array
> > > > > >> + * @req_num: Input the number of cache invalidation requests in
> the
> > > > > array.
> > > > > >> + *           Output the number of requests successfully handled by
> > > kernel.
> > > > > >> + * @out_driver_error_code: Report a driver speicifc error code
> upon
> > > > > failure.
> > > > > >> + *                         It's optional, driver has a choice to fill it or
> > > > > >> + *                         not.
> > > > > >
> > > > > > Being optional how does the user tell whether the code is filled or
> not?
> > >
> > > Well, naming it "error_code" indicates zero means no error while
> > > non-zero means something? An error return from this ioctl could
> > > also tell the user space to look up for this driver error code,
> > > if it ever cares.
> >
> > probably over-thinking but I'm not sure whether zero is guaranteed to
> > mean no error in all implementations...
> 
> Well, you are right. Usually HW conveniently raises a flag in a
> register to indicate something wrong, yet it is probably unsafe
> to say it definitely.
> 

this reminds me one open. What about an implementation having
a hierarchical error code layout e.g. one main error register with
each bit representing an error category then multiple error code
registers each for one error category? In this case probably
a single out_driver_error_code cannot carry that raw information.

Instead the iommu driver may need to define a customized error
code convention in uapi header which is converted from the
raw error information.

From this angle should we simply say that the error code definition
must be included in the uapi header? If raw error information can
be carried by this field then this hw can simply say that the error
code format is same as the hw spec defines.

With that explicit information then the viommu can easily tell
whether error code is filled or not based on its own convention.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-11-24  2:36               ` Tian, Kevin
@ 2023-11-27 19:53                 ` Nicolin Chen
  2023-11-28  6:01                   ` Yi Liu
  2023-11-28  8:03                   ` Tian, Kevin
  0 siblings, 2 replies; 93+ messages in thread
From: Nicolin Chen @ 2023-11-27 19:53 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, joro@8bytes.org, alex.williamson@redhat.com,
	jgg@nvidia.com, robin.murphy@arm.com, baolu.lu@linux.intel.com,
	cohuck@redhat.com, eric.auger@redhat.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

On Fri, Nov 24, 2023 at 02:36:29AM +0000, Tian, Kevin wrote:

> > > > > > >> + * @out_driver_error_code: Report a driver speicifc error code
> > upon
> > > > > > failure.
> > > > > > >> + *                         It's optional, driver has a choice to fill it or
> > > > > > >> + *                         not.
> > > > > > >
> > > > > > > Being optional how does the user tell whether the code is filled or
> > not?
> > > >
> > > > Well, naming it "error_code" indicates zero means no error while
> > > > non-zero means something? An error return from this ioctl could
> > > > also tell the user space to look up for this driver error code,
> > > > if it ever cares.
> > >
> > > probably over-thinking but I'm not sure whether zero is guaranteed to
> > > mean no error in all implementations...
> >
> > Well, you are right. Usually HW conveniently raises a flag in a
> > register to indicate something wrong, yet it is probably unsafe
> > to say it definitely.
> >
> 
> this reminds me one open. What about an implementation having
> a hierarchical error code layout e.g. one main error register with
> each bit representing an error category then multiple error code
> registers each for one error category? In this case probably
> a single out_driver_error_code cannot carry that raw information.

Hmm, good point.

> Instead the iommu driver may need to define a customized error
> code convention in uapi header which is converted from the
> raw error information.
> 
> From this angle should we simply say that the error code definition
> must be included in the uapi header? If raw error information can
> be carried by this field then this hw can simply say that the error
> code format is same as the hw spec defines.
> 
> With that explicit information then the viommu can easily tell
> whether error code is filled or not based on its own convention.

That'd be to put this error_code field into the driver uAPI
structure right?

I also thought about making this out_driver_error_code per HW.
Yet, an error can be either per array or per entry/quest. The
array-related error should be reported in the array structure
that is a core uAPI, v.s. the per-HW entry structure. Though
we could still report an array error in the entry structure
at the first entry (or indexed by "array->entry_num")?

Thanks
Nic

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-11-27 19:53                 ` Nicolin Chen
@ 2023-11-28  6:01                   ` Yi Liu
  2023-11-29  0:54                     ` Nicolin Chen
  2023-11-28  8:03                   ` Tian, Kevin
  1 sibling, 1 reply; 93+ messages in thread
From: Yi Liu @ 2023-11-28  6:01 UTC (permalink / raw)
  To: Nicolin Chen, Tian, Kevin
  Cc: joro@8bytes.org, alex.williamson@redhat.com, jgg@nvidia.com,
	robin.murphy@arm.com, baolu.lu@linux.intel.com, cohuck@redhat.com,
	eric.auger@redhat.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

On 2023/11/28 03:53, Nicolin Chen wrote:
> On Fri, Nov 24, 2023 at 02:36:29AM +0000, Tian, Kevin wrote:
> 
>>>>>>>>> + * @out_driver_error_code: Report a driver speicifc error code
>>> upon
>>>>>>> failure.
>>>>>>>>> + *                         It's optional, driver has a choice to fill it or
>>>>>>>>> + *                         not.
>>>>>>>>
>>>>>>>> Being optional how does the user tell whether the code is filled or
>>> not?
>>>>>
>>>>> Well, naming it "error_code" indicates zero means no error while
>>>>> non-zero means something? An error return from this ioctl could
>>>>> also tell the user space to look up for this driver error code,
>>>>> if it ever cares.
>>>>
>>>> probably over-thinking but I'm not sure whether zero is guaranteed to
>>>> mean no error in all implementations...
>>>
>>> Well, you are right. Usually HW conveniently raises a flag in a
>>> register to indicate something wrong, yet it is probably unsafe
>>> to say it definitely.
>>>
>>
>> this reminds me one open. What about an implementation having
>> a hierarchical error code layout e.g. one main error register with
>> each bit representing an error category then multiple error code
>> registers each for one error category? In this case probably
>> a single out_driver_error_code cannot carry that raw information.
> 
> Hmm, good point.
> 
>> Instead the iommu driver may need to define a customized error
>> code convention in uapi header which is converted from the
>> raw error information.
>>
>>  From this angle should we simply say that the error code definition
>> must be included in the uapi header? If raw error information can
>> be carried by this field then this hw can simply say that the error
>> code format is same as the hw spec defines.
>>
>> With that explicit information then the viommu can easily tell
>> whether error code is filled or not based on its own convention.
> 
> That'd be to put this error_code field into the driver uAPI
> structure right?

looks to be. Then it would be convenient to reserve a code for
the case of no error (either no error happened or just not used)

> 
> I also thought about making this out_driver_error_code per HW.
> Yet, an error can be either per array or per entry/quest. The
> array-related error should be reported in the array structure
> that is a core uAPI, v.s. the per-HW entry structure. Though
> we could still report an array error in the entry structure
> at the first entry (or indexed by "array->entry_num")?

per-entry error code seems like to be a completion code. Each
entry in the array can have a corresponding code (0 for succ,
others for failure). do you already have such a need?

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-11-28  6:01                   ` Yi Liu
@ 2023-11-29  0:54                     ` Nicolin Chen
  0 siblings, 0 replies; 93+ messages in thread
From: Nicolin Chen @ 2023-11-29  0:54 UTC (permalink / raw)
  To: Yi Liu
  Cc: Tian, Kevin, joro@8bytes.org, alex.williamson@redhat.com,
	jgg@nvidia.com, robin.murphy@arm.com, baolu.lu@linux.intel.com,
	cohuck@redhat.com, eric.auger@redhat.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

On Tue, Nov 28, 2023 at 02:01:59PM +0800, Yi Liu wrote:
> On 2023/11/28 03:53, Nicolin Chen wrote:
> > On Fri, Nov 24, 2023 at 02:36:29AM +0000, Tian, Kevin wrote:
> > 
> > > > > > > > > > + * @out_driver_error_code: Report a driver speicifc error code
> > > > upon
> > > > > > > > failure.
> > > > > > > > > > + *                         It's optional, driver has a choice to fill it or
> > > > > > > > > > + *                         not.
> > > > > > > > > 
> > > > > > > > > Being optional how does the user tell whether the code is filled or
> > > > not?
> > > > > > 
> > > > > > Well, naming it "error_code" indicates zero means no error while
> > > > > > non-zero means something? An error return from this ioctl could
> > > > > > also tell the user space to look up for this driver error code,
> > > > > > if it ever cares.
> > > > > 
> > > > > probably over-thinking but I'm not sure whether zero is guaranteed to
> > > > > mean no error in all implementations...
> > > > 
> > > > Well, you are right. Usually HW conveniently raises a flag in a
> > > > register to indicate something wrong, yet it is probably unsafe
> > > > to say it definitely.
> > > > 
> > > 
> > > this reminds me one open. What about an implementation having
> > > a hierarchical error code layout e.g. one main error register with
> > > each bit representing an error category then multiple error code
> > > registers each for one error category? In this case probably
> > > a single out_driver_error_code cannot carry that raw information.
> > 
> > Hmm, good point.
> > 
> > > Instead the iommu driver may need to define a customized error
> > > code convention in uapi header which is converted from the
> > > raw error information.
> > > 
> > >  From this angle should we simply say that the error code definition
> > > must be included in the uapi header? If raw error information can
> > > be carried by this field then this hw can simply say that the error
> > > code format is same as the hw spec defines.
> > > 
> > > With that explicit information then the viommu can easily tell
> > > whether error code is filled or not based on its own convention.
> > 
> > That'd be to put this error_code field into the driver uAPI
> > structure right?
> 
> looks to be. Then it would be convenient to reserve a code for
> the case of no error (either no error happened or just not used)
> 
> > 
> > I also thought about making this out_driver_error_code per HW.
> > Yet, an error can be either per array or per entry/quest. The
> > array-related error should be reported in the array structure
> > that is a core uAPI, v.s. the per-HW entry structure. Though
> > we could still report an array error in the entry structure
> > at the first entry (or indexed by "array->entry_num")?
> 
> per-entry error code seems like to be a completion code. Each
> entry in the array can have a corresponding code (0 for succ,
> others for failure). do you already have such a need?

Yes, SMMU can report a PREFETCH error if reading an queue/array
itself fails, and a ILLCMD error if the command is illegal.

We can move the error field to driver specific entry structure.
On the other hand, if something about the array fails, we just
return -EIO.

Thanks
Nic

^ permalink raw reply	[flat|nested] 93+ messages in thread

* RE: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-11-27 19:53                 ` Nicolin Chen
  2023-11-28  6:01                   ` Yi Liu
@ 2023-11-28  8:03                   ` Tian, Kevin
  2023-11-29  0:51                     ` Nicolin Chen
  1 sibling, 1 reply; 93+ messages in thread
From: Tian, Kevin @ 2023-11-28  8:03 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Liu, Yi L, joro@8bytes.org, alex.williamson@redhat.com,
	jgg@nvidia.com, robin.murphy@arm.com, baolu.lu@linux.intel.com,
	cohuck@redhat.com, eric.auger@redhat.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Tuesday, November 28, 2023 3:53 AM
> 
> On Fri, Nov 24, 2023 at 02:36:29AM +0000, Tian, Kevin wrote:
> 
> > > > > > > >> + * @out_driver_error_code: Report a driver speicifc error code
> > > upon
> > > > > > > failure.
> > > > > > > >> + *                         It's optional, driver has a choice to fill it or
> > > > > > > >> + *                         not.
> > > > > > > >
> > > > > > > > Being optional how does the user tell whether the code is filled
> or
> > > not?
> > > > >
> > > > > Well, naming it "error_code" indicates zero means no error while
> > > > > non-zero means something? An error return from this ioctl could
> > > > > also tell the user space to look up for this driver error code,
> > > > > if it ever cares.
> > > >
> > > > probably over-thinking but I'm not sure whether zero is guaranteed to
> > > > mean no error in all implementations...
> > >
> > > Well, you are right. Usually HW conveniently raises a flag in a
> > > register to indicate something wrong, yet it is probably unsafe
> > > to say it definitely.
> > >
> >
> > this reminds me one open. What about an implementation having
> > a hierarchical error code layout e.g. one main error register with
> > each bit representing an error category then multiple error code
> > registers each for one error category? In this case probably
> > a single out_driver_error_code cannot carry that raw information.
> 
> Hmm, good point.
> 
> > Instead the iommu driver may need to define a customized error
> > code convention in uapi header which is converted from the
> > raw error information.
> >
> > From this angle should we simply say that the error code definition
> > must be included in the uapi header? If raw error information can
> > be carried by this field then this hw can simply say that the error
> > code format is same as the hw spec defines.
> >
> > With that explicit information then the viommu can easily tell
> > whether error code is filled or not based on its own convention.
> 
> That'd be to put this error_code field into the driver uAPI
> structure right?
> 
> I also thought about making this out_driver_error_code per HW.
> Yet, an error can be either per array or per entry/quest. The
> array-related error should be reported in the array structure
> that is a core uAPI, v.s. the per-HW entry structure. Though
> we could still report an array error in the entry structure
> at the first entry (or indexed by "array->entry_num")?
> 

why would there be an array error? array is just a software
entity containing actual HW invalidation cmds. If there is
any error with the array itself it should be reported via
ioctl errno.

Jason, how about your opinion? I didn't spot big issues
except this one. Hope it can make into 6.8.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-11-28  8:03                   ` Tian, Kevin
@ 2023-11-29  0:51                     ` Nicolin Chen
  2023-11-29  0:57                       ` Jason Gunthorpe
  0 siblings, 1 reply; 93+ messages in thread
From: Nicolin Chen @ 2023-11-29  0:51 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, joro@8bytes.org, alex.williamson@redhat.com,
	jgg@nvidia.com, robin.murphy@arm.com, baolu.lu@linux.intel.com,
	cohuck@redhat.com, eric.auger@redhat.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

On Tue, Nov 28, 2023 at 08:03:35AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Tuesday, November 28, 2023 3:53 AM
> >
> > On Fri, Nov 24, 2023 at 02:36:29AM +0000, Tian, Kevin wrote:
> >
> > > > > > > > >> + * @out_driver_error_code: Report a driver speicifc error code
> > > > upon
> > > > > > > > failure.
> > > > > > > > >> + *                         It's optional, driver has a choice to fill it or
> > > > > > > > >> + *                         not.
> > > > > > > > >
> > > > > > > > > Being optional how does the user tell whether the code is filled
> > or
> > > > not?
> > > > > >
> > > > > > Well, naming it "error_code" indicates zero means no error while
> > > > > > non-zero means something? An error return from this ioctl could
> > > > > > also tell the user space to look up for this driver error code,
> > > > > > if it ever cares.
> > > > >
> > > > > probably over-thinking but I'm not sure whether zero is guaranteed to
> > > > > mean no error in all implementations...
> > > >
> > > > Well, you are right. Usually HW conveniently raises a flag in a
> > > > register to indicate something wrong, yet it is probably unsafe
> > > > to say it definitely.
> > > >
> > >
> > > this reminds me one open. What about an implementation having
> > > a hierarchical error code layout e.g. one main error register with
> > > each bit representing an error category then multiple error code
> > > registers each for one error category? In this case probably
> > > a single out_driver_error_code cannot carry that raw information.
> >
> > Hmm, good point.
> >
> > > Instead the iommu driver may need to define a customized error
> > > code convention in uapi header which is converted from the
> > > raw error information.
> > >
> > > From this angle should we simply say that the error code definition
> > > must be included in the uapi header? If raw error information can
> > > be carried by this field then this hw can simply say that the error
> > > code format is same as the hw spec defines.
> > >
> > > With that explicit information then the viommu can easily tell
> > > whether error code is filled or not based on its own convention.
> >
> > That'd be to put this error_code field into the driver uAPI
> > structure right?
> >
> > I also thought about making this out_driver_error_code per HW.
> > Yet, an error can be either per array or per entry/quest. The
> > array-related error should be reported in the array structure
> > that is a core uAPI, v.s. the per-HW entry structure. Though
> > we could still report an array error in the entry structure
> > at the first entry (or indexed by "array->entry_num")?
> >
> 
> why would there be an array error? array is just a software
> entity containing actual HW invalidation cmds. If there is
> any error with the array itself it should be reported via
> ioctl errno.

User array reading is a software operation, but kernel array
reading is a hardware operation that can raise an error when
the memory location to the array is incorrect or so.

With that being said, I think errno (-EIO) could do the job,
as you suggested too.

Thanks
Nic

> Jason, how about your opinion? I didn't spot big issues
> except this one. Hope it can make into 6.8.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-11-29  0:51                     ` Nicolin Chen
@ 2023-11-29  0:57                       ` Jason Gunthorpe
  2023-11-29  1:09                         ` Nicolin Chen
  2023-12-01  3:51                         ` Yi Liu
  0 siblings, 2 replies; 93+ messages in thread
From: Jason Gunthorpe @ 2023-11-29  0:57 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Tian, Kevin, Liu, Yi L, joro@8bytes.org,
	alex.williamson@redhat.com, robin.murphy@arm.com,
	baolu.lu@linux.intel.com, cohuck@redhat.com,
	eric.auger@redhat.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

On Tue, Nov 28, 2023 at 04:51:21PM -0800, Nicolin Chen wrote:
> > > I also thought about making this out_driver_error_code per HW.
> > > Yet, an error can be either per array or per entry/quest. The
> > > array-related error should be reported in the array structure
> > > that is a core uAPI, v.s. the per-HW entry structure. Though
> > > we could still report an array error in the entry structure
> > > at the first entry (or indexed by "array->entry_num")?
> > >
> > 
> > why would there be an array error? array is just a software
> > entity containing actual HW invalidation cmds. If there is
> > any error with the array itself it should be reported via
> > ioctl errno.
> 
> User array reading is a software operation, but kernel array
> reading is a hardware operation that can raise an error when
> the memory location to the array is incorrect or so.

Well, we shouldn't get into a situation like that.. By the time the HW
got the address it should be valid.

> With that being said, I think errno (-EIO) could do the job,
> as you suggested too.

Do we have any idea what HW failures can be generated by the commands
this will execture? IIRC I don't remember seeing any smmu specific
codes related to invalid invalidation? Everything is a valid input?

Can vt-d fail single commands? What about AMD?

> > Jason, how about your opinion? I didn't spot big issues
> > except this one. Hope it can make into 6.8.

Yes, lets try

Jason

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-11-29  0:57                       ` Jason Gunthorpe
@ 2023-11-29  1:09                         ` Nicolin Chen
  2023-11-29 19:58                           ` Jason Gunthorpe
  2023-12-01  3:51                         ` Yi Liu
  1 sibling, 1 reply; 93+ messages in thread
From: Nicolin Chen @ 2023-11-29  1:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Liu, Yi L, joro@8bytes.org,
	alex.williamson@redhat.com, robin.murphy@arm.com,
	baolu.lu@linux.intel.com, cohuck@redhat.com,
	eric.auger@redhat.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

On Tue, Nov 28, 2023 at 08:57:15PM -0400, Jason Gunthorpe wrote:
> On Tue, Nov 28, 2023 at 04:51:21PM -0800, Nicolin Chen wrote:
> > > > I also thought about making this out_driver_error_code per HW.
> > > > Yet, an error can be either per array or per entry/quest. The
> > > > array-related error should be reported in the array structure
> > > > that is a core uAPI, v.s. the per-HW entry structure. Though
> > > > we could still report an array error in the entry structure
> > > > at the first entry (or indexed by "array->entry_num")?
> > > >
> > > 
> > > why would there be an array error? array is just a software
> > > entity containing actual HW invalidation cmds. If there is
> > > any error with the array itself it should be reported via
> > > ioctl errno.
> > 
> > User array reading is a software operation, but kernel array
> > reading is a hardware operation that can raise an error when
> > the memory location to the array is incorrect or so.
> 
> Well, we shouldn't get into a situation like that.. By the time the HW
> got the address it should be valid.

Oh, that's true. I was trying to say that out_driver_error_code
was to mimic such a queue validation for user space if an error
happens to the array.

> > With that being said, I think errno (-EIO) could do the job,
> > as you suggested too.
> 
> Do we have any idea what HW failures can be generated by the commands
> this will execture? IIRC I don't remember seeing any smmu specific
> codes related to invalid invalidation? Everything is a valid input?

"7.1 Command queue errors" has the info.

Thanks
Nic

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-11-29  1:09                         ` Nicolin Chen
@ 2023-11-29 19:58                           ` Jason Gunthorpe
  2023-11-29 22:07                             ` Nicolin Chen
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Gunthorpe @ 2023-11-29 19:58 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Tian, Kevin, Liu, Yi L, joro@8bytes.org,
	alex.williamson@redhat.com, robin.murphy@arm.com,
	baolu.lu@linux.intel.com, cohuck@redhat.com,
	eric.auger@redhat.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

On Tue, Nov 28, 2023 at 05:09:07PM -0800, Nicolin Chen wrote:

> > > With that being said, I think errno (-EIO) could do the job,
> > > as you suggested too.
> > 
> > Do we have any idea what HW failures can be generated by the commands
> > this will execture? IIRC I don't remember seeing any smmu specific
> > codes related to invalid invalidation? Everything is a valid input?
> 
> "7.1 Command queue errors" has the info.

Hmm CERROR_ATC_INV_SYNC needs to be forwarded to the guest somehow

Jason

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-11-29 19:58                           ` Jason Gunthorpe
@ 2023-11-29 22:07                             ` Nicolin Chen
  2023-11-30  0:08                               ` Jason Gunthorpe
  0 siblings, 1 reply; 93+ messages in thread
From: Nicolin Chen @ 2023-11-29 22:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Liu, Yi L, joro@8bytes.org,
	alex.williamson@redhat.com, robin.murphy@arm.com,
	baolu.lu@linux.intel.com, cohuck@redhat.com,
	eric.auger@redhat.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

On Wed, Nov 29, 2023 at 03:58:04PM -0400, Jason Gunthorpe wrote:
> On Tue, Nov 28, 2023 at 05:09:07PM -0800, Nicolin Chen wrote:
> 
> > > > With that being said, I think errno (-EIO) could do the job,
> > > > as you suggested too.
> > > 
> > > Do we have any idea what HW failures can be generated by the commands
> > > this will execture? IIRC I don't remember seeing any smmu specific
> > > codes related to invalid invalidation? Everything is a valid input?
> > 
> > "7.1 Command queue errors" has the info.
> 
> Hmm CERROR_ATC_INV_SYNC needs to be forwarded to the guest somehow

Oh, for sure. That's typically triggered with an asynchronous
timeout from the eventq, so we'd need the io page fault series
as you previously remarked. Though I also wonder if an invalid
vSID that doesn't link to a pSID should be CERROR_ATC_INV_SYNC
also v.s. CERROR_ILL.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-11-29 22:07                             ` Nicolin Chen
@ 2023-11-30  0:08                               ` Jason Gunthorpe
  2023-11-30 20:41                                 ` Nicolin Chen
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Gunthorpe @ 2023-11-30  0:08 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Tian, Kevin, Liu, Yi L, joro@8bytes.org,
	alex.williamson@redhat.com, robin.murphy@arm.com,
	baolu.lu@linux.intel.com, cohuck@redhat.com,
	eric.auger@redhat.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

On Wed, Nov 29, 2023 at 02:07:58PM -0800, Nicolin Chen wrote:
> On Wed, Nov 29, 2023 at 03:58:04PM -0400, Jason Gunthorpe wrote:
> > On Tue, Nov 28, 2023 at 05:09:07PM -0800, Nicolin Chen wrote:
> > 
> > > > > With that being said, I think errno (-EIO) could do the job,
> > > > > as you suggested too.
> > > > 
> > > > Do we have any idea what HW failures can be generated by the commands
> > > > this will execture? IIRC I don't remember seeing any smmu specific
> > > > codes related to invalid invalidation? Everything is a valid input?
> > > 
> > > "7.1 Command queue errors" has the info.
> > 
> > Hmm CERROR_ATC_INV_SYNC needs to be forwarded to the guest somehow
> 
> Oh, for sure. That's typically triggered with an asynchronous
> timeout from the eventq, so we'd need the io page fault series
> as you previously remarked. Though I also wonder if an invalid
> vSID that doesn't link to a pSID should be CERROR_ATC_INV_SYNC
> also v.s. CERROR_ILL.

Yes, something like that make sense

Presumably sync becomes emulated and turns into something generated
when the ioctl returns. So userspace would have to read the event FD
before returning to be correct?

Maybe the kernel can somehow return a flag to indicate the event fd
has data in it?

If yes then all errors would flow through the event fd?

Would Intel be basically the same too?

Jason

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-11-30  0:08                               ` Jason Gunthorpe
@ 2023-11-30 20:41                                 ` Nicolin Chen
  2023-12-01  0:45                                   ` Jason Gunthorpe
  0 siblings, 1 reply; 93+ messages in thread
From: Nicolin Chen @ 2023-11-30 20:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Liu, Yi L, joro@8bytes.org,
	alex.williamson@redhat.com, robin.murphy@arm.com,
	baolu.lu@linux.intel.com, cohuck@redhat.com,
	eric.auger@redhat.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

On Wed, Nov 29, 2023 at 08:08:16PM -0400, Jason Gunthorpe wrote:
> On Wed, Nov 29, 2023 at 02:07:58PM -0800, Nicolin Chen wrote:
> > On Wed, Nov 29, 2023 at 03:58:04PM -0400, Jason Gunthorpe wrote:
> > > On Tue, Nov 28, 2023 at 05:09:07PM -0800, Nicolin Chen wrote:
> > > 
> > > > > > With that being said, I think errno (-EIO) could do the job,
> > > > > > as you suggested too.
> > > > > 
> > > > > Do we have any idea what HW failures can be generated by the commands
> > > > > this will execture? IIRC I don't remember seeing any smmu specific
> > > > > codes related to invalid invalidation? Everything is a valid input?
> > > > 
> > > > "7.1 Command queue errors" has the info.
> > > 
> > > Hmm CERROR_ATC_INV_SYNC needs to be forwarded to the guest somehow
> > 
> > Oh, for sure. That's typically triggered with an asynchronous
> > timeout from the eventq, so we'd need the io page fault series
> > as you previously remarked. Though I also wonder if an invalid
> > vSID that doesn't link to a pSID should be CERROR_ATC_INV_SYNC
> > also v.s. CERROR_ILL.
> 
> Yes, something like that make sense
> 
> Presumably sync becomes emulated and turns into something generated
> when the ioctl returns.

CMD_SYNC right? Yes. When the ioctl returns, VMM simply moves the
CONS index next to CMD_SYNC upon a success, or stops the index at
the faulty command upon a failure.

> So userspace would have to read the event FD
> before returning to be correct?
> 
> Maybe the kernel can somehow return a flag to indicate the event fd
> has data in it?
> 
> If yes then all errors would flow through the event fd?

I think it'd be nicer to return an immediate error to stop guest
CMDQ to raise a fault there accordingly, similar to returning a
-EIO for a bad STE in your SMMU part-3 series.

If the "return a flag" is an errno of the ioctl, it could work by
reading from a separate memory that belongs to the event fd. Yet,
in this case, an eventfd signal (assuming there is one to trigger
VMM's fault handler) becomes unnecessary, since the invalidation
ioctl is already handling it?

Thanks
Nic

> Would Intel be basically the same too?
> 
> Jason

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-11-30 20:41                                 ` Nicolin Chen
@ 2023-12-01  0:45                                   ` Jason Gunthorpe
  2023-12-01  4:29                                     ` Nicolin Chen
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Gunthorpe @ 2023-12-01  0:45 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Tian, Kevin, Liu, Yi L, joro@8bytes.org,
	alex.williamson@redhat.com, robin.murphy@arm.com,
	baolu.lu@linux.intel.com, cohuck@redhat.com,
	eric.auger@redhat.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

On Thu, Nov 30, 2023 at 12:41:20PM -0800, Nicolin Chen wrote:

> > So userspace would have to read the event FD
> > before returning to be correct?
> > 
> > Maybe the kernel can somehow return a flag to indicate the event fd
> > has data in it?
> > 
> > If yes then all errors would flow through the event fd?
> 
> I think it'd be nicer to return an immediate error to stop guest
> CMDQ to raise a fault there accordingly, similar to returning a
> -EIO for a bad STE in your SMMU part-3 series.
> 
> If the "return a flag" is an errno of the ioctl, it could work by
> reading from a separate memory that belongs to the event fd. Yet,
> in this case, an eventfd signal (assuming there is one to trigger
> VMM's fault handler) becomes unnecessary, since the invalidation
> ioctl is already handling it?

My concern is how does all this fit together and do we push the right
things to the right places in the right order when an error occurs.

I did not study the spec carefully to see what exactly is supposed to
happen here, and I don't see things in Linux that make me think it
particularly cares..

ie Linux doesn't seem like it will know that an async event was even
triggered while processing the sync to generate an EIO. It looks like
it just gets ETIMEDOUT? Presumably we should be checking the event
queue to detect a pushed error?

It is worth understanding if the spec has language that requires
certain order so we can try to follow it.

Jason

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-12-01  0:45                                   ` Jason Gunthorpe
@ 2023-12-01  4:29                                     ` Nicolin Chen
  2023-12-01 12:55                                       ` Jason Gunthorpe
  0 siblings, 1 reply; 93+ messages in thread
From: Nicolin Chen @ 2023-12-01  4:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Liu, Yi L, joro@8bytes.org,
	alex.williamson@redhat.com, robin.murphy@arm.com,
	baolu.lu@linux.intel.com, cohuck@redhat.com,
	eric.auger@redhat.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

On Thu, Nov 30, 2023 at 08:45:23PM -0400, Jason Gunthorpe wrote:
> On Thu, Nov 30, 2023 at 12:41:20PM -0800, Nicolin Chen wrote:
> 
> > > So userspace would have to read the event FD
> > > before returning to be correct?
> > > 
> > > Maybe the kernel can somehow return a flag to indicate the event fd
> > > has data in it?
> > > 
> > > If yes then all errors would flow through the event fd?
> > 
> > I think it'd be nicer to return an immediate error to stop guest
> > CMDQ to raise a fault there accordingly, similar to returning a
> > -EIO for a bad STE in your SMMU part-3 series.
> > 
> > If the "return a flag" is an errno of the ioctl, it could work by
> > reading from a separate memory that belongs to the event fd. Yet,
> > in this case, an eventfd signal (assuming there is one to trigger
> > VMM's fault handler) becomes unnecessary, since the invalidation
> > ioctl is already handling it?
> 
> My concern is how does all this fit together and do we push the right
> things to the right places in the right order when an error occurs.
> 
> I did not study the spec carefully to see what exactly is supposed to
> happen here, and I don't see things in Linux that make me think it
> particularly cares..
> 
> ie Linux doesn't seem like it will know that an async event was even
> triggered while processing the sync to generate an EIO. It looks like
> it just gets ETIMEDOUT? Presumably we should be checking the event
> queue to detect a pushed error?
> 
> It is worth understanding if the spec has language that requires
> certain order so we can try to follow it.

Oh, I replied one misinformation previously. Actually eventq
doesn't report a CERROR. The global error interrupt does.

7.1 has that sequence: 1) CMDQ stops 2) Log current index
to the CONS register 3) Log error code to the CONS register
4) Set bit-0 "CMDQ error" of GERROR register to rise an irq.

FWIW, both gerror and cmdq are global. So we can't know if
the error is for which master or domain. So, the only way
is to get errno from the arm_smmu_cmdq_issue_cmd_with_sync
call in our user invalidate function, where we can then get
the error code. But this feels very much synchronous, since
both the error code and faulty CONS index could be simply
returned without an async eventfd.

Thanks
Nic

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-12-01  4:29                                     ` Nicolin Chen
@ 2023-12-01 12:55                                       ` Jason Gunthorpe
  2023-12-01 19:58                                         ` Nicolin Chen
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Gunthorpe @ 2023-12-01 12:55 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Tian, Kevin, Liu, Yi L, joro@8bytes.org,
	alex.williamson@redhat.com, robin.murphy@arm.com,
	baolu.lu@linux.intel.com, cohuck@redhat.com,
	eric.auger@redhat.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

On Thu, Nov 30, 2023 at 08:29:34PM -0800, Nicolin Chen wrote:
> On Thu, Nov 30, 2023 at 08:45:23PM -0400, Jason Gunthorpe wrote:
> > On Thu, Nov 30, 2023 at 12:41:20PM -0800, Nicolin Chen wrote:
> > 
> > > > So userspace would have to read the event FD
> > > > before returning to be correct?
> > > > 
> > > > Maybe the kernel can somehow return a flag to indicate the event fd
> > > > has data in it?
> > > > 
> > > > If yes then all errors would flow through the event fd?
> > > 
> > > I think it'd be nicer to return an immediate error to stop guest
> > > CMDQ to raise a fault there accordingly, similar to returning a
> > > -EIO for a bad STE in your SMMU part-3 series.
> > > 
> > > If the "return a flag" is an errno of the ioctl, it could work by
> > > reading from a separate memory that belongs to the event fd. Yet,
> > > in this case, an eventfd signal (assuming there is one to trigger
> > > VMM's fault handler) becomes unnecessary, since the invalidation
> > > ioctl is already handling it?
> > 
> > My concern is how does all this fit together and do we push the right
> > things to the right places in the right order when an error occurs.
> > 
> > I did not study the spec carefully to see what exactly is supposed to
> > happen here, and I don't see things in Linux that make me think it
> > particularly cares..
> > 
> > ie Linux doesn't seem like it will know that an async event was even
> > triggered while processing the sync to generate an EIO. It looks like
> > it just gets ETIMEDOUT? Presumably we should be checking the event
> > queue to detect a pushed error?
> > 
> > It is worth understanding if the spec has language that requires
> > certain order so we can try to follow it.
> 
> Oh, I replied one misinformation previously. Actually eventq
> doesn't report a CERROR. The global error interrupt does.
> 
> 7.1 has that sequence: 1) CMDQ stops 2) Log current index
> to the CONS register 3) Log error code to the CONS register
> 4) Set bit-0 "CMDQ error" of GERROR register to rise an irq.

Which triggers some async mechanism that seems to restart the command
queue and convert the error into a printk.

It seems there is not a simple way to realize this error back to
userspace since we can't block the global command queue and we proceed
to complete commands that the real HW would not have completed.

To actually emulate this the gerror handler would have to capture all
the necessary registers, return them back to the thread doing
invalidate_user and all of that would have to return back to userspace
to go into the virtual version of all the same registers.

Yes, it can be synchronous it seems, but we don't have any
infrastructure in the driver to do this.

Given this is pretty niche maybe we just don't support error
forwarding and simply ensure it could be added to the uapi later?

Jason

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-12-01 12:55                                       ` Jason Gunthorpe
@ 2023-12-01 19:58                                         ` Nicolin Chen
  2023-12-01 20:43                                           ` Jason Gunthorpe
  0 siblings, 1 reply; 93+ messages in thread
From: Nicolin Chen @ 2023-12-01 19:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Liu, Yi L, joro@8bytes.org,
	alex.williamson@redhat.com, robin.murphy@arm.com,
	baolu.lu@linux.intel.com, cohuck@redhat.com,
	eric.auger@redhat.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

On Fri, Dec 01, 2023 at 08:55:38AM -0400, Jason Gunthorpe wrote:
> On Thu, Nov 30, 2023 at 08:29:34PM -0800, Nicolin Chen wrote:
> > On Thu, Nov 30, 2023 at 08:45:23PM -0400, Jason Gunthorpe wrote:
> > > On Thu, Nov 30, 2023 at 12:41:20PM -0800, Nicolin Chen wrote:
> > > 
> > > > > So userspace would have to read the event FD
> > > > > before returning to be correct?
> > > > > 
> > > > > Maybe the kernel can somehow return a flag to indicate the event fd
> > > > > has data in it?
> > > > > 
> > > > > If yes then all errors would flow through the event fd?
> > > > 
> > > > I think it'd be nicer to return an immediate error to stop guest
> > > > CMDQ to raise a fault there accordingly, similar to returning a
> > > > -EIO for a bad STE in your SMMU part-3 series.
> > > > 
> > > > If the "return a flag" is an errno of the ioctl, it could work by
> > > > reading from a separate memory that belongs to the event fd. Yet,
> > > > in this case, an eventfd signal (assuming there is one to trigger
> > > > VMM's fault handler) becomes unnecessary, since the invalidation
> > > > ioctl is already handling it?
> > > 
> > > My concern is how does all this fit together and do we push the right
> > > things to the right places in the right order when an error occurs.
> > > 
> > > I did not study the spec carefully to see what exactly is supposed to
> > > happen here, and I don't see things in Linux that make me think it
> > > particularly cares..
> > > 
> > > ie Linux doesn't seem like it will know that an async event was even
> > > triggered while processing the sync to generate an EIO. It looks like
> > > it just gets ETIMEDOUT? Presumably we should be checking the event
> > > queue to detect a pushed error?
> > > 
> > > It is worth understanding if the spec has language that requires
> > > certain order so we can try to follow it.
> > 
> > Oh, I replied one misinformation previously. Actually eventq
> > doesn't report a CERROR. The global error interrupt does.
> > 
> > 7.1 has that sequence: 1) CMDQ stops 2) Log current index
> > to the CONS register 3) Log error code to the CONS register
> > 4) Set bit-0 "CMDQ error" of GERROR register to rise an irq.
> 
> Which triggers some async mechanism that seems to restart the command
> queue and convert the error into a printk.

Yes. For CERROR_ILL, it replaces the commands with another SYNC.

> It seems there is not a simple way to realize this error back to
> userspace since we can't block the global command queue and we proceed
> to complete commands that the real HW would not have completed.
> 
> To actually emulate this the gerror handler would have to capture all
> the necessary registers, return them back to the thread doing
> invalidate_user and all of that would have to return back to userspace
> to go into the virtual version of all the same registers.
> 
> Yes, it can be synchronous it seems, but we don't have any
> infrastructure in the driver to do this.
>
> Given this is pretty niche maybe we just don't support error
> forwarding and simply ensure it could be added to the uapi later?

If arm_smmu_cmdq_issue_cmdlist in arm_smmu_cache_invalidate_user
fails with ETIMEOUT, we polls the CONS register to get the error
code. This can cover CERROR_ABT and CERROR_ATC_INV.

As you remarked that we can't block the global CMDQ, so we have
to let a real CERROR_ILL go. Yet, we can make sure commands to
be fully sanitized before being issued, as we should immediately
reject faulty commands anyway, for errors such as unsupported op
codes, unzero-ed reserved fields, and unlinked vSIDs. This can
at least largely reduce the probability of a real CERROR_ILL.

So, combining these two, we can still have a basic synchronous
way by returning an errno to the invalidate ioctl? I see Kevin
replied something similar too.

Thanks
Nic

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-12-01 19:58                                         ` Nicolin Chen
@ 2023-12-01 20:43                                           ` Jason Gunthorpe
  2023-12-01 22:12                                             ` Nicolin Chen
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Gunthorpe @ 2023-12-01 20:43 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Tian, Kevin, Liu, Yi L, joro@8bytes.org,
	alex.williamson@redhat.com, robin.murphy@arm.com,
	baolu.lu@linux.intel.com, cohuck@redhat.com,
	eric.auger@redhat.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

On Fri, Dec 01, 2023 at 11:58:07AM -0800, Nicolin Chen wrote:
> > It seems there is not a simple way to realize this error back to
> > userspace since we can't block the global command queue and we proceed
> > to complete commands that the real HW would not have completed.
> > 
> > To actually emulate this the gerror handler would have to capture all
> > the necessary registers, return them back to the thread doing
> > invalidate_user and all of that would have to return back to userspace
> > to go into the virtual version of all the same registers.
> > 
> > Yes, it can be synchronous it seems, but we don't have any
> > infrastructure in the driver to do this.
> >
> > Given this is pretty niche maybe we just don't support error
> > forwarding and simply ensure it could be added to the uapi later?
> 
> If arm_smmu_cmdq_issue_cmdlist in arm_smmu_cache_invalidate_user
> fails with ETIMEOUT, we polls the CONS register to get the error
> code. This can cover CERROR_ABT and CERROR_ATC_INV.

Why is timeout linked to these two? Or rather, it doesn't have to be
linked like that. Any gerror is effectively synchronous because it
halts the queue and allows SW time to inspect which command failed and
record the gerror flags. So each and every command can get an error
indication.

Restarting the queue is done by putting sync in there to effectively
nop the failed command and we hope for the best and let it rip.

> As you remarked that we can't block the global CMDQ, so we have
> to let a real CERROR_ILL go. Yet, we can make sure commands to
> be fully sanitized before being issued, as we should immediately
> reject faulty commands anyway, for errors such as unsupported op
> codes, unzero-ed reserved fields, and unlinked vSIDs. This can
> at least largely reduce the probability of a real CERROR_ILL.

I'm more a little more concerend with ATC_INV as a malfunctioning
device can trigger this..

> So, combining these two, we can still have a basic synchronous
> way by returning an errno to the invalidate ioctl? I see Kevin
> replied something similar too.

It isn't enough information, you don't know which gerror bits to set
and you don't know what cons index to stick to indicate the error
triggering command with just a simple errno.

It does need to return a bunch of data to get it all right.

Though again, there is no driver infrastructure to do all this and it
doesn't seem that important so maybe we can just ensure there is a
future extension possiblity and have userspace understand an errno
means to generate CERROR_ILL on the last command in the batch?

Jason

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-12-01 20:43                                           ` Jason Gunthorpe
@ 2023-12-01 22:12                                             ` Nicolin Chen
  2023-12-04 14:48                                               ` Jason Gunthorpe
  0 siblings, 1 reply; 93+ messages in thread
From: Nicolin Chen @ 2023-12-01 22:12 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Liu, Yi L, joro@8bytes.org,
	alex.williamson@redhat.com, robin.murphy@arm.com,
	baolu.lu@linux.intel.com, cohuck@redhat.com,
	eric.auger@redhat.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

On Fri, Dec 01, 2023 at 04:43:40PM -0400, Jason Gunthorpe wrote:
> On Fri, Dec 01, 2023 at 11:58:07AM -0800, Nicolin Chen wrote:
> > > It seems there is not a simple way to realize this error back to
> > > userspace since we can't block the global command queue and we proceed
> > > to complete commands that the real HW would not have completed.
> > > 
> > > To actually emulate this the gerror handler would have to capture all
> > > the necessary registers, return them back to the thread doing
> > > invalidate_user and all of that would have to return back to userspace
> > > to go into the virtual version of all the same registers.
> > > 
> > > Yes, it can be synchronous it seems, but we don't have any
> > > infrastructure in the driver to do this.
> > >
> > > Given this is pretty niche maybe we just don't support error
> > > forwarding and simply ensure it could be added to the uapi later?
> > 
> > If arm_smmu_cmdq_issue_cmdlist in arm_smmu_cache_invalidate_user
> > fails with ETIMEOUT, we polls the CONS register to get the error
> > code. This can cover CERROR_ABT and CERROR_ATC_INV.
> 
> Why is timeout linked to these two? Or rather, it doesn't have to be
> linked like that. Any gerror is effectively synchronous because it
> halts the queue and allows SW time to inspect which command failed and
> record the gerror flags. So each and every command can get an error
> indication.
> 
> Restarting the queue is done by putting sync in there to effectively
> nop the failed command and we hope for the best and let it rip.

I see that SMMU driver only restarts the queue when dealing with
CERROR_ILL. So only CERROR_ABT or CERROR_ATC_INV would result in
-ETIMEOUT.

> > As you remarked that we can't block the global CMDQ, so we have
> > to let a real CERROR_ILL go. Yet, we can make sure commands to
> > be fully sanitized before being issued, as we should immediately
> > reject faulty commands anyway, for errors such as unsupported op
> > codes, unzero-ed reserved fields, and unlinked vSIDs. This can
> > at least largely reduce the probability of a real CERROR_ILL.
> 
> I'm more a little more concerend with ATC_INV as a malfunctioning
> device can trigger this..

How about making sure that the invalidate handler always issues
one CMD_ATC_INV at a time, so each arm_smmu_cmdq_issue_cmdlist()
call has a chance to timeout? Then, we can simply know which one
in the user array fails.

> > So, combining these two, we can still have a basic synchronous
> > way by returning an errno to the invalidate ioctl? I see Kevin
> > replied something similar too.
> 
> It isn't enough information, you don't know which gerror bits to set
> and you don't know what cons index to stick to indicate the error
> triggering command with just a simple errno.
>
> It does need to return a bunch of data to get it all right.

The array structure returns req_num to indicate the index. This
works, even if the command consumption stops in the middle:
 * @req_num: Input the number of cache invalidation requests in the array.
 *           Output the number of requests successfully handled by kernel.

So we only need an error code of CERROR_ABT/ILL/ATC_INV.

Or am I missing some point here?

> Though again, there is no driver infrastructure to do all this and it
> doesn't seem that important so maybe we can just ensure there is a
> future extension possiblity and have userspace understand an errno
> means to generate CERROR_ILL on the last command in the batch?

Yea, it certainly can be a plan.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-12-01 22:12                                             ` Nicolin Chen
@ 2023-12-04 14:48                                               ` Jason Gunthorpe
  2023-12-05 17:33                                                 ` Nicolin Chen
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Gunthorpe @ 2023-12-04 14:48 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Tian, Kevin, Liu, Yi L, joro@8bytes.org,
	alex.williamson@redhat.com, robin.murphy@arm.com,
	baolu.lu@linux.intel.com, cohuck@redhat.com,
	eric.auger@redhat.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

On Fri, Dec 01, 2023 at 02:12:28PM -0800, Nicolin Chen wrote:
> > Why is timeout linked to these two? Or rather, it doesn't have to be
> > linked like that. Any gerror is effectively synchronous because it
> > halts the queue and allows SW time to inspect which command failed and
> > record the gerror flags. So each and every command can get an error
> > indication.
> > 
> > Restarting the queue is done by putting sync in there to effectively
> > nop the failed command and we hope for the best and let it rip.
> 
> I see that SMMU driver only restarts the queue when dealing with
> CERROR_ILL. So only CERROR_ABT or CERROR_ATC_INV would result in
> -ETIMEOUT.

I'm not sure that is the best thing to do. ABT is basically the
machine caught fire, so sure there is no recovery for that.

But ATC_INV could be recovered and should ideally be canceled then
forwarded to the VM.
 
> > > As you remarked that we can't block the global CMDQ, so we have
> > > to let a real CERROR_ILL go. Yet, we can make sure commands to
> > > be fully sanitized before being issued, as we should immediately
> > > reject faulty commands anyway, for errors such as unsupported op
> > > codes, unzero-ed reserved fields, and unlinked vSIDs. This can
> > > at least largely reduce the probability of a real CERROR_ILL.
> > 
> > I'm more a little more concerend with ATC_INV as a malfunctioning
> > device can trigger this..
> 
> How about making sure that the invalidate handler always issues
> one CMD_ATC_INV at a time, so each arm_smmu_cmdq_issue_cmdlist()
> call has a chance to timeout? Then, we can simply know which one
> in the user array fails.

That sounds slow

> > > So, combining these two, we can still have a basic synchronous
> > > way by returning an errno to the invalidate ioctl? I see Kevin
> > > replied something similar too.
> > 
> > It isn't enough information, you don't know which gerror bits to set
> > and you don't know what cons index to stick to indicate the error
> > triggering command with just a simple errno.
> >
> > It does need to return a bunch of data to get it all right.
> 
> The array structure returns req_num to indicate the index. This
> works, even if the command consumption stops in the middle:
>  * @req_num: Input the number of cache invalidation requests in the array.
>  *           Output the number of requests successfully handled by kernel.
> 
> So we only need an error code of CERROR_ABT/ILL/ATC_INV.

Yes

> Or am I missing some point here?

It sounds Ok, we just have to understand what userspace should be
doing and how much of this the kernel should implement.

It seems to me that the error code should return the gerror and the
req_num should indicate the halted cons. The vmm should relay both
into the virtual registers.

Jason

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-12-04 14:48                                               ` Jason Gunthorpe
@ 2023-12-05 17:33                                                 ` Nicolin Chen
  2023-12-06 12:48                                                   ` Jason Gunthorpe
  0 siblings, 1 reply; 93+ messages in thread
From: Nicolin Chen @ 2023-12-05 17:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Liu, Yi L, joro@8bytes.org,
	alex.williamson@redhat.com, robin.murphy@arm.com,
	baolu.lu@linux.intel.com, cohuck@redhat.com,
	eric.auger@redhat.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

On Mon, Dec 04, 2023 at 10:48:50AM -0400, Jason Gunthorpe wrote:
 
> > Or am I missing some point here?
> 
> It sounds Ok, we just have to understand what userspace should be
> doing and how much of this the kernel should implement.
> 
> It seems to me that the error code should return the gerror and the
> req_num should indicate the halted cons. The vmm should relay both
> into the virtual registers.

I see your concern. I will take a closer look and see if we can
add to the initial version of arm_smmu_cache_invalidate_user().
Otherwise, we can add later.

Btw, VT-d seems to want the error_code and reports in the VT-d
specific invalidate entry structure, as Kevin and Yi had that
discussion in the other side of the thread.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-12-05 17:33                                                 ` Nicolin Chen
@ 2023-12-06 12:48                                                   ` Jason Gunthorpe
  0 siblings, 0 replies; 93+ messages in thread
From: Jason Gunthorpe @ 2023-12-06 12:48 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Tian, Kevin, Liu, Yi L, joro@8bytes.org,
	alex.williamson@redhat.com, robin.murphy@arm.com,
	baolu.lu@linux.intel.com, cohuck@redhat.com,
	eric.auger@redhat.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

On Tue, Dec 05, 2023 at 09:33:30AM -0800, Nicolin Chen wrote:
> On Mon, Dec 04, 2023 at 10:48:50AM -0400, Jason Gunthorpe wrote:
>  
> > > Or am I missing some point here?
> > 
> > It sounds Ok, we just have to understand what userspace should be
> > doing and how much of this the kernel should implement.
> > 
> > It seems to me that the error code should return the gerror and the
> > req_num should indicate the halted cons. The vmm should relay both
> > into the virtual registers.
> 
> I see your concern. I will take a closer look and see if we can
> add to the initial version of arm_smmu_cache_invalidate_user().
> Otherwise, we can add later.
> 
> Btw, VT-d seems to want the error_code and reports in the VT-d
> specific invalidate entry structure, as Kevin and Yi had that
> discussion in the other side of the thread.

It could be fine to shift it into the driver data blob. It isn't
really generic in the end.

Jason

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-11-29  0:57                       ` Jason Gunthorpe
  2023-11-29  1:09                         ` Nicolin Chen
@ 2023-12-01  3:51                         ` Yi Liu
  2023-12-01  4:50                           ` Nicolin Chen
  1 sibling, 1 reply; 93+ messages in thread
From: Yi Liu @ 2023-12-01  3:51 UTC (permalink / raw)
  To: Jason Gunthorpe, Nicolin Chen
  Cc: Tian, Kevin, joro@8bytes.org, alex.williamson@redhat.com,
	robin.murphy@arm.com, baolu.lu@linux.intel.com, cohuck@redhat.com,
	eric.auger@redhat.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

On 2023/11/29 08:57, Jason Gunthorpe wrote:
> On Tue, Nov 28, 2023 at 04:51:21PM -0800, Nicolin Chen wrote:
>>>> I also thought about making this out_driver_error_code per HW.
>>>> Yet, an error can be either per array or per entry/quest. The
>>>> array-related error should be reported in the array structure
>>>> that is a core uAPI, v.s. the per-HW entry structure. Though
>>>> we could still report an array error in the entry structure
>>>> at the first entry (or indexed by "array->entry_num")?
>>>>
>>>
>>> why would there be an array error? array is just a software
>>> entity containing actual HW invalidation cmds. If there is
>>> any error with the array itself it should be reported via
>>> ioctl errno.
>>
>> User array reading is a software operation, but kernel array
>> reading is a hardware operation that can raise an error when
>> the memory location to the array is incorrect or so.
> 
> Well, we shouldn't get into a situation like that.. By the time the HW
> got the address it should be valid.
> 
>> With that being said, I think errno (-EIO) could do the job,
>> as you suggested too.
> 
> Do we have any idea what HW failures can be generated by the commands
> this will execture? IIRC I don't remember seeing any smmu specific
> codes related to invalid invalidation? Everything is a valid input?
> 
> Can vt-d fail single commands? What about AMD?

Intel VT-d side, after each invalidation request, there is a wait 
descriptor which either provide an interrupt or an address for the
hw to notify software the request before the wait descriptor has been
completed. While, if there is error happened on the invalidation request,
a flag (IQE, ICE, ITE) would be set in the Fault Status Register, and some
detailed information would be recorded in the Invalidation Queue Error
Record Register. So an invalidation request may be failed with some error
reported. If no error, will return completion via the wait descriptor. Is
this what you mean by "fail a single command"?

>>> Jason, how about your opinion? I didn't spot big issues
>>> except this one. Hope it can make into 6.8.
> 
> Yes, lets try
> 
> Jason

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-12-01  3:51                         ` Yi Liu
@ 2023-12-01  4:50                           ` Nicolin Chen
  2023-12-01  5:19                             ` Tian, Kevin
  0 siblings, 1 reply; 93+ messages in thread
From: Nicolin Chen @ 2023-12-01  4:50 UTC (permalink / raw)
  To: Yi Liu
  Cc: Jason Gunthorpe, Tian, Kevin, joro@8bytes.org,
	alex.williamson@redhat.com, robin.murphy@arm.com,
	baolu.lu@linux.intel.com, cohuck@redhat.com,
	eric.auger@redhat.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

On Fri, Dec 01, 2023 at 11:51:26AM +0800, Yi Liu wrote:
> On 2023/11/29 08:57, Jason Gunthorpe wrote:
> > On Tue, Nov 28, 2023 at 04:51:21PM -0800, Nicolin Chen wrote:
> > > > > I also thought about making this out_driver_error_code per HW.
> > > > > Yet, an error can be either per array or per entry/quest. The
> > > > > array-related error should be reported in the array structure
> > > > > that is a core uAPI, v.s. the per-HW entry structure. Though
> > > > > we could still report an array error in the entry structure
> > > > > at the first entry (or indexed by "array->entry_num")?
> > > > > 
> > > > 
> > > > why would there be an array error? array is just a software
> > > > entity containing actual HW invalidation cmds. If there is
> > > > any error with the array itself it should be reported via
> > > > ioctl errno.
> > > 
> > > User array reading is a software operation, but kernel array
> > > reading is a hardware operation that can raise an error when
> > > the memory location to the array is incorrect or so.
> > 
> > Well, we shouldn't get into a situation like that.. By the time the HW
> > got the address it should be valid.
> > 
> > > With that being said, I think errno (-EIO) could do the job,
> > > as you suggested too.
> > 
> > Do we have any idea what HW failures can be generated by the commands
> > this will execture? IIRC I don't remember seeing any smmu specific
> > codes related to invalid invalidation? Everything is a valid input?
> > 
> > Can vt-d fail single commands? What about AMD?
> 
> Intel VT-d side, after each invalidation request, there is a wait
> descriptor which either provide an interrupt or an address for the
> hw to notify software the request before the wait descriptor has been
> completed. While, if there is error happened on the invalidation request,
> a flag (IQE, ICE, ITE) would be set in the Fault Status Register, and some
> detailed information would be recorded in the Invalidation Queue Error
> Record Register. So an invalidation request may be failed with some error
> reported. If no error, will return completion via the wait descriptor. Is
> this what you mean by "fail a single command"?

I see the current VT-d series marking those as "REVISIT". How
will it report an error to the user space from those register?

Are they global status registers so that it might be difficult
to direct the error to the nested domain for an event fd?

Nic

^ permalink raw reply	[flat|nested] 93+ messages in thread

* RE: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-12-01  4:50                           ` Nicolin Chen
@ 2023-12-01  5:19                             ` Tian, Kevin
  2023-12-01  7:05                               ` Yi Liu
  0 siblings, 1 reply; 93+ messages in thread
From: Tian, Kevin @ 2023-12-01  5:19 UTC (permalink / raw)
  To: Nicolin Chen, Liu, Yi L
  Cc: Jason Gunthorpe, joro@8bytes.org, alex.williamson@redhat.com,
	robin.murphy@arm.com, baolu.lu@linux.intel.com, cohuck@redhat.com,
	eric.auger@redhat.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Friday, December 1, 2023 12:50 PM
> 
> On Fri, Dec 01, 2023 at 11:51:26AM +0800, Yi Liu wrote:
> > On 2023/11/29 08:57, Jason Gunthorpe wrote:
> > > On Tue, Nov 28, 2023 at 04:51:21PM -0800, Nicolin Chen wrote:
> > > > > > I also thought about making this out_driver_error_code per HW.
> > > > > > Yet, an error can be either per array or per entry/quest. The
> > > > > > array-related error should be reported in the array structure
> > > > > > that is a core uAPI, v.s. the per-HW entry structure. Though
> > > > > > we could still report an array error in the entry structure
> > > > > > at the first entry (or indexed by "array->entry_num")?
> > > > > >
> > > > >
> > > > > why would there be an array error? array is just a software
> > > > > entity containing actual HW invalidation cmds. If there is
> > > > > any error with the array itself it should be reported via
> > > > > ioctl errno.
> > > >
> > > > User array reading is a software operation, but kernel array
> > > > reading is a hardware operation that can raise an error when
> > > > the memory location to the array is incorrect or so.
> > >
> > > Well, we shouldn't get into a situation like that.. By the time the HW
> > > got the address it should be valid.
> > >
> > > > With that being said, I think errno (-EIO) could do the job,
> > > > as you suggested too.
> > >
> > > Do we have any idea what HW failures can be generated by the
> commands
> > > this will execture? IIRC I don't remember seeing any smmu specific
> > > codes related to invalid invalidation? Everything is a valid input?
> > >
> > > Can vt-d fail single commands? What about AMD?
> >
> > Intel VT-d side, after each invalidation request, there is a wait
> > descriptor which either provide an interrupt or an address for the
> > hw to notify software the request before the wait descriptor has been
> > completed. While, if there is error happened on the invalidation request,
> > a flag (IQE, ICE, ITE) would be set in the Fault Status Register, and some
> > detailed information would be recorded in the Invalidation Queue Error
> > Record Register. So an invalidation request may be failed with some error
> > reported. If no error, will return completion via the wait descriptor. Is
> > this what you mean by "fail a single command"?
> 
> I see the current VT-d series marking those as "REVISIT". How
> will it report an error to the user space from those register?
> 
> Are they global status registers so that it might be difficult
> to direct the error to the nested domain for an event fd?
> 

They are global registers but invalidation queue is also the global
resource. intel-iommu driver polls the status register after queueing
new invalidation descriptors. The submission is serialized.

If the error is related to a descriptor itself (e.g. format issue) then
the head register points to the problematic descriptor so software
can direct it to the related domain.

If the error is related to device tlb invalidation (e.g. timeout) there
is no way to associate the error with a specific descriptor by current
spec. But intel-iommu driver batches descriptors per domain so
we can still direct the error to the nested domain.

But I don't see the need of doing it via eventfd.

The poll semantics in intel-iommu driver is essentially a sync model.
vt-d spec does allow software to optionally enable notification upon
those errors but it's not used so far.

With that I still prefer to having driver-specific error code defined
in the entry. If ARM is an event-driven model then we can define
that field at least in vtd specific data structure.

btw given vtd doesn't use native format in uAPI it doesn't make
sense to forward descriptor formatting errors back to userspace.
Those, if happen, are driver's own problem. intel-iommu driver
should verify the uAPI structure and return -EINVAL or proper
errno to userspace purely in software.

With that Yi please just define error codes for device tlb related
errors for vtd.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-12-01  5:19                             ` Tian, Kevin
@ 2023-12-01  7:05                               ` Yi Liu
  2023-12-01  7:10                                 ` Tian, Kevin
  0 siblings, 1 reply; 93+ messages in thread
From: Yi Liu @ 2023-12-01  7:05 UTC (permalink / raw)
  To: Tian, Kevin, Nicolin Chen
  Cc: Jason Gunthorpe, joro@8bytes.org, alex.williamson@redhat.com,
	robin.murphy@arm.com, baolu.lu@linux.intel.com, cohuck@redhat.com,
	eric.auger@redhat.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

On 2023/12/1 13:19, Tian, Kevin wrote:
>> From: Nicolin Chen <nicolinc@nvidia.com>
>> Sent: Friday, December 1, 2023 12:50 PM
>>
>> On Fri, Dec 01, 2023 at 11:51:26AM +0800, Yi Liu wrote:
>>> On 2023/11/29 08:57, Jason Gunthorpe wrote:
>>>> On Tue, Nov 28, 2023 at 04:51:21PM -0800, Nicolin Chen wrote:
>>>>>>> I also thought about making this out_driver_error_code per HW.
>>>>>>> Yet, an error can be either per array or per entry/quest. The
>>>>>>> array-related error should be reported in the array structure
>>>>>>> that is a core uAPI, v.s. the per-HW entry structure. Though
>>>>>>> we could still report an array error in the entry structure
>>>>>>> at the first entry (or indexed by "array->entry_num")?
>>>>>>>
>>>>>>
>>>>>> why would there be an array error? array is just a software
>>>>>> entity containing actual HW invalidation cmds. If there is
>>>>>> any error with the array itself it should be reported via
>>>>>> ioctl errno.
>>>>>
>>>>> User array reading is a software operation, but kernel array
>>>>> reading is a hardware operation that can raise an error when
>>>>> the memory location to the array is incorrect or so.
>>>>
>>>> Well, we shouldn't get into a situation like that.. By the time the HW
>>>> got the address it should be valid.
>>>>
>>>>> With that being said, I think errno (-EIO) could do the job,
>>>>> as you suggested too.
>>>>
>>>> Do we have any idea what HW failures can be generated by the
>> commands
>>>> this will execture? IIRC I don't remember seeing any smmu specific
>>>> codes related to invalid invalidation? Everything is a valid input?
>>>>
>>>> Can vt-d fail single commands? What about AMD?
>>>
>>> Intel VT-d side, after each invalidation request, there is a wait
>>> descriptor which either provide an interrupt or an address for the
>>> hw to notify software the request before the wait descriptor has been
>>> completed. While, if there is error happened on the invalidation request,
>>> a flag (IQE, ICE, ITE) would be set in the Fault Status Register, and some
>>> detailed information would be recorded in the Invalidation Queue Error
>>> Record Register. So an invalidation request may be failed with some error
>>> reported. If no error, will return completion via the wait descriptor. Is
>>> this what you mean by "fail a single command"?
>>
>> I see the current VT-d series marking those as "REVISIT". How
>> will it report an error to the user space from those register?
>>
>> Are they global status registers so that it might be difficult
>> to direct the error to the nested domain for an event fd?
>>
> 
> They are global registers but invalidation queue is also the global
> resource. intel-iommu driver polls the status register after queueing
> new invalidation descriptors. The submission is serialized.
> 
> If the error is related to a descriptor itself (e.g. format issue) then
> the head register points to the problematic descriptor so software
> can direct it to the related domain.
> 
> If the error is related to device tlb invalidation (e.g. timeout) there
> is no way to associate the error with a specific descriptor by current
> spec. But intel-iommu driver batches descriptors per domain so
> we can still direct the error to the nested domain.
> 
> But I don't see the need of doing it via eventfd.
> 
> The poll semantics in intel-iommu driver is essentially a sync model.
> vt-d spec does allow software to optionally enable notification upon
> those errors but it's not used so far.
> 
> With that I still prefer to having driver-specific error code defined
> in the entry. If ARM is an event-driven model then we can define
> that field at least in vtd specific data structure.
> 
> btw given vtd doesn't use native format in uAPI it doesn't make
> sense to forward descriptor formatting errors back to userspace.
> Those, if happen, are driver's own problem. intel-iommu driver
> should verify the uAPI structure and return -EINVAL or proper
> errno to userspace purely in software.
> 
> With that Yi please just define error codes for device tlb related
> errors for vtd.

hmmm, this sounds like customized error code. is it? So far, VT-d
spec has two errors (ICE and ITE). ITE is valuable to let userspace
know. For ICE, looks like no much value. Intel iommu driver should
be responsible to submit a valid device-tlb invalidation to device.
is it?

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 93+ messages in thread

* RE: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-12-01  7:05                               ` Yi Liu
@ 2023-12-01  7:10                                 ` Tian, Kevin
  2023-12-01  9:08                                   ` Yi Liu
  0 siblings, 1 reply; 93+ messages in thread
From: Tian, Kevin @ 2023-12-01  7:10 UTC (permalink / raw)
  To: Liu, Yi L, Nicolin Chen
  Cc: Jason Gunthorpe, joro@8bytes.org, alex.williamson@redhat.com,
	robin.murphy@arm.com, baolu.lu@linux.intel.com, cohuck@redhat.com,
	eric.auger@redhat.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Friday, December 1, 2023 3:05 PM
> 
> On 2023/12/1 13:19, Tian, Kevin wrote:
> >> From: Nicolin Chen <nicolinc@nvidia.com>
> >> Sent: Friday, December 1, 2023 12:50 PM
> >>
> >> On Fri, Dec 01, 2023 at 11:51:26AM +0800, Yi Liu wrote:
> >>> On 2023/11/29 08:57, Jason Gunthorpe wrote:
> >>>> On Tue, Nov 28, 2023 at 04:51:21PM -0800, Nicolin Chen wrote:
> >>>>>>> I also thought about making this out_driver_error_code per HW.
> >>>>>>> Yet, an error can be either per array or per entry/quest. The
> >>>>>>> array-related error should be reported in the array structure
> >>>>>>> that is a core uAPI, v.s. the per-HW entry structure. Though
> >>>>>>> we could still report an array error in the entry structure
> >>>>>>> at the first entry (or indexed by "array->entry_num")?
> >>>>>>>
> >>>>>>
> >>>>>> why would there be an array error? array is just a software
> >>>>>> entity containing actual HW invalidation cmds. If there is
> >>>>>> any error with the array itself it should be reported via
> >>>>>> ioctl errno.
> >>>>>
> >>>>> User array reading is a software operation, but kernel array
> >>>>> reading is a hardware operation that can raise an error when
> >>>>> the memory location to the array is incorrect or so.
> >>>>
> >>>> Well, we shouldn't get into a situation like that.. By the time the HW
> >>>> got the address it should be valid.
> >>>>
> >>>>> With that being said, I think errno (-EIO) could do the job,
> >>>>> as you suggested too.
> >>>>
> >>>> Do we have any idea what HW failures can be generated by the
> >> commands
> >>>> this will execture? IIRC I don't remember seeing any smmu specific
> >>>> codes related to invalid invalidation? Everything is a valid input?
> >>>>
> >>>> Can vt-d fail single commands? What about AMD?
> >>>
> >>> Intel VT-d side, after each invalidation request, there is a wait
> >>> descriptor which either provide an interrupt or an address for the
> >>> hw to notify software the request before the wait descriptor has been
> >>> completed. While, if there is error happened on the invalidation request,
> >>> a flag (IQE, ICE, ITE) would be set in the Fault Status Register, and some
> >>> detailed information would be recorded in the Invalidation Queue Error
> >>> Record Register. So an invalidation request may be failed with some
> error
> >>> reported. If no error, will return completion via the wait descriptor. Is
> >>> this what you mean by "fail a single command"?
> >>
> >> I see the current VT-d series marking those as "REVISIT". How
> >> will it report an error to the user space from those register?
> >>
> >> Are they global status registers so that it might be difficult
> >> to direct the error to the nested domain for an event fd?
> >>
> >
> > They are global registers but invalidation queue is also the global
> > resource. intel-iommu driver polls the status register after queueing
> > new invalidation descriptors. The submission is serialized.
> >
> > If the error is related to a descriptor itself (e.g. format issue) then
> > the head register points to the problematic descriptor so software
> > can direct it to the related domain.
> >
> > If the error is related to device tlb invalidation (e.g. timeout) there
> > is no way to associate the error with a specific descriptor by current
> > spec. But intel-iommu driver batches descriptors per domain so
> > we can still direct the error to the nested domain.
> >
> > But I don't see the need of doing it via eventfd.
> >
> > The poll semantics in intel-iommu driver is essentially a sync model.
> > vt-d spec does allow software to optionally enable notification upon
> > those errors but it's not used so far.
> >
> > With that I still prefer to having driver-specific error code defined
> > in the entry. If ARM is an event-driven model then we can define
> > that field at least in vtd specific data structure.
> >
> > btw given vtd doesn't use native format in uAPI it doesn't make
> > sense to forward descriptor formatting errors back to userspace.
> > Those, if happen, are driver's own problem. intel-iommu driver
> > should verify the uAPI structure and return -EINVAL or proper
> > errno to userspace purely in software.
> >
> > With that Yi please just define error codes for device tlb related
> > errors for vtd.
> 
> hmmm, this sounds like customized error code. is it? So far, VT-d

yes. there is no need to replicate hardware registers/bits if most
of them are irrelevant to userspace.

> spec has two errors (ICE and ITE). ITE is valuable to let userspace
> know. For ICE, looks like no much value. Intel iommu driver should
> be responsible to submit a valid device-tlb invalidation to device.

it's an invalid completion message from the device which could be
caused by various reasons (not exactly due to the invalidation 
request by iommu driver). so it still makes sense to forward.

> is it?
> 
> --
> Regards,
> Yi Liu 

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-12-01  7:10                                 ` Tian, Kevin
@ 2023-12-01  9:08                                   ` Yi Liu
  0 siblings, 0 replies; 93+ messages in thread
From: Yi Liu @ 2023-12-01  9:08 UTC (permalink / raw)
  To: Tian, Kevin, Nicolin Chen
  Cc: Jason Gunthorpe, joro@8bytes.org, alex.williamson@redhat.com,
	robin.murphy@arm.com, baolu.lu@linux.intel.com, cohuck@redhat.com,
	eric.auger@redhat.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

On 2023/12/1 15:10, Tian, Kevin wrote:
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Sent: Friday, December 1, 2023 3:05 PM
>>
>> On 2023/12/1 13:19, Tian, Kevin wrote:
>>>> From: Nicolin Chen <nicolinc@nvidia.com>
>>>> Sent: Friday, December 1, 2023 12:50 PM
>>>>
>>>> On Fri, Dec 01, 2023 at 11:51:26AM +0800, Yi Liu wrote:
>>>>> On 2023/11/29 08:57, Jason Gunthorpe wrote:
>>>>>> On Tue, Nov 28, 2023 at 04:51:21PM -0800, Nicolin Chen wrote:
>>>>>>>>> I also thought about making this out_driver_error_code per HW.
>>>>>>>>> Yet, an error can be either per array or per entry/quest. The
>>>>>>>>> array-related error should be reported in the array structure
>>>>>>>>> that is a core uAPI, v.s. the per-HW entry structure. Though
>>>>>>>>> we could still report an array error in the entry structure
>>>>>>>>> at the first entry (or indexed by "array->entry_num")?
>>>>>>>>>
>>>>>>>>
>>>>>>>> why would there be an array error? array is just a software
>>>>>>>> entity containing actual HW invalidation cmds. If there is
>>>>>>>> any error with the array itself it should be reported via
>>>>>>>> ioctl errno.
>>>>>>>
>>>>>>> User array reading is a software operation, but kernel array
>>>>>>> reading is a hardware operation that can raise an error when
>>>>>>> the memory location to the array is incorrect or so.
>>>>>>
>>>>>> Well, we shouldn't get into a situation like that.. By the time the HW
>>>>>> got the address it should be valid.
>>>>>>
>>>>>>> With that being said, I think errno (-EIO) could do the job,
>>>>>>> as you suggested too.
>>>>>>
>>>>>> Do we have any idea what HW failures can be generated by the
>>>> commands
>>>>>> this will execture? IIRC I don't remember seeing any smmu specific
>>>>>> codes related to invalid invalidation? Everything is a valid input?
>>>>>>
>>>>>> Can vt-d fail single commands? What about AMD?
>>>>>
>>>>> Intel VT-d side, after each invalidation request, there is a wait
>>>>> descriptor which either provide an interrupt or an address for the
>>>>> hw to notify software the request before the wait descriptor has been
>>>>> completed. While, if there is error happened on the invalidation request,
>>>>> a flag (IQE, ICE, ITE) would be set in the Fault Status Register, and some
>>>>> detailed information would be recorded in the Invalidation Queue Error
>>>>> Record Register. So an invalidation request may be failed with some
>> error
>>>>> reported. If no error, will return completion via the wait descriptor. Is
>>>>> this what you mean by "fail a single command"?
>>>>
>>>> I see the current VT-d series marking those as "REVISIT". How
>>>> will it report an error to the user space from those register?
>>>>
>>>> Are they global status registers so that it might be difficult
>>>> to direct the error to the nested domain for an event fd?
>>>>
>>>
>>> They are global registers but invalidation queue is also the global
>>> resource. intel-iommu driver polls the status register after queueing
>>> new invalidation descriptors. The submission is serialized.
>>>
>>> If the error is related to a descriptor itself (e.g. format issue) then
>>> the head register points to the problematic descriptor so software
>>> can direct it to the related domain.
>>>
>>> If the error is related to device tlb invalidation (e.g. timeout) there
>>> is no way to associate the error with a specific descriptor by current
>>> spec. But intel-iommu driver batches descriptors per domain so
>>> we can still direct the error to the nested domain.
>>>
>>> But I don't see the need of doing it via eventfd.
>>>
>>> The poll semantics in intel-iommu driver is essentially a sync model.
>>> vt-d spec does allow software to optionally enable notification upon
>>> those errors but it's not used so far.
>>>
>>> With that I still prefer to having driver-specific error code defined
>>> in the entry. If ARM is an event-driven model then we can define
>>> that field at least in vtd specific data structure.
>>>
>>> btw given vtd doesn't use native format in uAPI it doesn't make
>>> sense to forward descriptor formatting errors back to userspace.
>>> Those, if happen, are driver's own problem. intel-iommu driver
>>> should verify the uAPI structure and return -EINVAL or proper
>>> errno to userspace purely in software.
>>>
>>> With that Yi please just define error codes for device tlb related
>>> errors for vtd.
>>
>> hmmm, this sounds like customized error code. is it? So far, VT-d
> 
> yes. there is no need to replicate hardware registers/bits if most
> of them are irrelevant to userspace.
> 
>> spec has two errors (ICE and ITE). ITE is valuable to let userspace
>> know. For ICE, looks like no much value. Intel iommu driver should
>> be responsible to submit a valid device-tlb invalidation to device.
> 
> it's an invalid completion message from the device which could be
> caused by various reasons (not exactly due to the invalidation
> request by iommu driver). so it still makes sense to forward.

ok. so we may need to define a field to forward the detailed info to
user as well. This data is error-code specific. @Nic, are we aligned
that the error_code field and error data reporting should be moved
to the driver-specific part since it is different between vendors?

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-11-17 13:07 ` [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE Yi Liu
  2023-11-20  8:09   ` Tian, Kevin
@ 2023-11-21  5:02   ` Baolu Lu
  2023-11-21  5:19     ` Nicolin Chen
  2023-12-06 18:33   ` Jason Gunthorpe
  2023-12-07  6:59   ` Yi Liu
  3 siblings, 1 reply; 93+ messages in thread
From: Baolu Lu @ 2023-11-21  5:02 UTC (permalink / raw)
  To: Yi Liu, joro, alex.williamson, jgg, kevin.tian, robin.murphy
  Cc: baolu.lu, cohuck, eric.auger, nicolinc, kvm, mjrosato,
	chao.p.peng, yi.y.sun, peterx, jasowang,
	shameerali.kolothum.thodi, lulu, suravee.suthikulpanit, iommu,
	linux-kernel, linux-kselftest, zhenzhong.duan, joao.m.martins,
	xin.zeng, yan.y.zhao

On 11/17/23 9:07 PM, Yi Liu wrote:
> In nested translation, the stage-1 page table is user-managed but cached
> by the IOMMU hardware, so an update on present page table entries in the
> stage-1 page table should be followed with a cache invalidation.
> 
> Add an IOMMU_HWPT_INVALIDATE ioctl to support such a cache invalidation.
> It takes hwpt_id to specify the iommu_domain, and a multi-entry array to
> support multiple invalidation requests in one ioctl.
> 
> Check cache_invalidate_user op in the iommufd_hw_pagetable_alloc_nested,
> since all nested domains need that.
> 
> Co-developed-by: Nicolin Chen<nicolinc@nvidia.com>
> Signed-off-by: Nicolin Chen<nicolinc@nvidia.com>
> Signed-off-by: Yi Liu<yi.l.liu@intel.com>
> ---
>   drivers/iommu/iommufd/hw_pagetable.c    | 35 +++++++++++++++++++++++++
>   drivers/iommu/iommufd/iommufd_private.h |  9 +++++++
>   drivers/iommu/iommufd/main.c            |  3 +++
>   include/uapi/linux/iommufd.h            | 35 +++++++++++++++++++++++++
>   4 files changed, 82 insertions(+)
> 
> diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
> index 2abbeafdbd22..367459d92f69 100644
> --- a/drivers/iommu/iommufd/hw_pagetable.c
> +++ b/drivers/iommu/iommufd/hw_pagetable.c
> @@ -238,6 +238,11 @@ iommufd_hwpt_nested_alloc(struct iommufd_ctx *ictx,
>   		rc = -EINVAL;
>   		goto out_abort;
>   	}
> +	/* Driver is buggy by missing cache_invalidate_user in domain_ops */
> +	if (WARN_ON_ONCE(!hwpt->domain->ops->cache_invalidate_user)) {
> +		rc = -EINVAL;
> +		goto out_abort;
> +	}
>   	return hwpt_nested;

The WARN message here may cause kernel regression when users bisect
issues. Till this patch, there are no drivers support the
cache_invalidation_user callback yet.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-11-21  5:02   ` Baolu Lu
@ 2023-11-21  5:19     ` Nicolin Chen
  2023-11-28  5:54       ` Yi Liu
  0 siblings, 1 reply; 93+ messages in thread
From: Nicolin Chen @ 2023-11-21  5:19 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Yi Liu, joro, alex.williamson, jgg, kevin.tian, robin.murphy,
	cohuck, eric.auger, kvm, mjrosato, chao.p.peng, yi.y.sun, peterx,
	jasowang, shameerali.kolothum.thodi, lulu, suravee.suthikulpanit,
	iommu, linux-kernel, linux-kselftest, zhenzhong.duan,
	joao.m.martins, xin.zeng, yan.y.zhao

On Tue, Nov 21, 2023 at 01:02:49PM +0800, Baolu Lu wrote:
> On 11/17/23 9:07 PM, Yi Liu wrote:
> > In nested translation, the stage-1 page table is user-managed but cached
> > by the IOMMU hardware, so an update on present page table entries in the
> > stage-1 page table should be followed with a cache invalidation.
> > 
> > Add an IOMMU_HWPT_INVALIDATE ioctl to support such a cache invalidation.
> > It takes hwpt_id to specify the iommu_domain, and a multi-entry array to
> > support multiple invalidation requests in one ioctl.
> > 
> > Check cache_invalidate_user op in the iommufd_hw_pagetable_alloc_nested,
> > since all nested domains need that.
> > 
> > Co-developed-by: Nicolin Chen<nicolinc@nvidia.com>
> > Signed-off-by: Nicolin Chen<nicolinc@nvidia.com>
> > Signed-off-by: Yi Liu<yi.l.liu@intel.com>
> > ---
> >   drivers/iommu/iommufd/hw_pagetable.c    | 35 +++++++++++++++++++++++++
> >   drivers/iommu/iommufd/iommufd_private.h |  9 +++++++
> >   drivers/iommu/iommufd/main.c            |  3 +++
> >   include/uapi/linux/iommufd.h            | 35 +++++++++++++++++++++++++
> >   4 files changed, 82 insertions(+)
> > 
> > diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
> > index 2abbeafdbd22..367459d92f69 100644
> > --- a/drivers/iommu/iommufd/hw_pagetable.c
> > +++ b/drivers/iommu/iommufd/hw_pagetable.c
> > @@ -238,6 +238,11 @@ iommufd_hwpt_nested_alloc(struct iommufd_ctx *ictx,
> >               rc = -EINVAL;
> >               goto out_abort;
> >       }
> > +     /* Driver is buggy by missing cache_invalidate_user in domain_ops */
> > +     if (WARN_ON_ONCE(!hwpt->domain->ops->cache_invalidate_user)) {
> > +             rc = -EINVAL;
> > +             goto out_abort;
> > +     }
> >       return hwpt_nested;
> 
> The WARN message here may cause kernel regression when users bisect
> issues. Till this patch, there are no drivers support the
> cache_invalidation_user callback yet.

Ah, this is an unintended consequence from our uAPI bisect to
merge the nesting alloc first...

Would removing the WARN_ON_ONCE be okay? Although having this
WARN is actually the point here...

Thanks
Nic

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-11-21  5:19     ` Nicolin Chen
@ 2023-11-28  5:54       ` Yi Liu
  0 siblings, 0 replies; 93+ messages in thread
From: Yi Liu @ 2023-11-28  5:54 UTC (permalink / raw)
  To: Nicolin Chen, Baolu Lu
  Cc: joro, alex.williamson, jgg, kevin.tian, robin.murphy, cohuck,
	eric.auger, kvm, mjrosato, chao.p.peng, yi.y.sun, peterx,
	jasowang, shameerali.kolothum.thodi, lulu, suravee.suthikulpanit,
	iommu, linux-kernel, linux-kselftest, zhenzhong.duan,
	joao.m.martins, xin.zeng, yan.y.zhao

On 2023/11/21 13:19, Nicolin Chen wrote:
> On Tue, Nov 21, 2023 at 01:02:49PM +0800, Baolu Lu wrote:
>> On 11/17/23 9:07 PM, Yi Liu wrote:
>>> In nested translation, the stage-1 page table is user-managed but cached
>>> by the IOMMU hardware, so an update on present page table entries in the
>>> stage-1 page table should be followed with a cache invalidation.
>>>
>>> Add an IOMMU_HWPT_INVALIDATE ioctl to support such a cache invalidation.
>>> It takes hwpt_id to specify the iommu_domain, and a multi-entry array to
>>> support multiple invalidation requests in one ioctl.
>>>
>>> Check cache_invalidate_user op in the iommufd_hw_pagetable_alloc_nested,
>>> since all nested domains need that.
>>>
>>> Co-developed-by: Nicolin Chen<nicolinc@nvidia.com>
>>> Signed-off-by: Nicolin Chen<nicolinc@nvidia.com>
>>> Signed-off-by: Yi Liu<yi.l.liu@intel.com>
>>> ---
>>>    drivers/iommu/iommufd/hw_pagetable.c    | 35 +++++++++++++++++++++++++
>>>    drivers/iommu/iommufd/iommufd_private.h |  9 +++++++
>>>    drivers/iommu/iommufd/main.c            |  3 +++
>>>    include/uapi/linux/iommufd.h            | 35 +++++++++++++++++++++++++
>>>    4 files changed, 82 insertions(+)
>>>
>>> diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
>>> index 2abbeafdbd22..367459d92f69 100644
>>> --- a/drivers/iommu/iommufd/hw_pagetable.c
>>> +++ b/drivers/iommu/iommufd/hw_pagetable.c
>>> @@ -238,6 +238,11 @@ iommufd_hwpt_nested_alloc(struct iommufd_ctx *ictx,
>>>                rc = -EINVAL;
>>>                goto out_abort;
>>>        }
>>> +     /* Driver is buggy by missing cache_invalidate_user in domain_ops */
>>> +     if (WARN_ON_ONCE(!hwpt->domain->ops->cache_invalidate_user)) {
>>> +             rc = -EINVAL;
>>> +             goto out_abort;
>>> +     }
>>>        return hwpt_nested;
>>
>> The WARN message here may cause kernel regression when users bisect
>> issues. Till this patch, there are no drivers support the
>> cache_invalidation_user callback yet.
> 
> Ah, this is an unintended consequence from our uAPI bisect to
> merge the nesting alloc first...
> 
> Would removing the WARN_ON_ONCE be okay? Although having this
> WARN is actually the point here...

seems like we may need to remove it. how about your opinion, @Jason?

> Thanks
> Nic

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-11-17 13:07 ` [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE Yi Liu
  2023-11-20  8:09   ` Tian, Kevin
  2023-11-21  5:02   ` Baolu Lu
@ 2023-12-06 18:33   ` Jason Gunthorpe
  2023-12-07  6:59   ` Yi Liu
  3 siblings, 0 replies; 93+ messages in thread
From: Jason Gunthorpe @ 2023-12-06 18:33 UTC (permalink / raw)
  To: Yi Liu
  Cc: joro, alex.williamson, kevin.tian, robin.murphy, baolu.lu, cohuck,
	eric.auger, nicolinc, kvm, mjrosato, chao.p.peng, yi.y.sun,
	peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	zhenzhong.duan, joao.m.martins, xin.zeng, yan.y.zhao

On Fri, Nov 17, 2023 at 05:07:13AM -0800, Yi Liu wrote:
> +/**
> + * struct iommu_hwpt_invalidate - ioctl(IOMMU_HWPT_INVALIDATE)
> + * @size: sizeof(struct iommu_hwpt_invalidate)
> + * @hwpt_id: HWPT ID of a nested HWPT for cache invalidation
> + * @reqs_uptr: User pointer to an array having @req_num of cache invalidation
> + *             requests. The request entries in the array are of fixed width
> + *             @req_len, and contain a user data structure for invalidation
> + *             request specific to the given hardware page table.
> + * @req_type: One of enum iommu_hwpt_data_type, defining the data type of all
> + *            the entries in the invalidation request array. It should suit
> + *            with the data_type passed per the allocation of the hwpt pointed
> + *            by @hwpt_id.
> + * @req_len: Length (in bytes) of a request entry in the request array
> + * @req_num: Input the number of cache invalidation requests in the array.
> + *           Output the number of requests successfully handled by kernel.
> + * @out_driver_error_code: Report a driver speicifc error code upon failure.
> + *                         It's optional, driver has a choice to fill it or
> + *                         not.

"specific"

Jason

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-11-17 13:07 ` [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE Yi Liu
                     ` (2 preceding siblings ...)
  2023-12-06 18:33   ` Jason Gunthorpe
@ 2023-12-07  6:59   ` Yi Liu
  2023-12-07  9:04     ` Tian, Kevin
  3 siblings, 1 reply; 93+ messages in thread
From: Yi Liu @ 2023-12-07  6:59 UTC (permalink / raw)
  To: joro, alex.williamson, jgg, kevin.tian, robin.murphy, baolu.lu
  Cc: cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	zhenzhong.duan, joao.m.martins, xin.zeng, yan.y.zhao

On 2023/11/17 21:07, Yi Liu wrote:
> In nested translation, the stage-1 page table is user-managed but cached
> by the IOMMU hardware, so an update on present page table entries in the
> stage-1 page table should be followed with a cache invalidation.
> 
> Add an IOMMU_HWPT_INVALIDATE ioctl to support such a cache invalidation.
> It takes hwpt_id to specify the iommu_domain, and a multi-entry array to
> support multiple invalidation requests in one ioctl.
> 
> Check cache_invalidate_user op in the iommufd_hw_pagetable_alloc_nested,
> since all nested domains need that.
> 
> Co-developed-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>   drivers/iommu/iommufd/hw_pagetable.c    | 35 +++++++++++++++++++++++++
>   drivers/iommu/iommufd/iommufd_private.h |  9 +++++++
>   drivers/iommu/iommufd/main.c            |  3 +++
>   include/uapi/linux/iommufd.h            | 35 +++++++++++++++++++++++++
>   4 files changed, 82 insertions(+)
> 
> diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
> index 2abbeafdbd22..367459d92f69 100644
> --- a/drivers/iommu/iommufd/hw_pagetable.c
> +++ b/drivers/iommu/iommufd/hw_pagetable.c
> @@ -238,6 +238,11 @@ iommufd_hwpt_nested_alloc(struct iommufd_ctx *ictx,
>   		rc = -EINVAL;
>   		goto out_abort;
>   	}
> +	/* Driver is buggy by missing cache_invalidate_user in domain_ops */
> +	if (WARN_ON_ONCE(!hwpt->domain->ops->cache_invalidate_user)) {
> +		rc = -EINVAL;
> +		goto out_abort;
> +	}
>   	return hwpt_nested;
>   
>   out_abort:
> @@ -370,4 +375,34 @@ int iommufd_hwpt_get_dirty_bitmap(struct iommufd_ucmd *ucmd)
>   
>   	iommufd_put_object(&hwpt_paging->common.obj);
>   	return rc;
> +};
> +
> +int iommufd_hwpt_invalidate(struct iommufd_ucmd *ucmd)
> +{
> +	struct iommu_hwpt_invalidate *cmd = ucmd->cmd;
> +	struct iommu_user_data_array data_array = {
> +		.type = cmd->req_type,
> +		.uptr = u64_to_user_ptr(cmd->reqs_uptr),
> +		.entry_len = cmd->req_len,
> +		.entry_num = cmd->req_num,
> +	};
> +	struct iommufd_hw_pagetable *hwpt;
> +	int rc = 0;
> +
> +	if (cmd->req_type == IOMMU_HWPT_DATA_NONE)
> +		return -EINVAL;
> +	if (!cmd->reqs_uptr || !cmd->req_len || !cmd->req_num)
> +		return -EINVAL;
> +
> +	hwpt = iommufd_hw_pagetable_get_nested(ucmd, cmd->hwpt_id);
> +	if (IS_ERR(hwpt))
> +		return PTR_ERR(hwpt);
> +
> +	rc = hwpt->domain->ops->cache_invalidate_user(hwpt->domain, &data_array,
> +						      &cmd->out_driver_error_code);
> +	cmd->req_num = data_array.entry_num;
> +	if (iommufd_ucmd_respond(ucmd, sizeof(*cmd)))
> +		return -EFAULT;
> +	iommufd_put_object(&hwpt->obj);
> +	return rc;
>   }
> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
> index a74cfefffbc6..160521800d9b 100644
> --- a/drivers/iommu/iommufd/iommufd_private.h
> +++ b/drivers/iommu/iommufd/iommufd_private.h
> @@ -301,6 +301,7 @@ void iommufd_hwpt_paging_abort(struct iommufd_object *obj);
>   void iommufd_hwpt_nested_destroy(struct iommufd_object *obj);
>   void iommufd_hwpt_nested_abort(struct iommufd_object *obj);
>   int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd);
> +int iommufd_hwpt_invalidate(struct iommufd_ucmd *ucmd);
>   
>   static inline void iommufd_hw_pagetable_put(struct iommufd_ctx *ictx,
>   					    struct iommufd_hw_pagetable *hwpt)
> @@ -318,6 +319,14 @@ static inline void iommufd_hw_pagetable_put(struct iommufd_ctx *ictx,
>   	refcount_dec(&hwpt->obj.users);
>   }
>   
> +static inline struct iommufd_hw_pagetable *
> +iommufd_hw_pagetable_get_nested(struct iommufd_ucmd *ucmd, u32 id)
> +{
> +	return container_of(iommufd_get_object(ucmd->ictx, id,
> +					       IOMMUFD_OBJ_HWPT_NESTED),
> +			    struct iommufd_hw_pagetable, obj);
> +}
> +
>   struct iommufd_group {
>   	struct kref ref;
>   	struct mutex lock;
> diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
> index 45b9d40773b1..6edef860f91c 100644
> --- a/drivers/iommu/iommufd/main.c
> +++ b/drivers/iommu/iommufd/main.c
> @@ -309,6 +309,7 @@ union ucmd_buffer {
>   	struct iommu_hwpt_alloc hwpt;
>   	struct iommu_hwpt_get_dirty_bitmap get_dirty_bitmap;
>   	struct iommu_hwpt_set_dirty_tracking set_dirty_tracking;
> +	struct iommu_hwpt_invalidate cache;
>   	struct iommu_ioas_alloc alloc;
>   	struct iommu_ioas_allow_iovas allow_iovas;
>   	struct iommu_ioas_copy ioas_copy;
> @@ -348,6 +349,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
>   		 struct iommu_hwpt_get_dirty_bitmap, data),
>   	IOCTL_OP(IOMMU_HWPT_SET_DIRTY_TRACKING, iommufd_hwpt_set_dirty_tracking,
>   		 struct iommu_hwpt_set_dirty_tracking, __reserved),
> +	IOCTL_OP(IOMMU_HWPT_INVALIDATE, iommufd_hwpt_invalidate,
> +		 struct iommu_hwpt_invalidate, out_driver_error_code),
>   	IOCTL_OP(IOMMU_IOAS_ALLOC, iommufd_ioas_alloc_ioctl,
>   		 struct iommu_ioas_alloc, out_ioas_id),
>   	IOCTL_OP(IOMMU_IOAS_ALLOW_IOVAS, iommufd_ioas_allow_iovas,
> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> index 0b2bc6252e2c..7f92cecc87d7 100644
> --- a/include/uapi/linux/iommufd.h
> +++ b/include/uapi/linux/iommufd.h
> @@ -49,6 +49,7 @@ enum {
>   	IOMMUFD_CMD_GET_HW_INFO,
>   	IOMMUFD_CMD_HWPT_SET_DIRTY_TRACKING,
>   	IOMMUFD_CMD_HWPT_GET_DIRTY_BITMAP,
> +	IOMMUFD_CMD_HWPT_INVALIDATE,
>   };
>   
>   /**
> @@ -613,4 +614,38 @@ struct iommu_hwpt_get_dirty_bitmap {
>   #define IOMMU_HWPT_GET_DIRTY_BITMAP _IO(IOMMUFD_TYPE, \
>   					IOMMUFD_CMD_HWPT_GET_DIRTY_BITMAP)
>   
> +/**
> + * struct iommu_hwpt_invalidate - ioctl(IOMMU_HWPT_INVALIDATE)
> + * @size: sizeof(struct iommu_hwpt_invalidate)
> + * @hwpt_id: HWPT ID of a nested HWPT for cache invalidation
> + * @reqs_uptr: User pointer to an array having @req_num of cache invalidation
> + *             requests. The request entries in the array are of fixed width
> + *             @req_len, and contain a user data structure for invalidation
> + *             request specific to the given hardware page table.
> + * @req_type: One of enum iommu_hwpt_data_type, defining the data type of all
> + *            the entries in the invalidation request array. It should suit
> + *            with the data_type passed per the allocation of the hwpt pointed
> + *            by @hwpt_id.

@Jason and Kevin,

Here a check with you two. I had a conversation with Nic on the definition
of req_type here. It was added to support potential multiple kinds of cache
invalidation data types for a invalidating cache for a single hwpt type[1].
But we defined it as reusing the hwpt_data_type. In this way, it is not
able to support the potential case in[1]. is it? Shall we define a separate
enum for invalidation data types? And how can we let user know the
available invalidation data types for a hwpt type? Any idea?

[1] https://lore.kernel.org/linux-iommu/20231018163720.GA3952@nvidia.com/


> + * @req_len: Length (in bytes) of a request entry in the request array
> + * @req_num: Input the number of cache invalidation requests in the array.
> + *           Output the number of requests successfully handled by kernel.
> + * @out_driver_error_code: Report a driver speicifc error code upon failure.
> + *                         It's optional, driver has a choice to fill it or
> + *                         not.
> + *
> + * Invalidate the iommu cache for user-managed page table. Modifications on a
> + * user-managed page table should be followed by this operation to sync cache.
> + * Each ioctl can support one or more cache invalidation requests in the array
> + * that has a total size of @req_len * @req_num.
> + */
> +struct iommu_hwpt_invalidate {
> +	__u32 size;
> +	__u32 hwpt_id;
> +	__aligned_u64 reqs_uptr;
> +	__u32 req_type;
> +	__u32 req_len;
> +	__u32 req_num;
> +	__u32 out_driver_error_code;
> +};
> +#define IOMMU_HWPT_INVALIDATE _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_INVALIDATE)
>   #endif

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 93+ messages in thread

* RE: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-12-07  6:59   ` Yi Liu
@ 2023-12-07  9:04     ` Tian, Kevin
  2023-12-07 14:42       ` Jason Gunthorpe
  2023-12-11  7:49       ` Yi Liu
  0 siblings, 2 replies; 93+ messages in thread
From: Tian, Kevin @ 2023-12-07  9:04 UTC (permalink / raw)
  To: Liu, Yi L, joro@8bytes.org, alex.williamson@redhat.com,
	jgg@nvidia.com, robin.murphy@arm.com, baolu.lu@linux.intel.com
  Cc: cohuck@redhat.com, eric.auger@redhat.com, nicolinc@nvidia.com,
	kvm@vger.kernel.org, mjrosato@linux.ibm.com,
	chao.p.peng@linux.intel.com, yi.y.sun@linux.intel.com,
	peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Thursday, December 7, 2023 2:59 PM
> 
> On 2023/11/17 21:07, Yi Liu wrote:
> > @@ -613,4 +614,38 @@ struct iommu_hwpt_get_dirty_bitmap {
> >   #define IOMMU_HWPT_GET_DIRTY_BITMAP _IO(IOMMUFD_TYPE, \
> >
> 	IOMMUFD_CMD_HWPT_GET_DIRTY_BITMAP)
> >
> > +/**
> > + * struct iommu_hwpt_invalidate - ioctl(IOMMU_HWPT_INVALIDATE)
> > + * @size: sizeof(struct iommu_hwpt_invalidate)
> > + * @hwpt_id: HWPT ID of a nested HWPT for cache invalidation
> > + * @reqs_uptr: User pointer to an array having @req_num of cache
> invalidation
> > + *             requests. The request entries in the array are of fixed width
> > + *             @req_len, and contain a user data structure for invalidation
> > + *             request specific to the given hardware page table.
> > + * @req_type: One of enum iommu_hwpt_data_type, defining the data
> type of all
> > + *            the entries in the invalidation request array. It should suit
> > + *            with the data_type passed per the allocation of the hwpt pointed
> > + *            by @hwpt_id.
> 
> @Jason and Kevin,
> 
> Here a check with you two. I had a conversation with Nic on the definition
> of req_type here. It was added to support potential multiple kinds of cache
> invalidation data types for a invalidating cache for a single hwpt type[1].
> But we defined it as reusing the hwpt_data_type. In this way, it is not
> able to support the potential case in[1]. is it? Shall we define a separate
> enum for invalidation data types? And how can we let user know the
> available invalidation data types for a hwpt type? Any idea?
> 
> [1] https://lore.kernel.org/linux-
> iommu/20231018163720.GA3952@nvidia.com/
> 

From that thread Jason mentioned to make the invalidation format
part of domain allocation. If that is the direction to go then there
won't be multiple invalidation formats per hwpt. The user should
create multiple hwpt's per invalidation format (though mixing
formats in one virtual platform is very unlikely)?

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-12-07  9:04     ` Tian, Kevin
@ 2023-12-07 14:42       ` Jason Gunthorpe
  2023-12-11  7:53         ` Yi Liu
  2023-12-11  7:49       ` Yi Liu
  1 sibling, 1 reply; 93+ messages in thread
From: Jason Gunthorpe @ 2023-12-07 14:42 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, joro@8bytes.org, alex.williamson@redhat.com,
	robin.murphy@arm.com, baolu.lu@linux.intel.com, cohuck@redhat.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

On Thu, Dec 07, 2023 at 09:04:00AM +0000, Tian, Kevin wrote:
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Thursday, December 7, 2023 2:59 PM
> > 
> > On 2023/11/17 21:07, Yi Liu wrote:
> > > @@ -613,4 +614,38 @@ struct iommu_hwpt_get_dirty_bitmap {
> > >   #define IOMMU_HWPT_GET_DIRTY_BITMAP _IO(IOMMUFD_TYPE, \
> > >
> > 	IOMMUFD_CMD_HWPT_GET_DIRTY_BITMAP)
> > >
> > > +/**
> > > + * struct iommu_hwpt_invalidate - ioctl(IOMMU_HWPT_INVALIDATE)
> > > + * @size: sizeof(struct iommu_hwpt_invalidate)
> > > + * @hwpt_id: HWPT ID of a nested HWPT for cache invalidation
> > > + * @reqs_uptr: User pointer to an array having @req_num of cache
> > invalidation
> > > + *             requests. The request entries in the array are of fixed width
> > > + *             @req_len, and contain a user data structure for invalidation
> > > + *             request specific to the given hardware page table.
> > > + * @req_type: One of enum iommu_hwpt_data_type, defining the data
> > type of all
> > > + *            the entries in the invalidation request array. It should suit
> > > + *            with the data_type passed per the allocation of the hwpt pointed
> > > + *            by @hwpt_id.
> > 
> > @Jason and Kevin,
> > 
> > Here a check with you two. I had a conversation with Nic on the definition
> > of req_type here. It was added to support potential multiple kinds of cache
> > invalidation data types for a invalidating cache for a single hwpt type[1].
> > But we defined it as reusing the hwpt_data_type. In this way, it is not
> > able to support the potential case in[1]. is it? Shall we define a separate
> > enum for invalidation data types? And how can we let user know the
> > available invalidation data types for a hwpt type? Any idea?
> > 
> > [1] https://lore.kernel.org/linux-
> > iommu/20231018163720.GA3952@nvidia.com/
> > 
> 
> From that thread Jason mentioned to make the invalidation format
> part of domain allocation. If that is the direction to go then there
> won't be multiple invalidation formats per hwpt. The user should
> create multiple hwpt's per invalidation format (though mixing
> formats in one virtual platform is very unlikely)?

I think we could do either, but I have a vauge cleanness preference
that the enums are just different? That would follow a pretty typical
pattern for a structure tag to reflect the content of the structure.

Jason

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-12-07 14:42       ` Jason Gunthorpe
@ 2023-12-11  7:53         ` Yi Liu
  2023-12-11 13:21           ` Jason Gunthorpe
  0 siblings, 1 reply; 93+ messages in thread
From: Yi Liu @ 2023-12-11  7:53 UTC (permalink / raw)
  To: Jason Gunthorpe, Tian, Kevin
  Cc: joro@8bytes.org, alex.williamson@redhat.com, robin.murphy@arm.com,
	baolu.lu@linux.intel.com, cohuck@redhat.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

On 2023/12/7 22:42, Jason Gunthorpe wrote:
> On Thu, Dec 07, 2023 at 09:04:00AM +0000, Tian, Kevin wrote:
>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>> Sent: Thursday, December 7, 2023 2:59 PM
>>>
>>> On 2023/11/17 21:07, Yi Liu wrote:
>>>> @@ -613,4 +614,38 @@ struct iommu_hwpt_get_dirty_bitmap {
>>>>    #define IOMMU_HWPT_GET_DIRTY_BITMAP _IO(IOMMUFD_TYPE, \
>>>>
>>> 	IOMMUFD_CMD_HWPT_GET_DIRTY_BITMAP)
>>>>
>>>> +/**
>>>> + * struct iommu_hwpt_invalidate - ioctl(IOMMU_HWPT_INVALIDATE)
>>>> + * @size: sizeof(struct iommu_hwpt_invalidate)
>>>> + * @hwpt_id: HWPT ID of a nested HWPT for cache invalidation
>>>> + * @reqs_uptr: User pointer to an array having @req_num of cache
>>> invalidation
>>>> + *             requests. The request entries in the array are of fixed width
>>>> + *             @req_len, and contain a user data structure for invalidation
>>>> + *             request specific to the given hardware page table.
>>>> + * @req_type: One of enum iommu_hwpt_data_type, defining the data
>>> type of all
>>>> + *            the entries in the invalidation request array. It should suit
>>>> + *            with the data_type passed per the allocation of the hwpt pointed
>>>> + *            by @hwpt_id.
>>>
>>> @Jason and Kevin,
>>>
>>> Here a check with you two. I had a conversation with Nic on the definition
>>> of req_type here. It was added to support potential multiple kinds of cache
>>> invalidation data types for a invalidating cache for a single hwpt type[1].
>>> But we defined it as reusing the hwpt_data_type. In this way, it is not
>>> able to support the potential case in[1]. is it? Shall we define a separate
>>> enum for invalidation data types? And how can we let user know the
>>> available invalidation data types for a hwpt type? Any idea?
>>>
>>> [1] https://lore.kernel.org/linux-
>>> iommu/20231018163720.GA3952@nvidia.com/
>>>
>>
>>  From that thread Jason mentioned to make the invalidation format
>> part of domain allocation. If that is the direction to go then there
>> won't be multiple invalidation formats per hwpt. The user should
>> create multiple hwpt's per invalidation format (though mixing
>> formats in one virtual platform is very unlikely)?
> 
> I think we could do either, but I have a vauge cleanness preference
> that the enums are just different? That would follow a pretty typical
> pattern for a structure tag to reflect the content of the structure.

Is this still following the direction to make the invalidation format
part of domain allocation?

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-12-11  7:53         ` Yi Liu
@ 2023-12-11 13:21           ` Jason Gunthorpe
  2023-12-12 13:45             ` Liu, Yi L
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Gunthorpe @ 2023-12-11 13:21 UTC (permalink / raw)
  To: Yi Liu
  Cc: Tian, Kevin, joro@8bytes.org, alex.williamson@redhat.com,
	robin.murphy@arm.com, baolu.lu@linux.intel.com, cohuck@redhat.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

On Mon, Dec 11, 2023 at 03:53:40PM +0800, Yi Liu wrote:

> > >  From that thread Jason mentioned to make the invalidation format
> > > part of domain allocation. If that is the direction to go then there
> > > won't be multiple invalidation formats per hwpt. The user should
> > > create multiple hwpt's per invalidation format (though mixing
> > > formats in one virtual platform is very unlikely)?
> > 
> > I think we could do either, but I have a vauge cleanness preference
> > that the enums are just different? That would follow a pretty typical
> > pattern for a structure tag to reflect the content of the structure.
> 
> Is this still following the direction to make the invalidation format
> part of domain allocation?

I think you should make it seperate

Jason

^ permalink raw reply	[flat|nested] 93+ messages in thread

* RE: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-12-11 13:21           ` Jason Gunthorpe
@ 2023-12-12 13:45             ` Liu, Yi L
  2023-12-12 14:40               ` Jason Gunthorpe
  0 siblings, 1 reply; 93+ messages in thread
From: Liu, Yi L @ 2023-12-12 13:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, joro@8bytes.org, alex.williamson@redhat.com,
	robin.murphy@arm.com, baolu.lu@linux.intel.com, cohuck@redhat.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Monday, December 11, 2023 9:22 PM
> 
> On Mon, Dec 11, 2023 at 03:53:40PM +0800, Yi Liu wrote:
> 
> > > >  From that thread Jason mentioned to make the invalidation format
> > > > part of domain allocation. If that is the direction to go then there
> > > > won't be multiple invalidation formats per hwpt. The user should
> > > > create multiple hwpt's per invalidation format (though mixing
> > > > formats in one virtual platform is very unlikely)?
> > >
> > > I think we could do either, but I have a vauge cleanness preference
> > > that the enums are just different? That would follow a pretty typical
> > > pattern for a structure tag to reflect the content of the structure.
> >
> > Is this still following the direction to make the invalidation format
> > part of domain allocation?
> 
> I think you should make it seperate

Sure. I'll add a separate enum for cache invalidation format. Just to
see if it is good to pass down the invalidation format in the hwpt
alloc path? So iommu driver can check if the invalidation format
can be used for the hwpt to be allocated.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-12-12 13:45             ` Liu, Yi L
@ 2023-12-12 14:40               ` Jason Gunthorpe
  2023-12-13 13:47                 ` Liu, Yi L
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Gunthorpe @ 2023-12-12 14:40 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Tian, Kevin, joro@8bytes.org, alex.williamson@redhat.com,
	robin.murphy@arm.com, baolu.lu@linux.intel.com, cohuck@redhat.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

On Tue, Dec 12, 2023 at 01:45:16PM +0000, Liu, Yi L wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Monday, December 11, 2023 9:22 PM
> > 
> > On Mon, Dec 11, 2023 at 03:53:40PM +0800, Yi Liu wrote:
> > 
> > > > >  From that thread Jason mentioned to make the invalidation format
> > > > > part of domain allocation. If that is the direction to go then there
> > > > > won't be multiple invalidation formats per hwpt. The user should
> > > > > create multiple hwpt's per invalidation format (though mixing
> > > > > formats in one virtual platform is very unlikely)?
> > > >
> > > > I think we could do either, but I have a vauge cleanness preference
> > > > that the enums are just different? That would follow a pretty typical
> > > > pattern for a structure tag to reflect the content of the structure.
> > >
> > > Is this still following the direction to make the invalidation format
> > > part of domain allocation?
> > 
> > I think you should make it seperate
> 
> Sure. I'll add a separate enum for cache invalidation format. Just to
> see if it is good to pass down the invalidation format in the hwpt
> alloc path? So iommu driver can check if the invalidation format
> can be used for the hwpt to be allocated.

I would skip it in the invalidation. If necessary drivers can push a 0
length invalidation to do 'try and fail' discovery of supported types.

Jason

^ permalink raw reply	[flat|nested] 93+ messages in thread

* RE: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-12-12 14:40               ` Jason Gunthorpe
@ 2023-12-13 13:47                 ` Liu, Yi L
  2023-12-13 14:11                   ` Jason Gunthorpe
  0 siblings, 1 reply; 93+ messages in thread
From: Liu, Yi L @ 2023-12-13 13:47 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, joro@8bytes.org, alex.williamson@redhat.com,
	robin.murphy@arm.com, baolu.lu@linux.intel.com, cohuck@redhat.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, December 12, 2023 10:40 PM
> 
> On Tue, Dec 12, 2023 at 01:45:16PM +0000, Liu, Yi L wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Monday, December 11, 2023 9:22 PM
> > >
> > > On Mon, Dec 11, 2023 at 03:53:40PM +0800, Yi Liu wrote:
> > >
> > > > > >  From that thread Jason mentioned to make the invalidation format
> > > > > > part of domain allocation. If that is the direction to go then there
> > > > > > won't be multiple invalidation formats per hwpt. The user should
> > > > > > create multiple hwpt's per invalidation format (though mixing
> > > > > > formats in one virtual platform is very unlikely)?
> > > > >
> > > > > I think we could do either, but I have a vauge cleanness preference
> > > > > that the enums are just different? That would follow a pretty typical
> > > > > pattern for a structure tag to reflect the content of the structure.
> > > >
> > > > Is this still following the direction to make the invalidation format
> > > > part of domain allocation?
> > >
> > > I think you should make it seperate
> >
> > Sure. I'll add a separate enum for cache invalidation format. Just to
> > see if it is good to pass down the invalidation format in the hwpt
> > alloc path? So iommu driver can check if the invalidation format
> > can be used for the hwpt to be allocated.
> 
> I would skip it in the invalidation. If necessary drivers can push a 0
> length invalidation to do 'try and fail' discovery of supported types.

I think you mean keep passing the req_type in cache invalidation path
instead of the way I proposed. For the 'try and fail' discovery, we should
allow a zero-length array, is it? If the req_type is supported by the iommu
driver, then return successful, otherwise fail the ioctl. Is it?

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-12-13 13:47                 ` Liu, Yi L
@ 2023-12-13 14:11                   ` Jason Gunthorpe
  0 siblings, 0 replies; 93+ messages in thread
From: Jason Gunthorpe @ 2023-12-13 14:11 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Tian, Kevin, joro@8bytes.org, alex.williamson@redhat.com,
	robin.murphy@arm.com, baolu.lu@linux.intel.com, cohuck@redhat.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

On Wed, Dec 13, 2023 at 01:47:54PM +0000, Liu, Yi L wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, December 12, 2023 10:40 PM
> > 
> > On Tue, Dec 12, 2023 at 01:45:16PM +0000, Liu, Yi L wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Monday, December 11, 2023 9:22 PM
> > > >
> > > > On Mon, Dec 11, 2023 at 03:53:40PM +0800, Yi Liu wrote:
> > > >
> > > > > > >  From that thread Jason mentioned to make the invalidation format
> > > > > > > part of domain allocation. If that is the direction to go then there
> > > > > > > won't be multiple invalidation formats per hwpt. The user should
> > > > > > > create multiple hwpt's per invalidation format (though mixing
> > > > > > > formats in one virtual platform is very unlikely)?
> > > > > >
> > > > > > I think we could do either, but I have a vauge cleanness preference
> > > > > > that the enums are just different? That would follow a pretty typical
> > > > > > pattern for a structure tag to reflect the content of the structure.
> > > > >
> > > > > Is this still following the direction to make the invalidation format
> > > > > part of domain allocation?
> > > >
> > > > I think you should make it seperate
> > >
> > > Sure. I'll add a separate enum for cache invalidation format. Just to
> > > see if it is good to pass down the invalidation format in the hwpt
> > > alloc path? So iommu driver can check if the invalidation format
> > > can be used for the hwpt to be allocated.
> > 
> > I would skip it in the invalidation. If necessary drivers can push a 0
> > length invalidation to do 'try and fail' discovery of supported types.
> 
> I think you mean keep passing the req_type in cache invalidation path
> instead of the way I proposed. For the 'try and fail' discovery, we should
> allow a zero-length array, is it? If the req_type is supported by the iommu
> driver, then return successful, otherwise fail the ioctl. Is it?

Yes

Jason

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-12-07  9:04     ` Tian, Kevin
  2023-12-07 14:42       ` Jason Gunthorpe
@ 2023-12-11  7:49       ` Yi Liu
  1 sibling, 0 replies; 93+ messages in thread
From: Yi Liu @ 2023-12-11  7:49 UTC (permalink / raw)
  To: Tian, Kevin, joro@8bytes.org, alex.williamson@redhat.com,
	jgg@nvidia.com, robin.murphy@arm.com, baolu.lu@linux.intel.com
  Cc: cohuck@redhat.com, eric.auger@redhat.com, nicolinc@nvidia.com,
	kvm@vger.kernel.org, mjrosato@linux.ibm.com,
	chao.p.peng@linux.intel.com, yi.y.sun@linux.intel.com,
	peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

On 2023/12/7 17:04, Tian, Kevin wrote:
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Sent: Thursday, December 7, 2023 2:59 PM
>>
>> On 2023/11/17 21:07, Yi Liu wrote:
>>> @@ -613,4 +614,38 @@ struct iommu_hwpt_get_dirty_bitmap {
>>>    #define IOMMU_HWPT_GET_DIRTY_BITMAP _IO(IOMMUFD_TYPE, \
>>>
>> 	IOMMUFD_CMD_HWPT_GET_DIRTY_BITMAP)
>>>
>>> +/**
>>> + * struct iommu_hwpt_invalidate - ioctl(IOMMU_HWPT_INVALIDATE)
>>> + * @size: sizeof(struct iommu_hwpt_invalidate)
>>> + * @hwpt_id: HWPT ID of a nested HWPT for cache invalidation
>>> + * @reqs_uptr: User pointer to an array having @req_num of cache
>> invalidation
>>> + *             requests. The request entries in the array are of fixed width
>>> + *             @req_len, and contain a user data structure for invalidation
>>> + *             request specific to the given hardware page table.
>>> + * @req_type: One of enum iommu_hwpt_data_type, defining the data
>> type of all
>>> + *            the entries in the invalidation request array. It should suit
>>> + *            with the data_type passed per the allocation of the hwpt pointed
>>> + *            by @hwpt_id.
>>
>> @Jason and Kevin,
>>
>> Here a check with you two. I had a conversation with Nic on the definition
>> of req_type here. It was added to support potential multiple kinds of cache
>> invalidation data types for a invalidating cache for a single hwpt type[1].
>> But we defined it as reusing the hwpt_data_type. In this way, it is not
>> able to support the potential case in[1]. is it? Shall we define a separate
>> enum for invalidation data types? And how can we let user know the
>> available invalidation data types for a hwpt type? Any idea?
>>
>> [1] https://lore.kernel.org/linux-
>> iommu/20231018163720.GA3952@nvidia.com/
>>
> 
>  From that thread Jason mentioned to make the invalidation format
> part of domain allocation. If that is the direction to go then there
> won't be multiple invalidation formats per hwpt. The user should
> create multiple hwpt's per invalidation format (though mixing
> formats in one virtual platform is very unlikely)?

yes, following that, the hwpt data type would be a kind of a combination
of page table type and cache invalidation type. In this way, the req_type
may also be dropped as the hwpt allocation path can record the type and
reuse it in the cache invalidation path.

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [PATCH v6 3/6] iommu: Add iommu_copy_struct_from_user_array helper
  2023-11-17 13:07 [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2) Yi Liu
  2023-11-17 13:07 ` [PATCH v6 1/6] iommu: Add cache_invalidate_user op Yi Liu
  2023-11-17 13:07 ` [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE Yi Liu
@ 2023-11-17 13:07 ` Yi Liu
  2023-11-20  8:17   ` Tian, Kevin
  2024-01-08  8:37   ` Binbin Wu
  2023-11-17 13:07 ` [PATCH v6 4/6] iommufd/selftest: Add mock_domain_cache_invalidate_user support Yi Liu
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 93+ messages in thread
From: Yi Liu @ 2023-11-17 13:07 UTC (permalink / raw)
  To: joro, alex.williamson, jgg, kevin.tian, robin.murphy, baolu.lu
  Cc: cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.l.liu, yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi,
	lulu, suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	zhenzhong.duan, joao.m.martins, xin.zeng, yan.y.zhao

From: Nicolin Chen <nicolinc@nvidia.com>

Wrap up the data type/pointer/num sanity and __iommu_copy_struct_from_user
call for iommu drivers to copy driver specific data at a specific location
in the struct iommu_user_data_array.

And expect it to be used in cache_invalidate_user ops for example.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Co-developed-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 include/linux/iommu.h | 54 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 0c1ff7fe4fa1..832fdf193c57 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -342,6 +342,60 @@ static inline int __iommu_copy_struct_from_user(
 				      sizeof(*kdst),                      \
 				      offsetofend(typeof(*kdst), min_last))
 
+/**
+ * __iommu_copy_struct_from_user_array - Copy iommu driver specific user space
+ *                                       data from an iommu_user_data_array
+ * @dst_data: Pointer to an iommu driver specific user data that is defined in
+ *            include/uapi/linux/iommufd.h
+ * @src_array: Pointer to a struct iommu_user_data_array for a user space array
+ * @data_type: The data type of the @dst_data. Must match with @src_array.type
+ * @index: Index to offset the location in the array to copy user data from
+ * @data_len: Length of current user data structure, i.e. sizeof(struct _dst)
+ * @min_len: Initial length of user data structure for backward compatibility.
+ *           This should be offsetofend using the last member in the user data
+ *           struct that was initially added to include/uapi/linux/iommufd.h
+ */
+static inline int
+__iommu_copy_struct_from_user_array(void *dst_data,
+				    const struct iommu_user_data_array *src_array,
+				    unsigned int data_type, unsigned int index,
+				    size_t data_len, size_t min_len)
+{
+	struct iommu_user_data src_data;
+
+	if (src_array->type != data_type)
+		return -EINVAL;
+	if (WARN_ON(!src_array || index >= src_array->entry_num))
+		return -EINVAL;
+	if (!src_array->entry_num)
+		return -EINVAL;
+	src_data.uptr = src_array->uptr + src_array->entry_len * index;
+	src_data.len = src_array->entry_len;
+	src_data.type = src_array->type;
+
+	return __iommu_copy_struct_from_user(dst_data, &src_data, data_type,
+					     data_len, min_len);
+}
+
+/**
+ * iommu_copy_struct_from_user_array - Copy iommu driver specific user space
+ *                                     data from an iommu_user_data_array
+ * @kdst: Pointer to an iommu driver specific user data that is defined in
+ *        include/uapi/linux/iommufd.h
+ * @user_array: Pointer to a struct iommu_user_data_array for a user space array
+ * @data_type: The data type of the @kdst. Must match with @user_array->type
+ * @index: Index to offset the location in the array to copy user data from
+ * @min_last: The last memember of the data structure @kdst points in the
+ *            initial version.
+ * Return 0 for success, otherwise -error.
+ */
+#define iommu_copy_struct_from_user_array(kdst, user_array, data_type,   \
+					  index, min_last)               \
+	__iommu_copy_struct_from_user_array(kdst, user_array, data_type, \
+					    index, sizeof(*kdst),        \
+					    offsetofend(typeof(*kdst),   \
+							min_last))
+
 /**
  * struct iommu_ops - iommu ops and capabilities
  * @capable: check capability
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* RE: [PATCH v6 3/6] iommu: Add iommu_copy_struct_from_user_array helper
  2023-11-17 13:07 ` [PATCH v6 3/6] iommu: Add iommu_copy_struct_from_user_array helper Yi Liu
@ 2023-11-20  8:17   ` Tian, Kevin
  2023-11-20 17:25     ` Nicolin Chen
  2024-01-08  8:37   ` Binbin Wu
  1 sibling, 1 reply; 93+ messages in thread
From: Tian, Kevin @ 2023-11-20  8:17 UTC (permalink / raw)
  To: Liu, Yi L, joro@8bytes.org, alex.williamson@redhat.com,
	jgg@nvidia.com, robin.murphy@arm.com, baolu.lu@linux.intel.com
  Cc: cohuck@redhat.com, eric.auger@redhat.com, nicolinc@nvidia.com,
	kvm@vger.kernel.org, mjrosato@linux.ibm.com,
	chao.p.peng@linux.intel.com, yi.y.sun@linux.intel.com,
	peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Friday, November 17, 2023 9:07 PM
> 
> +/**
> + * __iommu_copy_struct_from_user_array - Copy iommu driver specific

__iommu_copy_entry_from_user_array?

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 3/6] iommu: Add iommu_copy_struct_from_user_array helper
  2023-11-20  8:17   ` Tian, Kevin
@ 2023-11-20 17:25     ` Nicolin Chen
  2023-11-21  2:48       ` Tian, Kevin
  0 siblings, 1 reply; 93+ messages in thread
From: Nicolin Chen @ 2023-11-20 17:25 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, joro@8bytes.org, alex.williamson@redhat.com,
	jgg@nvidia.com, robin.murphy@arm.com, baolu.lu@linux.intel.com,
	cohuck@redhat.com, eric.auger@redhat.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

On Mon, Nov 20, 2023 at 08:17:47AM +0000, Tian, Kevin wrote:
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Friday, November 17, 2023 9:07 PM
> >
> > +/**
> > + * __iommu_copy_struct_from_user_array - Copy iommu driver specific
> 
> __iommu_copy_entry_from_user_array?

I think "struct" and "entry" are interchangeable. Yet, aligning
with the {__}iommu_copy_struct_from_user seems to be nicer?

^ permalink raw reply	[flat|nested] 93+ messages in thread

* RE: [PATCH v6 3/6] iommu: Add iommu_copy_struct_from_user_array helper
  2023-11-20 17:25     ` Nicolin Chen
@ 2023-11-21  2:48       ` Tian, Kevin
  0 siblings, 0 replies; 93+ messages in thread
From: Tian, Kevin @ 2023-11-21  2:48 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Liu, Yi L, joro@8bytes.org, alex.williamson@redhat.com,
	jgg@nvidia.com, robin.murphy@arm.com, baolu.lu@linux.intel.com,
	cohuck@redhat.com, eric.auger@redhat.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Tuesday, November 21, 2023 1:25 AM
> 
> On Mon, Nov 20, 2023 at 08:17:47AM +0000, Tian, Kevin wrote:
> > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > Sent: Friday, November 17, 2023 9:07 PM
> > >
> > > +/**
> > > + * __iommu_copy_struct_from_user_array - Copy iommu driver specific
> >
> > __iommu_copy_entry_from_user_array?
> 
> I think "struct" and "entry" are interchangeable. Yet, aligning
> with the {__}iommu_copy_struct_from_user seems to be nicer?

ok

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 3/6] iommu: Add iommu_copy_struct_from_user_array helper
  2023-11-17 13:07 ` [PATCH v6 3/6] iommu: Add iommu_copy_struct_from_user_array helper Yi Liu
  2023-11-20  8:17   ` Tian, Kevin
@ 2024-01-08  8:37   ` Binbin Wu
  1 sibling, 0 replies; 93+ messages in thread
From: Binbin Wu @ 2024-01-08  8:37 UTC (permalink / raw)
  To: Yi Liu
  Cc: alex.williamson, jgg, kevin.tian, robin.murphy, baolu.lu, joro,
	cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	zhenzhong.duan, joao.m.martins, xin.zeng, yan.y.zhao



On 11/17/2023 9:07 PM, Yi Liu wrote:
> +
> +/**
> + * iommu_copy_struct_from_user_array - Copy iommu driver specific user space
> + *                                     data from an iommu_user_data_array
> + * @kdst: Pointer to an iommu driver specific user data that is defined in
> + *        include/uapi/linux/iommufd.h
> + * @user_array: Pointer to a struct iommu_user_data_array for a user space array
> + * @data_type: The data type of the @kdst. Must match with @user_array->type
> + * @index: Index to offset the location in the array to copy user data from
> + * @min_last: The last memember of the data structure @kdst points in the

s/memember/member/

> + *            initial version.
> + * Return 0 for success, otherwise -error.
> + */
> +#define iommu_copy_struct_from_user_array(kdst, user_array, data_type,   \
> +					  index, min_last)               \
> +	__iommu_copy_struct_from_user_array(kdst, user_array, data_type, \
> +					    index, sizeof(*kdst),        \
> +					    offsetofend(typeof(*kdst),   \
> +							min_last))
> +
>   /**
>    * struct iommu_ops - iommu ops and capabilities
>    * @capable: check capability


^ permalink raw reply	[flat|nested] 93+ messages in thread

* [PATCH v6 4/6] iommufd/selftest: Add mock_domain_cache_invalidate_user support
  2023-11-17 13:07 [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2) Yi Liu
                   ` (2 preceding siblings ...)
  2023-11-17 13:07 ` [PATCH v6 3/6] iommu: Add iommu_copy_struct_from_user_array helper Yi Liu
@ 2023-11-17 13:07 ` Yi Liu
  2023-12-06 18:16   ` Jason Gunthorpe
  2023-11-17 13:07 ` [PATCH v6 5/6] iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op Yi Liu
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 93+ messages in thread
From: Yi Liu @ 2023-11-17 13:07 UTC (permalink / raw)
  To: joro, alex.williamson, jgg, kevin.tian, robin.murphy, baolu.lu
  Cc: cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.l.liu, yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi,
	lulu, suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	zhenzhong.duan, joao.m.martins, xin.zeng, yan.y.zhao

From: Nicolin Chen <nicolinc@nvidia.com>

Add mock_domain_cache_invalidate_user() data structure to support user
space selftest program to cover user cache invalidation pathway.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/iommufd_test.h | 17 +++++++++++
 drivers/iommu/iommufd/selftest.c     | 43 ++++++++++++++++++++++++++++
 2 files changed, 60 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h
index 7910fbe1962d..0000f58dcda3 100644
--- a/drivers/iommu/iommufd/iommufd_test.h
+++ b/drivers/iommu/iommufd/iommufd_test.h
@@ -148,4 +148,21 @@ struct iommu_hwpt_selftest {
 	__u32 iotlb;
 };
 
+/**
+ * struct iommu_hwpt_invalidate_selftest
+ *
+ * @flags: invalidate flags
+ * @iotlb_id: invalidate iotlb entry index
+ *
+ * If IOMMU_TEST_INVALIDATE_ALL is set in @flags, @iotlb_id will be ignored
+ */
+struct iommu_hwpt_invalidate_selftest {
+#define IOMMU_TEST_INVALIDATE_ALL	(1ULL << 0)
+	__u32 flags;
+	__u32 iotlb_id;
+};
+
+#define IOMMU_TEST_INVALIDATE_ERR_FETCH	0xdeadbeee
+#define IOMMU_TEST_INVALIDATE_ERR_REQ	0xdeadbeef
+
 #endif
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index 5d93434003d8..2a9b970ca84e 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -473,9 +473,52 @@ static void mock_domain_free_nested(struct iommu_domain *domain)
 	kfree(mock_nested);
 }
 
+static int
+mock_domain_cache_invalidate_user(struct iommu_domain *domain,
+				  struct iommu_user_data_array *array,
+				  u32 *error_code)
+{
+	struct mock_iommu_domain_nested *mock_nested =
+		container_of(domain, struct mock_iommu_domain_nested, domain);
+	struct iommu_hwpt_invalidate_selftest inv;
+	int rc = 0;
+	int i, j;
+
+	if (domain->type != IOMMU_DOMAIN_NESTED)
+		return -EINVAL;
+
+	for (i = 0; i < array->entry_num; i++) {
+		rc = iommu_copy_struct_from_user_array(&inv, array,
+						       IOMMU_HWPT_DATA_SELFTEST,
+						       i, iotlb_id);
+		if (rc) {
+			*error_code = IOMMU_TEST_INVALIDATE_ERR_FETCH;
+			goto err;
+		}
+		/* Invalidate all mock iotlb entries and ignore iotlb_id */
+		if (inv.flags & IOMMU_TEST_INVALIDATE_ALL) {
+			for (j = 0; j < MOCK_NESTED_DOMAIN_IOTLB_NUM; j++)
+				mock_nested->iotlb[j] = 0;
+			continue;
+		}
+		/* Treat out-of-boundry iotlb_id as a request error */
+		if (inv.iotlb_id > MOCK_NESTED_DOMAIN_IOTLB_ID_MAX) {
+			*error_code = IOMMU_TEST_INVALIDATE_ERR_REQ;
+			rc = -EINVAL;
+			goto err;
+		}
+		mock_nested->iotlb[inv.iotlb_id] = 0;
+	}
+
+err:
+	array->entry_num = i;
+	return rc;
+}
+
 static struct iommu_domain_ops domain_nested_ops = {
 	.free = mock_domain_free_nested,
 	.attach_dev = mock_domain_nop_attach,
+	.cache_invalidate_user = mock_domain_cache_invalidate_user,
 };
 
 static inline struct iommufd_hw_pagetable *
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 4/6] iommufd/selftest: Add mock_domain_cache_invalidate_user support
  2023-11-17 13:07 ` [PATCH v6 4/6] iommufd/selftest: Add mock_domain_cache_invalidate_user support Yi Liu
@ 2023-12-06 18:16   ` Jason Gunthorpe
  2023-12-11 11:21     ` Yi Liu
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Gunthorpe @ 2023-12-06 18:16 UTC (permalink / raw)
  To: Yi Liu
  Cc: joro, alex.williamson, kevin.tian, robin.murphy, baolu.lu, cohuck,
	eric.auger, nicolinc, kvm, mjrosato, chao.p.peng, yi.y.sun,
	peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	zhenzhong.duan, joao.m.martins, xin.zeng, yan.y.zhao

On Fri, Nov 17, 2023 at 05:07:15AM -0800, Yi Liu wrote:
> +static int
> +mock_domain_cache_invalidate_user(struct iommu_domain *domain,
> +				  struct iommu_user_data_array *array,
> +				  u32 *error_code)
> +{
> +	struct mock_iommu_domain_nested *mock_nested =
> +		container_of(domain, struct mock_iommu_domain_nested, domain);
> +	struct iommu_hwpt_invalidate_selftest inv;
> +	int rc = 0;
> +	int i, j;
> +
> +	if (domain->type != IOMMU_DOMAIN_NESTED)
> +		return -EINVAL;

The core code already checked this, and it is only present on
domain_nested_ops so it is checked twice already..

Jason

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 4/6] iommufd/selftest: Add mock_domain_cache_invalidate_user support
  2023-12-06 18:16   ` Jason Gunthorpe
@ 2023-12-11 11:21     ` Yi Liu
  0 siblings, 0 replies; 93+ messages in thread
From: Yi Liu @ 2023-12-11 11:21 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: joro, alex.williamson, kevin.tian, robin.murphy, baolu.lu, cohuck,
	eric.auger, nicolinc, kvm, mjrosato, chao.p.peng, yi.y.sun,
	peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	zhenzhong.duan, joao.m.martins, xin.zeng, yan.y.zhao

On 2023/12/7 02:16, Jason Gunthorpe wrote:
> On Fri, Nov 17, 2023 at 05:07:15AM -0800, Yi Liu wrote:
>> +static int
>> +mock_domain_cache_invalidate_user(struct iommu_domain *domain,
>> +				  struct iommu_user_data_array *array,
>> +				  u32 *error_code)
>> +{
>> +	struct mock_iommu_domain_nested *mock_nested =
>> +		container_of(domain, struct mock_iommu_domain_nested, domain);
>> +	struct iommu_hwpt_invalidate_selftest inv;
>> +	int rc = 0;
>> +	int i, j;
>> +
>> +	if (domain->type != IOMMU_DOMAIN_NESTED)
>> +		return -EINVAL;
> 
> The core code already checked this, and it is only present on
> domain_nested_ops so it is checked twice already..

would drop it. thanks.

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [PATCH v6 5/6] iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op
  2023-11-17 13:07 [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2) Yi Liu
                   ` (3 preceding siblings ...)
  2023-11-17 13:07 ` [PATCH v6 4/6] iommufd/selftest: Add mock_domain_cache_invalidate_user support Yi Liu
@ 2023-11-17 13:07 ` Yi Liu
  2023-11-17 13:07 ` [PATCH v6 6/6] iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl Yi Liu
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 93+ messages in thread
From: Yi Liu @ 2023-11-17 13:07 UTC (permalink / raw)
  To: joro, alex.williamson, jgg, kevin.tian, robin.murphy, baolu.lu
  Cc: cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.l.liu, yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi,
	lulu, suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	zhenzhong.duan, joao.m.martins, xin.zeng, yan.y.zhao

From: Nicolin Chen <nicolinc@nvidia.com>

Allow to test whether IOTLB has been invalidated or not.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/iommufd_test.h          |  5 ++++
 drivers/iommu/iommufd/selftest.c              | 26 +++++++++++++++++++
 tools/testing/selftests/iommu/iommufd.c       |  4 +++
 tools/testing/selftests/iommu/iommufd_utils.h | 24 +++++++++++++++++
 4 files changed, 59 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h
index 0000f58dcda3..35d6207a96dd 100644
--- a/drivers/iommu/iommufd/iommufd_test.h
+++ b/drivers/iommu/iommufd/iommufd_test.h
@@ -21,6 +21,7 @@ enum {
 	IOMMU_TEST_OP_ACCESS_REPLACE_IOAS,
 	IOMMU_TEST_OP_MOCK_DOMAIN_FLAGS,
 	IOMMU_TEST_OP_DIRTY,
+	IOMMU_TEST_OP_MD_CHECK_IOTLB,
 };
 
 enum {
@@ -121,6 +122,10 @@ struct iommu_test_cmd {
 			__aligned_u64 uptr;
 			__aligned_u64 out_nr_dirty;
 		} dirty;
+		struct {
+			__u32 id;
+			__u32 iotlb;
+		} check_iotlb;
 	};
 	__u32 last;
 };
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index 2a9b970ca84e..f2d4599f701b 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -836,6 +836,28 @@ static int iommufd_test_md_check_refs(struct iommufd_ucmd *ucmd,
 	return 0;
 }
 
+static int iommufd_test_md_check_iotlb(struct iommufd_ucmd *ucmd,
+				       u32 mockpt_id, unsigned int iotlb_id,
+				       u32 iotlb)
+{
+	struct mock_iommu_domain_nested *mock_nested;
+	struct iommufd_hw_pagetable *hwpt;
+	int rc = 0;
+
+	hwpt = get_md_pagetable_nested(ucmd, mockpt_id, &mock_nested);
+	if (IS_ERR(hwpt))
+		return PTR_ERR(hwpt);
+
+	mock_nested = container_of(hwpt->domain,
+				   struct mock_iommu_domain_nested, domain);
+
+	if (iotlb_id > MOCK_NESTED_DOMAIN_IOTLB_ID_MAX ||
+	    mock_nested->iotlb[iotlb_id] != iotlb)
+		rc = -EINVAL;
+	iommufd_put_object(&hwpt->obj);
+	return rc;
+}
+
 struct selftest_access {
 	struct iommufd_access *access;
 	struct file *file;
@@ -1317,6 +1339,10 @@ int iommufd_test(struct iommufd_ucmd *ucmd)
 		return iommufd_test_md_check_refs(
 			ucmd, u64_to_user_ptr(cmd->check_refs.uptr),
 			cmd->check_refs.length, cmd->check_refs.refs);
+	case IOMMU_TEST_OP_MD_CHECK_IOTLB:
+		return iommufd_test_md_check_iotlb(ucmd, cmd->id,
+						   cmd->check_iotlb.id,
+						   cmd->check_iotlb.iotlb);
 	case IOMMU_TEST_OP_CREATE_ACCESS:
 		return iommufd_test_create_access(ucmd, cmd->id,
 						  cmd->create_access.flags);
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index 6ed328c863c4..c8763b880a16 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -330,6 +330,10 @@ TEST_F(iommufd_ioas, alloc_hwpt_nested)
 					   &nested_hwpt_id[1],
 					   IOMMU_HWPT_DATA_SELFTEST, &data,
 					   sizeof(data));
+		test_cmd_hwpt_check_iotlb_all(nested_hwpt_id[0],
+					      IOMMU_TEST_IOTLB_DEFAULT);
+		test_cmd_hwpt_check_iotlb_all(nested_hwpt_id[1],
+					      IOMMU_TEST_IOTLB_DEFAULT);
 
 		/* Negative test: a nested hwpt on top of a nested hwpt */
 		test_err_hwpt_alloc_nested(EINVAL, self->device_id,
diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h
index 050e9751321c..69561ba9e7aa 100644
--- a/tools/testing/selftests/iommu/iommufd_utils.h
+++ b/tools/testing/selftests/iommu/iommufd_utils.h
@@ -195,6 +195,30 @@ static int _test_cmd_hwpt_alloc(int fd, __u32 device_id, __u32 pt_id,
 		     _test_cmd_hwpt_alloc(self->fd, device_id, pt_id, flags, \
 					  hwpt_id, data_type, data, data_len))
 
+#define test_cmd_hwpt_check_iotlb(hwpt_id, iotlb_id, expected)                 \
+	({                                                                     \
+		struct iommu_test_cmd test_cmd = {                             \
+			.size = sizeof(test_cmd),                              \
+			.op = IOMMU_TEST_OP_MD_CHECK_IOTLB,                    \
+			.id = hwpt_id,                                         \
+			.check_iotlb = {                                       \
+				.id = iotlb_id,                                \
+				.iotlb = expected,                             \
+			},                                                     \
+		};                                                             \
+		ASSERT_EQ(0,                                                   \
+			  ioctl(self->fd,                                      \
+				_IOMMU_TEST_CMD(IOMMU_TEST_OP_MD_CHECK_IOTLB), \
+				&test_cmd));                                   \
+	})
+
+#define test_cmd_hwpt_check_iotlb_all(hwpt_id, expected)                       \
+	({                                                                     \
+		int i;                                                         \
+		for (i = 0; i < MOCK_NESTED_DOMAIN_IOTLB_NUM; i++)             \
+			test_cmd_hwpt_check_iotlb(hwpt_id, i, expected);       \
+	})
+
 static int _test_cmd_access_replace_ioas(int fd, __u32 access_id,
 					 unsigned int ioas_id)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v6 6/6] iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl
  2023-11-17 13:07 [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2) Yi Liu
                   ` (4 preceding siblings ...)
  2023-11-17 13:07 ` [PATCH v6 5/6] iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op Yi Liu
@ 2023-11-17 13:07 ` Yi Liu
  2023-12-06 18:19   ` Jason Gunthorpe
  2023-12-09  1:47 ` [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2) Jason Gunthorpe
       [not found] ` <CGME20231217215720eucas1p2a590aca62ce8eb5ba81df6bc8b1a785d@eucas1p2.samsung.com>
  7 siblings, 1 reply; 93+ messages in thread
From: Yi Liu @ 2023-11-17 13:07 UTC (permalink / raw)
  To: joro, alex.williamson, jgg, kevin.tian, robin.murphy, baolu.lu
  Cc: cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.l.liu, yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi,
	lulu, suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	zhenzhong.duan, joao.m.martins, xin.zeng, yan.y.zhao

From: Nicolin Chen <nicolinc@nvidia.com>

Add test cases for the IOMMU_HWPT_INVALIDATE ioctl and verify it by using
the new IOMMU_TEST_OP_MD_CHECK_IOTLB.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 tools/testing/selftests/iommu/iommufd.c       | 71 +++++++++++++++++++
 tools/testing/selftests/iommu/iommufd_utils.h | 39 ++++++++++
 2 files changed, 110 insertions(+)

diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index c8763b880a16..2781d5bc6309 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -116,6 +116,7 @@ TEST_F(iommufd, cmd_length)
 	TEST_LENGTH(iommu_destroy, IOMMU_DESTROY, id);
 	TEST_LENGTH(iommu_hw_info, IOMMU_GET_HW_INFO, __reserved);
 	TEST_LENGTH(iommu_hwpt_alloc, IOMMU_HWPT_ALLOC, __reserved);
+	TEST_LENGTH(iommu_hwpt_invalidate, IOMMU_HWPT_INVALIDATE, out_driver_error_code);
 	TEST_LENGTH(iommu_ioas_alloc, IOMMU_IOAS_ALLOC, out_ioas_id);
 	TEST_LENGTH(iommu_ioas_iova_ranges, IOMMU_IOAS_IOVA_RANGES,
 		    out_iova_alignment);
@@ -271,7 +272,9 @@ TEST_F(iommufd_ioas, alloc_hwpt_nested)
 	struct iommu_hwpt_selftest data = {
 		.iotlb = IOMMU_TEST_IOTLB_DEFAULT,
 	};
+	struct iommu_hwpt_invalidate_selftest inv_reqs[2] = {0};
 	uint32_t nested_hwpt_id[2] = {};
+	uint32_t num_inv, driver_error;
 	uint32_t parent_hwpt_id = 0;
 	uint32_t parent_hwpt_id_not_work = 0;
 	uint32_t test_hwpt_id = 0;
@@ -344,6 +347,74 @@ TEST_F(iommufd_ioas, alloc_hwpt_nested)
 		EXPECT_ERRNO(EBUSY,
 			     _test_ioctl_destroy(self->fd, parent_hwpt_id));
 
+		/* hwpt_invalidate only supports a user-managed hwpt (nested) */
+		num_inv = 1;
+		test_err_hwpt_invalidate(ENOENT, parent_hwpt_id, inv_reqs,
+					 IOMMU_HWPT_DATA_SELFTEST,
+					 sizeof(*inv_reqs), &num_inv, NULL);
+		/* Negative test: wrong data type */
+		num_inv = 1;
+		test_err_hwpt_invalidate(EINVAL, nested_hwpt_id[0], inv_reqs,
+					 IOMMU_HWPT_DATA_NONE,
+					 sizeof(*inv_reqs), &num_inv,
+					 &driver_error);
+		/* Negative test: structure size sanity */
+		num_inv = 1;
+		test_err_hwpt_invalidate(EINVAL, nested_hwpt_id[0], inv_reqs,
+					 IOMMU_HWPT_DATA_SELFTEST,
+					 sizeof(*inv_reqs) + 1, &num_inv,
+					 &driver_error);
+		assert(driver_error == IOMMU_TEST_INVALIDATE_ERR_FETCH);
+
+		num_inv = 1;
+		test_err_hwpt_invalidate(EINVAL, nested_hwpt_id[0], inv_reqs,
+					 IOMMU_HWPT_DATA_SELFTEST,
+					 1, &num_inv, &driver_error);
+		assert(driver_error == IOMMU_TEST_INVALIDATE_ERR_FETCH);
+
+		/* Invalidate the 1st iotlb entry but fail the 2nd request */
+		num_inv = 2;
+		inv_reqs[0].iotlb_id = 0;
+		inv_reqs[1].iotlb_id = MOCK_NESTED_DOMAIN_IOTLB_ID_MAX + 1;
+		test_err_hwpt_invalidate(EINVAL, nested_hwpt_id[0], inv_reqs,
+					 IOMMU_HWPT_DATA_SELFTEST,
+					 sizeof(*inv_reqs), &num_inv,
+					 &driver_error);
+		assert(num_inv == 1);
+		assert(driver_error == IOMMU_TEST_INVALIDATE_ERR_REQ);
+		test_cmd_hwpt_check_iotlb(nested_hwpt_id[0], 0, 0);
+		test_cmd_hwpt_check_iotlb(nested_hwpt_id[0], 1,
+					  IOMMU_TEST_IOTLB_DEFAULT);
+		test_cmd_hwpt_check_iotlb(nested_hwpt_id[0], 2,
+					  IOMMU_TEST_IOTLB_DEFAULT);
+		test_cmd_hwpt_check_iotlb(nested_hwpt_id[0], 3,
+					  IOMMU_TEST_IOTLB_DEFAULT);
+
+		/* Invalidate the 2nd iotlb entry and verify */
+		num_inv = 1;
+		inv_reqs[0].iotlb_id = 1;
+		test_cmd_hwpt_invalidate(nested_hwpt_id[0], inv_reqs,
+					 sizeof(*inv_reqs), &num_inv);
+		test_cmd_hwpt_check_iotlb(nested_hwpt_id[0], 0, 0);
+		test_cmd_hwpt_check_iotlb(nested_hwpt_id[0], 1, 0);
+		test_cmd_hwpt_check_iotlb(nested_hwpt_id[0], 2,
+					  IOMMU_TEST_IOTLB_DEFAULT);
+		test_cmd_hwpt_check_iotlb(nested_hwpt_id[0], 3,
+					  IOMMU_TEST_IOTLB_DEFAULT);
+		/* Invalidate the 3rd and 4th iotlb entries and verify */
+		num_inv = 2;
+		inv_reqs[0].iotlb_id = 2;
+		inv_reqs[1].iotlb_id = 3;
+		test_cmd_hwpt_invalidate(nested_hwpt_id[0], inv_reqs,
+					 sizeof(*inv_reqs), &num_inv);
+		test_cmd_hwpt_check_iotlb_all(nested_hwpt_id[0], 0);
+		/* Invalidate all iotlb entries for nested_hwpt_id[1] and verify */
+		num_inv = 1;
+		inv_reqs[0].flags = IOMMU_TEST_INVALIDATE_ALL;
+		test_cmd_hwpt_invalidate(nested_hwpt_id[1], inv_reqs,
+					 sizeof(*inv_reqs), &num_inv);
+		test_cmd_hwpt_check_iotlb_all(nested_hwpt_id[1], 0);
+
 		/* Attach device to nested_hwpt_id[0] that then will be busy */
 		test_cmd_mock_domain_replace(self->stdev_id, nested_hwpt_id[0]);
 		EXPECT_ERRNO(EBUSY,
diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h
index 69561ba9e7aa..d4a169a2278e 100644
--- a/tools/testing/selftests/iommu/iommufd_utils.h
+++ b/tools/testing/selftests/iommu/iommufd_utils.h
@@ -219,6 +219,45 @@ static int _test_cmd_hwpt_alloc(int fd, __u32 device_id, __u32 pt_id,
 			test_cmd_hwpt_check_iotlb(hwpt_id, i, expected);       \
 	})
 
+static int _test_cmd_hwpt_invalidate(int fd, __u32 hwpt_id, void *reqs,
+				     uint32_t req_type, uint32_t lreq,
+				     uint32_t *nreqs, uint32_t *driver_error)
+{
+	struct iommu_hwpt_invalidate cmd = {
+		.size = sizeof(cmd),
+		.hwpt_id = hwpt_id,
+		.req_type = req_type,
+		.reqs_uptr = (uint64_t)reqs,
+		.req_len = lreq,
+		.req_num = *nreqs,
+	};
+	int rc = ioctl(fd, IOMMU_HWPT_INVALIDATE, &cmd);
+	*nreqs = cmd.req_num;
+	if (driver_error)
+		*driver_error = cmd.out_driver_error_code;
+	return rc;
+}
+
+#define test_cmd_hwpt_invalidate(hwpt_id, reqs, lreq, nreqs)                  \
+	({                                                                    \
+		uint32_t error, num = *nreqs;                                 \
+		ASSERT_EQ(0,                                                  \
+			  _test_cmd_hwpt_invalidate(self->fd, hwpt_id, reqs,  \
+						    IOMMU_HWPT_DATA_SELFTEST, \
+						    lreq, nreqs, &error));    \
+		assert(num == *nreqs);                                        \
+		assert(error == 0);                                           \
+	})
+#define test_err_hwpt_invalidate(_errno, hwpt_id, reqs, req_type, lreq,    \
+				 nreqs, driver_error)                      \
+	({                                                                 \
+		EXPECT_ERRNO(_errno,                                       \
+			     _test_cmd_hwpt_invalidate(self->fd, hwpt_id,  \
+						       reqs, req_type,     \
+						       lreq, nreqs,        \
+						       driver_error));     \
+	})
+
 static int _test_cmd_access_replace_ioas(int fd, __u32 access_id,
 					 unsigned int ioas_id)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 6/6] iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl
  2023-11-17 13:07 ` [PATCH v6 6/6] iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl Yi Liu
@ 2023-12-06 18:19   ` Jason Gunthorpe
  2023-12-11 11:28     ` Yi Liu
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Gunthorpe @ 2023-12-06 18:19 UTC (permalink / raw)
  To: Yi Liu
  Cc: joro, alex.williamson, kevin.tian, robin.murphy, baolu.lu, cohuck,
	eric.auger, nicolinc, kvm, mjrosato, chao.p.peng, yi.y.sun,
	peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	zhenzhong.duan, joao.m.martins, xin.zeng, yan.y.zhao

On Fri, Nov 17, 2023 at 05:07:17AM -0800, Yi Liu wrote:
> From: Nicolin Chen <nicolinc@nvidia.com>
> 
> Add test cases for the IOMMU_HWPT_INVALIDATE ioctl and verify it by using
> the new IOMMU_TEST_OP_MD_CHECK_IOTLB.
> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  tools/testing/selftests/iommu/iommufd.c       | 71 +++++++++++++++++++
>  tools/testing/selftests/iommu/iommufd_utils.h | 39 ++++++++++
>  2 files changed, 110 insertions(+)
> 
> diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
> index c8763b880a16..2781d5bc6309 100644
> --- a/tools/testing/selftests/iommu/iommufd.c
> +++ b/tools/testing/selftests/iommu/iommufd.c
> @@ -116,6 +116,7 @@ TEST_F(iommufd, cmd_length)
>  	TEST_LENGTH(iommu_destroy, IOMMU_DESTROY, id);
>  	TEST_LENGTH(iommu_hw_info, IOMMU_GET_HW_INFO, __reserved);
>  	TEST_LENGTH(iommu_hwpt_alloc, IOMMU_HWPT_ALLOC, __reserved);
> +	TEST_LENGTH(iommu_hwpt_invalidate, IOMMU_HWPT_INVALIDATE, out_driver_error_code);
>  	TEST_LENGTH(iommu_ioas_alloc, IOMMU_IOAS_ALLOC, out_ioas_id);
>  	TEST_LENGTH(iommu_ioas_iova_ranges, IOMMU_IOAS_IOVA_RANGES,
>  		    out_iova_alignment);
> @@ -271,7 +272,9 @@ TEST_F(iommufd_ioas, alloc_hwpt_nested)
>  	struct iommu_hwpt_selftest data = {
>  		.iotlb = IOMMU_TEST_IOTLB_DEFAULT,
>  	};
> +	struct iommu_hwpt_invalidate_selftest inv_reqs[2] = {0};

Don't use {0}

Jason

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 6/6] iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl
  2023-12-06 18:19   ` Jason Gunthorpe
@ 2023-12-11 11:28     ` Yi Liu
  2023-12-11 13:06       ` Jason Gunthorpe
  0 siblings, 1 reply; 93+ messages in thread
From: Yi Liu @ 2023-12-11 11:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: joro, alex.williamson, kevin.tian, robin.murphy, baolu.lu, cohuck,
	eric.auger, nicolinc, kvm, mjrosato, chao.p.peng, yi.y.sun,
	peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	zhenzhong.duan, joao.m.martins, xin.zeng, yan.y.zhao

On 2023/12/7 02:19, Jason Gunthorpe wrote:
> On Fri, Nov 17, 2023 at 05:07:17AM -0800, Yi Liu wrote:
>> From: Nicolin Chen <nicolinc@nvidia.com>
>>
>> Add test cases for the IOMMU_HWPT_INVALIDATE ioctl and verify it by using
>> the new IOMMU_TEST_OP_MD_CHECK_IOTLB.
>>
>> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> ---
>>   tools/testing/selftests/iommu/iommufd.c       | 71 +++++++++++++++++++
>>   tools/testing/selftests/iommu/iommufd_utils.h | 39 ++++++++++
>>   2 files changed, 110 insertions(+)
>>
>> diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
>> index c8763b880a16..2781d5bc6309 100644
>> --- a/tools/testing/selftests/iommu/iommufd.c
>> +++ b/tools/testing/selftests/iommu/iommufd.c
>> @@ -116,6 +116,7 @@ TEST_F(iommufd, cmd_length)
>>   	TEST_LENGTH(iommu_destroy, IOMMU_DESTROY, id);
>>   	TEST_LENGTH(iommu_hw_info, IOMMU_GET_HW_INFO, __reserved);
>>   	TEST_LENGTH(iommu_hwpt_alloc, IOMMU_HWPT_ALLOC, __reserved);
>> +	TEST_LENGTH(iommu_hwpt_invalidate, IOMMU_HWPT_INVALIDATE, out_driver_error_code);
>>   	TEST_LENGTH(iommu_ioas_alloc, IOMMU_IOAS_ALLOC, out_ioas_id);
>>   	TEST_LENGTH(iommu_ioas_iova_ranges, IOMMU_IOAS_IOVA_RANGES,
>>   		    out_iova_alignment);
>> @@ -271,7 +272,9 @@ TEST_F(iommufd_ioas, alloc_hwpt_nested)
>>   	struct iommu_hwpt_selftest data = {
>>   		.iotlb = IOMMU_TEST_IOTLB_DEFAULT,
>>   	};
>> +	struct iommu_hwpt_invalidate_selftest inv_reqs[2] = {0};
> 
> Don't use {0}

sure. I'll use memset then.

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 6/6] iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl
  2023-12-11 11:28     ` Yi Liu
@ 2023-12-11 13:06       ` Jason Gunthorpe
  0 siblings, 0 replies; 93+ messages in thread
From: Jason Gunthorpe @ 2023-12-11 13:06 UTC (permalink / raw)
  To: Yi Liu
  Cc: joro, alex.williamson, kevin.tian, robin.murphy, baolu.lu, cohuck,
	eric.auger, nicolinc, kvm, mjrosato, chao.p.peng, yi.y.sun,
	peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	zhenzhong.duan, joao.m.martins, xin.zeng, yan.y.zhao

On Mon, Dec 11, 2023 at 07:28:42PM +0800, Yi Liu wrote:
> On 2023/12/7 02:19, Jason Gunthorpe wrote:
> > On Fri, Nov 17, 2023 at 05:07:17AM -0800, Yi Liu wrote:
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > 
> > > Add test cases for the IOMMU_HWPT_INVALIDATE ioctl and verify it by using
> > > the new IOMMU_TEST_OP_MD_CHECK_IOTLB.
> > > 
> > > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> > > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> > > ---
> > >   tools/testing/selftests/iommu/iommufd.c       | 71 +++++++++++++++++++
> > >   tools/testing/selftests/iommu/iommufd_utils.h | 39 ++++++++++
> > >   2 files changed, 110 insertions(+)
> > > 
> > > diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
> > > index c8763b880a16..2781d5bc6309 100644
> > > --- a/tools/testing/selftests/iommu/iommufd.c
> > > +++ b/tools/testing/selftests/iommu/iommufd.c
> > > @@ -116,6 +116,7 @@ TEST_F(iommufd, cmd_length)
> > >   	TEST_LENGTH(iommu_destroy, IOMMU_DESTROY, id);
> > >   	TEST_LENGTH(iommu_hw_info, IOMMU_GET_HW_INFO, __reserved);
> > >   	TEST_LENGTH(iommu_hwpt_alloc, IOMMU_HWPT_ALLOC, __reserved);
> > > +	TEST_LENGTH(iommu_hwpt_invalidate, IOMMU_HWPT_INVALIDATE, out_driver_error_code);
> > >   	TEST_LENGTH(iommu_ioas_alloc, IOMMU_IOAS_ALLOC, out_ioas_id);
> > >   	TEST_LENGTH(iommu_ioas_iova_ranges, IOMMU_IOAS_IOVA_RANGES,
> > >   		    out_iova_alignment);
> > > @@ -271,7 +272,9 @@ TEST_F(iommufd_ioas, alloc_hwpt_nested)
> > >   	struct iommu_hwpt_selftest data = {
> > >   		.iotlb = IOMMU_TEST_IOTLB_DEFAULT,
> > >   	};
> > > +	struct iommu_hwpt_invalidate_selftest inv_reqs[2] = {0};
> > 
> > Don't use {0}
> 
> sure. I'll use memset then.

Just use = {}

Jason 

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2)
  2023-11-17 13:07 [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2) Yi Liu
                   ` (5 preceding siblings ...)
  2023-11-17 13:07 ` [PATCH v6 6/6] iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl Yi Liu
@ 2023-12-09  1:47 ` Jason Gunthorpe
  2023-12-11  2:29   ` Tian, Kevin
                     ` (3 more replies)
       [not found] ` <CGME20231217215720eucas1p2a590aca62ce8eb5ba81df6bc8b1a785d@eucas1p2.samsung.com>
  7 siblings, 4 replies; 93+ messages in thread
From: Jason Gunthorpe @ 2023-12-09  1:47 UTC (permalink / raw)
  To: Yi Liu, Giani, Dhaval, Vasant Hegde, Suravee Suthikulpanit
  Cc: joro, alex.williamson, kevin.tian, robin.murphy, baolu.lu, cohuck,
	eric.auger, nicolinc, kvm, mjrosato, chao.p.peng, yi.y.sun,
	peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	zhenzhong.duan, joao.m.martins, xin.zeng, yan.y.zhao

On Fri, Nov 17, 2023 at 05:07:11AM -0800, Yi Liu wrote:

> Take Intel VT-d as an example, the stage-1 translation table is I/O page
> table. As the below diagram shows, guest I/O page table pointer in GPA
> (guest physical address) is passed to host and be used to perform the stage-1
> address translation. Along with it, modifications to present mappings in the
> guest I/O page table should be followed with an IOTLB invalidation.

I've been looking at what the three HW's need for invalidation, it is
a bit messy.. Here is my thinking. Please let me know if I got it right

What is the starting point of the guest memory walks:
 Intel: Single Scalable Mode PASID table entry indexed by a RID & PASID
 AMD: GCR3 table (a table of PASIDs) indexed by RID
 ARM: CD table (a table of PASIDs) indexed by RID

What key does the physical HW use for invalidation:
 Intel: Domain-ID (stored in hypervisor, per PASID), PASID
 AMD: Domain-ID (stored in hypervisor, per RID), PASID
 ARM: VMID (stored in hypervisor, per RID), ASID (stored in guest)

Why key does the VM use for invalidation:
 Intel: vDomain-ID (per PASID), PASID
 AMD: vDomain-ID (per RID), PASID
 ARM: ASID

What is in a Nested domain:
 Intel: A single IO page table refereed to by a PASID entry
        Each vDomain-ID,PASID allocates a unique nesting domain
 AMD: A GCR3 table pointer
      Nesting domains are created for every unique GCR3 pointer.
      vDomain-ID can possibly refer to multiple Nesting domains :(
 ARM: A CD table pointer
      Nesting domains are created for every unique CD table top pointer.

How does the hypervisor compute it's cache tag:
 Intel: Each time a nesting domain is attached to a (RID,PASID) the
        hypervisor driver will try to find a (DID,PASID) that is
	already used by this domain, or allocate a new DID.
 AMD: When creating the Nesting Domain the vDomain-ID should be passed
      in. The hypervisor driver will allocate a unique pDomain-ID for
      each vDomain-ID (hand wave). Several Nesting Domains will share
      the same p/vDomain-ID.
 ARM: The VMID is uniquely assigned to the Nesting Parent when it
      is allocated, in some configurations it has to be shared with
      KVM's VMID so allocating the Nesting Parent will require a KVM FD.

Will ATC be forwarded or synthesized:
 Intel: The (vDomain-ID,PASID) is a unique nesting domain so
        the hypervisor knows exactly which RIDs this nesting domain is
	linked to and can generate an ATC invalidation. Plan is to
	supress/discard the ATC invalidations from the VM and generate
	them in the hypervisor.
 AMD: (vDomain-ID,PASID) is ambiguous, it can refer to multiple GCR3
      tables. We know which maximal set of RIDs it represents, but not
      the actual set. I expect AMD will forward the ATC invalidation
      to avoid over invalidation.
 ARM: ASID is ambiguous. We have no idea which Nesting Domain/CD table
      the ASID is contained in. ARM must forward the ATC invalidation
      from the guest.

What iommufd object should receive the IOTLB invalidation command list:
 Intel: The Nesting domain. The command list has to be broken up per
        (vDomain-ID,PASID) and that batch delivered to the single
	nesting domain. Kernel ignores vDomain-ID/PASID and just
	invalidates whatever the nesting domain is actually attached to
 AMD: Any Nesting Domain in the vDomain-ID group. The command list has
      to be broken up per (vDomain-ID). Kernel replaces
      vDomain-ID with pDomain-ID from the nesting domain and executes
      the invalidation.
 ARM: The Nesting Parent domain. Kernel forces the VMID from the
      Nesting Parent and executes the invalidation.

In all cases the VM issues an ATC invalidation with (vRID, PASID) as
the tag. The VMM must translate vRID -> dev_id -> pRID

For a pure SW flow the vRID can be mapped to the dev_id and the ATC
invalidation delivered to the device object (eg IOMMUFD_DEV_INVALIDATE)

Finally, we have the HW driven invalidation DMA queues that can be
directly assigned to the guest. AMD and SMMUv3+vCMDQ support this. In
this case the HW is directly processing invalidation commands without
a hypervisor trap.

To make this work the iommu needs to be programmed with:
 AMD: A vDomain-ID -> pDomain-ID table
      A vRID -> pRID table
      This is all bound to some "virtual function"
 ARM: A vRID -> pRID table
      The vCMDQ is bound to a VM_ID, so to the Nesting Parent

For AMD, as above, I suggest the vDomain-ID be passed when creating
the nesting domain.

The AMD "virtual function".. It is probably best to create a new iommufd
object for this and it can be passed in to a few places

The vRID->pRID table should be some mostly common
IOMMUFD_DEV_ASSIGN_VIRTUAL_ID. AMD will need to pass in the virtual
function ID and ARM will need to pass in the Nesting Parent ID.

For the HW path some function will create the command queue and
DMA/mmap it. Taking in the virtual function/nesting parent as the
handle to associate it with.

For a SW path:
 AMD: All invalidations can be delivered to the virtual function
      and the kernel can use the vDomainID/vRID tables to translate
      them fully
 ARM: All invalidations can be delivered to the nesting parent

In many ways the nesting parent/virtual function are very similar
things. Perhaps ARM should also create a virtual function object which
is just welded to the nesting parent for API consistency.

So.. In short.. Invalidation is a PITA. The idea is the same but
annoying little details interfere with actually having a compltely
common API here. IMHO the uAPI in this series is fine. It will support
Intel invalidation and non-ATC invalidation on AMD/ARM. It should be
setup to allow that the target domain object can be any HWPT.

ARM will be able to do IOTLB invalidation using this API.

IOMMUFD_DEV_INVALIDATE should be introduced with the same design as
HWPT invalidate. This would be used for AMD/ARM's ATC invalidation
(and just force the stream ID, userspace must direct the vRID to the
correct dev_id).

Then in yet another series we can tackle the entire "virtual function"
vRID/pRID translation stuff when the mmapable queue thing is
introduced.

Thus next steps:
 - Respin this and lets focus on Intel only (this will be tough for
   the holidays, but if it is available I will try)
 - Get an ARM patch that just does IOTLB invalidation and add it to my
   part 3
 - Start working on IOMMUFD_DEV_INVALIDATE along with an ARM
   implementation of it
 - Reorganize the AMD RFC broadly along these lines and lets see it
   freshened up in the next months as well. I would like to see the
   AMD support structured to implement the SW paths in first steps and
   later add in the "virtual function" acceleration stuff. The latter
   is going to be complex.
 
Jason

^ permalink raw reply	[flat|nested] 93+ messages in thread

* RE: [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2)
  2023-12-09  1:47 ` [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2) Jason Gunthorpe
@ 2023-12-11  2:29   ` Tian, Kevin
  2023-12-11 12:36     ` Yi Liu
  2023-12-11 12:35   ` Yi Liu
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 93+ messages in thread
From: Tian, Kevin @ 2023-12-11  2:29 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L, Giani, Dhaval, Vasant Hegde,
	Suravee Suthikulpanit
  Cc: joro@8bytes.org, alex.williamson@redhat.com, robin.murphy@arm.com,
	baolu.lu@linux.intel.com, cohuck@redhat.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Duan, Zhenzhong, joao.m.martins@oracle.com, Zeng, Xin,
	Zhao, Yan Y

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Saturday, December 9, 2023 9:47 AM
> 
> What is in a Nested domain:
>  Intel: A single IO page table refereed to by a PASID entry
>         Each vDomain-ID,PASID allocates a unique nesting domain
>  AMD: A GCR3 table pointer
>       Nesting domains are created for every unique GCR3 pointer.
>       vDomain-ID can possibly refer to multiple Nesting domains :(
>  ARM: A CD table pointer
>       Nesting domains are created for every unique CD table top pointer.

this AMD/ARM difference is not very clear to me.

How could a vDomain-ID refer to multiple GCR3 pointers? Wouldn't it
lead to cache tag conflict when a same PASID entry in multiple GCR3 tables
points to different I/O page tables?

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2)
  2023-12-11  2:29   ` Tian, Kevin
@ 2023-12-11 12:36     ` Yi Liu
  2023-12-11 13:05       ` Jason Gunthorpe
  0 siblings, 1 reply; 93+ messages in thread
From: Yi Liu @ 2023-12-11 12:36 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe, Giani, Dhaval, Vasant Hegde,
	Suravee Suthikulpanit
  Cc: joro@8bytes.org, alex.williamson@redhat.com, robin.murphy@arm.com,
	baolu.lu@linux.intel.com, cohuck@redhat.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org, Duan, Zhenzhong,
	joao.m.martins@oracle.com, Zeng, Xin, Zhao, Yan Y

On 2023/12/11 10:29, Tian, Kevin wrote:
>> From: Jason Gunthorpe <jgg@nvidia.com>
>> Sent: Saturday, December 9, 2023 9:47 AM
>>
>> What is in a Nested domain:
>>   Intel: A single IO page table refereed to by a PASID entry
>>          Each vDomain-ID,PASID allocates a unique nesting domain
>>   AMD: A GCR3 table pointer
>>        Nesting domains are created for every unique GCR3 pointer.
>>        vDomain-ID can possibly refer to multiple Nesting domains :(
>>   ARM: A CD table pointer
>>        Nesting domains are created for every unique CD table top pointer.
> 
> this AMD/ARM difference is not very clear to me.
> 
> How could a vDomain-ID refer to multiple GCR3 pointers? Wouldn't it
> lead to cache tag conflict when a same PASID entry in multiple GCR3 tables
> points to different I/O page tables?

Perhaps due to only one DomainID in the DTE table indexed by BDF? Actually,
the vDomainID will not be used to tag cache, the host DomainId would be
used instead. @Jason?

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2)
  2023-12-11 12:36     ` Yi Liu
@ 2023-12-11 13:05       ` Jason Gunthorpe
  2023-12-11 15:34         ` Suthikulpanit, Suravee
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Gunthorpe @ 2023-12-11 13:05 UTC (permalink / raw)
  To: Yi Liu
  Cc: Tian, Kevin, Giani, Dhaval, Vasant Hegde, Suravee Suthikulpanit,
	joro@8bytes.org, alex.williamson@redhat.com, robin.murphy@arm.com,
	baolu.lu@linux.intel.com, cohuck@redhat.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org, Duan, Zhenzhong,
	joao.m.martins@oracle.com, Zeng, Xin, Zhao, Yan Y

On Mon, Dec 11, 2023 at 08:36:46PM +0800, Yi Liu wrote:
> On 2023/12/11 10:29, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Saturday, December 9, 2023 9:47 AM
> > > 
> > > What is in a Nested domain:
> > >   Intel: A single IO page table refereed to by a PASID entry
> > >          Each vDomain-ID,PASID allocates a unique nesting domain
> > >   AMD: A GCR3 table pointer
> > >        Nesting domains are created for every unique GCR3 pointer.
> > >        vDomain-ID can possibly refer to multiple Nesting domains :(
> > >   ARM: A CD table pointer
> > >        Nesting domains are created for every unique CD table top pointer.
> > 
> > this AMD/ARM difference is not very clear to me.
> > 
> > How could a vDomain-ID refer to multiple GCR3 pointers? Wouldn't it
> > lead to cache tag conflict when a same PASID entry in multiple GCR3 tables
> > points to different I/O page tables?
> 
> Perhaps due to only one DomainID in the DTE table indexed by BDF? Actually,
> the vDomainID will not be used to tag cache, the host DomainId would be
> used instead. @Jason?

The DomainID comes from the DTE table which is indexed by the RID, and
the DTE entry points to the GCR3 table. So the VM certainly can setup
a DTE table with multiple entires having the same vDomainID but
pointing to different GCR3's. So the VMM has to do *something* with
this.

Most likely this is not a useful thing to do. However what should the
VMM do when it sees this? Block a random DTE or push the duplication
down to real HW would be my options. I'd probably try to do the latter
just on the basis of better emulation.

Jason

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2)
  2023-12-11 13:05       ` Jason Gunthorpe
@ 2023-12-11 15:34         ` Suthikulpanit, Suravee
  2023-12-11 16:06           ` Jason Gunthorpe
  0 siblings, 1 reply; 93+ messages in thread
From: Suthikulpanit, Suravee @ 2023-12-11 15:34 UTC (permalink / raw)
  To: Jason Gunthorpe, Yi Liu
  Cc: Tian, Kevin, Giani, Dhaval, Vasant Hegde, joro@8bytes.org,
	alex.williamson@redhat.com, robin.murphy@arm.com,
	baolu.lu@linux.intel.com, cohuck@redhat.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org, Duan, Zhenzhong,
	joao.m.martins@oracle.com, Zeng, Xin, Zhao, Yan Y



On 12/11/2023 8:05 PM, Jason Gunthorpe wrote:
> On Mon, Dec 11, 2023 at 08:36:46PM +0800, Yi Liu wrote:
>> On 2023/12/11 10:29, Tian, Kevin wrote:
>>>> From: Jason Gunthorpe <jgg@nvidia.com>
>>>> Sent: Saturday, December 9, 2023 9:47 AM
>>>>
>>>> What is in a Nested domain:
>>>>    Intel: A single IO page table refereed to by a PASID entry
>>>>           Each vDomain-ID,PASID allocates a unique nesting domain
>>>>    AMD: A GCR3 table pointer
>>>>         Nesting domains are created for every unique GCR3 pointer.
>>>>         vDomain-ID can possibly refer to multiple Nesting domains :(
>>>>    ARM: A CD table pointer
>>>>         Nesting domains are created for every unique CD table top pointer.
>>>
>>> this AMD/ARM difference is not very clear to me.
>>>
>>> How could a vDomain-ID refer to multiple GCR3 pointers? Wouldn't it
>>> lead to cache tag conflict when a same PASID entry in multiple GCR3 tables
>>> points to different I/O page tables?
>>
>> Perhaps due to only one DomainID in the DTE table indexed by BDF? Actually,
>> the vDomainID will not be used to tag cache, the host DomainId would be
>> used instead. @Jason?
> 
> The DomainID comes from the DTE table which is indexed by the RID, and
> the DTE entry points to the GCR3 table. So the VM certainly can setup
> a DTE table with multiple entires having the same vDomainID but
> pointing to different GCR3's. So the VMM has to do *something* with
> this.
> 
> Most likely this is not a useful thing to do. However what should the
> VMM do when it sees this? Block a random DTE or push the duplication
> down to real HW would be my options. I'd probably try to do the latter
> just on the basis of better emulation.
> 
> Jason

For AMD, the hardware uses host DomainID (hDomainId) and PASID to tag 
the IOMMU TLB.

The VM can setup vDomainID independently from device (RID) and 
hDomainID. The vDomainId->hDomainId mapping would be managed by the host 
IOMMU driver (since this is also needed by the HW when enabling the 
HW-vIOMMU support a.k.a virtual function).

Currently, the AMD IOMMU driver allocates a DomainId per IOMMU group.
One issue with this is when we have nested translation where we could 
end up with multiple devices (RIDs) sharing same PASID and the same 
hDomainID.

For example:

   - Host view
     Device1 (RID 1) w/ hDomainId 1
     Device2 (RID 2) w/ hDomainId 1
   - Guest view
     Pass-through Device1 (vRID 3) w/ vDomainID A + PASID 0
     Pass-through Device2 (vRID 4) w/ vDomainID B + PASID 0

We should be able to workaround this by changing the way we assign 
hDomainId to be per-device for VFIO pass-through devices although 
sharing the same v1 (stage-2) page table. This would look like.

   - Host view
     Device1 (RID 1) w/ hDomainId 1
     Device2 (RID 2) w/ hDomainId 2
   - Guest view
     Pass-through Device1 (vRID 3) w/ vDomainID A + PASID 0
     Pass-through Device2 (vRID 4) w/ vDomainID B + PASID 0

This should avoid the IOMMU TLB conflict. However, the invalidation 
would need to be done for both DomainId 1 and 2 when updating the v1 
(stage-2) page table.

Thanks,
Suravee


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2)
  2023-12-11 15:34         ` Suthikulpanit, Suravee
@ 2023-12-11 16:06           ` Jason Gunthorpe
  0 siblings, 0 replies; 93+ messages in thread
From: Jason Gunthorpe @ 2023-12-11 16:06 UTC (permalink / raw)
  To: Suthikulpanit, Suravee
  Cc: Yi Liu, Tian, Kevin, Giani, Dhaval, Vasant Hegde, joro@8bytes.org,
	alex.williamson@redhat.com, robin.murphy@arm.com,
	baolu.lu@linux.intel.com, cohuck@redhat.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org, Duan, Zhenzhong,
	joao.m.martins@oracle.com, Zeng, Xin, Zhao, Yan Y

On Mon, Dec 11, 2023 at 10:34:09PM +0700, Suthikulpanit, Suravee wrote:

> Currently, the AMD IOMMU driver allocates a DomainId per IOMMU group.
> One issue with this is when we have nested translation where we could end up
> with multiple devices (RIDs) sharing same PASID and the same hDomainID.

Which means you also create multiple GCR3 tables since those are
(soon) per-device and we end up with the situation I described for a
functional legitimate reason :( It is just wasting memory by
duplicating GCR3 tables.

> For example:
> 
>   - Host view
>     Device1 (RID 1) w/ hDomainId 1
>     Device2 (RID 2) w/ hDomainId 1

So.. Groups are another ugly mess that we may have to do something
more robust about.

The group infrastructure assumes that all devices in the group have
the same translation. This is not how the VM communicates, each member
of the group gets to have its own DTE and there are legitimate cases
where the DTEs will be different (even if just temporarily)

How to mesh this is not yet solved (most likely we need to allow group
members to have temporarily different translation). But in the long
run the group should definately not be providing the cache tag, the
driver has to be smarter than this.

I think we talked about this before.. For the AMD driver the v1 page
table should store the domainid in the iommu_domain and that value
should be used everywhere

For modes with a GCR3 table the best you can do is to de-duplicate the
GCR3 tables and assign identical GCR3 tables to identical domain ids.
Ie all devices in a group will eventually share GCR3 tables so they
can converge on the same domain id.

>   - Guest view
>     Pass-through Device1 (vRID 3) w/ vDomainID A + PASID 0
>     Pass-through Device2 (vRID 4) w/ vDomainID B + PASID 0
> 
> We should be able to workaround this by changing the way we assign hDomainId
> to be per-device for VFIO pass-through devices although sharing the same v1
> (stage-2) page table. This would look like.

As I said, this doesn't quite work since the VM could do other
things. The kernel must be aware of the vDomainID and must select an
appropriate hDomainID with that knowledge in mind, otherwise
multi-device-groups in guests are fully broken.

>   - Guest view
>     Pass-through Device1 (vRID 3) w/ vDomainID A + PASID 0
>     Pass-through Device2 (vRID 4) w/ vDomainID B + PASID 0
> 
> This should avoid the IOMMU TLB conflict. However, the invalidation would
> need to be done for both DomainId 1 and 2 when updating the v1 (stage-2)
> page table.

Which is the key problem, if the VM thinks it has only one vDomainID
the VMM can't split that into two hDomainID's and expect the viommu
acceleration will work - so we shouldn't try to make it work in SW
either, IMHO.

Jason

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2)
  2023-12-09  1:47 ` [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2) Jason Gunthorpe
  2023-12-11  2:29   ` Tian, Kevin
@ 2023-12-11 12:35   ` Yi Liu
  2023-12-11 13:20     ` Jason Gunthorpe
  2023-12-11 17:35   ` Suthikulpanit, Suravee
  2023-12-11 21:27   ` Nicolin Chen
  3 siblings, 1 reply; 93+ messages in thread
From: Yi Liu @ 2023-12-11 12:35 UTC (permalink / raw)
  To: Jason Gunthorpe, Giani, Dhaval, Vasant Hegde,
	Suravee Suthikulpanit
  Cc: joro, alex.williamson, kevin.tian, robin.murphy, baolu.lu, cohuck,
	eric.auger, nicolinc, kvm, mjrosato, chao.p.peng, yi.y.sun,
	peterx, jasowang, shameerali.kolothum.thodi, lulu, iommu,
	linux-kernel, linux-kselftest, zhenzhong.duan, joao.m.martins,
	xin.zeng, yan.y.zhao

On 2023/12/9 09:47, Jason Gunthorpe wrote:
> On Fri, Nov 17, 2023 at 05:07:11AM -0800, Yi Liu wrote:
> 
>> Take Intel VT-d as an example, the stage-1 translation table is I/O page
>> table. As the below diagram shows, guest I/O page table pointer in GPA
>> (guest physical address) is passed to host and be used to perform the stage-1
>> address translation. Along with it, modifications to present mappings in the
>> guest I/O page table should be followed with an IOTLB invalidation.
> 
> I've been looking at what the three HW's need for invalidation, it is
> a bit messy.. Here is my thinking. Please let me know if I got it right
> 
> What is the starting point of the guest memory walks:
>   Intel: Single Scalable Mode PASID table entry indexed by a RID & PASID
>   AMD: GCR3 table (a table of PASIDs) indexed by RID
>   ARM: CD table (a table of PASIDs) indexed by RID
> 
> What key does the physical HW use for invalidation:
>   Intel: Domain-ID (stored in hypervisor, per PASID), PASID
>   AMD: Domain-ID (stored in hypervisor, per RID), PASID
>   ARM: VMID (stored in hypervisor, per RID), ASID (stored in guest)
> 
> Why key does the VM use for invalidation:
>   Intel: vDomain-ID (per PASID), PASID
>   AMD: vDomain-ID (per RID), PASID
>   ARM: ASID
> 
> What is in a Nested domain:
>   Intel: A single IO page table refereed to by a PASID entry
>          Each vDomain-ID,PASID allocates a unique nesting domain
>   AMD: A GCR3 table pointer
>        Nesting domains are created for every unique GCR3 pointer.
>        vDomain-ID can possibly refer to multiple Nesting domains :(

Per section '2.2.6.3 Guest CR3 Table' in below spec, DTE has domainID,
and it points to a guest CR3 Table (it should be a set of guest CRs3)
which is indexed by PASID. So it looks like the PASID is per-DommainID.
HW should use domainID+PASID to tag the cache, otherwise there would be
cache conflict...

https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/specifications/48882_IOMMU.pdf

>   ARM: A CD table pointer
>        Nesting domains are created for every unique CD table top pointer.
> 
> How does the hypervisor compute it's cache tag:
>   Intel: Each time a nesting domain is attached to a (RID,PASID) the
>          hypervisor driver will try to find a (DID,PASID) that is
> 	already used by this domain, or allocate a new DID.
>   AMD: When creating the Nesting Domain the vDomain-ID should be passed
>        in. The hypervisor driver will allocate a unique pDomain-ID for
>        each vDomain-ID (hand wave). Several Nesting Domains will share
>        the same p/vDomain-ID.
>   ARM: The VMID is uniquely assigned to the Nesting Parent when it
>        is allocated, in some configurations it has to be shared with
>        KVM's VMID so allocating the Nesting Parent will require a KVM FD.
> 
> Will ATC be forwarded or synthesized:
>   Intel: The (vDomain-ID,PASID) is a unique nesting domain so
>          the hypervisor knows exactly which RIDs this nesting domain is
> 	linked to and can generate an ATC invalidation. Plan is to
> 	supress/discard the ATC invalidations from the VM and generate
> 	them in the hypervisor.
>   AMD: (vDomain-ID,PASID) is ambiguous, it can refer to multiple GCR3
>        tables. We know which maximal set of RIDs it represents, but not
>        the actual set. I expect AMD will forward the ATC invalidation
>        to avoid over invalidation.
>   ARM: ASID is ambiguous. We have no idea which Nesting Domain/CD table
>        the ASID is contained in. ARM must forward the ATC invalidation
>        from the guest.
> 
> What iommufd object should receive the IOTLB invalidation command list:
>   Intel: The Nesting domain. The command list has to be broken up per
>          (vDomain-ID,PASID) and that batch delivered to the single
> 	nesting domain. Kernel ignores vDomain-ID/PASID and just
> 	invalidates whatever the nesting domain is actually attached to

this is what we are doing in current series[1]. is it? Guest needs to
issue invalidation request with vDomain-ID and PASID (if it is enabled),
and affected pages for sure. But in hypervisor side, use vDomainID and
PASID info to figure out the target HWPT, then invoke a cache invalidation
on the HWPT with the invalidation range is enough. Kernel can figure out
what device/pasid this HWPT has been attached and invalidate the caches.

[1] 
https://lore.kernel.org/linux-iommu/20231117131816.24359-1-yi.l.liu@intel.com/

>   AMD: Any Nesting Domain in the vDomain-ID group. The command list has
>        to be broken up per (vDomain-ID). Kernel replaces
>        vDomain-ID with pDomain-ID from the nesting domain and executes
>        the invalidation.
>   ARM: The Nesting Parent domain. Kernel forces the VMID from the
>        Nesting Parent and executes the invalidation.
> 
> In all cases the VM issues an ATC invalidation with (vRID, PASID) as
> the tag. The VMM must translate vRID -> dev_id -> pRID
> 
> For a pure SW flow the vRID can be mapped to the dev_id and the ATC
> invalidation delivered to the device object (eg IOMMUFD_DEV_INVALIDATE)
> 
> Finally, we have the HW driven invalidation DMA queues that can be
> directly assigned to the guest. AMD and SMMUv3+vCMDQ support this. In
> this case the HW is directly processing invalidation commands without
> a hypervisor trap.
> 
> To make this work the iommu needs to be programmed with:
>   AMD: A vDomain-ID -> pDomain-ID table
>        A vRID -> pRID table
>        This is all bound to some "virtual function"
>   ARM: A vRID -> pRID table
>        The vCMDQ is bound to a VM_ID, so to the Nesting Parent
> 
> For AMD, as above, I suggest the vDomain-ID be passed when creating
> the nesting domain.
> 
> The AMD "virtual function".. It is probably best to create a new iommufd
> object for this and it can be passed in to a few places
> 
> The vRID->pRID table should be some mostly common
> IOMMUFD_DEV_ASSIGN_VIRTUAL_ID. AMD will need to pass in the virtual
> function ID and ARM will need to pass in the Nesting Parent ID.
> 
> For the HW path some function will create the command queue and
> DMA/mmap it. Taking in the virtual function/nesting parent as the
> handle to associate it with.
> 
> For a SW path:
>   AMD: All invalidations can be delivered to the virtual function
>        and the kernel can use the vDomainID/vRID tables to translate
>        them fully
>   ARM: All invalidations can be delivered to the nesting parent
> 
> In many ways the nesting parent/virtual function are very similar
> things. Perhaps ARM should also create a virtual function object which
> is just welded to the nesting parent for API consistency.
> 
> So.. In short.. Invalidation is a PITA. The idea is the same but
> annoying little details interfere with actually having a compltely
> common API here. IMHO the uAPI in this series is fine. It will support
> Intel invalidation and non-ATC invalidation on AMD/ARM. It should be
> setup to allow that the target domain object can be any HWPT.

This HWPT is still nested domain. Is it? But it can represent a guest I/O
page table (VT-d), guest CD table (ARM), guest CR3 Table (AMD, it seems to
be a set of guest CR3 table pointers). May ARM and AMD guys keep me honest
here.

The Intel guest I/O page table case may be the simplest as userspace only
needs to provide the HWPT ID and the affected ranges for invalidating. As
mentioned above, kernel will find out the attached device/pasid and
invalidating cache with the device/pasid. For ARM and AMD case, extra
information is needed. Am I getting you correct?

> 
> ARM will be able to do IOTLB invalidation using this API.
> 
> IOMMUFD_DEV_INVALIDATE should be introduced with the same design as
> HWPT invalidate. This would be used for AMD/ARM's ATC invalidation
> (and just force the stream ID, userspace must direct the vRID to the
> correct dev_id).
> 
> Then in yet another series we can tackle the entire "virtual function"
> vRID/pRID translation stuff when the mmapable queue thing is
> introduced.
> 
> Thus next steps:
>   - Respin this and lets focus on Intel only (this will be tough for
>     the holidays, but if it is available I will try)

I've respinned the iommufd cache invalidation part with the change to
report error_code/error_data per invalidation entry. yet still busy on
making Intel VTd part to report the error_code. Besides, I didn't see
other respin needed for Intel VT-d invalidation. If I missed thing, please
do let me know.:)

>   - Get an ARM patch that just does IOTLB invalidation and add it to my
>     part 3
>   - Start working on IOMMUFD_DEV_INVALIDATE along with an ARM
>     implementation of it
>   - Reorganize the AMD RFC broadly along these lines and lets see it
>     freshened up in the next months as well. I would like to see the
>     AMD support structured to implement the SW paths in first steps and
>     later add in the "virtual function" acceleration stuff. The latter
>     is going to be complex.
>   
> Jason

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2)
  2023-12-11 12:35   ` Yi Liu
@ 2023-12-11 13:20     ` Jason Gunthorpe
  2023-12-11 20:11       ` Nicolin Chen
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Gunthorpe @ 2023-12-11 13:20 UTC (permalink / raw)
  To: Yi Liu
  Cc: Giani, Dhaval, Vasant Hegde, Suravee Suthikulpanit, joro,
	alex.williamson, kevin.tian, robin.murphy, baolu.lu, cohuck,
	eric.auger, nicolinc, kvm, mjrosato, chao.p.peng, yi.y.sun,
	peterx, jasowang, shameerali.kolothum.thodi, lulu, iommu,
	linux-kernel, linux-kselftest, zhenzhong.duan, joao.m.martins,
	xin.zeng, yan.y.zhao

On Mon, Dec 11, 2023 at 08:35:09PM +0800, Yi Liu wrote:

> > What iommufd object should receive the IOTLB invalidation command list:
> >   Intel: The Nesting domain. The command list has to be broken up per
> >          (vDomain-ID,PASID) and that batch delivered to the single
> > 	nesting domain. Kernel ignores vDomain-ID/PASID and just
> > 	invalidates whatever the nesting domain is actually attached to
> 
> this is what we are doing in current series[1]. is it?

I think so

> > So.. In short.. Invalidation is a PITA. The idea is the same but
> > annoying little details interfere with actually having a compltely
> > common API here. IMHO the uAPI in this series is fine. It will support
> > Intel invalidation and non-ATC invalidation on AMD/ARM. It should be
> > setup to allow that the target domain object can be any HWPT.
> 
> This HWPT is still nested domain. Is it? But it can represent a guest I/O
> page table (VT-d), guest CD table (ARM), guest CR3 Table (AMD, it seems to
> be a set of guest CR3 table pointers). May ARM and AMD guys keep me honest
> here.

I was thinking ARM would not want to use a nested domain because
really the invalidation is global to the entire nesting parent.

But, there is an issue with that - the nesting parent could be
attached to multiple iommu instances but we only want to invalidate a
single instance. 

However, specifying the nesting domain instead, and restricting the
nesting domain to be single-instance would provide the kernel enough
information without providing weird new APIs. So maybe it is best to
just make everything use the nesting domain even if it is somewhat
semantically weird on arm.

> The Intel guest I/O page table case may be the simplest as userspace only
> needs to provide the HWPT ID and the affected ranges for invalidating. As
> mentioned above, kernel will find out the attached device/pasid and
> invalidating cache with the device/pasid. For ARM and AMD case, extra
> information is needed. Am I getting you correct?

Yes
 
> > Thus next steps:
> >   - Respin this and lets focus on Intel only (this will be tough for
> >     the holidays, but if it is available I will try)
> 
> I've respinned the iommufd cache invalidation part with the change to
> report error_code/error_data per invalidation entry. yet still busy on
> making Intel VTd part to report the error_code. Besides, I didn't see
> other respin needed for Intel VT-d invalidation. If I missed thing, please
> do let me know.:)

Lets see

Jason

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2)
  2023-12-11 13:20     ` Jason Gunthorpe
@ 2023-12-11 20:11       ` Nicolin Chen
  2023-12-11 21:48         ` Jason Gunthorpe
  0 siblings, 1 reply; 93+ messages in thread
From: Nicolin Chen @ 2023-12-11 20:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yi Liu, Giani, Dhaval, Vasant Hegde, Suravee Suthikulpanit, joro,
	alex.williamson, kevin.tian, robin.murphy, baolu.lu, cohuck,
	eric.auger, kvm, mjrosato, chao.p.peng, yi.y.sun, peterx,
	jasowang, shameerali.kolothum.thodi, lulu, iommu, linux-kernel,
	linux-kselftest, zhenzhong.duan, joao.m.martins, xin.zeng,
	yan.y.zhao

On Mon, Dec 11, 2023 at 09:20:41AM -0400, Jason Gunthorpe wrote:
> On Mon, Dec 11, 2023 at 08:35:09PM +0800, Yi Liu wrote:
> > > So.. In short.. Invalidation is a PITA. The idea is the same but
> > > annoying little details interfere with actually having a compltely
> > > common API here. IMHO the uAPI in this series is fine. It will support
> > > Intel invalidation and non-ATC invalidation on AMD/ARM. It should be
> > > setup to allow that the target domain object can be any HWPT.
> > 
> > This HWPT is still nested domain. Is it? But it can represent a guest I/O
> > page table (VT-d), guest CD table (ARM), guest CR3 Table (AMD, it seems to
> > be a set of guest CR3 table pointers). May ARM and AMD guys keep me honest
> > here.
> 
> I was thinking ARM would not want to use a nested domain because
> really the invalidation is global to the entire nesting parent.
> 
> But, there is an issue with that - the nesting parent could be
> attached to multiple iommu instances but we only want to invalidate a
> single instance. 

I am still not sure about attaching an S2 domain to multiple
SMMUs. An S2 domain is created per SMMU, and we have such a
rejection in arm_smmu_attach_dev():
	} else if (smmu_domain->smmu != smmu)
		ret = -EINVAL;

I understand that it would be probably ideal to share the S2
iopt among the SMMUs. But in the driver the objects (domain)
holding a shared S2 iopt must be different to allocate their
own VMIDs, right?

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2)
  2023-12-11 20:11       ` Nicolin Chen
@ 2023-12-11 21:48         ` Jason Gunthorpe
  0 siblings, 0 replies; 93+ messages in thread
From: Jason Gunthorpe @ 2023-12-11 21:48 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Yi Liu, Giani, Dhaval, Vasant Hegde, Suravee Suthikulpanit, joro,
	alex.williamson, kevin.tian, robin.murphy, baolu.lu, cohuck,
	eric.auger, kvm, mjrosato, chao.p.peng, yi.y.sun, peterx,
	jasowang, shameerali.kolothum.thodi, lulu, iommu, linux-kernel,
	linux-kselftest, zhenzhong.duan, joao.m.martins, xin.zeng,
	yan.y.zhao

On Mon, Dec 11, 2023 at 12:11:25PM -0800, Nicolin Chen wrote:
> On Mon, Dec 11, 2023 at 09:20:41AM -0400, Jason Gunthorpe wrote:
> > On Mon, Dec 11, 2023 at 08:35:09PM +0800, Yi Liu wrote:
> > > > So.. In short.. Invalidation is a PITA. The idea is the same but
> > > > annoying little details interfere with actually having a compltely
> > > > common API here. IMHO the uAPI in this series is fine. It will support
> > > > Intel invalidation and non-ATC invalidation on AMD/ARM. It should be
> > > > setup to allow that the target domain object can be any HWPT.
> > > 
> > > This HWPT is still nested domain. Is it? But it can represent a guest I/O
> > > page table (VT-d), guest CD table (ARM), guest CR3 Table (AMD, it seems to
> > > be a set of guest CR3 table pointers). May ARM and AMD guys keep me honest
> > > here.
> > 
> > I was thinking ARM would not want to use a nested domain because
> > really the invalidation is global to the entire nesting parent.
> > 
> > But, there is an issue with that - the nesting parent could be
> > attached to multiple iommu instances but we only want to invalidate a
> > single instance. 
> 
> I am still not sure about attaching an S2 domain to multiple
> SMMUs. An S2 domain is created per SMMU, and we have such a
> rejection in arm_smmu_attach_dev():
> 	} else if (smmu_domain->smmu != smmu)
> 		ret = -EINVAL;

I intend to remove that eventually

> I understand that it would be probably ideal to share the S2
> iopt among the SMMUs. But in the driver the objects (domain)
> holding a shared S2 iopt must be different to allocate their
> own VMIDs, right?

No, the vmid will be moved into the struct arm_smmu_master_domain

Jason

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2)
  2023-12-09  1:47 ` [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2) Jason Gunthorpe
  2023-12-11  2:29   ` Tian, Kevin
  2023-12-11 12:35   ` Yi Liu
@ 2023-12-11 17:35   ` Suthikulpanit, Suravee
  2023-12-11 17:45     ` Jason Gunthorpe
  2023-12-11 21:27   ` Nicolin Chen
  3 siblings, 1 reply; 93+ messages in thread
From: Suthikulpanit, Suravee @ 2023-12-11 17:35 UTC (permalink / raw)
  To: Jason Gunthorpe, Yi Liu, Giani, Dhaval, Vasant Hegde
  Cc: joro, alex.williamson, kevin.tian, robin.murphy, baolu.lu, cohuck,
	eric.auger, nicolinc, kvm, mjrosato, chao.p.peng, yi.y.sun,
	peterx, jasowang, shameerali.kolothum.thodi, lulu, iommu,
	linux-kernel, linux-kselftest, zhenzhong.duan, joao.m.martins,
	xin.zeng, yan.y.zhao



On 12/9/2023 8:47 AM, Jason Gunthorpe wrote:
> On Fri, Nov 17, 2023 at 05:07:11AM -0800, Yi Liu wrote:
> 
>> Take Intel VT-d as an example, the stage-1 translation table is I/O page
>> table. As the below diagram shows, guest I/O page table pointer in GPA
>> (guest physical address) is passed to host and be used to perform the stage-1
>> address translation. Along with it, modifications to present mappings in the
>> guest I/O page table should be followed with an IOTLB invalidation.
> 
> I've been looking at what the three HW's need for invalidation, it is
> a bit messy.. Here is my thinking. Please let me know if I got it right
> 
> What is the starting point of the guest memory walks:
>   Intel: Single Scalable Mode PASID table entry indexed by a RID & PASID
>   AMD: GCR3 table (a table of PASIDs) indexed by RID

GCR3 table is indexed by PASID.
Device Table (DTE) is indexted by DeviceID (RID)

> ...
> Will ATC be forwarded or synthesized:
>   Intel: The (vDomain-ID,PASID) is a unique nesting domain so
>          the hypervisor knows exactly which RIDs this nesting domain is
> 	linked to and can generate an ATC invalidation. Plan is to
> 	supress/discard the ATC invalidations from the VM and generate
> 	them in the hypervisor.
>   AMD: (vDomain-ID,PASID) is ambiguous, it can refer to multiple GCR3
>        tables. We know which maximal set of RIDs it represents, but not
>        the actual set. I expect AMD will forward the ATC invalidation
>        to avoid over invalidation.

Not sure I understand your description here.

For the AMD IOMMU INVALIDE_IOMMU_PAGES (i.e. invalidate the IOMMU TLB), 
the hypervisor needs to map gDomainId->hDomainId and issue the command 
on behalf of the VM along with the PASID and GVA (or GVA range) provided 
by the guest.

For the AMD IOMMU INVALIDE_IOTLB_PAGES (i.e. invalidate the ATC on the 
device), the hypervisor needs to map gDeviceId->hDeviceId and issue the 
command on behalf of the VM along with the PASID and GVA (or GVA range) 
provided by the guest.

>   ARM: ASID is ambiguous. We have no idea which Nesting Domain/CD table
>        the ASID is contained in. ARM must forward the ATC invalidation
>        from the guest.
> 
> What iommufd object should receive the IOTLB invalidation command list:
>   Intel: The Nesting domain. The command list has to be broken up per
>          (vDomain-ID,PASID) and that batch delivered to the single
> 	nesting domain. Kernel ignores vDomain-ID/PASID and just
> 	invalidates whatever the nesting domain is actually attached to
>   AMD: Any Nesting Domain in the vDomain-ID group. The command list has
>        to be broken up per (vDomain-ID). Kernel replaces
>        vDomain-ID with pDomain-ID from the nesting domain and executes
>        the invalidation.
>   ARM: The Nesting Parent domain. Kernel forces the VMID from the
>        Nesting Parent and executes the invalidation.
> 
> In all cases the VM issues an ATC invalidation with (vRID, PASID) as
> the tag. The VMM must translate vRID -> dev_id -> pRID
> 
> For a pure SW flow the vRID can be mapped to the dev_id and the ATC
> invalidation delivered to the device object (eg IOMMUFD_DEV_INVALIDATE)
> 
> Finally, we have the HW driven invalidation DMA queues that can be
> directly assigned to the guest. AMD and SMMUv3+vCMDQ support this. In
> this case the HW is directly processing invalidation commands without
> a hypervisor trap.
> 
> To make this work the iommu needs to be programmed with:
>   AMD: A vDomain-ID -> pDomain-ID table
>        A vRID -> pRID table
>        This is all bound to some "virtual function"

By "virtual function", I assume you are referring to the AMD vIOMMU 
instance in the guest?

>   ARM: A vRID -> pRID table
>        The vCMDQ is bound to a VM_ID, so to the Nesting Parent
> 
> For AMD, as above, I suggest the vDomain-ID be passed when creating
> the nesting domain
Sure, we can do this part.

> The AMD "virtual function".. It is probably best to create a new iommufd
> object for this and it can be passed in to a few places

Something like IOMMUFD_OBJ_VIOMMU? Then operation would include 
something like:
   * Init
   * Destroy
   * ...

> The vRID->pRID table should be some mostly common
> IOMMUFD_DEV_ASSIGN_VIRTUAL_ID. AMD will need to pass in the virtual
> function ID and ARM will need to pass in the Nesting Parent ID.

Ok.

> ...
> Thus next steps:
>   - Respin this and lets focus on Intel only (this will be tough for
>     the holidays, but if it is available I will try)
>   - Get an ARM patch that just does IOTLB invalidation and add it to my
>     part 3
>   - Start working on IOMMUFD_DEV_INVALIDATE along with an ARM
>     implementation of it
>   - Reorganize the AMD RFC broadly along these lines and lets see it
>     freshened up in the next months as well. I would like to see the
>     AMD support structured to implement the SW paths in first steps and
>     later add in the "virtual function" acceleration stuff. The latter
>     is going to be complex.

Working on refining the part 1 to add HW info reporting and nested 
translation (minus the invalidation stuff). Should be sending out soon.

Suravee

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2)
  2023-12-11 17:35   ` Suthikulpanit, Suravee
@ 2023-12-11 17:45     ` Jason Gunthorpe
  0 siblings, 0 replies; 93+ messages in thread
From: Jason Gunthorpe @ 2023-12-11 17:45 UTC (permalink / raw)
  To: Suthikulpanit, Suravee
  Cc: Yi Liu, Giani, Dhaval, Vasant Hegde, joro, alex.williamson,
	kevin.tian, robin.murphy, baolu.lu, cohuck, eric.auger, nicolinc,
	kvm, mjrosato, chao.p.peng, yi.y.sun, peterx, jasowang,
	shameerali.kolothum.thodi, lulu, iommu, linux-kernel,
	linux-kselftest, zhenzhong.duan, joao.m.martins, xin.zeng,
	yan.y.zhao

On Tue, Dec 12, 2023 at 12:35:26AM +0700, Suthikulpanit, Suravee wrote:
> 
> 
> On 12/9/2023 8:47 AM, Jason Gunthorpe wrote:
> > On Fri, Nov 17, 2023 at 05:07:11AM -0800, Yi Liu wrote:
> > 
> > > Take Intel VT-d as an example, the stage-1 translation table is I/O page
> > > table. As the below diagram shows, guest I/O page table pointer in GPA
> > > (guest physical address) is passed to host and be used to perform the stage-1
> > > address translation. Along with it, modifications to present mappings in the
> > > guest I/O page table should be followed with an IOTLB invalidation.
> > 
> > I've been looking at what the three HW's need for invalidation, it is
> > a bit messy.. Here is my thinking. Please let me know if I got it right
> > 
> > What is the starting point of the guest memory walks:
> >   Intel: Single Scalable Mode PASID table entry indexed by a RID & PASID
> >   AMD: GCR3 table (a table of PASIDs) indexed by RID
> 
> GCR3 table is indexed by PASID.
> Device Table (DTE) is indexted by DeviceID (RID)

Yes, this is what I was trying to say


> > Will ATC be forwarded or synthesized:
> >   Intel: The (vDomain-ID,PASID) is a unique nesting domain so
> >          the hypervisor knows exactly which RIDs this nesting domain is
> > 	linked to and can generate an ATC invalidation. Plan is to
> > 	supress/discard the ATC invalidations from the VM and generate
> > 	them in the hypervisor.
> >   AMD: (vDomain-ID,PASID) is ambiguous, it can refer to multiple GCR3
> >        tables. We know which maximal set of RIDs it represents, but not
> >        the actual set. I expect AMD will forward the ATC invalidation
> >        to avoid over invalidation.
> 
> Not sure I understand your description here.
> 
> For the AMD IOMMU INVALIDE_IOMMU_PAGES (i.e. invalidate the IOMMU TLB), the
> hypervisor needs to map gDomainId->hDomainId and issue the command on behalf
> of the VM along with the PASID and GVA (or GVA range) provided by the guest.

Yes, that is the "forwarding" approach. Contrast this to the Intel
approach where the ATC is synthesized by the kernel emulating the
INVALIDE_IOMMU_PAGES

> > To make this work the iommu needs to be programmed with:
> >   AMD: A vDomain-ID -> pDomain-ID table
> >        A vRID -> pRID table
> >        This is all bound to some "virtual function"
> 
> By "virtual function", I assume you are referring to the AMD vIOMMU instance
> in the guest?

Yes, but it is not in the guest, it has to be some concrete iommufd
object.

> Something like IOMMUFD_OBJ_VIOMMU? Then operation would include something
> like:
>   * Init
>   * Destroy
>   * ...

Yes, something like that. It needs to be able to work for ARM vCMDQ
stuff too. I don't know what the name should be. Maybe viommu is OK
for now.

- Alloc viommu (against a single iommu instance?)
- Assign a virtual ID to an iommufd device within the same instance
- Setup a submission and completion queue in userspace memory
- mmap the doorbell page (both need this?)
- Route back completion interrupts via eventfd

When you get here you and Nicolin should work out something along
those lines that works for both

But I'd like to keep things in steps, so if we can get info, nesting
parent, nesting domain and SW IOTLB and ATC invalidation as the first
(two?) steps that would be great

> > Thus next steps:
> >   - Respin this and lets focus on Intel only (this will be tough for
> >     the holidays, but if it is available I will try)
> >   - Get an ARM patch that just does IOTLB invalidation and add it to my
> >     part 3
> >   - Start working on IOMMUFD_DEV_INVALIDATE along with an ARM
> >     implementation of it
> >   - Reorganize the AMD RFC broadly along these lines and lets see it
> >     freshened up in the next months as well. I would like to see the
> >     AMD support structured to implement the SW paths in first steps and
> >     later add in the "virtual function" acceleration stuff. The latter
> >     is going to be complex.
> 
> Working on refining the part 1 to add HW info reporting and nested
> translation (minus the invalidation stuff). Should be sending out soon.

Nice!

Jason

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2)
  2023-12-09  1:47 ` [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2) Jason Gunthorpe
                     ` (2 preceding siblings ...)
  2023-12-11 17:35   ` Suthikulpanit, Suravee
@ 2023-12-11 21:27   ` Nicolin Chen
  2023-12-11 21:57     ` Jason Gunthorpe
  3 siblings, 1 reply; 93+ messages in thread
From: Nicolin Chen @ 2023-12-11 21:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yi Liu, Giani, Dhaval, Vasant Hegde, Suravee Suthikulpanit, joro,
	alex.williamson, kevin.tian, robin.murphy, baolu.lu, cohuck,
	eric.auger, kvm, mjrosato, chao.p.peng, yi.y.sun, peterx,
	jasowang, shameerali.kolothum.thodi, lulu, iommu, linux-kernel,
	linux-kselftest, zhenzhong.duan, joao.m.martins, xin.zeng,
	yan.y.zhao

On Fri, Dec 08, 2023 at 09:47:26PM -0400, Jason Gunthorpe wrote:
> What is in a Nested domain:
>  ARM: A CD table pointer
>       Nesting domains are created for every unique CD table top pointer.

I think we basically implemented in a way of syncing STE, i,e,
vSTE.Config must be "S1 Translate" besides a CD table pointer,
and a nested domain is freed when vSTE.Config=BYPASS even if a
CD table pointer is present, right?

> To make this work the iommu needs to be programmed with:
>  AMD: A vDomain-ID -> pDomain-ID table
>       A vRID -> pRID table
>       This is all bound to some "virtual function"
>  ARM: A vRID -> pRID table
>       The vCMDQ is bound to a VM_ID, so to the Nesting Parent

VCMDQ also has something called "virtual interface" that holds
a VMID and a list of CMDQ queues, which might be a bit similar
to AMD's "virtual function".

> For AMD, as above, I suggest the vDomain-ID be passed when creating
> the nesting domain.
> 
> The AMD "virtual function".. It is probably best to create a new iommufd
> object for this and it can be passed in to a few places
> 
> The vRID->pRID table should be some mostly common
> IOMMUFD_DEV_ASSIGN_VIRTUAL_ID. AMD will need to pass in the virtual
> function ID and ARM will need to pass in the Nesting Parent ID.

It sounds like our previous IOMMUFD_SET/UNSET_IDEV_DATA. I'm
wondering if we need to make it exclusive to the ID assigning?
Maybe set_idev_data could be reused for other potential cases?

If we do implement an IOMMUFD_DEV_ASSIGN_VIRTUAL_ID, do we need
an IOMMUFD_DEV_RESIGN_VIRTUAL_ID? (or better word than resign).

Could the structure just look like this?
struct iommu_dev_assign_virtual_id {
       __u32 size;
       __u32 dev_id;
       __u32 id_type;
       __u32 id;
};

> In many ways the nesting parent/virtual function are very similar
> things. Perhaps ARM should also create a virtual function object which
> is just welded to the nesting parent for API consistency.

A virtual function that holds an S2 domain/iopt + a VMID? If
this is for VCMDQ, the VMCDQ extension driver has that kinda
object holding an S2 domain: I implemented as the extension
function at the end of arm_smmu_finalise_s2() previously.

> So.. In short.. Invalidation is a PITA. The idea is the same but
> annoying little details interfere with actually having a compltely
> common API here. IMHO the uAPI in this series is fine. It will support
> Intel invalidation and non-ATC invalidation on AMD/ARM. It should be
> setup to allow that the target domain object can be any HWPT.
> 
> ARM will be able to do IOTLB invalidation using this API.
> 
> IOMMUFD_DEV_INVALIDATE should be introduced with the same design as
> HWPT invalidate. This would be used for AMD/ARM's ATC invalidation
> (and just force the stream ID, userspace must direct the vRID to the
> correct dev_id).

SMMU's CD invalidations could fall into this category too.

> Then in yet another series we can tackle the entire "virtual function"
> vRID/pRID translation stuff when the mmapable queue thing is
> introduced.

VCMDQ is also a mmapable queue. I feel that there could be
more common stuff between "virtual function" and "virtual
interface", I'll need to take a look at AMD's stuff though.

I previously drafted something to test it out with iommufd.
Basically it needs the pairing of vRID/pRID in attach_dev()
and another ioctl to mmap/config user queue(s):
+struct iommu_hwpt_cache_config_tegra241_vcmdq {
+       __u32 vcmdq_id;			// queue id
+       __u32 vcmdq_log2size;		// queue size
+       __aligned_u64 vcmdq_base;	// queue guest PA
+};

> Thus next steps:
>  - Get an ARM patch that just does IOTLB invalidation and add it to my
>    part 3
>  - Start working on IOMMUFD_DEV_INVALIDATE along with an ARM
>    implementation of it.

I will work on these two, presumably including the new
IOMMUFD_DEV_ASSIGN_VIRTUAL_ID or so.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2)
  2023-12-11 21:27   ` Nicolin Chen
@ 2023-12-11 21:57     ` Jason Gunthorpe
  2023-12-12  7:30       ` Nicolin Chen
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Gunthorpe @ 2023-12-11 21:57 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Yi Liu, Giani, Dhaval, Vasant Hegde, Suravee Suthikulpanit, joro,
	alex.williamson, kevin.tian, robin.murphy, baolu.lu, cohuck,
	eric.auger, kvm, mjrosato, chao.p.peng, yi.y.sun, peterx,
	jasowang, shameerali.kolothum.thodi, lulu, iommu, linux-kernel,
	linux-kselftest, zhenzhong.duan, joao.m.martins, xin.zeng,
	yan.y.zhao

On Mon, Dec 11, 2023 at 01:27:49PM -0800, Nicolin Chen wrote:
> On Fri, Dec 08, 2023 at 09:47:26PM -0400, Jason Gunthorpe wrote:
> > What is in a Nested domain:
> >  ARM: A CD table pointer
> >       Nesting domains are created for every unique CD table top pointer.
> 
> I think we basically implemented in a way of syncing STE, i,e,
> vSTE.Config must be "S1 Translate" besides a CD table pointer,
> and a nested domain is freed when vSTE.Config=BYPASS even if a
> CD table pointer is present, right?

Yes, but you can also de-duplicate the nested domains based on the CD
table pointer. It is not as critical for ARM as others, but may
still be worth doing.

> > To make this work the iommu needs to be programmed with:
> >  AMD: A vDomain-ID -> pDomain-ID table
> >       A vRID -> pRID table
> >       This is all bound to some "virtual function"
> >  ARM: A vRID -> pRID table
> >       The vCMDQ is bound to a VM_ID, so to the Nesting Parent
> 
> VCMDQ also has something called "virtual interface" that holds
> a VMID and a list of CMDQ queues, which might be a bit similar
> to AMD's "virtual function".

Yeah, there must be some kind of logical grouping of HW objects to
build that kind of stuff.

> > The vRID->pRID table should be some mostly common
> > IOMMUFD_DEV_ASSIGN_VIRTUAL_ID. AMD will need to pass in the virtual
> > function ID and ARM will need to pass in the Nesting Parent ID.
> 
> It sounds like our previous IOMMUFD_SET/UNSET_IDEV_DATA. I'm
> wondering if we need to make it exclusive to the ID assigning?
> Maybe set_idev_data could be reused for other potential cases?

No, it should be an API only for the ID
 
> If we do implement an IOMMUFD_DEV_ASSIGN_VIRTUAL_ID, do we need
> an IOMMUFD_DEV_RESIGN_VIRTUAL_ID? (or better word than resign).

I don't think so.. The vRID is basically fixed, if it needs to be
changed then the device can be destroyed (or assign can just change it)

> Could the structure just look like this?
> struct iommu_dev_assign_virtual_id {
>        __u32 size;
>        __u32 dev_id;
>        __u32 id_type;
>        __u32 id;
> };

It needs to take in the viommu_id also, and I'd make the id 64 bits
just for good luck.

> > In many ways the nesting parent/virtual function are very similar
> > things. Perhaps ARM should also create a virtual function object which
> > is just welded to the nesting parent for API consistency.
> 
> A virtual function that holds an S2 domain/iopt + a VMID? If
> this is for VCMDQ, the VMCDQ extension driver has that kinda
> object holding an S2 domain: I implemented as the extension
> function at the end of arm_smmu_finalise_s2() previously.

Not so much hold a S2, but that the VMID would be forced to be shared
amung them somehow.

> > IOMMUFD_DEV_INVALIDATE should be introduced with the same design as
> > HWPT invalidate. This would be used for AMD/ARM's ATC invalidation
> > (and just force the stream ID, userspace must direct the vRID to the
> > correct dev_id).
> 
> SMMU's CD invalidations could fall into this category too.

Yes, I forgot to look closely at the CD/GCR3 table invalidations :(
I actually can't tell how AMD invalidates any GCR3 cache, maybe
INVALIDATE_DEVTAB_ENTRY?

> > Then in yet another series we can tackle the entire "virtual function"
> > vRID/pRID translation stuff when the mmapable queue thing is
> > introduced.
> 
> VCMDQ is also a mmapable queue. I feel that there could be
> more common stuff between "virtual function" and "virtual
> interface", I'll need to take a look at AMD's stuff though.

I'm not thinking of two things right now at least..

> I previously drafted something to test it out with iommufd.
> Basically it needs the pairing of vRID/pRID in attach_dev()
> and another ioctl to mmap/config user queue(s):
> +struct iommu_hwpt_cache_config_tegra241_vcmdq {
> +       __u32 vcmdq_id;			// queue id
> +       __u32 vcmdq_log2size;		// queue size
> +       __aligned_u64 vcmdq_base;	// queue guest PA
> +};

vRID/pRID pairing should come from IOMMUFD_DEV_ASSIGN_VIRTUAL_ID. When
a HWPT is allocated it would be connected to the viommu_id and then it
would all be bundled together in the HW somehow

From there you can ask the viommu_id to setup a queue.

Jason

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2)
  2023-12-11 21:57     ` Jason Gunthorpe
@ 2023-12-12  7:30       ` Nicolin Chen
  2023-12-12 14:44         ` Jason Gunthorpe
  0 siblings, 1 reply; 93+ messages in thread
From: Nicolin Chen @ 2023-12-12  7:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yi Liu, Giani, Dhaval, Vasant Hegde, Suravee Suthikulpanit, joro,
	alex.williamson, kevin.tian, robin.murphy, baolu.lu, cohuck,
	eric.auger, kvm, mjrosato, chao.p.peng, yi.y.sun, peterx,
	jasowang, shameerali.kolothum.thodi, lulu, iommu, linux-kernel,
	linux-kselftest, zhenzhong.duan, joao.m.martins, xin.zeng,
	yan.y.zhao

On Mon, Dec 11, 2023 at 05:57:38PM -0400, Jason Gunthorpe wrote:
> On Mon, Dec 11, 2023 at 01:27:49PM -0800, Nicolin Chen wrote:
> > On Fri, Dec 08, 2023 at 09:47:26PM -0400, Jason Gunthorpe wrote:
> > > What is in a Nested domain:
> > >  ARM: A CD table pointer
> > >       Nesting domains are created for every unique CD table top pointer.
> > 
> > I think we basically implemented in a way of syncing STE, i,e,
> > vSTE.Config must be "S1 Translate" besides a CD table pointer,
> > and a nested domain is freed when vSTE.Config=BYPASS even if a
> > CD table pointer is present, right?
> 
> Yes, but you can also de-duplicate the nested domains based on the CD
> table pointer. It is not as critical for ARM as others, but may
> still be worth doing.

Ah right! Should do that.

> > If we do implement an IOMMUFD_DEV_ASSIGN_VIRTUAL_ID, do we need
> > an IOMMUFD_DEV_RESIGN_VIRTUAL_ID? (or better word than resign).
> 
> I don't think so.. The vRID is basically fixed, if it needs to be
> changed then the device can be destroyed (or assign can just change it)

OK. Will insert the cleanup piece to the unbind()/destroy().

> > Could the structure just look like this?
> > struct iommu_dev_assign_virtual_id {
> >        __u32 size;
> >        __u32 dev_id;
> >        __u32 id_type;
> >        __u32 id;
> > };
> 
> It needs to take in the viommu_id also, and I'd make the id 64 bits
> just for good luck.

What is viommu_id required for in this context? I thought we
already know which SMMU instance to issue commands via dev_id?

> > > IOMMUFD_DEV_INVALIDATE should be introduced with the same design as
> > > HWPT invalidate. This would be used for AMD/ARM's ATC invalidation
> > > (and just force the stream ID, userspace must direct the vRID to the
> > > correct dev_id).
> > 
> > SMMU's CD invalidations could fall into this category too.

Do we need a different iommu API for this ioctl? We currently
have the cache_invalidate_user as a domain op. The new device
op will be an iommu op?

And should we rename the "cache_invalidate_user"? Would VT-d
still uses it for device cache?

> > I previously drafted something to test it out with iommufd.
> > Basically it needs the pairing of vRID/pRID in attach_dev()
> > and another ioctl to mmap/config user queue(s):
> > +struct iommu_hwpt_cache_config_tegra241_vcmdq {
> > +       __u32 vcmdq_id;			// queue id
> > +       __u32 vcmdq_log2size;		// queue size
> > +       __aligned_u64 vcmdq_base;	// queue guest PA
> > +};
> 
> vRID/pRID pairing should come from IOMMUFD_DEV_ASSIGN_VIRTUAL_ID. When
> a HWPT is allocated it would be connected to the viommu_id and then it
> would all be bundled together in the HW somehow

Since we were talking about sharing stage-2 domain, the HWPT
to the stage-2 domain will be shared among the vIOMMU/pIOMMU
instances too? I think I am not quite getting the viommu_id
part yet. From the other side of this thread, viommu object
is created per vIOMMU instance and each one actually tied to
a pIOMMU by nature?

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2)
  2023-12-12  7:30       ` Nicolin Chen
@ 2023-12-12 14:44         ` Jason Gunthorpe
  2023-12-12 19:13           ` Nicolin Chen
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Gunthorpe @ 2023-12-12 14:44 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Yi Liu, Giani, Dhaval, Vasant Hegde, Suravee Suthikulpanit, joro,
	alex.williamson, kevin.tian, robin.murphy, baolu.lu, cohuck,
	eric.auger, kvm, mjrosato, chao.p.peng, yi.y.sun, peterx,
	jasowang, shameerali.kolothum.thodi, lulu, iommu, linux-kernel,
	linux-kselftest, zhenzhong.duan, joao.m.martins, xin.zeng,
	yan.y.zhao

On Mon, Dec 11, 2023 at 11:30:00PM -0800, Nicolin Chen wrote:

> > > Could the structure just look like this?
> > > struct iommu_dev_assign_virtual_id {
> > >        __u32 size;
> > >        __u32 dev_id;
> > >        __u32 id_type;
> > >        __u32 id;
> > > };
> > 
> > It needs to take in the viommu_id also, and I'd make the id 64 bits
> > just for good luck.
> 
> What is viommu_id required for in this context? I thought we
> already know which SMMU instance to issue commands via dev_id?

The viommu_id would be the container that holds the xarray that maps
the vRID to pRID

Logically we could have multiple mappings per iommufd as we could have
multiple iommu instances working here.

> > > > IOMMUFD_DEV_INVALIDATE should be introduced with the same design as
> > > > HWPT invalidate. This would be used for AMD/ARM's ATC invalidation
> > > > (and just force the stream ID, userspace must direct the vRID to the
> > > > correct dev_id).
> > > 
> > > SMMU's CD invalidations could fall into this category too.
> 
> Do we need a different iommu API for this ioctl? We currently
> have the cache_invalidate_user as a domain op. The new device
> op will be an iommu op?

Yes

> And should we rename the "cache_invalidate_user"? Would VT-d
> still uses it for device cache?

I think vt-d will not implement it
 
> > > I previously drafted something to test it out with iommufd.
> > > Basically it needs the pairing of vRID/pRID in attach_dev()
> > > and another ioctl to mmap/config user queue(s):
> > > +struct iommu_hwpt_cache_config_tegra241_vcmdq {
> > > +       __u32 vcmdq_id;			// queue id
> > > +       __u32 vcmdq_log2size;		// queue size
> > > +       __aligned_u64 vcmdq_base;	// queue guest PA
> > > +};
> > 
> > vRID/pRID pairing should come from IOMMUFD_DEV_ASSIGN_VIRTUAL_ID. When
> > a HWPT is allocated it would be connected to the viommu_id and then it
> > would all be bundled together in the HW somehow
> 
> Since we were talking about sharing stage-2 domain, the HWPT
> to the stage-2 domain will be shared among the vIOMMU/pIOMMU
> instances too?

The HWPT for the stage 2 should be shared

> I think I am not quite getting the viommu_id
> part yet. From the other side of this thread, viommu object
> is created per vIOMMU instance and each one actually tied to
> a pIOMMU by nature?

Yes

Jason

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2)
  2023-12-12 14:44         ` Jason Gunthorpe
@ 2023-12-12 19:13           ` Nicolin Chen
  2023-12-12 19:21             ` Jason Gunthorpe
  0 siblings, 1 reply; 93+ messages in thread
From: Nicolin Chen @ 2023-12-12 19:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yi Liu, Giani, Dhaval, Vasant Hegde, Suravee Suthikulpanit, joro,
	alex.williamson, kevin.tian, robin.murphy, baolu.lu, cohuck,
	eric.auger, kvm, mjrosato, chao.p.peng, yi.y.sun, peterx,
	jasowang, shameerali.kolothum.thodi, lulu, iommu, linux-kernel,
	linux-kselftest, zhenzhong.duan, joao.m.martins, xin.zeng,
	yan.y.zhao

On Tue, Dec 12, 2023 at 10:44:21AM -0400, Jason Gunthorpe wrote:
> On Mon, Dec 11, 2023 at 11:30:00PM -0800, Nicolin Chen wrote:
> 
> > > > Could the structure just look like this?
> > > > struct iommu_dev_assign_virtual_id {
> > > >        __u32 size;
> > > >        __u32 dev_id;
> > > >        __u32 id_type;
> > > >        __u32 id;
> > > > };
> > > 
> > > It needs to take in the viommu_id also, and I'd make the id 64 bits
> > > just for good luck.
> > 
> > What is viommu_id required for in this context? I thought we
> > already know which SMMU instance to issue commands via dev_id?
> 
> The viommu_id would be the container that holds the xarray that maps
> the vRID to pRID
> 
> Logically we could have multiple mappings per iommufd as we could have
> multiple iommu instances working here.

I see. This is the object to hold a shared stage-2 HWPT/domain then.

// iommufd_private.h

enum iommufd_object_type {
	...
+	IOMMUFD_OBJ_VIOMMU,
	...
};

+struct iommufd_viommu {
+	struct iommufd_object obj;
+	struct iommufd_hwpt_paging *hwpt;
+	struct xarray devices;
+};

struct iommufd_hwpt_paging hwpt {
	...
+	struct list_head viommu_list;
	...
};

struct iommufd_group {
	...
+	struct iommufd_viommu *viommu; // should we attach to viommu instead of hwpt?
	...
};

Question to finalize how we maps vRID-pRID in the xarray:
how should IOMMUFD_DEV_INVALIDATE work? The ioctl structure has
a dev_id and a list of commands that belongs to the device. So,
it forwards the struct device pointer to the driver along with
the commands. Then, doesn't the driver already know the pRID 
from the dev pointer without looking up a vRID-pRID table?

// uapi/linux/iommufd.h

struct iommu_hwpt_alloc {
	...
+	__u32 viommu_id;
};

+enum iommu_dev_virtual_id_type {
+	IOMMU_DEV_VIRTUAL_ID_TYPE_AMD_VIOMMU_DID, // not sure how this fits the xarray in viommu obj.
+	IOMMU_DEV_VIRTUAL_ID_TYPE_AMD_VIOMMU_RID,
+	IOMMU_DEV_VIRTUAL_ID_TYPE_ARM_SMMU_SID,
+};
+
+struct iommu_dev_assign_virtual_id {
+	__u32 size;
+	__u32 dev_id;
+	__u32 viommu_id;
+	__u32 id_type;
+	__aligned_u64 id;
+};

Then, I think that we also need an iommu_viommu_alloc structure
and ioctl to allocate an object, and that VMM should know if it
needs to allocate multiple viommu objects -- this probably needs
the hw_info ioctl to return a piommu_id so VMM gets the list of
piommus from the attached devices?

Another question, introducing the viommu obj complicates things
a lot. Do we want it to come with iommu_dev_assign_virtual_id,
or maybe put in a later series? We could stage the xarray in the
iommu_hwpt_paging struct for now, so a single-IOMMU system could
still work with that.

> > > > > IOMMUFD_DEV_INVALIDATE should be introduced with the same design as
> > > > > HWPT invalidate. This would be used for AMD/ARM's ATC invalidation
> > > > > (and just force the stream ID, userspace must direct the vRID to the
> > > > > correct dev_id).
> > > > 
> > > > SMMU's CD invalidations could fall into this category too.
> > 
> > Do we need a different iommu API for this ioctl? We currently
> > have the cache_invalidate_user as a domain op. The new device
> > op will be an iommu op?
> 
> Yes

OK. FWIW, I think either VMM or iommufd core knows which nested
HWPT the device is attached to. If we ever wanted to keep it in
the domain op list, we could still do that by passing in extra
domain pointer to the driver.

> > And should we rename the "cache_invalidate_user"? Would VT-d
> > still uses it for device cache?
> 
> I think vt-d will not implement it

Then should we "s/cache_invalidate_user/iotlb_sync_user"?

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2)
  2023-12-12 19:13           ` Nicolin Chen
@ 2023-12-12 19:21             ` Jason Gunthorpe
  2023-12-12 20:05               ` Nicolin Chen
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Gunthorpe @ 2023-12-12 19:21 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Yi Liu, Giani, Dhaval, Vasant Hegde, Suravee Suthikulpanit, joro,
	alex.williamson, kevin.tian, robin.murphy, baolu.lu, cohuck,
	eric.auger, kvm, mjrosato, chao.p.peng, yi.y.sun, peterx,
	jasowang, shameerali.kolothum.thodi, lulu, iommu, linux-kernel,
	linux-kselftest, zhenzhong.duan, joao.m.martins, xin.zeng,
	yan.y.zhao

On Tue, Dec 12, 2023 at 11:13:37AM -0800, Nicolin Chen wrote:
> On Tue, Dec 12, 2023 at 10:44:21AM -0400, Jason Gunthorpe wrote:
> > On Mon, Dec 11, 2023 at 11:30:00PM -0800, Nicolin Chen wrote:
> > 
> > > > > Could the structure just look like this?
> > > > > struct iommu_dev_assign_virtual_id {
> > > > >        __u32 size;
> > > > >        __u32 dev_id;
> > > > >        __u32 id_type;
> > > > >        __u32 id;
> > > > > };
> > > > 
> > > > It needs to take in the viommu_id also, and I'd make the id 64 bits
> > > > just for good luck.
> > > 
> > > What is viommu_id required for in this context? I thought we
> > > already know which SMMU instance to issue commands via dev_id?
> > 
> > The viommu_id would be the container that holds the xarray that maps
> > the vRID to pRID
> > 
> > Logically we could have multiple mappings per iommufd as we could have
> > multiple iommu instances working here.
> 
> I see. This is the object to hold a shared stage-2 HWPT/domain then.

It could be done like that, yes. I wasn't thinking about linking the
stage two so tightly but perhaps? If we can avoid putting the hwpt
here that might be more general.

> // iommufd_private.h
> 
> enum iommufd_object_type {
> 	...
> +	IOMMUFD_OBJ_VIOMMU,
> 	...
> };
> 
> +struct iommufd_viommu {
> +	struct iommufd_object obj;
> +	struct iommufd_hwpt_paging *hwpt;
> +	struct xarray devices;
> +};
> 
> struct iommufd_hwpt_paging hwpt {
> 	...
> +	struct list_head viommu_list;
> 	...
> };

I'd probably first try to go backwards and link the hwpt to the
viommu.

> struct iommufd_group {
> 	...
> +	struct iommufd_viommu *viommu; // should we attach to viommu instead of hwpt?
> 	...
> };

No. Attach is a statement of translation so you still attach to the HWPT.

 
> Question to finalize how we maps vRID-pRID in the xarray:
> how should IOMMUFD_DEV_INVALIDATE work? The ioctl structure has
> a dev_id and a list of commands that belongs to the device. So,
> it forwards the struct device pointer to the driver along with
> the commands. Then, doesn't the driver already know the pRID 
> from the dev pointer without looking up a vRID-pRID table?

The first version of DEV_INVALIDATE should have no xarray. The
invalidate commands are stripped of the SID and executed on the given
dev_id period. VMM splits up the invalidate command list.

The second version maybe we have the xarray, or maybe we just push the
xarray to the eventual viommu series.

> struct iommu_hwpt_alloc {
> 	...
> +	__u32 viommu_id;
> };
> 
> +enum iommu_dev_virtual_id_type {
> +	IOMMU_DEV_VIRTUAL_ID_TYPE_AMD_VIOMMU_DID, // not sure how this fits the xarray in viommu obj.
> +	IOMMU_DEV_VIRTUAL_ID_TYPE_AMD_VIOMMU_RID,

It is just DID. In both cases the ID is the index to the "STE" radix
tree, whatever the driver happens to call it.

> Then, I think that we also need an iommu_viommu_alloc structure
> and ioctl to allocate an object, and that VMM should know if it
> needs to allocate multiple viommu objects -- this probably needs
> the hw_info ioctl to return a piommu_id so VMM gets the list of
> piommus from the attached devices?

Yes and yes

> Another question, introducing the viommu obj complicates things
> a lot. Do we want it to come with iommu_dev_assign_virtual_id,
> or maybe put in a later series? We could stage the xarray in the
> iommu_hwpt_paging struct for now, so a single-IOMMU system could
> still work with that.

All this would be in its own series to enable HW accelerated viommu
support on ARM & AMD as we've been doing so far.

I imagine it after we get the basic invalidation done

> > > And should we rename the "cache_invalidate_user"? Would VT-d
> > > still uses it for device cache?
> > 
> > I think vt-d will not implement it
> 
> Then should we "s/cache_invalidate_user/iotlb_sync_user"?

I think cache_invalidate is still a fine name.. vt-d will generate ATC
invalidations under that function too.

Jason

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2)
  2023-12-12 19:21             ` Jason Gunthorpe
@ 2023-12-12 20:05               ` Nicolin Chen
  2023-12-13 12:40                 ` Jason Gunthorpe
  0 siblings, 1 reply; 93+ messages in thread
From: Nicolin Chen @ 2023-12-12 20:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yi Liu, Giani, Dhaval, Vasant Hegde, Suravee Suthikulpanit, joro,
	alex.williamson, kevin.tian, robin.murphy, baolu.lu, cohuck,
	eric.auger, kvm, mjrosato, chao.p.peng, yi.y.sun, peterx,
	jasowang, shameerali.kolothum.thodi, lulu, iommu, linux-kernel,
	linux-kselftest, zhenzhong.duan, joao.m.martins, xin.zeng,
	yan.y.zhao

On Tue, Dec 12, 2023 at 03:21:00PM -0400, Jason Gunthorpe wrote:
> On Tue, Dec 12, 2023 at 11:13:37AM -0800, Nicolin Chen wrote:
> > On Tue, Dec 12, 2023 at 10:44:21AM -0400, Jason Gunthorpe wrote:
> > > On Mon, Dec 11, 2023 at 11:30:00PM -0800, Nicolin Chen wrote:
> > > 
> > > > > > Could the structure just look like this?
> > > > > > struct iommu_dev_assign_virtual_id {
> > > > > >        __u32 size;
> > > > > >        __u32 dev_id;
> > > > > >        __u32 id_type;
> > > > > >        __u32 id;
> > > > > > };
> > > > > 
> > > > > It needs to take in the viommu_id also, and I'd make the id 64 bits
> > > > > just for good luck.
> > > > 
> > > > What is viommu_id required for in this context? I thought we
> > > > already know which SMMU instance to issue commands via dev_id?
> > > 
> > > The viommu_id would be the container that holds the xarray that maps
> > > the vRID to pRID
> > > 
> > > Logically we could have multiple mappings per iommufd as we could have
> > > multiple iommu instances working here.
> > 
> > I see. This is the object to hold a shared stage-2 HWPT/domain then.
> 
> It could be done like that, yes. I wasn't thinking about linking the
> stage two so tightly but perhaps? If we can avoid putting the hwpt
> here that might be more general.
> 
> > // iommufd_private.h
> > 
> > enum iommufd_object_type {
> > 	...
> > +	IOMMUFD_OBJ_VIOMMU,
> > 	...
> > };
> > 
> > +struct iommufd_viommu {
> > +	struct iommufd_object obj;
> > +	struct iommufd_hwpt_paging *hwpt;
> > +	struct xarray devices;
> > +};
> > 
> > struct iommufd_hwpt_paging hwpt {
> > 	...
> > +	struct list_head viommu_list;
> > 	...
> > };
> 
> I'd probably first try to go backwards and link the hwpt to the
> viommu.

I think a VM should have only one hwpt_paging object while one
or more viommu objects, so we could do only viommu->hwpt_paging
and hwpt_paging->viommu_list. How to go backwards?

> > struct iommufd_group {
> > 	...
> > +	struct iommufd_viommu *viommu; // should we attach to viommu instead of hwpt?
> > 	...
> > };
> 
> No. Attach is a statement of translation so you still attach to the HWPT.

OK. It's probably not necessary since we know which piommu the
device is behind. And we only need to link viommu and piommu,
right?

> > Question to finalize how we maps vRID-pRID in the xarray:
> > how should IOMMUFD_DEV_INVALIDATE work? The ioctl structure has
> > a dev_id and a list of commands that belongs to the device. So,
> > it forwards the struct device pointer to the driver along with
> > the commands. Then, doesn't the driver already know the pRID 
> > from the dev pointer without looking up a vRID-pRID table?
> 
> The first version of DEV_INVALIDATE should have no xarray. The
> invalidate commands are stripped of the SID and executed on the given
> dev_id period. VMM splits up the invalidate command list.

Yes. This makes sense to me. VMM knows which device to issue
an IOMMUFD_DEV_INVALIDATE from the vSID/vRID in the commands.

> The second version maybe we have the xarray, or maybe we just push the
> xarray to the eventual viommu series.

I think that I still don't get the purpose of the xarray here.
It was needed previously because a cache invalidate per hwpt
doesn't know which device. Now IOMMUFD_DEV_INVALIDATE knows.

Maybe it's related to that narrative "logically we could have
multiple mappings per iommufd" that you mentioned above. Mind
elaborating a bit?

In my mind, viommu is allocated by VMM per piommu, by detecting
the piommu_id via hw_info. In that case, viommu can only have
one unique device list. If IOMMUFD_DEV_INVALIDATE passes in the
dev_id, we don't really need a mapping of vRID-pRID in a multi-
viommu case either? In another word, VMM already has a mapping
from vRID to dev_id, so it could call the DEV_INVALIDATE ioctl
in the first place?

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2)
  2023-12-12 20:05               ` Nicolin Chen
@ 2023-12-13 12:40                 ` Jason Gunthorpe
  2023-12-13 19:54                   ` Nicolin Chen
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Gunthorpe @ 2023-12-13 12:40 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Yi Liu, Giani, Dhaval, Vasant Hegde, Suravee Suthikulpanit, joro,
	alex.williamson, kevin.tian, robin.murphy, baolu.lu, cohuck,
	eric.auger, kvm, mjrosato, chao.p.peng, yi.y.sun, peterx,
	jasowang, shameerali.kolothum.thodi, lulu, iommu, linux-kernel,
	linux-kselftest, zhenzhong.duan, joao.m.martins, xin.zeng,
	yan.y.zhao

On Tue, Dec 12, 2023 at 12:05:41PM -0800, Nicolin Chen wrote:
> > > // iommufd_private.h
> > > 
> > > enum iommufd_object_type {
> > > 	...
> > > +	IOMMUFD_OBJ_VIOMMU,
> > > 	...
> > > };
> > > 
> > > +struct iommufd_viommu {
> > > +	struct iommufd_object obj;
> > > +	struct iommufd_hwpt_paging *hwpt;
> > > +	struct xarray devices;
> > > +};
> > > 
> > > struct iommufd_hwpt_paging hwpt {
> > > 	...
> > > +	struct list_head viommu_list;
> > > 	...
> > > };
> > 
> > I'd probably first try to go backwards and link the hwpt to the
> > viommu.
> 
> I think a VM should have only one hwpt_paging object while one
> or more viommu objects, so we could do only viommu->hwpt_paging
> and hwpt_paging->viommu_list. How to go backwards?

That is probably how things would work but I don't know if it makes
sense to enforce it in the kernel logic..

Point the S2 to a list of viommu objects it is linked to

> > > struct iommufd_group {
> > > 	...
> > > +	struct iommufd_viommu *viommu; // should we attach to viommu instead of hwpt?
> > > 	...
> > > };
> > 
> > No. Attach is a statement of translation so you still attach to the HWPT.
> 
> OK. It's probably not necessary since we know which piommu the
> device is behind. And we only need to link viommu and piommu,
> right?

Yes

> > The second version maybe we have the xarray, or maybe we just push the
> > xarray to the eventual viommu series.
> 
> I think that I still don't get the purpose of the xarray here.
> It was needed previously because a cache invalidate per hwpt
> doesn't know which device. Now IOMMUFD_DEV_INVALIDATE knows.
> 
> Maybe it's related to that narrative "logically we could have
> multiple mappings per iommufd" that you mentioned above. Mind
> elaborating a bit?
> 
> In my mind, viommu is allocated by VMM per piommu, by detecting
> the piommu_id via hw_info. In that case, viommu can only have
> one unique device list. If IOMMUFD_DEV_INVALIDATE passes in the
> dev_id, we don't really need a mapping of vRID-pRID in a multi-
> viommu case either? In another word, VMM already has a mapping
> from vRID to dev_id, so it could call the DEV_INVALIDATE ioctl
> in the first place?

The xarray exists to optimize the invalidation flow.

For SW you can imagine issuing an invalidation against the viommu
itself and *all* commands, be they ASID or ATC invalidations can be
processed in one shot. The xarray allows converting the vSID to pSID
to process ATC invalidations, and the viommu object forces a single
VMID to handle the ATC invalidations. If we want to do this, I don't
know.

For HW, the xarray holds the vSID to pSID mapping that must be
programmed into the HW operating the dedicated invalidation queue.

Jason

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2)
  2023-12-13 12:40                 ` Jason Gunthorpe
@ 2023-12-13 19:54                   ` Nicolin Chen
  0 siblings, 0 replies; 93+ messages in thread
From: Nicolin Chen @ 2023-12-13 19:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yi Liu, Giani, Dhaval, Vasant Hegde, Suravee Suthikulpanit, joro,
	alex.williamson, kevin.tian, robin.murphy, baolu.lu, cohuck,
	eric.auger, kvm, mjrosato, chao.p.peng, yi.y.sun, peterx,
	jasowang, shameerali.kolothum.thodi, lulu, iommu, linux-kernel,
	linux-kselftest, zhenzhong.duan, joao.m.martins, xin.zeng,
	yan.y.zhao

On Wed, Dec 13, 2023 at 08:40:55AM -0400, Jason Gunthorpe wrote:
> On Tue, Dec 12, 2023 at 12:05:41PM -0800, Nicolin Chen wrote:
> > > > // iommufd_private.h
> > > > 
> > > > enum iommufd_object_type {
> > > > 	...
> > > > +	IOMMUFD_OBJ_VIOMMU,
> > > > 	...
> > > > };
> > > > 
> > > > +struct iommufd_viommu {
> > > > +	struct iommufd_object obj;
> > > > +	struct iommufd_hwpt_paging *hwpt;
> > > > +	struct xarray devices;
> > > > +};
> > > > 
> > > > struct iommufd_hwpt_paging hwpt {
> > > > 	...
> > > > +	struct list_head viommu_list;
> > > > 	...
> > > > };
> > > 
> > > I'd probably first try to go backwards and link the hwpt to the
> > > viommu.
> > 
> > I think a VM should have only one hwpt_paging object while one
> > or more viommu objects, so we could do only viommu->hwpt_paging
> > and hwpt_paging->viommu_list. How to go backwards?
> 
> That is probably how things would work but I don't know if it makes
> sense to enforce it in the kernel logic..
> 
> Point the S2 to a list of viommu objects it is linked to

Hmm, isn't that hwpt_paging->viommu_list already?

> > > The second version maybe we have the xarray, or maybe we just push the
> > > xarray to the eventual viommu series.
> > 
> > I think that I still don't get the purpose of the xarray here.
> > It was needed previously because a cache invalidate per hwpt
> > doesn't know which device. Now IOMMUFD_DEV_INVALIDATE knows.
> > 
> > Maybe it's related to that narrative "logically we could have
> > multiple mappings per iommufd" that you mentioned above. Mind
> > elaborating a bit?
> > 
> > In my mind, viommu is allocated by VMM per piommu, by detecting
> > the piommu_id via hw_info. In that case, viommu can only have
> > one unique device list. If IOMMUFD_DEV_INVALIDATE passes in the
> > dev_id, we don't really need a mapping of vRID-pRID in a multi-
> > viommu case either? In another word, VMM already has a mapping
> > from vRID to dev_id, so it could call the DEV_INVALIDATE ioctl
> > in the first place?
> 
> The xarray exists to optimize the invalidation flow.
> 
> For SW you can imagine issuing an invalidation against the viommu
> itself and *all* commands, be they ASID or ATC invalidations can be
> processed in one shot. The xarray allows converting the vSID to pSID
> to process ATC invalidations, and the viommu object forces a single
> VMID to handle the ATC invalidations. If we want to do this, I don't
> know.

I drafted some patches with IOMMUFD_DEV_INVALIDATE yesterday,
and realized the same problem that you pointed out here: how
VMM should handle a group of commands interlaced with ASID and
ATC commands. If VMM dispatches commands into two groups, the
executions of the commands will be in a different order than
what the guest kernel issued in. This might be bitter if there
is an error occurring in the middle of command executions, in
which case some later invalidations are done successfully but
the CONS index would have to stop at a command prior.

And even if there are only ATC invalidations in a guest queue,
there's no guarantee that all commands are for the same dev_id,
i.e. ATC invalidations themselves would be dispatched into more
groups and separate IOMMUFD_DEV_INVALIDATE calls.

With the xarray, perhaps we could provide a viommu_id in data
structure of the current iommu_hwpt_invalidate, i.e. reshaping
the existing invalidate uAPI per viommu, so it can be reused by
ATC invalidations too instead of adding IOMMUFD_DEV_INVALIDATE?
Then we wouldn't have the out-of-order execution problem above.

> For HW, the xarray holds the vSID to pSID mapping that must be
> programmed into the HW operating the dedicated invalidation queue.

Ah, right! VCMDQ at least needs that.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 93+ messages in thread

[parent not found: <CGME20231217215720eucas1p2a590aca62ce8eb5ba81df6bc8b1a785d@eucas1p2.samsung.com>]

* Re: [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2)
       [not found] ` <CGME20231217215720eucas1p2a590aca62ce8eb5ba81df6bc8b1a785d@eucas1p2.samsung.com>
@ 2023-12-17 11:21   ` Joel Granados
  2023-12-19  9:26     ` Yi Liu
  0 siblings, 1 reply; 93+ messages in thread
From: Joel Granados @ 2023-12-17 11:21 UTC (permalink / raw)
  To: Yi Liu
  Cc: joro, alex.williamson, jgg, kevin.tian, robin.murphy, baolu.lu,
	cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	zhenzhong.duan, joao.m.martins, xin.zeng, yan.y.zhao

[-- Attachment #1: Type: text/plain, Size: 7303 bytes --]

Hey Yi

I have been working with https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1
and have some questions regarding one of the commits in that series.
I however cannot find it in lore.kernel.org. Can you please direct me to
where the rfc was posted? If it has not been posted yet, do you have an
alternate place for discussion?

Best

On Fri, Nov 17, 2023 at 05:07:11AM -0800, Yi Liu wrote:
> Nested translation is a hardware feature that is supported by many modern
> IOMMU hardwares. It has two stages (stage-1, stage-2) address translation
> to get access to the physical address. stage-1 translation table is owned
> by userspace (e.g. by a guest OS), while stage-2 is owned by kernel. Changes
> to stage-1 translation table should be followed by an IOTLB invalidation.
> 
> Take Intel VT-d as an example, the stage-1 translation table is I/O page
> table. As the below diagram shows, guest I/O page table pointer in GPA
> (guest physical address) is passed to host and be used to perform the stage-1
> address translation. Along with it, modifications to present mappings in the
> guest I/O page table should be followed with an IOTLB invalidation.
> 
>     .-------------.  .---------------------------.
>     |   vIOMMU    |  | Guest I/O page table      |
>     |             |  '---------------------------'
>     .----------------/
>     | PASID Entry |--- PASID cache flush --+
>     '-------------'                        |
>     |             |                        V
>     |             |           I/O page table pointer in GPA
>     '-------------'
> Guest
> ------| Shadow |---------------------------|--------
>       v        v                           v
> Host
>     .-------------.  .------------------------.
>     |   pIOMMU    |  |  FS for GIOVA->GPA     |
>     |             |  '------------------------'
>     .----------------/  |
>     | PASID Entry |     V (Nested xlate)
>     '----------------\.----------------------------------.
>     |             |   | SS for GPA->HPA, unmanaged domain|
>     |             |   '----------------------------------'
>     '-------------'
> Where:
>  - FS = First stage page tables
>  - SS = Second stage page tables
> <Intel VT-d Nested translation>
> 
> This series adds the cache invalidation path for the userspace to invalidate
> cache after modifying the stage-1 page table. This is based on the first part
> of nesting [1]
> 
> Complete code can be found in [2], QEMU could can be found in [3].
> 
> At last, this is a team work together with Nicolin Chen, Lu Baolu. Thanks
> them for the help. ^_^. Look forward to your feedbacks.
> 
> [1] https://lore.kernel.org/linux-iommu/20231026044216.64964-1-yi.l.liu@intel.com/ - merged
> [2] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
> [3] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1
> 
> Change log:
> 
> v6:
>  - No much change, just rebase on top of 6.7-rc1 as part 1/2 is merged
> 
> v5: https://lore.kernel.org/linux-iommu/20231020092426.13907-1-yi.l.liu@intel.com/#t
>  - Split the iommufd nesting series into two parts of alloc_user and
>    invalidation (Jason)
>  - Split IOMMUFD_OBJ_HW_PAGETABLE to IOMMUFD_OBJ_HWPT_PAGING/_NESTED, and
>    do the same with the structures/alloc()/abort()/destroy(). Reworked the
>    selftest accordingly too. (Jason)
>  - Move hwpt/data_type into struct iommu_user_data from standalone op
>    arguments. (Jason)
>  - Rename hwpt_type to be data_type, the HWPT_TYPE to be HWPT_ALLOC_DATA,
>    _TYPE_DEFAULT to be _ALLOC_DATA_NONE (Jason, Kevin)
>  - Rename iommu_copy_user_data() to iommu_copy_struct_from_user() (Kevin)
>  - Add macro to the iommu_copy_struct_from_user() to calculate min_size
>    (Jason)
>  - Fix two bugs spotted by ZhaoYan
> 
> v4: https://lore.kernel.org/linux-iommu/20230921075138.124099-1-yi.l.liu@intel.com/
>  - Separate HWPT alloc/destroy/abort functions between user-managed HWPTs
>    and kernel-managed HWPTs
>  - Rework invalidate uAPI to be a multi-request array-based design
>  - Add a struct iommu_user_data_array and a helper for driver to sanitize
>    and copy the entry data from user space invalidation array
>  - Add a patch fixing TEST_LENGTH() in selftest program
>  - Drop IOMMU_RESV_IOVA_RANGES patches
>  - Update kdoc and inline comments
>  - Drop the code to add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation,
>    this does not change the rule that resv regions should only be added to the
>    kernel-managed HWPT. The IOMMU_RESV_SW_MSI stuff will be added in later series
>    as it is needed only by SMMU so far.
> 
> v3: https://lore.kernel.org/linux-iommu/20230724110406.107212-1-yi.l.liu@intel.com/
>  - Add new uAPI things in alphabetical order
>  - Pass in "enum iommu_hwpt_type hwpt_type" to op->domain_alloc_user for
>    sanity, replacing the previous op->domain_alloc_user_data_len solution
>  - Return ERR_PTR from domain_alloc_user instead of NULL
>  - Only add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation (Kevin)
>  - Add IOMMU_RESV_IOVA_RANGES to report resv iova ranges to userspace hence
>    userspace is able to exclude the ranges in the stage-1 HWPT (e.g. guest I/O
>    page table). (Kevin)
>  - Add selftest coverage for the new IOMMU_RESV_IOVA_RANGES ioctl
>  - Minor changes per Kevin's inputs
> 
> v2: https://lore.kernel.org/linux-iommu/20230511143844.22693-1-yi.l.liu@intel.com/
>  - Add union iommu_domain_user_data to include all user data structures to avoid
>    passing void * in kernel APIs.
>  - Add iommu op to return user data length for user domain allocation
>  - Rename struct iommu_hwpt_alloc::data_type to be hwpt_type
>  - Store the invalidation data length in iommu_domain_ops::cache_invalidate_user_data_len
>  - Convert cache_invalidate_user op to be int instead of void
>  - Remove @data_type in struct iommu_hwpt_invalidate
>  - Remove out_hwpt_type_bitmap in struct iommu_hw_info hence drop patch 08 of v1
> 
> v1: https://lore.kernel.org/linux-iommu/20230309080910.607396-1-yi.l.liu@intel.com/
> 
> Thanks,
> 	Yi Liu
> 
> Lu Baolu (1):
>   iommu: Add cache_invalidate_user op
> 
> Nicolin Chen (4):
>   iommu: Add iommu_copy_struct_from_user_array helper
>   iommufd/selftest: Add mock_domain_cache_invalidate_user support
>   iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op
>   iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl
> 
> Yi Liu (1):
>   iommufd: Add IOMMU_HWPT_INVALIDATE
> 
>  drivers/iommu/iommufd/hw_pagetable.c          | 35 ++++++++
>  drivers/iommu/iommufd/iommufd_private.h       |  9 ++
>  drivers/iommu/iommufd/iommufd_test.h          | 22 +++++
>  drivers/iommu/iommufd/main.c                  |  3 +
>  drivers/iommu/iommufd/selftest.c              | 69 +++++++++++++++
>  include/linux/iommu.h                         | 84 +++++++++++++++++++
>  include/uapi/linux/iommufd.h                  | 35 ++++++++
>  tools/testing/selftests/iommu/iommufd.c       | 75 +++++++++++++++++
>  tools/testing/selftests/iommu/iommufd_utils.h | 63 ++++++++++++++
>  9 files changed, 395 insertions(+)
> 
> -- 
> 2.34.1
> 

-- 

Joel Granados

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 659 bytes --]

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2)
  2023-12-17 11:21   ` Joel Granados
@ 2023-12-19  9:26     ` Yi Liu
  2023-12-20 11:23       ` Joel Granados
  0 siblings, 1 reply; 93+ messages in thread
From: Yi Liu @ 2023-12-19  9:26 UTC (permalink / raw)
  To: Joel Granados
  Cc: joro, alex.williamson, jgg, kevin.tian, robin.murphy, baolu.lu,
	cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	zhenzhong.duan, joao.m.martins, xin.zeng, yan.y.zhao

On 2023/12/17 19:21, Joel Granados wrote:
> Hey Yi
> 
> I have been working with https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1

good to know about it.

> and have some questions regarding one of the commits in that series.
> I however cannot find it in lore.kernel.org. Can you please direct me to
> where the rfc was posted? If it has not been posted yet, do you have an
> alternate place for discussion?

the qemu series has not been posted yet as kernel side is still changing.
It still needs some time to be ready for public review. Zhenzhong Duan
is going to post it when it's ready. If you have questions to discuss,
you can post your questions to Zhenzhong and me first. I guess it may be
fine to cc Alex Williamson, Eric Auger, Nicolin Chen, Cédric Le Goater,
Kevin Tian, Jason Gunthorpe and qemu mail list as this is discussion 
something that is going to be posted in public.

> 
> Best
> 
> On Fri, Nov 17, 2023 at 05:07:11AM -0800, Yi Liu wrote:
>> Nested translation is a hardware feature that is supported by many modern
>> IOMMU hardwares. It has two stages (stage-1, stage-2) address translation
>> to get access to the physical address. stage-1 translation table is owned
>> by userspace (e.g. by a guest OS), while stage-2 is owned by kernel. Changes
>> to stage-1 translation table should be followed by an IOTLB invalidation.
>>
>> Take Intel VT-d as an example, the stage-1 translation table is I/O page
>> table. As the below diagram shows, guest I/O page table pointer in GPA
>> (guest physical address) is passed to host and be used to perform the stage-1
>> address translation. Along with it, modifications to present mappings in the
>> guest I/O page table should be followed with an IOTLB invalidation.
>>
>>      .-------------.  .---------------------------.
>>      |   vIOMMU    |  | Guest I/O page table      |
>>      |             |  '---------------------------'
>>      .----------------/
>>      | PASID Entry |--- PASID cache flush --+
>>      '-------------'                        |
>>      |             |                        V
>>      |             |           I/O page table pointer in GPA
>>      '-------------'
>> Guest
>> ------| Shadow |---------------------------|--------
>>        v        v                           v
>> Host
>>      .-------------.  .------------------------.
>>      |   pIOMMU    |  |  FS for GIOVA->GPA     |
>>      |             |  '------------------------'
>>      .----------------/  |
>>      | PASID Entry |     V (Nested xlate)
>>      '----------------\.----------------------------------.
>>      |             |   | SS for GPA->HPA, unmanaged domain|
>>      |             |   '----------------------------------'
>>      '-------------'
>> Where:
>>   - FS = First stage page tables
>>   - SS = Second stage page tables
>> <Intel VT-d Nested translation>
>>
>> This series adds the cache invalidation path for the userspace to invalidate
>> cache after modifying the stage-1 page table. This is based on the first part
>> of nesting [1]
>>
>> Complete code can be found in [2], QEMU could can be found in [3].
>>
>> At last, this is a team work together with Nicolin Chen, Lu Baolu. Thanks
>> them for the help. ^_^. Look forward to your feedbacks.
>>
>> [1] https://lore.kernel.org/linux-iommu/20231026044216.64964-1-yi.l.liu@intel.com/ - merged
>> [2] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
>> [3] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1
>>
>> Change log:
>>
>> v6:
>>   - No much change, just rebase on top of 6.7-rc1 as part 1/2 is merged
>>
>> v5: https://lore.kernel.org/linux-iommu/20231020092426.13907-1-yi.l.liu@intel.com/#t
>>   - Split the iommufd nesting series into two parts of alloc_user and
>>     invalidation (Jason)
>>   - Split IOMMUFD_OBJ_HW_PAGETABLE to IOMMUFD_OBJ_HWPT_PAGING/_NESTED, and
>>     do the same with the structures/alloc()/abort()/destroy(). Reworked the
>>     selftest accordingly too. (Jason)
>>   - Move hwpt/data_type into struct iommu_user_data from standalone op
>>     arguments. (Jason)
>>   - Rename hwpt_type to be data_type, the HWPT_TYPE to be HWPT_ALLOC_DATA,
>>     _TYPE_DEFAULT to be _ALLOC_DATA_NONE (Jason, Kevin)
>>   - Rename iommu_copy_user_data() to iommu_copy_struct_from_user() (Kevin)
>>   - Add macro to the iommu_copy_struct_from_user() to calculate min_size
>>     (Jason)
>>   - Fix two bugs spotted by ZhaoYan
>>
>> v4: https://lore.kernel.org/linux-iommu/20230921075138.124099-1-yi.l.liu@intel.com/
>>   - Separate HWPT alloc/destroy/abort functions between user-managed HWPTs
>>     and kernel-managed HWPTs
>>   - Rework invalidate uAPI to be a multi-request array-based design
>>   - Add a struct iommu_user_data_array and a helper for driver to sanitize
>>     and copy the entry data from user space invalidation array
>>   - Add a patch fixing TEST_LENGTH() in selftest program
>>   - Drop IOMMU_RESV_IOVA_RANGES patches
>>   - Update kdoc and inline comments
>>   - Drop the code to add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation,
>>     this does not change the rule that resv regions should only be added to the
>>     kernel-managed HWPT. The IOMMU_RESV_SW_MSI stuff will be added in later series
>>     as it is needed only by SMMU so far.
>>
>> v3: https://lore.kernel.org/linux-iommu/20230724110406.107212-1-yi.l.liu@intel.com/
>>   - Add new uAPI things in alphabetical order
>>   - Pass in "enum iommu_hwpt_type hwpt_type" to op->domain_alloc_user for
>>     sanity, replacing the previous op->domain_alloc_user_data_len solution
>>   - Return ERR_PTR from domain_alloc_user instead of NULL
>>   - Only add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation (Kevin)
>>   - Add IOMMU_RESV_IOVA_RANGES to report resv iova ranges to userspace hence
>>     userspace is able to exclude the ranges in the stage-1 HWPT (e.g. guest I/O
>>     page table). (Kevin)
>>   - Add selftest coverage for the new IOMMU_RESV_IOVA_RANGES ioctl
>>   - Minor changes per Kevin's inputs
>>
>> v2: https://lore.kernel.org/linux-iommu/20230511143844.22693-1-yi.l.liu@intel.com/
>>   - Add union iommu_domain_user_data to include all user data structures to avoid
>>     passing void * in kernel APIs.
>>   - Add iommu op to return user data length for user domain allocation
>>   - Rename struct iommu_hwpt_alloc::data_type to be hwpt_type
>>   - Store the invalidation data length in iommu_domain_ops::cache_invalidate_user_data_len
>>   - Convert cache_invalidate_user op to be int instead of void
>>   - Remove @data_type in struct iommu_hwpt_invalidate
>>   - Remove out_hwpt_type_bitmap in struct iommu_hw_info hence drop patch 08 of v1
>>
>> v1: https://lore.kernel.org/linux-iommu/20230309080910.607396-1-yi.l.liu@intel.com/
>>
>> Thanks,
>> 	Yi Liu
>>
>> Lu Baolu (1):
>>    iommu: Add cache_invalidate_user op
>>
>> Nicolin Chen (4):
>>    iommu: Add iommu_copy_struct_from_user_array helper
>>    iommufd/selftest: Add mock_domain_cache_invalidate_user support
>>    iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op
>>    iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl
>>
>> Yi Liu (1):
>>    iommufd: Add IOMMU_HWPT_INVALIDATE
>>
>>   drivers/iommu/iommufd/hw_pagetable.c          | 35 ++++++++
>>   drivers/iommu/iommufd/iommufd_private.h       |  9 ++
>>   drivers/iommu/iommufd/iommufd_test.h          | 22 +++++
>>   drivers/iommu/iommufd/main.c                  |  3 +
>>   drivers/iommu/iommufd/selftest.c              | 69 +++++++++++++++
>>   include/linux/iommu.h                         | 84 +++++++++++++++++++
>>   include/uapi/linux/iommufd.h                  | 35 ++++++++
>>   tools/testing/selftests/iommu/iommufd.c       | 75 +++++++++++++++++
>>   tools/testing/selftests/iommu/iommufd_utils.h | 63 ++++++++++++++
>>   9 files changed, 395 insertions(+)
>>
>> -- 
>> 2.34.1
>>
> 

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2)
  2023-12-19  9:26     ` Yi Liu
@ 2023-12-20 11:23       ` Joel Granados
  0 siblings, 0 replies; 93+ messages in thread
From: Joel Granados @ 2023-12-20 11:23 UTC (permalink / raw)
  To: Yi Liu
  Cc: joro, alex.williamson, jgg, kevin.tian, robin.murphy, baolu.lu,
	cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	zhenzhong.duan, joao.m.martins, xin.zeng, yan.y.zhao

[-- Attachment #1: Type: text/plain, Size: 9222 bytes --]

On Tue, Dec 19, 2023 at 05:26:21PM +0800, Yi Liu wrote:
> On 2023/12/17 19:21, Joel Granados wrote:
> > Hey Yi
> > 
> > I have been working with https://protect2.fireeye.com/v1/url?k=b58750ce-ea1c9eaa-b586db81-000babda0201-365207d33731a099&q=1&e=ee73b69d-5c35-49ef-9e62-2355fb797f21&u=https%3A%2F%2Fgithub.com%2Fyiliu1765%2Fqemu%2Ftree%2Fzhenzhong%2Fwip%2Fiommufd_nesting_rfcv1
> 
> good to know about it.
> 
> > and have some questions regarding one of the commits in that series.
> > I however cannot find it in lore.kernel.org. Can you please direct me to
> > where the rfc was posted? If it has not been posted yet, do you have an
> > alternate place for discussion?
> 
> the qemu series has not been posted yet as kernel side is still changing.
> It still needs some time to be ready for public review. Zhenzhong Duan
> is going to post it when it's ready. If you have questions to discuss,
> you can post your questions to Zhenzhong and me first. I guess it may be
> fine to cc Alex Williamson, Eric Auger, Nicolin Chen, Cédric Le Goater,
> Kevin Tian, Jason Gunthorpe and qemu mail list as this is discussion
> something that is going to be posted in public.
Thx for getting back to me. I'll direct my questions to these
recipients.

Best

> 
> > 
> > Best
> > 
> > On Fri, Nov 17, 2023 at 05:07:11AM -0800, Yi Liu wrote:
> > > Nested translation is a hardware feature that is supported by many modern
> > > IOMMU hardwares. It has two stages (stage-1, stage-2) address translation
> > > to get access to the physical address. stage-1 translation table is owned
> > > by userspace (e.g. by a guest OS), while stage-2 is owned by kernel. Changes
> > > to stage-1 translation table should be followed by an IOTLB invalidation.
> > > 
> > > Take Intel VT-d as an example, the stage-1 translation table is I/O page
> > > table. As the below diagram shows, guest I/O page table pointer in GPA
> > > (guest physical address) is passed to host and be used to perform the stage-1
> > > address translation. Along with it, modifications to present mappings in the
> > > guest I/O page table should be followed with an IOTLB invalidation.
> > > 
> > >      .-------------.  .---------------------------.
> > >      |   vIOMMU    |  | Guest I/O page table      |
> > >      |             |  '---------------------------'
> > >      .----------------/
> > >      | PASID Entry |--- PASID cache flush --+
> > >      '-------------'                        |
> > >      |             |                        V
> > >      |             |           I/O page table pointer in GPA
> > >      '-------------'
> > > Guest
> > > ------| Shadow |---------------------------|--------
> > >        v        v                           v
> > > Host
> > >      .-------------.  .------------------------.
> > >      |   pIOMMU    |  |  FS for GIOVA->GPA     |
> > >      |             |  '------------------------'
> > >      .----------------/  |
> > >      | PASID Entry |     V (Nested xlate)
> > >      '----------------\.----------------------------------.
> > >      |             |   | SS for GPA->HPA, unmanaged domain|
> > >      |             |   '----------------------------------'
> > >      '-------------'
> > > Where:
> > >   - FS = First stage page tables
> > >   - SS = Second stage page tables
> > > <Intel VT-d Nested translation>
> > > 
> > > This series adds the cache invalidation path for the userspace to invalidate
> > > cache after modifying the stage-1 page table. This is based on the first part
> > > of nesting [1]
> > > 
> > > Complete code can be found in [2], QEMU could can be found in [3].
> > > 
> > > At last, this is a team work together with Nicolin Chen, Lu Baolu. Thanks
> > > them for the help. ^_^. Look forward to your feedbacks.
> > > 
> > > [1] https://lore.kernel.org/linux-iommu/20231026044216.64964-1-yi.l.liu@intel.com/ - merged
> > > [2] https://protect2.fireeye.com/v1/url?k=38b56f01-672ea165-38b4e44e-000babda0201-469ae350f21411ca&q=1&e=ee73b69d-5c35-49ef-9e62-2355fb797f21&u=https%3A%2F%2Fgithub.com%2Fyiliu1765%2Fiommufd%2Ftree%2Fiommufd_nesting
> > > [3] https://protect2.fireeye.com/v1/url?k=d6e01ed1-897bd0b5-d6e1959e-000babda0201-bcf2b26a8dc8b34d&q=1&e=ee73b69d-5c35-49ef-9e62-2355fb797f21&u=https%3A%2F%2Fgithub.com%2Fyiliu1765%2Fqemu%2Ftree%2Fzhenzhong%2Fwip%2Fiommufd_nesting_rfcv1
> > > 
> > > Change log:
> > > 
> > > v6:
> > >   - No much change, just rebase on top of 6.7-rc1 as part 1/2 is merged
> > > 
> > > v5: https://lore.kernel.org/linux-iommu/20231020092426.13907-1-yi.l.liu@intel.com/#t
> > >   - Split the iommufd nesting series into two parts of alloc_user and
> > >     invalidation (Jason)
> > >   - Split IOMMUFD_OBJ_HW_PAGETABLE to IOMMUFD_OBJ_HWPT_PAGING/_NESTED, and
> > >     do the same with the structures/alloc()/abort()/destroy(). Reworked the
> > >     selftest accordingly too. (Jason)
> > >   - Move hwpt/data_type into struct iommu_user_data from standalone op
> > >     arguments. (Jason)
> > >   - Rename hwpt_type to be data_type, the HWPT_TYPE to be HWPT_ALLOC_DATA,
> > >     _TYPE_DEFAULT to be _ALLOC_DATA_NONE (Jason, Kevin)
> > >   - Rename iommu_copy_user_data() to iommu_copy_struct_from_user() (Kevin)
> > >   - Add macro to the iommu_copy_struct_from_user() to calculate min_size
> > >     (Jason)
> > >   - Fix two bugs spotted by ZhaoYan
> > > 
> > > v4: https://lore.kernel.org/linux-iommu/20230921075138.124099-1-yi.l.liu@intel.com/
> > >   - Separate HWPT alloc/destroy/abort functions between user-managed HWPTs
> > >     and kernel-managed HWPTs
> > >   - Rework invalidate uAPI to be a multi-request array-based design
> > >   - Add a struct iommu_user_data_array and a helper for driver to sanitize
> > >     and copy the entry data from user space invalidation array
> > >   - Add a patch fixing TEST_LENGTH() in selftest program
> > >   - Drop IOMMU_RESV_IOVA_RANGES patches
> > >   - Update kdoc and inline comments
> > >   - Drop the code to add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation,
> > >     this does not change the rule that resv regions should only be added to the
> > >     kernel-managed HWPT. The IOMMU_RESV_SW_MSI stuff will be added in later series
> > >     as it is needed only by SMMU so far.
> > > 
> > > v3: https://lore.kernel.org/linux-iommu/20230724110406.107212-1-yi.l.liu@intel.com/
> > >   - Add new uAPI things in alphabetical order
> > >   - Pass in "enum iommu_hwpt_type hwpt_type" to op->domain_alloc_user for
> > >     sanity, replacing the previous op->domain_alloc_user_data_len solution
> > >   - Return ERR_PTR from domain_alloc_user instead of NULL
> > >   - Only add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation (Kevin)
> > >   - Add IOMMU_RESV_IOVA_RANGES to report resv iova ranges to userspace hence
> > >     userspace is able to exclude the ranges in the stage-1 HWPT (e.g. guest I/O
> > >     page table). (Kevin)
> > >   - Add selftest coverage for the new IOMMU_RESV_IOVA_RANGES ioctl
> > >   - Minor changes per Kevin's inputs
> > > 
> > > v2: https://lore.kernel.org/linux-iommu/20230511143844.22693-1-yi.l.liu@intel.com/
> > >   - Add union iommu_domain_user_data to include all user data structures to avoid
> > >     passing void * in kernel APIs.
> > >   - Add iommu op to return user data length for user domain allocation
> > >   - Rename struct iommu_hwpt_alloc::data_type to be hwpt_type
> > >   - Store the invalidation data length in iommu_domain_ops::cache_invalidate_user_data_len
> > >   - Convert cache_invalidate_user op to be int instead of void
> > >   - Remove @data_type in struct iommu_hwpt_invalidate
> > >   - Remove out_hwpt_type_bitmap in struct iommu_hw_info hence drop patch 08 of v1
> > > 
> > > v1: https://lore.kernel.org/linux-iommu/20230309080910.607396-1-yi.l.liu@intel.com/
> > > 
> > > Thanks,
> > > 	Yi Liu
> > > 
> > > Lu Baolu (1):
> > >    iommu: Add cache_invalidate_user op
> > > 
> > > Nicolin Chen (4):
> > >    iommu: Add iommu_copy_struct_from_user_array helper
> > >    iommufd/selftest: Add mock_domain_cache_invalidate_user support
> > >    iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op
> > >    iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl
> > > 
> > > Yi Liu (1):
> > >    iommufd: Add IOMMU_HWPT_INVALIDATE
> > > 
> > >   drivers/iommu/iommufd/hw_pagetable.c          | 35 ++++++++
> > >   drivers/iommu/iommufd/iommufd_private.h       |  9 ++
> > >   drivers/iommu/iommufd/iommufd_test.h          | 22 +++++
> > >   drivers/iommu/iommufd/main.c                  |  3 +
> > >   drivers/iommu/iommufd/selftest.c              | 69 +++++++++++++++
> > >   include/linux/iommu.h                         | 84 +++++++++++++++++++
> > >   include/uapi/linux/iommufd.h                  | 35 ++++++++
> > >   tools/testing/selftests/iommu/iommufd.c       | 75 +++++++++++++++++
> > >   tools/testing/selftests/iommu/iommufd_utils.h | 63 ++++++++++++++
> > >   9 files changed, 395 insertions(+)
> > > 
> > > -- 
> > > 2.34.1
> > > 
> > 
> 
> -- 
> Regards,
> Yi Liu

-- 

Joel Granados

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 659 bytes --]

^ permalink raw reply	[flat|nested] 93+ messages in thread

end of thread, other threads:[~2024-01-08  8:37 UTC | newest]

Thread overview: 93+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-17 13:07 [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2) Yi Liu
2023-11-17 13:07 ` [PATCH v6 1/6] iommu: Add cache_invalidate_user op Yi Liu
2023-11-20  7:53   ` Tian, Kevin
2023-12-06 18:32   ` Jason Gunthorpe
2023-12-06 18:43     ` Nicolin Chen
2023-12-06 18:50       ` Jason Gunthorpe
2023-12-07  6:53         ` Yi Liu
2024-01-08  7:32   ` Binbin Wu
2023-11-17 13:07 ` [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE Yi Liu
2023-11-20  8:09   ` Tian, Kevin
2023-11-20  8:29     ` Yi Liu
2023-11-20  8:34       ` Tian, Kevin
2023-11-20 17:36         ` Nicolin Chen
2023-11-21  2:50           ` Tian, Kevin
2023-11-21  5:24             ` Nicolin Chen
2023-11-24  2:36               ` Tian, Kevin
2023-11-27 19:53                 ` Nicolin Chen
2023-11-28  6:01                   ` Yi Liu
2023-11-29  0:54                     ` Nicolin Chen
2023-11-28  8:03                   ` Tian, Kevin
2023-11-29  0:51                     ` Nicolin Chen
2023-11-29  0:57                       ` Jason Gunthorpe
2023-11-29  1:09                         ` Nicolin Chen
2023-11-29 19:58                           ` Jason Gunthorpe
2023-11-29 22:07                             ` Nicolin Chen
2023-11-30  0:08                               ` Jason Gunthorpe
2023-11-30 20:41                                 ` Nicolin Chen
2023-12-01  0:45                                   ` Jason Gunthorpe
2023-12-01  4:29                                     ` Nicolin Chen
2023-12-01 12:55                                       ` Jason Gunthorpe
2023-12-01 19:58                                         ` Nicolin Chen
2023-12-01 20:43                                           ` Jason Gunthorpe
2023-12-01 22:12                                             ` Nicolin Chen
2023-12-04 14:48                                               ` Jason Gunthorpe
2023-12-05 17:33                                                 ` Nicolin Chen
2023-12-06 12:48                                                   ` Jason Gunthorpe
2023-12-01  3:51                         ` Yi Liu
2023-12-01  4:50                           ` Nicolin Chen
2023-12-01  5:19                             ` Tian, Kevin
2023-12-01  7:05                               ` Yi Liu
2023-12-01  7:10                                 ` Tian, Kevin
2023-12-01  9:08                                   ` Yi Liu
2023-11-21  5:02   ` Baolu Lu
2023-11-21  5:19     ` Nicolin Chen
2023-11-28  5:54       ` Yi Liu
2023-12-06 18:33   ` Jason Gunthorpe
2023-12-07  6:59   ` Yi Liu
2023-12-07  9:04     ` Tian, Kevin
2023-12-07 14:42       ` Jason Gunthorpe
2023-12-11  7:53         ` Yi Liu
2023-12-11 13:21           ` Jason Gunthorpe
2023-12-12 13:45             ` Liu, Yi L
2023-12-12 14:40               ` Jason Gunthorpe
2023-12-13 13:47                 ` Liu, Yi L
2023-12-13 14:11                   ` Jason Gunthorpe
2023-12-11  7:49       ` Yi Liu
2023-11-17 13:07 ` [PATCH v6 3/6] iommu: Add iommu_copy_struct_from_user_array helper Yi Liu
2023-11-20  8:17   ` Tian, Kevin
2023-11-20 17:25     ` Nicolin Chen
2023-11-21  2:48       ` Tian, Kevin
2024-01-08  8:37   ` Binbin Wu
2023-11-17 13:07 ` [PATCH v6 4/6] iommufd/selftest: Add mock_domain_cache_invalidate_user support Yi Liu
2023-12-06 18:16   ` Jason Gunthorpe
2023-12-11 11:21     ` Yi Liu
2023-11-17 13:07 ` [PATCH v6 5/6] iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op Yi Liu
2023-11-17 13:07 ` [PATCH v6 6/6] iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl Yi Liu
2023-12-06 18:19   ` Jason Gunthorpe
2023-12-11 11:28     ` Yi Liu
2023-12-11 13:06       ` Jason Gunthorpe
2023-12-09  1:47 ` [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2) Jason Gunthorpe
2023-12-11  2:29   ` Tian, Kevin
2023-12-11 12:36     ` Yi Liu
2023-12-11 13:05       ` Jason Gunthorpe
2023-12-11 15:34         ` Suthikulpanit, Suravee
2023-12-11 16:06           ` Jason Gunthorpe
2023-12-11 12:35   ` Yi Liu
2023-12-11 13:20     ` Jason Gunthorpe
2023-12-11 20:11       ` Nicolin Chen
2023-12-11 21:48         ` Jason Gunthorpe
2023-12-11 17:35   ` Suthikulpanit, Suravee
2023-12-11 17:45     ` Jason Gunthorpe
2023-12-11 21:27   ` Nicolin Chen
2023-12-11 21:57     ` Jason Gunthorpe
2023-12-12  7:30       ` Nicolin Chen
2023-12-12 14:44         ` Jason Gunthorpe
2023-12-12 19:13           ` Nicolin Chen
2023-12-12 19:21             ` Jason Gunthorpe
2023-12-12 20:05               ` Nicolin Chen
2023-12-13 12:40                 ` Jason Gunthorpe
2023-12-13 19:54                   ` Nicolin Chen
     [not found] ` <CGME20231217215720eucas1p2a590aca62ce8eb5ba81df6bc8b1a785d@eucas1p2.samsung.com>
2023-12-17 11:21   ` Joel Granados
2023-12-19  9:26     ` Yi Liu
2023-12-20 11:23       ` Joel Granados

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).