[PATCH v2 0/4] vfio-pci support pasid attach/detach

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/4] vfio-pci support pasid attach/detach
@ 2024-04-12  8:21 Yi Liu
  2024-04-12  8:21 ` [PATCH v2 1/4] ida: Add ida_get_lowest() Yi Liu
                   ` (4 more replies)
  0 siblings, 5 replies; 102+ messages in thread
From: Yi Liu @ 2024-04-12  8:21 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: joro, robin.murphy, eric.auger, nicolinc, kvm, chao.p.peng,
	yi.l.liu, iommu, baolu.lu, zhenzhong.duan, jacob.jun.pan

This adds the pasid attach/detach uAPIs for userspace to attach/detach
a PASID of a device to/from a given ioas/hwpt. Only vfio-pci driver is
enabled in this series. After this series, PASID-capable devices bound
with vfio-pci can report PASID capability to userspace and VM to enable
PASID usages like Shared Virtual Addressing (SVA).

This series first adds the helpers for pasid attach in vfio core and then
adds the device cdev ioctls for pasid attach/detach, finally exposing the
device PASID capability to the user. It depends on the iommufd pasid
attach/detach series [1].

A userspace VMM is supposed to get the details of the device's PASID capability
and assemble a virtual PASID capability in a proper offset in the virtual PCI
configuration space. While it is still an open on how to get the available
offsets. Devices may have hidden bits that are not in the PCI cap chain. For
now, there are two options to get the available offsets.[2]

- Report the available offsets via ioctl. This requires device-specific logic
  to provide available offsets. e.g., vfio-pci variant driver. Or may the device
  provide the available offset by DVSEC.
- Store the available offsets in a static table in userspace VMM. VMM gets the
  empty offsets from this table.

Since there was no clear answer to it, I have not made much change on this in
this version. Wish we could have more discussions on this.

The completed code can be found at [3], tested with a hacky Qemu branch [4].

[1] https://lore.kernel.org/linux-iommu/20240412081516.31168-1-yi.l.liu@intel.com/
[2] https://lore.kernel.org/kvm/b3e07591-8ebc-4924-85fe-29a46fc73d78@intel.com/
[3] https://github.com/yiliu1765/iommufd/tree/iommufd_pasid
[4] https://github.com/yiliu1765/qemu/tree/wip/zhenzhong/iommufd_nesting_rfcv2-test-pasid

Change log:

v2:
 - Use IDA to track if PASID is attached or not in VFIO. (Jason)
 - Fix the issue of calling pasid_at[de]tach_ioas callback unconditionally (Alex)
 - Fix the wrong data copy in vfio_df_ioctl_pasid_detach_pt() (Zhenzhong)
 - Minor tweaks in comments (Kevin)

v1: https://lore.kernel.org/kvm/20231127063909.129153-1-yi.l.liu@intel.com/
 - Report PASID capability via VFIO_DEVICE_FEATURE (Alex)

rfc: https://lore.kernel.org/linux-iommu/20230926093121.18676-1-yi.l.liu@intel.com/

Regards,
        Yi Liu

Yi Liu (4):
  ida: Add ida_get_lowest()
  vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices
  vfio: Add VFIO_DEVICE_PASID_[AT|DE]TACH_IOMMUFD_PT
  vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl

 drivers/vfio/device_cdev.c       | 51 +++++++++++++++++++++++
 drivers/vfio/iommufd.c           | 60 +++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci.c      |  2 +
 drivers/vfio/pci/vfio_pci_core.c | 50 +++++++++++++++++++++++
 drivers/vfio/vfio.h              |  4 ++
 drivers/vfio/vfio_main.c         |  8 ++++
 include/linux/idr.h              |  1 +
 include/linux/vfio.h             | 11 +++++
 include/uapi/linux/vfio.h        | 69 ++++++++++++++++++++++++++++++++
 lib/idr.c                        | 67 +++++++++++++++++++++++++++++++
 10 files changed, 323 insertions(+)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH v2 1/4] ida: Add ida_get_lowest()
  2024-04-12  8:21 [PATCH v2 0/4] vfio-pci support pasid attach/detach Yi Liu
@ 2024-04-12  8:21 ` Yi Liu
  2024-04-16 16:03   ` Alex Williamson
  2024-04-12  8:21 ` [PATCH v2 2/4] vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices Yi Liu
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 102+ messages in thread
From: Yi Liu @ 2024-04-12  8:21 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: joro, robin.murphy, eric.auger, nicolinc, kvm, chao.p.peng,
	yi.l.liu, iommu, baolu.lu, zhenzhong.duan, jacob.jun.pan,
	Matthew Wilcox

There is no helpers for user to check if a given ID is allocated or not,
neither a helper to loop all the allocated IDs in an IDA and do something
for cleanup. With the two needs, a helper to get the lowest allocated ID
of a range can help to achieve it.

Caller can check if a given ID is allocated or not by:
	int id = 200, rc;

	rc = ida_get_lowest(&ida, id, id);
	if (rc == id)
		//id 200 is used
	else
		//id 200 is not used

Caller can iterate all allocated IDs by:
	int id = 0;

	while (!ida_is_empty(&pasid_ida)) {
		id = ida_get_lowest(pasid_ida, id, INT_MAX);
		if (id < 0)
			break;
		//anything to do with the allocated ID
		ida_free(pasid_ida, pasid);
	}

Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 include/linux/idr.h |  1 +
 lib/idr.c           | 67 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 68 insertions(+)

diff --git a/include/linux/idr.h b/include/linux/idr.h
index da5f5fa4a3a6..1dae71d4a75d 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -257,6 +257,7 @@ struct ida {
 int ida_alloc_range(struct ida *, unsigned int min, unsigned int max, gfp_t);
 void ida_free(struct ida *, unsigned int id);
 void ida_destroy(struct ida *ida);
+int ida_get_lowest(struct ida *ida, unsigned int min, unsigned int max);
 
 /**
  * ida_alloc() - Allocate an unused ID.
diff --git a/lib/idr.c b/lib/idr.c
index da36054c3ca0..03e461242fe2 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -476,6 +476,73 @@ int ida_alloc_range(struct ida *ida, unsigned int min, unsigned int max,
 }
 EXPORT_SYMBOL(ida_alloc_range);
 
+/**
+ * ida_get_lowest - Get the lowest used ID.
+ * @ida: IDA handle.
+ * @min: Lowest ID to get.
+ * @max: Highest ID to get.
+ *
+ * Get the lowest used ID between @min and @max, inclusive.  The returned
+ * ID will not exceed %INT_MAX, even if @max is larger.
+ *
+ * Context: Any context. Takes and releases the xa_lock.
+ * Return: The lowest used ID, or errno if no used ID is found.
+ */
+int ida_get_lowest(struct ida *ida, unsigned int min, unsigned int max)
+{
+	unsigned long index = min / IDA_BITMAP_BITS;
+	unsigned int offset = min % IDA_BITMAP_BITS;
+	unsigned long *addr, size, bit;
+	unsigned long flags;
+	void *entry;
+	int ret;
+
+	if (min >= INT_MAX)
+		return -EINVAL;
+	if (max >= INT_MAX)
+		max = INT_MAX;
+
+	xa_lock_irqsave(&ida->xa, flags);
+
+	entry = xa_find(&ida->xa, &index, max / IDA_BITMAP_BITS, XA_PRESENT);
+	if (!entry) {
+		ret = -ENOTTY;
+		goto err_unlock;
+	}
+
+	if (index > min / IDA_BITMAP_BITS)
+		offset = 0;
+	if (index * IDA_BITMAP_BITS + offset > max) {
+		ret = -ENOTTY;
+		goto err_unlock;
+	}
+
+	if (xa_is_value(entry)) {
+		unsigned long tmp = xa_to_value(entry);
+
+		addr = &tmp;
+		size = BITS_PER_XA_VALUE;
+	} else {
+		addr = ((struct ida_bitmap *)entry)->bitmap;
+		size = IDA_BITMAP_BITS;
+	}
+
+	bit = find_next_bit(addr, size, offset);
+
+	xa_unlock_irqrestore(&ida->xa, flags);
+
+	if (bit == size ||
+	    index * IDA_BITMAP_BITS + bit > max)
+		return -ENOTTY;
+
+	return index * IDA_BITMAP_BITS + bit;
+
+err_unlock:
+	xa_unlock_irqrestore(&ida->xa, flags);
+	return ret;
+}
+EXPORT_SYMBOL(ida_get_lowest);
+
 /**
  * ida_free() - Release an allocated ID.
  * @ida: IDA handle.
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v2 2/4] vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices
  2024-04-12  8:21 [PATCH v2 0/4] vfio-pci support pasid attach/detach Yi Liu
  2024-04-12  8:21 ` [PATCH v2 1/4] ida: Add ida_get_lowest() Yi Liu
@ 2024-04-12  8:21 ` Yi Liu
  2024-04-16  9:01   ` Tian, Kevin
  2024-04-23 12:43   ` Jason Gunthorpe
  2024-04-12  8:21 ` [PATCH v2 3/4] vfio: Add VFIO_DEVICE_PASID_[AT|DE]TACH_IOMMUFD_PT Yi Liu
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 102+ messages in thread
From: Yi Liu @ 2024-04-12  8:21 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: joro, robin.murphy, eric.auger, nicolinc, kvm, chao.p.peng,
	yi.l.liu, iommu, baolu.lu, zhenzhong.duan, jacob.jun.pan,
	Matthew Wilcox

This adds pasid_at|de]tach_ioas ops for attaching hwpt to pasid of a
device and the helpers for it. For now, only vfio-pci supports pasid
attach/detach.

Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/iommufd.c      | 60 +++++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci.c |  2 ++
 include/linux/vfio.h        | 11 +++++++
 3 files changed, 73 insertions(+)

diff --git a/drivers/vfio/iommufd.c b/drivers/vfio/iommufd.c
index 82eba6966fa5..fc533416c75d 100644
--- a/drivers/vfio/iommufd.c
+++ b/drivers/vfio/iommufd.c
@@ -119,14 +119,26 @@ int vfio_iommufd_physical_bind(struct vfio_device *vdev,
 	if (IS_ERR(idev))
 		return PTR_ERR(idev);
 	vdev->iommufd_device = idev;
+	ida_init(&vdev->pasids);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(vfio_iommufd_physical_bind);
 
 void vfio_iommufd_physical_unbind(struct vfio_device *vdev)
 {
+	int pasid = 0;
+
 	lockdep_assert_held(&vdev->dev_set->lock);
 
+	while (!ida_is_empty(&vdev->pasids)) {
+		pasid = ida_get_lowest(&vdev->pasids, pasid, INT_MAX);
+		if (pasid < 0)
+			break;
+
+		iommufd_device_pasid_detach(vdev->iommufd_device, pasid);
+		ida_free(&vdev->pasids, pasid);
+	}
+
 	if (vdev->iommufd_attached) {
 		iommufd_device_detach(vdev->iommufd_device);
 		vdev->iommufd_attached = false;
@@ -168,6 +180,54 @@ void vfio_iommufd_physical_detach_ioas(struct vfio_device *vdev)
 }
 EXPORT_SYMBOL_GPL(vfio_iommufd_physical_detach_ioas);
 
+int vfio_iommufd_physical_pasid_attach_ioas(struct vfio_device *vdev,
+					    u32 pasid, u32 *pt_id)
+{
+	int rc;
+
+	lockdep_assert_held(&vdev->dev_set->lock);
+
+	if (WARN_ON(!vdev->iommufd_device))
+		return -EINVAL;
+
+	rc = ida_get_lowest(&vdev->pasids, pasid, pasid);
+	if (rc == pasid)
+		return iommufd_device_pasid_replace(vdev->iommufd_device,
+						    pasid, pt_id);
+
+	rc = iommufd_device_pasid_attach(vdev->iommufd_device, pasid, pt_id);
+	if (rc)
+		return rc;
+
+	rc = ida_alloc_range(&vdev->pasids, pasid, pasid, GFP_KERNEL);
+	if (rc < 0) {
+		iommufd_device_pasid_detach(vdev->iommufd_device, pasid);
+		return rc;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(vfio_iommufd_physical_pasid_attach_ioas);
+
+void vfio_iommufd_physical_pasid_detach_ioas(struct vfio_device *vdev,
+					     u32 pasid)
+{
+	int rc;
+
+	lockdep_assert_held(&vdev->dev_set->lock);
+
+	if (WARN_ON(!vdev->iommufd_device))
+		return;
+
+	rc = ida_get_lowest(&vdev->pasids, pasid, pasid);
+	if (rc < 0)
+		return;
+
+	iommufd_device_pasid_detach(vdev->iommufd_device, pasid);
+	ida_free(&vdev->pasids, pasid);
+}
+EXPORT_SYMBOL_GPL(vfio_iommufd_physical_pasid_detach_ioas);
+
 /*
  * The emulated standard ops mean that vfio_device is going to use the
  * "mdev path" and will call vfio_pin_pages()/vfio_dma_rw(). Drivers using this
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index cb5b7f865d58..e0198851ffd2 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -142,6 +142,8 @@ static const struct vfio_device_ops vfio_pci_ops = {
 	.unbind_iommufd	= vfio_iommufd_physical_unbind,
 	.attach_ioas	= vfio_iommufd_physical_attach_ioas,
 	.detach_ioas	= vfio_iommufd_physical_detach_ioas,
+	.pasid_attach_ioas	= vfio_iommufd_physical_pasid_attach_ioas,
+	.pasid_detach_ioas	= vfio_iommufd_physical_pasid_detach_ioas,
 };
 
 static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 8b1a29820409..8fd1db173e84 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -66,6 +66,7 @@ struct vfio_device {
 	void (*put_kvm)(struct kvm *kvm);
 #if IS_ENABLED(CONFIG_IOMMUFD)
 	struct iommufd_device *iommufd_device;
+	struct ida pasids;
 	u8 iommufd_attached:1;
 #endif
 	u8 cdev_opened:1;
@@ -90,6 +91,8 @@ struct vfio_device {
  *		 bound iommufd. Undo in unbind_iommufd if @detach_ioas is not
  *		 called.
  * @detach_ioas: Opposite of attach_ioas
+ * @pasid_attach_ioas: The pasid variation of attach_ioas
+ * @pasid_detach_ioas: Opposite of pasid_attach_ioas
  * @open_device: Called when the first file descriptor is opened for this device
  * @close_device: Opposite of open_device
  * @read: Perform read(2) on device file descriptor
@@ -114,6 +117,8 @@ struct vfio_device_ops {
 	void	(*unbind_iommufd)(struct vfio_device *vdev);
 	int	(*attach_ioas)(struct vfio_device *vdev, u32 *pt_id);
 	void	(*detach_ioas)(struct vfio_device *vdev);
+	int	(*pasid_attach_ioas)(struct vfio_device *vdev, u32 pasid, u32 *pt_id);
+	void	(*pasid_detach_ioas)(struct vfio_device *vdev, u32 pasid);
 	int	(*open_device)(struct vfio_device *vdev);
 	void	(*close_device)(struct vfio_device *vdev);
 	ssize_t	(*read)(struct vfio_device *vdev, char __user *buf,
@@ -138,6 +143,8 @@ int vfio_iommufd_physical_bind(struct vfio_device *vdev,
 void vfio_iommufd_physical_unbind(struct vfio_device *vdev);
 int vfio_iommufd_physical_attach_ioas(struct vfio_device *vdev, u32 *pt_id);
 void vfio_iommufd_physical_detach_ioas(struct vfio_device *vdev);
+int vfio_iommufd_physical_pasid_attach_ioas(struct vfio_device *vdev, u32 pasid, u32 *pt_id);
+void vfio_iommufd_physical_pasid_detach_ioas(struct vfio_device *vdev, u32 pasid);
 int vfio_iommufd_emulated_bind(struct vfio_device *vdev,
 			       struct iommufd_ctx *ictx, u32 *out_device_id);
 void vfio_iommufd_emulated_unbind(struct vfio_device *vdev);
@@ -165,6 +172,10 @@ vfio_iommufd_get_dev_id(struct vfio_device *vdev, struct iommufd_ctx *ictx)
 	((int (*)(struct vfio_device *vdev, u32 *pt_id)) NULL)
 #define vfio_iommufd_physical_detach_ioas \
 	((void (*)(struct vfio_device *vdev)) NULL)
+#define vfio_iommufd_physical_pasid_attach_ioas \
+	((int (*)(struct vfio_device *vdev, u32 pasid, u32 *pt_id)) NULL)
+#define vfio_iommufd_physical_pasid_detach_ioas \
+	((void (*)(struct vfio_device *vdev, u32 pasid)) NULL)
 #define vfio_iommufd_emulated_bind                                      \
 	((int (*)(struct vfio_device *vdev, struct iommufd_ctx *ictx,   \
 		  u32 *out_device_id)) NULL)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v2 3/4] vfio: Add VFIO_DEVICE_PASID_[AT|DE]TACH_IOMMUFD_PT
  2024-04-12  8:21 [PATCH v2 0/4] vfio-pci support pasid attach/detach Yi Liu
  2024-04-12  8:21 ` [PATCH v2 1/4] ida: Add ida_get_lowest() Yi Liu
  2024-04-12  8:21 ` [PATCH v2 2/4] vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices Yi Liu
@ 2024-04-12  8:21 ` Yi Liu
  2024-04-16  9:13   ` Tian, Kevin
  2024-04-23 12:45   ` Jason Gunthorpe
  2024-04-12  8:21 ` [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl Yi Liu
  2024-04-16  8:38 ` [PATCH v2 0/4] vfio-pci support pasid attach/detach Tian, Kevin
  4 siblings, 2 replies; 102+ messages in thread
From: Yi Liu @ 2024-04-12  8:21 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: joro, robin.murphy, eric.auger, nicolinc, kvm, chao.p.peng,
	yi.l.liu, iommu, baolu.lu, zhenzhong.duan, jacob.jun.pan

This adds ioctls for the userspace to attach/detach a given pasid of a
vfio device to/from an IOAS/HWPT.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/device_cdev.c | 51 +++++++++++++++++++++++++++++++++++
 drivers/vfio/vfio.h        |  4 +++
 drivers/vfio/vfio_main.c   |  8 ++++++
 include/uapi/linux/vfio.h  | 55 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 118 insertions(+)

diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c
index e75da0a70d1f..5326f1608ace 100644
--- a/drivers/vfio/device_cdev.c
+++ b/drivers/vfio/device_cdev.c
@@ -210,6 +210,57 @@ int vfio_df_ioctl_detach_pt(struct vfio_device_file *df,
 	return 0;
 }
 
+int vfio_df_ioctl_pasid_attach_pt(struct vfio_device_file *df,
+				  struct vfio_device_pasid_attach_iommufd_pt __user *arg)
+{
+	struct vfio_device_pasid_attach_iommufd_pt attach;
+	struct vfio_device *device = df->device;
+	unsigned long minsz;
+	int ret;
+
+	minsz = offsetofend(struct vfio_device_pasid_attach_iommufd_pt, pt_id);
+
+	if (copy_from_user(&attach, arg, minsz))
+		return -EFAULT;
+
+	if (attach.argsz < minsz || attach.flags)
+		return -EINVAL;
+
+	if (!device->ops->pasid_attach_ioas)
+		return -EOPNOTSUPP;
+
+	mutex_lock(&device->dev_set->lock);
+	ret = device->ops->pasid_attach_ioas(device, attach.pasid, &attach.pt_id);
+	mutex_unlock(&device->dev_set->lock);
+
+	return ret;
+}
+
+int vfio_df_ioctl_pasid_detach_pt(struct vfio_device_file *df,
+				  struct vfio_device_pasid_detach_iommufd_pt __user *arg)
+{
+	struct vfio_device_pasid_detach_iommufd_pt detach;
+	struct vfio_device *device = df->device;
+	unsigned long minsz;
+
+	minsz = offsetofend(struct vfio_device_pasid_detach_iommufd_pt, pasid);
+
+	if (copy_from_user(&detach, arg, minsz))
+		return -EFAULT;
+
+	if (detach.argsz < minsz || detach.flags)
+		return -EINVAL;
+
+	if (!device->ops->pasid_detach_ioas)
+		return -EOPNOTSUPP;
+
+	mutex_lock(&device->dev_set->lock);
+	device->ops->pasid_detach_ioas(device, detach.pasid);
+	mutex_unlock(&device->dev_set->lock);
+
+	return 0;
+}
+
 static char *vfio_device_devnode(const struct device *dev, umode_t *mode)
 {
 	return kasprintf(GFP_KERNEL, "vfio/devices/%s", dev_name(dev));
diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index 50128da18bca..20d3cb283ba0 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -353,6 +353,10 @@ int vfio_df_ioctl_attach_pt(struct vfio_device_file *df,
 			    struct vfio_device_attach_iommufd_pt __user *arg);
 int vfio_df_ioctl_detach_pt(struct vfio_device_file *df,
 			    struct vfio_device_detach_iommufd_pt __user *arg);
+int vfio_df_ioctl_pasid_attach_pt(struct vfio_device_file *df,
+				  struct vfio_device_pasid_attach_iommufd_pt __user *arg);
+int vfio_df_ioctl_pasid_detach_pt(struct vfio_device_file *df,
+				  struct vfio_device_pasid_detach_iommufd_pt __user *arg);
 
 #if IS_ENABLED(CONFIG_VFIO_DEVICE_CDEV)
 void vfio_init_device_cdev(struct vfio_device *device);
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index e97d796a54fb..a5cece9fff5e 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -1242,6 +1242,14 @@ static long vfio_device_fops_unl_ioctl(struct file *filep,
 		case VFIO_DEVICE_DETACH_IOMMUFD_PT:
 			ret = vfio_df_ioctl_detach_pt(df, uptr);
 			goto out;
+
+		case VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT:
+			ret = vfio_df_ioctl_pasid_attach_pt(df, uptr);
+			goto out;
+
+		case VFIO_DEVICE_PASID_DETACH_IOMMUFD_PT:
+			ret = vfio_df_ioctl_pasid_detach_pt(df, uptr);
+			goto out;
 		}
 	}
 
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 2b68e6cdf190..9591dc24b75c 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -977,6 +977,61 @@ struct vfio_device_detach_iommufd_pt {
 
 #define VFIO_DEVICE_DETACH_IOMMUFD_PT		_IO(VFIO_TYPE, VFIO_BASE + 20)
 
+/*
+ * VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT - _IOW(VFIO_TYPE, VFIO_BASE + 21,
+ *					      struct vfio_device_pasid_attach_iommufd_pt)
+ * @argsz:	User filled size of this data.
+ * @flags:	Must be 0.
+ * @pasid:	The pasid to be attached.
+ * @pt_id:	Input the target id which can represent an ioas or a hwpt
+ *		allocated via iommufd subsystem.
+ *		Output the input ioas id or the attached hwpt id which could
+ *		be the specified hwpt itself or a hwpt automatically created
+ *		for the specified ioas by kernel during the attachment.
+ *
+ * Associate a pasid (of a cdev device) with an address space within the
+ * bound iommufd. Undo by VFIO_DEVICE_PASID_DETACH_IOMMUFD_PT or device fd
+ * close. This is only allowed on cdev fds.
+ *
+ * If a pasid is currently attached to a valid hw_pagetable (hwpt), without
+ * doing a VFIO_DEVICE_PASID_DETACH_IOMMUFD_PT, a second
+ * VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT ioctl passing in another hwpt id is
+ * allowed. This action, also known as a hwpt replacement, will replace the
+ * pasid's currently attached hwpt with a new hwpt corresponding to the given
+ * @pt_id.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_device_pasid_attach_iommufd_pt {
+	__u32	argsz;
+	__u32	flags;
+	__u32	pasid;
+	__u32	pt_id;
+};
+
+#define VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT	_IO(VFIO_TYPE, VFIO_BASE + 21)
+
+/*
+ * VFIO_DEVICE_PASID_DETACH_IOMMUFD_PT - _IOW(VFIO_TYPE, VFIO_BASE + 22,
+ *					      struct vfio_device_pasid_detach_iommufd_pt)
+ * @argsz:	User filled size of this data.
+ * @flags:	Must be 0.
+ * @pasid:	The pasid to be detached.
+ *
+ * Remove the association of a pasid (of a cdev device) and its current
+ * associated address space.  After it, the pasid of the device should be
+ * in a blocking DMA state.  This is only allowed on cdev fds.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_device_pasid_detach_iommufd_pt {
+	__u32	argsz;
+	__u32	flags;
+	__u32	pasid;
+};
+
+#define VFIO_DEVICE_PASID_DETACH_IOMMUFD_PT	_IO(VFIO_TYPE, VFIO_BASE + 22)
+
 /*
  * Provide support for setting a PCI VF Token, which is used as a shared
  * secret between PF and VF drivers.  This feature may only be set on a
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl
  2024-04-12  8:21 [PATCH v2 0/4] vfio-pci support pasid attach/detach Yi Liu
                   ` (2 preceding siblings ...)
  2024-04-12  8:21 ` [PATCH v2 3/4] vfio: Add VFIO_DEVICE_PASID_[AT|DE]TACH_IOMMUFD_PT Yi Liu
@ 2024-04-12  8:21 ` Yi Liu
  2024-04-16  9:40   ` Tian, Kevin
                     ` (2 more replies)
  2024-04-16  8:38 ` [PATCH v2 0/4] vfio-pci support pasid attach/detach Tian, Kevin
  4 siblings, 3 replies; 102+ messages in thread
From: Yi Liu @ 2024-04-12  8:21 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: joro, robin.murphy, eric.auger, nicolinc, kvm, chao.p.peng,
	yi.l.liu, iommu, baolu.lu, zhenzhong.duan, jacob.jun.pan

Today, vfio-pci hides the PASID capability of devices from userspace. Unlike
other PCI capabilities, PASID capability is going to be reported to user by
VFIO_DEVICE_FEATURE. Hence userspace could probe PASID capability by it.
This is a bit different from the other capabilities which are reported to
userspace when the user reads the device's PCI configuration space. There
are two reasons for this.

 - First, userspace like Qemu by default exposes all the available PCI
   capabilities in vfio-pci config space to the guest as read-only, so
   adding PASID capability in the vfio-pci config space will make it
   exposed to the guest automatically while an old Qemu doesn't really
   support it.

 - Second, the PASID capability does not exist on VFs (instead shares the
   cap of the PF). Creating a virtual PASID capability in vfio-pci config
   space needs to find a hole to place it, but doing so may require device
   specific knowledge to avoid potential conflict with device specific
   registers like hidden bits in VF's config space. It's simpler to move
   this burden to the VMM instead of maintaining a quirk system in the kernel.

Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 50 ++++++++++++++++++++++++++++++++
 include/uapi/linux/vfio.h        | 14 +++++++++
 2 files changed, 64 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index d94d61b92c1a..ca64e461d435 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1495,6 +1495,54 @@ static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
 	return 0;
 }
 
+static int vfio_pci_core_feature_pasid(struct vfio_device *device, u32 flags,
+				       struct vfio_device_feature_pasid __user *arg,
+				       size_t argsz)
+{
+	struct vfio_pci_core_device *vdev =
+		container_of(device, struct vfio_pci_core_device, vdev);
+	struct vfio_device_feature_pasid pasid = { 0 };
+	struct pci_dev *pdev = vdev->pdev;
+	u32 capabilities = 0;
+	u16 ctrl = 0;
+	int ret;
+
+	/*
+	 * Due to no PASID capability per VF, to be consistent, we do not
+	 * support SET of the PASID capability for both PF and VF.
+	 */
+	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
+				 sizeof(pasid));
+	if (ret != 1)
+		return ret;
+
+	/* VF shares the PASID capability of its PF */
+	if (pdev->is_virtfn)
+		pdev = pci_physfn(pdev);
+
+	if (!pdev->pasid_enabled)
+		goto out;
+
+#ifdef CONFIG_PCI_PASID
+	pci_read_config_dword(pdev, pdev->pasid_cap + PCI_PASID_CAP,
+			      &capabilities);
+	pci_read_config_word(pdev, pdev->pasid_cap + PCI_PASID_CTRL,
+			     &ctrl);
+#endif
+
+	pasid.width = (capabilities >> 8) & 0x1f;
+
+	if (ctrl & PCI_PASID_CTRL_EXEC)
+		pasid.capabilities |= VFIO_DEVICE_PASID_CAP_EXEC;
+	if (ctrl & PCI_PASID_CTRL_PRIV)
+		pasid.capabilities |= VFIO_DEVICE_PASID_CAP_PRIV;
+
+out:
+	if (copy_to_user(arg, &pasid, sizeof(pasid)))
+		return -EFAULT;
+	return 0;
+}
+
 int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 				void __user *arg, size_t argsz)
 {
@@ -1508,6 +1556,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 		return vfio_pci_core_pm_exit(device, flags, arg, argsz);
 	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
 		return vfio_pci_core_feature_token(device, flags, arg, argsz);
+	case VFIO_DEVICE_FEATURE_PASID:
+		return vfio_pci_core_feature_pasid(device, flags, arg, argsz);
 	default:
 		return -ENOTTY;
 	}
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 9591dc24b75c..e50e55c67ab4 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1513,6 +1513,20 @@ struct vfio_device_feature_bus_master {
 };
 #define VFIO_DEVICE_FEATURE_BUS_MASTER 10
 
+/**
+ * Upon VFIO_DEVICE_FEATURE_GET, return the PASID capability for the device.
+ * Zero width means no support for PASID.
+ */
+struct vfio_device_feature_pasid {
+	__u16 capabilities;
+#define VFIO_DEVICE_PASID_CAP_EXEC	(1 << 0)
+#define VFIO_DEVICE_PASID_CAP_PRIV	(1 << 1)
+	__u8 width;
+	__u8 __reserved;
+};
+
+#define VFIO_DEVICE_FEATURE_PASID 11
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* RE: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-12  8:21 [PATCH v2 0/4] vfio-pci support pasid attach/detach Yi Liu
                   ` (3 preceding siblings ...)
  2024-04-12  8:21 ` [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl Yi Liu
@ 2024-04-16  8:38 ` Tian, Kevin
  2024-04-16 17:50   ` Jason Gunthorpe
  4 siblings, 1 reply; 102+ messages in thread
From: Tian, Kevin @ 2024-04-16  8:38 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson@redhat.com, jgg@nvidia.com
  Cc: joro@8bytes.org, robin.murphy@arm.com, eric.auger@redhat.com,
	nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Friday, April 12, 2024 4:21 PM
> 
> A userspace VMM is supposed to get the details of the device's PASID
> capability
> and assemble a virtual PASID capability in a proper offset in the virtual PCI
> configuration space. While it is still an open on how to get the available
> offsets. Devices may have hidden bits that are not in the PCI cap chain. For
> now, there are two options to get the available offsets.[2]
> 
> - Report the available offsets via ioctl. This requires device-specific logic
>   to provide available offsets. e.g., vfio-pci variant driver. Or may the device
>   provide the available offset by DVSEC.
> - Store the available offsets in a static table in userspace VMM. VMM gets the
>   empty offsets from this table.
> 

I'm not a fan of requesting a variant driver for every PASID-capable
VF just for the purpose of reporting a free range in the PCI config space.

It's easier to do that quirk in userspace.

But I like Alex's original comment that at least for PF there is no reason
to hide the offset. there could be a flag+field to communicate it. or 
if there will be a new variant VF driver for other purposes e.g. migration
it can certainly fill the field too.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* RE: [PATCH v2 2/4] vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices
  2024-04-12  8:21 ` [PATCH v2 2/4] vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices Yi Liu
@ 2024-04-16  9:01   ` Tian, Kevin
  2024-04-16  9:24     ` Yi Liu
  2024-04-23 12:43   ` Jason Gunthorpe
  1 sibling, 1 reply; 102+ messages in thread
From: Tian, Kevin @ 2024-04-16  9:01 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson@redhat.com, jgg@nvidia.com
  Cc: joro@8bytes.org, robin.murphy@arm.com, eric.auger@redhat.com,
	nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun,
	Matthew Wilcox

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Friday, April 12, 2024 4:21 PM
> 
>  void vfio_iommufd_physical_unbind(struct vfio_device *vdev)
>  {
> +	int pasid = 0;
> +
>  	lockdep_assert_held(&vdev->dev_set->lock);
> 
> +	while (!ida_is_empty(&vdev->pasids)) {
> +		pasid = ida_get_lowest(&vdev->pasids, pasid, INT_MAX);
> +		if (pasid < 0)
> +			break;

WARN_ON as this shouldn't happen when ida is not empty.

> 
> +int vfio_iommufd_physical_pasid_attach_ioas(struct vfio_device *vdev,
> +					    u32 pasid, u32 *pt_id)

the name is too long. What about removing 'physical' as there is no
plan (unlikely) to support pasid on mdev?

> +{
> +	int rc;
> +
> +	lockdep_assert_held(&vdev->dev_set->lock);
> +
> +	if (WARN_ON(!vdev->iommufd_device))
> +		return -EINVAL;
> +
> +	rc = ida_get_lowest(&vdev->pasids, pasid, pasid);
> +	if (rc == pasid)
> +		return iommufd_device_pasid_replace(vdev-
> >iommufd_device,
> +						    pasid, pt_id);
> +
> +	rc = iommufd_device_pasid_attach(vdev->iommufd_device, pasid,
> pt_id);
> +	if (rc)
> +		return rc;
> +
> +	rc = ida_alloc_range(&vdev->pasids, pasid, pasid, GFP_KERNEL);
> +	if (rc < 0) {
> +		iommufd_device_pasid_detach(vdev->iommufd_device,
> pasid);
> +		return rc;
> +	}

I'd do simple operation (ida_alloc_range()) first before doing attach.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* RE: [PATCH v2 3/4] vfio: Add VFIO_DEVICE_PASID_[AT|DE]TACH_IOMMUFD_PT
  2024-04-12  8:21 ` [PATCH v2 3/4] vfio: Add VFIO_DEVICE_PASID_[AT|DE]TACH_IOMMUFD_PT Yi Liu
@ 2024-04-16  9:13   ` Tian, Kevin
  2024-04-16  9:36     ` Yi Liu
  2024-04-23 12:45   ` Jason Gunthorpe
  1 sibling, 1 reply; 102+ messages in thread
From: Tian, Kevin @ 2024-04-16  9:13 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson@redhat.com, jgg@nvidia.com
  Cc: joro@8bytes.org, robin.murphy@arm.com, eric.auger@redhat.com,
	nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Friday, April 12, 2024 4:21 PM
> 
> +/*
> + * VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT - _IOW(VFIO_TYPE,
> VFIO_BASE + 21,
> + *					      struct
> vfio_device_pasid_attach_iommufd_pt)
> + * @argsz:	User filled size of this data.
> + * @flags:	Must be 0.
> + * @pasid:	The pasid to be attached.
> + * @pt_id:	Input the target id which can represent an ioas or a hwpt
> + *		allocated via iommufd subsystem.
> + *		Output the input ioas id or the attached hwpt id which could
> + *		be the specified hwpt itself or a hwpt automatically created
> + *		for the specified ioas by kernel during the attachment.
> + *
> + * Associate a pasid (of a cdev device) with an address space within the

remove '(of a cdev device)' as end of the paragraph has "This is only
allowed on cdev fds". Also a pasid certainly belongs to device hence
just using pasid alone is clear.

> + * bound iommufd. Undo by VFIO_DEVICE_PASID_DETACH_IOMMUFD_PT
> or device fd
> + * close. This is only allowed on cdev fds.
> + *
> + * If a pasid is currently attached to a valid hw_pagetable (hwpt), 

remove 'hw_pagetable'. the abbreviation "hwpt" has been used
throughout this file (e.g. even when explaining @pt_id in earlier place).


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 2/4] vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices
  2024-04-16  9:01   ` Tian, Kevin
@ 2024-04-16  9:24     ` Yi Liu
  2024-04-16  9:47       ` Tian, Kevin
  0 siblings, 1 reply; 102+ messages in thread
From: Yi Liu @ 2024-04-16  9:24 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson@redhat.com, jgg@nvidia.com
  Cc: joro@8bytes.org, robin.murphy@arm.com, eric.auger@redhat.com,
	nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun,
	Matthew Wilcox

On 2024/4/16 17:01, Tian, Kevin wrote:
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Sent: Friday, April 12, 2024 4:21 PM
>>
>>   void vfio_iommufd_physical_unbind(struct vfio_device *vdev)
>>   {
>> +	int pasid = 0;
>> +
>>   	lockdep_assert_held(&vdev->dev_set->lock);
>>
>> +	while (!ida_is_empty(&vdev->pasids)) {
>> +		pasid = ida_get_lowest(&vdev->pasids, pasid, INT_MAX);
>> +		if (pasid < 0)
>> +			break;
> 
> WARN_ON as this shouldn't happen when ida is not empty.

ok.

>>
>> +int vfio_iommufd_physical_pasid_attach_ioas(struct vfio_device *vdev,
>> +					    u32 pasid, u32 *pt_id)
> 
> the name is too long. What about removing 'physical' as there is no
> plan (unlikely) to support pasid on mdev?

I'm ok to do it.

>> +{
>> +	int rc;
>> +
>> +	lockdep_assert_held(&vdev->dev_set->lock);
>> +
>> +	if (WARN_ON(!vdev->iommufd_device))
>> +		return -EINVAL;
>> +
>> +	rc = ida_get_lowest(&vdev->pasids, pasid, pasid);
>> +	if (rc == pasid)
>> +		return iommufd_device_pasid_replace(vdev-
>>> iommufd_device,
>> +						    pasid, pt_id);
>> +
>> +	rc = iommufd_device_pasid_attach(vdev->iommufd_device, pasid,
>> pt_id);
>> +	if (rc)
>> +		return rc;
>> +
>> +	rc = ida_alloc_range(&vdev->pasids, pasid, pasid, GFP_KERNEL);
>> +	if (rc < 0) {
>> +		iommufd_device_pasid_detach(vdev->iommufd_device,
>> pasid);
>> +		return rc;
>> +	}
> 
> I'd do simple operation (ida_alloc_range()) first before doing attach.
> 

But that means we rely on the ida_alloc_range() to return -ENOSPC to
indicate the pasid is allocated, hence this attach is actually a
replacement. This is easy to be broken if ida_alloc_range() returns
-ENOSPC for other reasons in future.

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 3/4] vfio: Add VFIO_DEVICE_PASID_[AT|DE]TACH_IOMMUFD_PT
  2024-04-16  9:13   ` Tian, Kevin
@ 2024-04-16  9:36     ` Yi Liu
  0 siblings, 0 replies; 102+ messages in thread
From: Yi Liu @ 2024-04-16  9:36 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson@redhat.com, jgg@nvidia.com
  Cc: joro@8bytes.org, robin.murphy@arm.com, eric.auger@redhat.com,
	nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun

On 2024/4/16 17:13, Tian, Kevin wrote:
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Sent: Friday, April 12, 2024 4:21 PM
>>
>> +/*
>> + * VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT - _IOW(VFIO_TYPE,
>> VFIO_BASE + 21,
>> + *					      struct
>> vfio_device_pasid_attach_iommufd_pt)
>> + * @argsz:	User filled size of this data.
>> + * @flags:	Must be 0.
>> + * @pasid:	The pasid to be attached.
>> + * @pt_id:	Input the target id which can represent an ioas or a hwpt
>> + *		allocated via iommufd subsystem.
>> + *		Output the input ioas id or the attached hwpt id which could
>> + *		be the specified hwpt itself or a hwpt automatically created
>> + *		for the specified ioas by kernel during the attachment.
>> + *
>> + * Associate a pasid (of a cdev device) with an address space within the
> 
> remove '(of a cdev device)' as end of the paragraph has "This is only
> allowed on cdev fds". Also a pasid certainly belongs to device hence
> just using pasid alone is clear.

ok

>> + * bound iommufd. Undo by VFIO_DEVICE_PASID_DETACH_IOMMUFD_PT
>> or device fd
>> + * close. This is only allowed on cdev fds.
>> + *
>> + * If a pasid is currently attached to a valid hw_pagetable (hwpt),
> 
> remove 'hw_pagetable'. the abbreviation "hwpt" has been used
> throughout this file (e.g. even when explaining @pt_id in earlier place).
> 

ok

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 102+ messages in thread

* RE: [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl
  2024-04-12  8:21 ` [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl Yi Liu
@ 2024-04-16  9:40   ` Tian, Kevin
  2024-04-16 17:57   ` Alex Williamson
  2024-04-23 12:39   ` Jason Gunthorpe
  2 siblings, 0 replies; 102+ messages in thread
From: Tian, Kevin @ 2024-04-16  9:40 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson@redhat.com, jgg@nvidia.com
  Cc: joro@8bytes.org, robin.murphy@arm.com, eric.auger@redhat.com,
	nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Friday, April 12, 2024 4:21 PM
> 
> +static int vfio_pci_core_feature_pasid(struct vfio_device *device, u32 flags,
> +				       struct vfio_device_feature_pasid __user
> *arg,
> +				       size_t argsz)
> +{
> +	struct vfio_pci_core_device *vdev =
> +		container_of(device, struct vfio_pci_core_device, vdev);
> +	struct vfio_device_feature_pasid pasid = { 0 };
> +	struct pci_dev *pdev = vdev->pdev;
> +	u32 capabilities = 0;
> +	u16 ctrl = 0;
> +	int ret;
> +
> +	/*
> +	 * Due to no PASID capability per VF, to be consistent, we do not
> +	 * support SET of the PASID capability for both PF and VF.
> +	 */

/* Disallow SET of the PASID capability given it is shared by all VF's 
 * and configured implicitly by the IOMMU driver.
 */

> +	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
> +				 sizeof(pasid));
> +	if (ret != 1)
> +		return ret;
> +
> +	/* VF shares the PASID capability of its PF */
> +	if (pdev->is_virtfn)
> +		pdev = pci_physfn(pdev);
> +
> +	if (!pdev->pasid_enabled)
> +		goto out;
> +
> +#ifdef CONFIG_PCI_PASID
> +	pci_read_config_dword(pdev, pdev->pasid_cap + PCI_PASID_CAP,
> +			      &capabilities);
> +	pci_read_config_word(pdev, pdev->pasid_cap + PCI_PASID_CTRL,
> +			     &ctrl);
> +#endif
> +
> +	pasid.width = (capabilities >> 8) & 0x1f;

it's cleaner to have helpers instead of directly checking CONFIG_PCI_XXX here.

there is an existing helper for the width: pci_max_pasids()

pci_pasid_features() can report supported features but not the actual
enabled set.

for enabled features it's already stored in pdev->pasid_features. so what's
required here is probably a new pci_pasid_enabled_features() to return
that field.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* RE: [PATCH v2 2/4] vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices
  2024-04-16  9:24     ` Yi Liu
@ 2024-04-16  9:47       ` Tian, Kevin
  2024-04-18  7:04         ` Yi Liu
  0 siblings, 1 reply; 102+ messages in thread
From: Tian, Kevin @ 2024-04-16  9:47 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson@redhat.com, jgg@nvidia.com
  Cc: joro@8bytes.org, robin.murphy@arm.com, eric.auger@redhat.com,
	nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun,
	Matthew Wilcox

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Tuesday, April 16, 2024 5:25 PM
> 
> On 2024/4/16 17:01, Tian, Kevin wrote:
> >> From: Liu, Yi L <yi.l.liu@intel.com>
> >> Sent: Friday, April 12, 2024 4:21 PM
> >>
> >> +
> >> +	rc = ida_get_lowest(&vdev->pasids, pasid, pasid);
> >> +	if (rc == pasid)
> >> +		return iommufd_device_pasid_replace(vdev-
> >>> iommufd_device,
> >> +						    pasid, pt_id);
> >> +
> >> +	rc = iommufd_device_pasid_attach(vdev->iommufd_device, pasid,
> >> pt_id);
> >> +	if (rc)
> >> +		return rc;
> >> +
> >> +	rc = ida_alloc_range(&vdev->pasids, pasid, pasid, GFP_KERNEL);
> >> +	if (rc < 0) {
> >> +		iommufd_device_pasid_detach(vdev->iommufd_device,
> >> pasid);
> >> +		return rc;
> >> +	}
> >
> > I'd do simple operation (ida_alloc_range()) first before doing attach.
> >
> 
> But that means we rely on the ida_alloc_range() to return -ENOSPC to
> indicate the pasid is allocated, hence this attach is actually a
> replacement. This is easy to be broken if ida_alloc_range() returns
> -ENOSPC for other reasons in future.
> 

ida_alloc_range() could fail for other reasons e.g. -ENOMEM.

in case I didn't make it clear I just meant to swap the order
between iommufd_device_pasid_attach() and ida_alloc_range().

replacement is still checked against ida_get_lowest().

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 1/4] ida: Add ida_get_lowest()
  2024-04-12  8:21 ` [PATCH v2 1/4] ida: Add ida_get_lowest() Yi Liu
@ 2024-04-16 16:03   ` Alex Williamson
  2024-04-18  7:02     ` Yi Liu
  0 siblings, 1 reply; 102+ messages in thread
From: Alex Williamson @ 2024-04-16 16:03 UTC (permalink / raw)
  To: Yi Liu
  Cc: jgg, kevin.tian, joro, robin.murphy, eric.auger, nicolinc, kvm,
	chao.p.peng, iommu, baolu.lu, zhenzhong.duan, jacob.jun.pan,
	Matthew Wilcox

On Fri, 12 Apr 2024 01:21:18 -0700
Yi Liu <yi.l.liu@intel.com> wrote:

> There is no helpers for user to check if a given ID is allocated or not,
> neither a helper to loop all the allocated IDs in an IDA and do something
> for cleanup. With the two needs, a helper to get the lowest allocated ID
> of a range can help to achieve it.
> 
> Caller can check if a given ID is allocated or not by:
> 	int id = 200, rc;
> 
> 	rc = ida_get_lowest(&ida, id, id);
> 	if (rc == id)
> 		//id 200 is used
> 	else
> 		//id 200 is not used
> 
> Caller can iterate all allocated IDs by:
> 	int id = 0;
> 
> 	while (!ida_is_empty(&pasid_ida)) {
> 		id = ida_get_lowest(pasid_ida, id, INT_MAX);
> 		if (id < 0)
> 			break;
> 		//anything to do with the allocated ID
> 		ida_free(pasid_ida, pasid);
> 	}
> 
> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  include/linux/idr.h |  1 +
>  lib/idr.c           | 67 +++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 68 insertions(+)
> 
> diff --git a/include/linux/idr.h b/include/linux/idr.h
> index da5f5fa4a3a6..1dae71d4a75d 100644
> --- a/include/linux/idr.h
> +++ b/include/linux/idr.h
> @@ -257,6 +257,7 @@ struct ida {
>  int ida_alloc_range(struct ida *, unsigned int min, unsigned int max, gfp_t);
>  void ida_free(struct ida *, unsigned int id);
>  void ida_destroy(struct ida *ida);
> +int ida_get_lowest(struct ida *ida, unsigned int min, unsigned int max);
>  
>  /**
>   * ida_alloc() - Allocate an unused ID.
> diff --git a/lib/idr.c b/lib/idr.c
> index da36054c3ca0..03e461242fe2 100644
> --- a/lib/idr.c
> +++ b/lib/idr.c
> @@ -476,6 +476,73 @@ int ida_alloc_range(struct ida *ida, unsigned int min, unsigned int max,
>  }
>  EXPORT_SYMBOL(ida_alloc_range);
>  
> +/**
> + * ida_get_lowest - Get the lowest used ID.
> + * @ida: IDA handle.
> + * @min: Lowest ID to get.
> + * @max: Highest ID to get.
> + *
> + * Get the lowest used ID between @min and @max, inclusive.  The returned
> + * ID will not exceed %INT_MAX, even if @max is larger.
> + *
> + * Context: Any context. Takes and releases the xa_lock.
> + * Return: The lowest used ID, or errno if no used ID is found.
> + */
> +int ida_get_lowest(struct ida *ida, unsigned int min, unsigned int max)
> +{
> +	unsigned long index = min / IDA_BITMAP_BITS;
> +	unsigned int offset = min % IDA_BITMAP_BITS;
> +	unsigned long *addr, size, bit;
> +	unsigned long flags;
> +	void *entry;
> +	int ret;
> +
> +	if (min >= INT_MAX)
> +		return -EINVAL;
> +	if (max >= INT_MAX)
> +		max = INT_MAX;
> +

Could these be made consistent with the test in ida_alloc_range(), ie:

	if ((int)min < 0)
		return -EINVAL;
	if ((int)max < 0)
		max = INT_MAX;


> +	xa_lock_irqsave(&ida->xa, flags);
> +
> +	entry = xa_find(&ida->xa, &index, max / IDA_BITMAP_BITS, XA_PRESENT);
> +	if (!entry) {
> +		ret = -ENOTTY;

-ENOENT?  Same for all below too.

> +		goto err_unlock;
> +	}
> +
> +	if (index > min / IDA_BITMAP_BITS)
> +		offset = 0;
> +	if (index * IDA_BITMAP_BITS + offset > max) {
> +		ret = -ENOTTY;
> +		goto err_unlock;
> +	}
> +
> +	if (xa_is_value(entry)) {
> +		unsigned long tmp = xa_to_value(entry);
> +
> +		addr = &tmp;
> +		size = BITS_PER_XA_VALUE;
> +	} else {
> +		addr = ((struct ida_bitmap *)entry)->bitmap;
> +		size = IDA_BITMAP_BITS;
> +	}
> +
> +	bit = find_next_bit(addr, size, offset);
> +
> +	xa_unlock_irqrestore(&ida->xa, flags);
> +
> +	if (bit == size ||
> +	    index * IDA_BITMAP_BITS + bit > max)
> +		return -ENOTTY;
> +
> +	return index * IDA_BITMAP_BITS + bit;
> +
> +err_unlock:
> +	xa_unlock_irqrestore(&ida->xa, flags);
> +	return ret;
> +}
> +EXPORT_SYMBOL(ida_get_lowest);

The API is a bit awkward to me, I wonder if it might be helped with
some renaming and wrappers...

int ida_find_first_range(struct ida *ida, unsigned int min, unsigned int max);

bool ida_exists(struct ida *ida, unsigned int id)
{
	return ida_find_first_range(ida, id, id) == id;
}

int ida_find_first(struct ida *ida)
{
	return ida_find_first_range(ida, 0, ~0);
}

_min and _max variations of the latter would align with existing
ida_alloc variants, but maybe no need to add them preemptively.

Possibly an ida_for_each() could be useful in the use case of
disassociating each id, but not required for the brute force iterative
method.  Thanks,

Alex

> +
>  /**
>   * ida_free() - Release an allocated ID.
>   * @ida: IDA handle.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-16  8:38 ` [PATCH v2 0/4] vfio-pci support pasid attach/detach Tian, Kevin
@ 2024-04-16 17:50   ` Jason Gunthorpe
  2024-04-17  7:16     ` Tian, Kevin
  0 siblings, 1 reply; 102+ messages in thread
From: Jason Gunthorpe @ 2024-04-16 17:50 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson@redhat.com, joro@8bytes.org,
	robin.murphy@arm.com, eric.auger@redhat.com, nicolinc@nvidia.com,
	kvm@vger.kernel.org, chao.p.peng@linux.intel.com,
	iommu@lists.linux.dev, baolu.lu@linux.intel.com, Duan, Zhenzhong,
	Pan, Jacob jun

On Tue, Apr 16, 2024 at 08:38:50AM +0000, Tian, Kevin wrote:
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Friday, April 12, 2024 4:21 PM
> > 
> > A userspace VMM is supposed to get the details of the device's PASID
> > capability
> > and assemble a virtual PASID capability in a proper offset in the virtual PCI
> > configuration space. While it is still an open on how to get the available
> > offsets. Devices may have hidden bits that are not in the PCI cap chain. For
> > now, there are two options to get the available offsets.[2]
> > 
> > - Report the available offsets via ioctl. This requires device-specific logic
> >   to provide available offsets. e.g., vfio-pci variant driver. Or may the device
> >   provide the available offset by DVSEC.
> > - Store the available offsets in a static table in userspace VMM. VMM gets the
> >   empty offsets from this table.
> > 
> 
> I'm not a fan of requesting a variant driver for every PASID-capable
> VF just for the purpose of reporting a free range in the PCI config space.
> 
> It's easier to do that quirk in userspace.
> 
> But I like Alex's original comment that at least for PF there is no reason
> to hide the offset. there could be a flag+field to communicate it. or 
> if there will be a new variant VF driver for other purposes e.g. migration
> it can certainly fill the field too.

Yes, since this has been such a sticking point can we get a clean
series that just enables it for PF and then come with a solution for
VF?

Jason

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl
  2024-04-12  8:21 ` [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl Yi Liu
  2024-04-16  9:40   ` Tian, Kevin
@ 2024-04-16 17:57   ` Alex Williamson
  2024-04-17  7:09     ` Tian, Kevin
  2024-04-23 12:39   ` Jason Gunthorpe
  2 siblings, 1 reply; 102+ messages in thread
From: Alex Williamson @ 2024-04-16 17:57 UTC (permalink / raw)
  To: Yi Liu
  Cc: jgg, kevin.tian, joro, robin.murphy, eric.auger, nicolinc, kvm,
	chao.p.peng, iommu, baolu.lu, zhenzhong.duan, jacob.jun.pan

On Fri, 12 Apr 2024 01:21:21 -0700
Yi Liu <yi.l.liu@intel.com> wrote:

> Today, vfio-pci hides the PASID capability of devices from userspace. Unlike
> other PCI capabilities, PASID capability is going to be reported to user by
> VFIO_DEVICE_FEATURE. Hence userspace could probe PASID capability by it.
> This is a bit different from the other capabilities which are reported to
> userspace when the user reads the device's PCI configuration space. There
> are two reasons for this.
> 
>  - First, userspace like Qemu by default exposes all the available PCI
>    capabilities in vfio-pci config space to the guest as read-only, so
>    adding PASID capability in the vfio-pci config space will make it
>    exposed to the guest automatically while an old Qemu doesn't really
>    support it.
> 
>  - Second, the PASID capability does not exist on VFs (instead shares the
>    cap of the PF). Creating a virtual PASID capability in vfio-pci config
>    space needs to find a hole to place it, but doing so may require device
>    specific knowledge to avoid potential conflict with device specific
>    registers like hidden bits in VF's config space. It's simpler to move
>    this burden to the VMM instead of maintaining a quirk system in the kernel.
> 
> Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 50 ++++++++++++++++++++++++++++++++
>  include/uapi/linux/vfio.h        | 14 +++++++++
>  2 files changed, 64 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index d94d61b92c1a..ca64e461d435 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1495,6 +1495,54 @@ static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
>  	return 0;
>  }
>  
> +static int vfio_pci_core_feature_pasid(struct vfio_device *device, u32 flags,
> +				       struct vfio_device_feature_pasid __user *arg,
> +				       size_t argsz)
> +{
> +	struct vfio_pci_core_device *vdev =
> +		container_of(device, struct vfio_pci_core_device, vdev);
> +	struct vfio_device_feature_pasid pasid = { 0 };
> +	struct pci_dev *pdev = vdev->pdev;
> +	u32 capabilities = 0;
> +	u16 ctrl = 0;
> +	int ret;
> +
> +	/*
> +	 * Due to no PASID capability per VF, to be consistent, we do not
> +	 * support SET of the PASID capability for both PF and VF.
> +	 */
> +	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
> +				 sizeof(pasid));
> +	if (ret != 1)
> +		return ret;
> +
> +	/* VF shares the PASID capability of its PF */
> +	if (pdev->is_virtfn)
> +		pdev = pci_physfn(pdev);
> +
> +	if (!pdev->pasid_enabled)
> +		goto out;
> +
> +#ifdef CONFIG_PCI_PASID
> +	pci_read_config_dword(pdev, pdev->pasid_cap + PCI_PASID_CAP,
> +			      &capabilities);
> +	pci_read_config_word(pdev, pdev->pasid_cap + PCI_PASID_CTRL,
> +			     &ctrl);
> +#endif
> +
> +	pasid.width = (capabilities >> 8) & 0x1f;
> +
> +	if (ctrl & PCI_PASID_CTRL_EXEC)
> +		pasid.capabilities |= VFIO_DEVICE_PASID_CAP_EXEC;
> +	if (ctrl & PCI_PASID_CTRL_PRIV)
> +		pasid.capabilities |= VFIO_DEVICE_PASID_CAP_PRIV;

I agree with Kevin here, let's make use of and add helpers to avoid
#ifdef blocks of code.

> +
> +out:
> +	if (copy_to_user(arg, &pasid, sizeof(pasid)))
> +		return -EFAULT;
> +	return 0;
> +}
> +
>  int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
>  				void __user *arg, size_t argsz)
>  {
> @@ -1508,6 +1556,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
>  		return vfio_pci_core_pm_exit(device, flags, arg, argsz);
>  	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
>  		return vfio_pci_core_feature_token(device, flags, arg, argsz);
> +	case VFIO_DEVICE_FEATURE_PASID:
> +		return vfio_pci_core_feature_pasid(device, flags, arg, argsz);
>  	default:
>  		return -ENOTTY;
>  	}
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 9591dc24b75c..e50e55c67ab4 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -1513,6 +1513,20 @@ struct vfio_device_feature_bus_master {
>  };
>  #define VFIO_DEVICE_FEATURE_BUS_MASTER 10
>  
> +/**
> + * Upon VFIO_DEVICE_FEATURE_GET, return the PASID capability for the device.
> + * Zero width means no support for PASID.

Why would we do that rather than reporting the feature as unsupported?
Just return -ENOTTY if PASID is not supported or enabled.

> + */
> +struct vfio_device_feature_pasid {
> +	__u16 capabilities;
> +#define VFIO_DEVICE_PASID_CAP_EXEC	(1 << 0)
> +#define VFIO_DEVICE_PASID_CAP_PRIV	(1 << 1)
> +	__u8 width;
> +	__u8 __reserved;
> +};

Building on Kevin's comment on the cover letter, if we could describe
an offset for emulating a PASID capability, this seems like the place
we'd do it.  I think we're not doing that because we'd like an in-band
mechanism for a device to report unused config space, such as a DVSEC
capability, so that it can be implemented on a physical device.  As
noted in the commit log here, we'd also prefer not to bloat the kernel
with more device quirks.

In an ideal world we might be able to jump start support of that DVSEC
option by emulating the DVSEC capability on top of the PASID capability
for PFs, but unfortunately the PASID capability is 8 bytes while the
DVSEC capability is at least 12 bytes, so we can't implement that
generically either.

I don't know there's any good solution here or whether there's actually
any value to the PASID capability on a PF, but do we need to consider
leaving a field+flag here to describe the offset for that scenario?
Would we then allow variant drivers to take advantage of it?  Does this
then turn into the quirk that we're trying to avoid in the kernel
rather than userspace and is that a problem?  Thanks,

Alex

> +
> +#define VFIO_DEVICE_FEATURE_PASID 11
> +
>  /* -------- API for Type1 VFIO IOMMU -------- */
>  
>  /**


^ permalink raw reply	[flat|nested] 102+ messages in thread

* RE: [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl
  2024-04-16 17:57   ` Alex Williamson
@ 2024-04-17  7:09     ` Tian, Kevin
  2024-04-17 20:25       ` Alex Williamson
  0 siblings, 1 reply; 102+ messages in thread
From: Tian, Kevin @ 2024-04-17  7:09 UTC (permalink / raw)
  To: Alex Williamson, Liu, Yi L
  Cc: jgg@nvidia.com, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, April 17, 2024 1:57 AM
> 
> On Fri, 12 Apr 2024 01:21:21 -0700
> Yi Liu <yi.l.liu@intel.com> wrote:
> 
> > + */
> > +struct vfio_device_feature_pasid {
> > +	__u16 capabilities;
> > +#define VFIO_DEVICE_PASID_CAP_EXEC	(1 << 0)
> > +#define VFIO_DEVICE_PASID_CAP_PRIV	(1 << 1)
> > +	__u8 width;
> > +	__u8 __reserved;
> > +};
> 
> Building on Kevin's comment on the cover letter, if we could describe
> an offset for emulating a PASID capability, this seems like the place
> we'd do it.  I think we're not doing that because we'd like an in-band
> mechanism for a device to report unused config space, such as a DVSEC
> capability, so that it can be implemented on a physical device.  As
> noted in the commit log here, we'd also prefer not to bloat the kernel
> with more device quirks.
> 
> In an ideal world we might be able to jump start support of that DVSEC
> option by emulating the DVSEC capability on top of the PASID capability
> for PFs, but unfortunately the PASID capability is 8 bytes while the
> DVSEC capability is at least 12 bytes, so we can't implement that
> generically either.

Yeah, that's a problem.

> 
> I don't know there's any good solution here or whether there's actually
> any value to the PASID capability on a PF, but do we need to consider
> leaving a field+flag here to describe the offset for that scenario?

Yes, I prefer to this way.

> Would we then allow variant drivers to take advantage of it?  Does this
> then turn into the quirk that we're trying to avoid in the kernel
> rather than userspace and is that a problem?  Thanks,
> 

We don't want to proactively pursue quirks in the kernel.

But if a variant driver exists for other reasons, I don't see why it 
should be prohibited from deciding an offset to ease the
userspace. 😊

^ permalink raw reply	[flat|nested] 102+ messages in thread

* RE: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-16 17:50   ` Jason Gunthorpe
@ 2024-04-17  7:16     ` Tian, Kevin
  2024-04-17 12:20       ` Jason Gunthorpe
  0 siblings, 1 reply; 102+ messages in thread
From: Tian, Kevin @ 2024-04-17  7:16 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, alex.williamson@redhat.com, joro@8bytes.org,
	robin.murphy@arm.com, eric.auger@redhat.com, nicolinc@nvidia.com,
	kvm@vger.kernel.org, chao.p.peng@linux.intel.com,
	iommu@lists.linux.dev, baolu.lu@linux.intel.com, Duan, Zhenzhong,
	Pan, Jacob jun

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, April 17, 2024 1:50 AM
> 
> On Tue, Apr 16, 2024 at 08:38:50AM +0000, Tian, Kevin wrote:
> > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > Sent: Friday, April 12, 2024 4:21 PM
> > >
> > > A userspace VMM is supposed to get the details of the device's PASID
> > > capability
> > > and assemble a virtual PASID capability in a proper offset in the virtual
> PCI
> > > configuration space. While it is still an open on how to get the available
> > > offsets. Devices may have hidden bits that are not in the PCI cap chain.
> For
> > > now, there are two options to get the available offsets.[2]
> > >
> > > - Report the available offsets via ioctl. This requires device-specific logic
> > >   to provide available offsets. e.g., vfio-pci variant driver. Or may the
> device
> > >   provide the available offset by DVSEC.
> > > - Store the available offsets in a static table in userspace VMM. VMM gets
> the
> > >   empty offsets from this table.
> > >
> >
> > I'm not a fan of requesting a variant driver for every PASID-capable
> > VF just for the purpose of reporting a free range in the PCI config space.
> >
> > It's easier to do that quirk in userspace.
> >
> > But I like Alex's original comment that at least for PF there is no reason
> > to hide the offset. there could be a flag+field to communicate it. or
> > if there will be a new variant VF driver for other purposes e.g. migration
> > it can certainly fill the field too.
> 
> Yes, since this has been such a sticking point can we get a clean
> series that just enables it for PF and then come with a solution for
> VF?
> 

sure but we at least need to reach consensus on a minimal required
uapi covering both PF/VF to move forward so the user doesn't need
to touch different contracts for PF vs. VF.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-17  7:16     ` Tian, Kevin
@ 2024-04-17 12:20       ` Jason Gunthorpe
  2024-04-17 23:02         ` Alex Williamson
  0 siblings, 1 reply; 102+ messages in thread
From: Jason Gunthorpe @ 2024-04-17 12:20 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson@redhat.com, joro@8bytes.org,
	robin.murphy@arm.com, eric.auger@redhat.com, nicolinc@nvidia.com,
	kvm@vger.kernel.org, chao.p.peng@linux.intel.com,
	iommu@lists.linux.dev, baolu.lu@linux.intel.com, Duan, Zhenzhong,
	Pan, Jacob jun

On Wed, Apr 17, 2024 at 07:16:05AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, April 17, 2024 1:50 AM
> > 
> > On Tue, Apr 16, 2024 at 08:38:50AM +0000, Tian, Kevin wrote:
> > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > Sent: Friday, April 12, 2024 4:21 PM
> > > >
> > > > A userspace VMM is supposed to get the details of the device's PASID
> > > > capability
> > > > and assemble a virtual PASID capability in a proper offset in the virtual
> > PCI
> > > > configuration space. While it is still an open on how to get the available
> > > > offsets. Devices may have hidden bits that are not in the PCI cap chain.
> > For
> > > > now, there are two options to get the available offsets.[2]
> > > >
> > > > - Report the available offsets via ioctl. This requires device-specific logic
> > > >   to provide available offsets. e.g., vfio-pci variant driver. Or may the
> > device
> > > >   provide the available offset by DVSEC.
> > > > - Store the available offsets in a static table in userspace VMM. VMM gets
> > the
> > > >   empty offsets from this table.
> > > >
> > >
> > > I'm not a fan of requesting a variant driver for every PASID-capable
> > > VF just for the purpose of reporting a free range in the PCI config space.
> > >
> > > It's easier to do that quirk in userspace.
> > >
> > > But I like Alex's original comment that at least for PF there is no reason
> > > to hide the offset. there could be a flag+field to communicate it. or
> > > if there will be a new variant VF driver for other purposes e.g. migration
> > > it can certainly fill the field too.
> > 
> > Yes, since this has been such a sticking point can we get a clean
> > series that just enables it for PF and then come with a solution for
> > VF?
> > 
> 
> sure but we at least need to reach consensus on a minimal required
> uapi covering both PF/VF to move forward so the user doesn't need
> to touch different contracts for PF vs. VF.

Do we? The situation where the VMM needs to wholly make a up a PASID
capability seems completely new and seperate from just using an
existing PASID capability as in the PF case.

If it needs to make another system call or otherwise to do that then
that seems fine to do incrementally??

Jason

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl
  2024-04-17  7:09     ` Tian, Kevin
@ 2024-04-17 20:25       ` Alex Williamson
  2024-04-18  0:21         ` Tian, Kevin
  0 siblings, 1 reply; 102+ messages in thread
From: Alex Williamson @ 2024-04-17 20:25 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, jgg@nvidia.com, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun

On Wed, 17 Apr 2024 07:09:52 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, April 17, 2024 1:57 AM
> > 
> > On Fri, 12 Apr 2024 01:21:21 -0700
> > Yi Liu <yi.l.liu@intel.com> wrote:
> >   
> > > + */
> > > +struct vfio_device_feature_pasid {
> > > +	__u16 capabilities;
> > > +#define VFIO_DEVICE_PASID_CAP_EXEC	(1 << 0)
> > > +#define VFIO_DEVICE_PASID_CAP_PRIV	(1 << 1)
> > > +	__u8 width;
> > > +	__u8 __reserved;
> > > +};  
> > 
> > Building on Kevin's comment on the cover letter, if we could describe
> > an offset for emulating a PASID capability, this seems like the place
> > we'd do it.  I think we're not doing that because we'd like an in-band
> > mechanism for a device to report unused config space, such as a DVSEC
> > capability, so that it can be implemented on a physical device.  As
> > noted in the commit log here, we'd also prefer not to bloat the kernel
> > with more device quirks.
> > 
> > In an ideal world we might be able to jump start support of that DVSEC
> > option by emulating the DVSEC capability on top of the PASID capability
> > for PFs, but unfortunately the PASID capability is 8 bytes while the
> > DVSEC capability is at least 12 bytes, so we can't implement that
> > generically either.  
> 
> Yeah, that's a problem.
> 
> > 
> > I don't know there's any good solution here or whether there's actually
> > any value to the PASID capability on a PF, but do we need to consider
> > leaving a field+flag here to describe the offset for that scenario?  
> 
> Yes, I prefer to this way.
> 
> > Would we then allow variant drivers to take advantage of it?  Does this
> > then turn into the quirk that we're trying to avoid in the kernel
> > rather than userspace and is that a problem?  Thanks,
> >   
> 
> We don't want to proactively pursue quirks in the kernel.
> 
> But if a variant driver exists for other reasons, I don't see why it 
> should be prohibited from deciding an offset to ease the
> userspace. 😊

At that point we've turned the corner into an arbitrary policy decision
that I can't defend.  A "worthy" variant driver can implement something
through a side channel vfio API, but implementing that side channel
itself is not enough to justify a variant driver?  It doesn't make
sense.

Further, if we have a variant driver, why do we need a side channel for
the purpose of describing available config space when we expect devices
themselves to eventually describe the same through a DVSEC capability?
The purpose of enabling variant drivers is to enhance the functionality
of the device.  Adding an emulated DVSEC capability seems like a valid
enhancement to justify a variant driver to me.

So the more I think about it, it would be easy to add something here
that hints a location for an emulated PASID capability in the VMM, but
it would also be counterproductive to an end goal of having a DVSEC
capability that describes unused config space.  The very narrow scope
where that side-band channel would be useful is an unknown PF device
which doesn't implement a DVSEC capability and without intervention
simply behaves as it always has, without PASID support.

A vendor desiring such support can a) implement DVSEC in the hardware,
b) implement a variant driver emulating a DVSEC capability, or c)
directly modify the VMM to tell it where to place the PASID capability.
I also don't think we should exclude the possibility that b) could turn
into a shared variant driver that knows about multiple devices and has
a table of free config space for each.  Option c) is only the last
resort if there's not already 12 bytes of contiguous, aligned free
space to place a DVSEC capability.  That seems unlikely.

At some point we need to define the format and use of this DVSEC.  Do
we allow (not require) one at every gap in config space that's at least
12-bytes long and adjust the DVSEC Length to describe longer gaps, or do
we use a single DVSEC to describe a table of ranges throughout extended
(maybe even conventional) config space?  The former seems easier,
especially if we expect a device has a large block of free space,
enough for multiple emulated capabilities and described by a single
DVSEC.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-17 12:20       ` Jason Gunthorpe
@ 2024-04-17 23:02         ` Alex Williamson
  2024-04-18  0:06           ` Tian, Kevin
  2024-04-19 13:34           ` Jason Gunthorpe
  0 siblings, 2 replies; 102+ messages in thread
From: Alex Williamson @ 2024-04-17 23:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Liu, Yi L, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun

On Wed, 17 Apr 2024 09:20:51 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Apr 17, 2024 at 07:16:05AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, April 17, 2024 1:50 AM
> > > 
> > > On Tue, Apr 16, 2024 at 08:38:50AM +0000, Tian, Kevin wrote:  
> > > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > > Sent: Friday, April 12, 2024 4:21 PM
> > > > >
> > > > > A userspace VMM is supposed to get the details of the device's PASID
> > > > > capability
> > > > > and assemble a virtual PASID capability in a proper offset in the virtual  
> > > PCI  
> > > > > configuration space. While it is still an open on how to get the available
> > > > > offsets. Devices may have hidden bits that are not in the PCI cap chain.  
> > > For  
> > > > > now, there are two options to get the available offsets.[2]
> > > > >
> > > > > - Report the available offsets via ioctl. This requires device-specific logic
> > > > >   to provide available offsets. e.g., vfio-pci variant driver. Or may the  
> > > device  
> > > > >   provide the available offset by DVSEC.
> > > > > - Store the available offsets in a static table in userspace VMM. VMM gets  
> > > the  
> > > > >   empty offsets from this table.
> > > > >  
> > > >
> > > > I'm not a fan of requesting a variant driver for every PASID-capable
> > > > VF just for the purpose of reporting a free range in the PCI config space.
> > > >
> > > > It's easier to do that quirk in userspace.
> > > >
> > > > But I like Alex's original comment that at least for PF there is no reason
> > > > to hide the offset. there could be a flag+field to communicate it. or
> > > > if there will be a new variant VF driver for other purposes e.g. migration
> > > > it can certainly fill the field too.  
> > > 
> > > Yes, since this has been such a sticking point can we get a clean
> > > series that just enables it for PF and then come with a solution for
> > > VF?
> > >   
> > 
> > sure but we at least need to reach consensus on a minimal required
> > uapi covering both PF/VF to move forward so the user doesn't need
> > to touch different contracts for PF vs. VF.  
> 
> Do we? The situation where the VMM needs to wholly make a up a PASID
> capability seems completely new and seperate from just using an
> existing PASID capability as in the PF case.

But we don't actually expose the PASID capability on the PF and as
argued in path 4/ we can't because it would break existing userspace.

> If it needs to make another system call or otherwise to do that then
> that seems fine to do incrementally??

With PASID currently hidden, VF and PF support really seem too similar
not to handle them both at the same time.  What's missing is a
mechanism to describe unused config space where userspace can implement
an emulated PASID capability.  Defining a DVSEC capability to handle
this seemed to be the preferred solution from the LPC discussion.  PF
and VF devices that implement a TBD DVSEC capability could avoid
needing a variant driver.  Until then we need to be careful not to
undermine a future world with that preferred solution by introducing
side-band mechanisms which only work for variant drivers and PF
devices.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* RE: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-17 23:02         ` Alex Williamson
@ 2024-04-18  0:06           ` Tian, Kevin
  2024-04-18  9:03             ` Yi Liu
  2024-04-19 13:34           ` Jason Gunthorpe
  1 sibling, 1 reply; 102+ messages in thread
From: Tian, Kevin @ 2024-04-18  0:06 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: Liu, Yi L, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Thursday, April 18, 2024 7:02 AM
> 
> On Wed, 17 Apr 2024 09:20:51 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Apr 17, 2024 at 07:16:05AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Wednesday, April 17, 2024 1:50 AM
> > > >
> > > > On Tue, Apr 16, 2024 at 08:38:50AM +0000, Tian, Kevin wrote:
> > > > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > Sent: Friday, April 12, 2024 4:21 PM
> > > > > >
> > > > > > A userspace VMM is supposed to get the details of the device's
> PASID
> > > > > > capability
> > > > > > and assemble a virtual PASID capability in a proper offset in the
> virtual
> > > > PCI
> > > > > > configuration space. While it is still an open on how to get the
> available
> > > > > > offsets. Devices may have hidden bits that are not in the PCI cap
> chain.
> > > > For
> > > > > > now, there are two options to get the available offsets.[2]
> > > > > >
> > > > > > - Report the available offsets via ioctl. This requires device-specific
> logic
> > > > > >   to provide available offsets. e.g., vfio-pci variant driver. Or may the
> > > > device
> > > > > >   provide the available offset by DVSEC.
> > > > > > - Store the available offsets in a static table in userspace VMM.
> VMM gets
> > > > the
> > > > > >   empty offsets from this table.
> > > > > >
> > > > >
> > > > > I'm not a fan of requesting a variant driver for every PASID-capable
> > > > > VF just for the purpose of reporting a free range in the PCI config
> space.
> > > > >
> > > > > It's easier to do that quirk in userspace.
> > > > >
> > > > > But I like Alex's original comment that at least for PF there is no
> reason
> > > > > to hide the offset. there could be a flag+field to communicate it. or
> > > > > if there will be a new variant VF driver for other purposes e.g.
> migration
> > > > > it can certainly fill the field too.
> > > >
> > > > Yes, since this has been such a sticking point can we get a clean
> > > > series that just enables it for PF and then come with a solution for
> > > > VF?
> > > >
> > >
> > > sure but we at least need to reach consensus on a minimal required
> > > uapi covering both PF/VF to move forward so the user doesn't need
> > > to touch different contracts for PF vs. VF.
> >
> > Do we? The situation where the VMM needs to wholly make a up a PASID
> > capability seems completely new and seperate from just using an
> > existing PASID capability as in the PF case.
> 
> But we don't actually expose the PASID capability on the PF and as
> argued in path 4/ we can't because it would break existing userspace.

Come back to this statement.

Does 'break' means that legacy Qemu will crash due to a guest write
to the read-only PASID capability, or just a conceptually functional
break i.e. non-faithful emulation due to writes being dropped?

If the latter it's probably not a bad idea to allow exposing the PASID
capability on the PF as a sane guest shouldn't enable the PASID
capability w/o seeing vIOMMU supporting PASID. And there is no
status bit defined in the PASID capability to check back so even
if an insane guest wants to blindly enable PASID it will naturally
write and done. The only niche case is that the enable bits are
defined as RW so ideally reading back those bits should get the
latest written value. But probably this can be tolerated?

With that then should we consider exposing the PASID capability
in PCI config space as the first option? For PF it's simple as how
other caps are exposed. For VF a variant driver can also fake the
PASID capability or emulate a DVSEC capability for unused space
(to motivate the physical implementation so no variant driver is
required in the future)

This allows a staging approach as Jason envisioned. 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* RE: [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl
  2024-04-17 20:25       ` Alex Williamson
@ 2024-04-18  0:21         ` Tian, Kevin
  2024-04-18  8:23           ` Yi Liu
  2024-04-18 16:34           ` Alex Williamson
  0 siblings, 2 replies; 102+ messages in thread
From: Tian, Kevin @ 2024-04-18  0:21 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Liu, Yi L, jgg@nvidia.com, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Thursday, April 18, 2024 4:26 AM
> 
> On Wed, 17 Apr 2024 07:09:52 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Wednesday, April 17, 2024 1:57 AM
> > >
> > > On Fri, 12 Apr 2024 01:21:21 -0700
> > > Yi Liu <yi.l.liu@intel.com> wrote:
> > >
> > > > + */
> > > > +struct vfio_device_feature_pasid {
> > > > +	__u16 capabilities;
> > > > +#define VFIO_DEVICE_PASID_CAP_EXEC	(1 << 0)
> > > > +#define VFIO_DEVICE_PASID_CAP_PRIV	(1 << 1)
> > > > +	__u8 width;
> > > > +	__u8 __reserved;
> > > > +};
> > >
> > > Building on Kevin's comment on the cover letter, if we could describe
> > > an offset for emulating a PASID capability, this seems like the place
> > > we'd do it.  I think we're not doing that because we'd like an in-band
> > > mechanism for a device to report unused config space, such as a DVSEC
> > > capability, so that it can be implemented on a physical device.  As
> > > noted in the commit log here, we'd also prefer not to bloat the kernel
> > > with more device quirks.
> > >
> > > In an ideal world we might be able to jump start support of that DVSEC
> > > option by emulating the DVSEC capability on top of the PASID capability
> > > for PFs, but unfortunately the PASID capability is 8 bytes while the
> > > DVSEC capability is at least 12 bytes, so we can't implement that
> > > generically either.
> >
> > Yeah, that's a problem.
> >
> > >
> > > I don't know there's any good solution here or whether there's actually
> > > any value to the PASID capability on a PF, but do we need to consider
> > > leaving a field+flag here to describe the offset for that scenario?
> >
> > Yes, I prefer to this way.
> >
> > > Would we then allow variant drivers to take advantage of it?  Does this
> > > then turn into the quirk that we're trying to avoid in the kernel
> > > rather than userspace and is that a problem?  Thanks,
> > >
> >
> > We don't want to proactively pursue quirks in the kernel.
> >
> > But if a variant driver exists for other reasons, I don't see why it
> > should be prohibited from deciding an offset to ease the
> > userspace. 😊
> 
> At that point we've turned the corner into an arbitrary policy decision
> that I can't defend.  A "worthy" variant driver can implement something
> through a side channel vfio API, but implementing that side channel
> itself is not enough to justify a variant driver?  It doesn't make
> sense.
> 
> Further, if we have a variant driver, why do we need a side channel for
> the purpose of describing available config space when we expect devices
> themselves to eventually describe the same through a DVSEC capability?
> The purpose of enabling variant drivers is to enhance the functionality
> of the device.  Adding an emulated DVSEC capability seems like a valid
> enhancement to justify a variant driver to me.
> 
> So the more I think about it, it would be easy to add something here
> that hints a location for an emulated PASID capability in the VMM, but
> it would also be counterproductive to an end goal of having a DVSEC
> capability that describes unused config space.  The very narrow scope
> where that side-band channel would be useful is an unknown PF device
> which doesn't implement a DVSEC capability and without intervention
> simply behaves as it always has, without PASID support.
> 
> A vendor desiring such support can a) implement DVSEC in the hardware,
> b) implement a variant driver emulating a DVSEC capability, or c)
> directly modify the VMM to tell it where to place the PASID capability.
> I also don't think we should exclude the possibility that b) could turn
> into a shared variant driver that knows about multiple devices and has
> a table of free config space for each.  Option c) is only the last
> resort if there's not already 12 bytes of contiguous, aligned free
> space to place a DVSEC capability.  That seems unlikely.

or b) could be a table in vfio_pci_config.c i.e. kind of making vfio-pci
as the shared variant driver.

> 
> At some point we need to define the format and use of this DVSEC.  Do
> we allow (not require) one at every gap in config space that's at least
> 12-bytes long and adjust the DVSEC Length to describe longer gaps, or do

Does PCI spec allows multiple same-type capabilities co-existing?

> we use a single DVSEC to describe a table of ranges throughout extended
> (maybe even conventional) config space?  The former seems easier,

this might be challenging as the table itself requires a contiguous
large free block.

> especially if we expect a device has a large block of free space,
> enough for multiple emulated capabilities and described by a single
> DVSEC.  Thanks,
> 

yes that sounds simpler.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 1/4] ida: Add ida_get_lowest()
  2024-04-16 16:03   ` Alex Williamson
@ 2024-04-18  7:02     ` Yi Liu
  2024-04-18 16:23       ` Alex Williamson
  0 siblings, 1 reply; 102+ messages in thread
From: Yi Liu @ 2024-04-18  7:02 UTC (permalink / raw)
  To: Alex Williamson
  Cc: jgg, kevin.tian, joro, robin.murphy, eric.auger, nicolinc, kvm,
	chao.p.peng, iommu, baolu.lu, zhenzhong.duan, jacob.jun.pan,
	Matthew Wilcox

On 2024/4/17 00:03, Alex Williamson wrote:
> On Fri, 12 Apr 2024 01:21:18 -0700
> Yi Liu <yi.l.liu@intel.com> wrote:
> 
>> There is no helpers for user to check if a given ID is allocated or not,
>> neither a helper to loop all the allocated IDs in an IDA and do something
>> for cleanup. With the two needs, a helper to get the lowest allocated ID
>> of a range can help to achieve it.
>>
>> Caller can check if a given ID is allocated or not by:
>> 	int id = 200, rc;
>>
>> 	rc = ida_get_lowest(&ida, id, id);
>> 	if (rc == id)
>> 		//id 200 is used
>> 	else
>> 		//id 200 is not used
>>
>> Caller can iterate all allocated IDs by:
>> 	int id = 0;
>>
>> 	while (!ida_is_empty(&pasid_ida)) {
>> 		id = ida_get_lowest(pasid_ida, id, INT_MAX);
>> 		if (id < 0)
>> 			break;
>> 		//anything to do with the allocated ID
>> 		ida_free(pasid_ida, pasid);
>> 	}
>>
>> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
>> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> ---
>>   include/linux/idr.h |  1 +
>>   lib/idr.c           | 67 +++++++++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 68 insertions(+)
>>
>> diff --git a/include/linux/idr.h b/include/linux/idr.h
>> index da5f5fa4a3a6..1dae71d4a75d 100644
>> --- a/include/linux/idr.h
>> +++ b/include/linux/idr.h
>> @@ -257,6 +257,7 @@ struct ida {
>>   int ida_alloc_range(struct ida *, unsigned int min, unsigned int max, gfp_t);
>>   void ida_free(struct ida *, unsigned int id);
>>   void ida_destroy(struct ida *ida);
>> +int ida_get_lowest(struct ida *ida, unsigned int min, unsigned int max);
>>   
>>   /**
>>    * ida_alloc() - Allocate an unused ID.
>> diff --git a/lib/idr.c b/lib/idr.c
>> index da36054c3ca0..03e461242fe2 100644
>> --- a/lib/idr.c
>> +++ b/lib/idr.c
>> @@ -476,6 +476,73 @@ int ida_alloc_range(struct ida *ida, unsigned int min, unsigned int max,
>>   }
>>   EXPORT_SYMBOL(ida_alloc_range);
>>   
>> +/**
>> + * ida_get_lowest - Get the lowest used ID.
>> + * @ida: IDA handle.
>> + * @min: Lowest ID to get.
>> + * @max: Highest ID to get.
>> + *
>> + * Get the lowest used ID between @min and @max, inclusive.  The returned
>> + * ID will not exceed %INT_MAX, even if @max is larger.
>> + *
>> + * Context: Any context. Takes and releases the xa_lock.
>> + * Return: The lowest used ID, or errno if no used ID is found.
>> + */
>> +int ida_get_lowest(struct ida *ida, unsigned int min, unsigned int max)
>> +{
>> +	unsigned long index = min / IDA_BITMAP_BITS;
>> +	unsigned int offset = min % IDA_BITMAP_BITS;
>> +	unsigned long *addr, size, bit;
>> +	unsigned long flags;
>> +	void *entry;
>> +	int ret;
>> +
>> +	if (min >= INT_MAX)
>> +		return -EINVAL;
>> +	if (max >= INT_MAX)
>> +		max = INT_MAX;
>> +
> 
> Could these be made consistent with the test in ida_alloc_range(), ie:
> 
> 	if ((int)min < 0)
> 		return -EINVAL;
> 	if ((int)max < 0)
> 		max = INT_MAX;
> 

sure.

>> +	xa_lock_irqsave(&ida->xa, flags);
>> +
>> +	entry = xa_find(&ida->xa, &index, max / IDA_BITMAP_BITS, XA_PRESENT);
>> +	if (!entry) {
>> +		ret = -ENOTTY;
> 
> -ENOENT?  Same for all below too.

I see.

>> +		goto err_unlock;
>> +	}
>> +
>> +	if (index > min / IDA_BITMAP_BITS)
>> +		offset = 0;
>> +	if (index * IDA_BITMAP_BITS + offset > max) {
>> +		ret = -ENOTTY;
>> +		goto err_unlock;
>> +	}
>> +
>> +	if (xa_is_value(entry)) {
>> +		unsigned long tmp = xa_to_value(entry);
>> +
>> +		addr = &tmp;
>> +		size = BITS_PER_XA_VALUE;
>> +	} else {
>> +		addr = ((struct ida_bitmap *)entry)->bitmap;
>> +		size = IDA_BITMAP_BITS;
>> +	}
>> +
>> +	bit = find_next_bit(addr, size, offset);
>> +
>> +	xa_unlock_irqrestore(&ida->xa, flags);
>> +
>> +	if (bit == size ||
>> +	    index * IDA_BITMAP_BITS + bit > max)
>> +		return -ENOTTY;
>> +
>> +	return index * IDA_BITMAP_BITS + bit;
>> +
>> +err_unlock:
>> +	xa_unlock_irqrestore(&ida->xa, flags);
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL(ida_get_lowest);
> 
> The API is a bit awkward to me, I wonder if it might be helped with
> some renaming and wrappers...
> 
> int ida_find_first_range(struct ida *ida, unsigned int min, unsigned int max);

ok.

> bool ida_exists(struct ida *ida, unsigned int id)
> {
> 	return ida_find_first_range(ida, id, id) == id;
> }

this does helps in next patch.

> 
> int ida_find_first(struct ida *ida)
> {
> 	return ida_find_first_range(ida, 0, ~0);
> }
>

perhaps it can be added in future. This series has two usages. One is to
check if a given ID is allocated. This can be covered by your ida_exists().
Another usage is to loop each IDs, do detach and free. This can still use
the ida_find_first_range() like the example in the commit message. The
first loop starts from 0, and next would start from the last found ID.
This may be more efficient than starts from 0 in every loop.


> _min and _max variations of the latter would align with existing
> ida_alloc variants, but maybe no need to add them preemptively.

yes.

> Possibly an ida_for_each() could be useful in the use case of
> disassociating each id, but not required for the brute force iterative
> method.  Thanks,

yep. maybe we can start with the below code, no need for ida_for_each()
today.


  	int id = 0;

  	while (!ida_is_empty(&pasid_ida)) {
  		id = ida_find_first_range(pasid_ida, id, INT_MAX);
  		if (unlikely(WARN_ON(id < 0))
			break;
  		iommufd_device_pasid_detach();
  		ida_free(pasid_ida, pasid);
  	}

> 
>> +
>>   /**
>>    * ida_free() - Release an allocated ID.
>>    * @ida: IDA handle.
> 

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 2/4] vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices
  2024-04-16  9:47       ` Tian, Kevin
@ 2024-04-18  7:04         ` Yi Liu
  0 siblings, 0 replies; 102+ messages in thread
From: Yi Liu @ 2024-04-18  7:04 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson@redhat.com, jgg@nvidia.com
  Cc: joro@8bytes.org, robin.murphy@arm.com, eric.auger@redhat.com,
	nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun,
	Matthew Wilcox



On 2024/4/16 17:47, Tian, Kevin wrote:
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Sent: Tuesday, April 16, 2024 5:25 PM
>>
>> On 2024/4/16 17:01, Tian, Kevin wrote:
>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>> Sent: Friday, April 12, 2024 4:21 PM
>>>>
>>>> +
>>>> +	rc = ida_get_lowest(&vdev->pasids, pasid, pasid);
>>>> +	if (rc == pasid)
>>>> +		return iommufd_device_pasid_replace(vdev-
>>>>> iommufd_device,
>>>> +						    pasid, pt_id);
>>>> +
>>>> +	rc = iommufd_device_pasid_attach(vdev->iommufd_device, pasid,
>>>> pt_id);
>>>> +	if (rc)
>>>> +		return rc;
>>>> +
>>>> +	rc = ida_alloc_range(&vdev->pasids, pasid, pasid, GFP_KERNEL);
>>>> +	if (rc < 0) {
>>>> +		iommufd_device_pasid_detach(vdev->iommufd_device,
>>>> pasid);
>>>> +		return rc;
>>>> +	}
>>>
>>> I'd do simple operation (ida_alloc_range()) first before doing attach.
>>>
>>
>> But that means we rely on the ida_alloc_range() to return -ENOSPC to
>> indicate the pasid is allocated, hence this attach is actually a
>> replacement. This is easy to be broken if ida_alloc_range() returns
>> -ENOSPC for other reasons in future.
>>
> 
> ida_alloc_range() could fail for other reasons e.g. -ENOMEM.
> 
> in case I didn't make it clear I just meant to swap the order
> between iommufd_device_pasid_attach() and ida_alloc_range().
> 
> replacement is still checked against ida_get_lowest().

aha, I see.

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl
  2024-04-18  0:21         ` Tian, Kevin
@ 2024-04-18  8:23           ` Yi Liu
  2024-04-18 16:34           ` Alex Williamson
  1 sibling, 0 replies; 102+ messages in thread
From: Yi Liu @ 2024-04-18  8:23 UTC (permalink / raw)
  To: Tian, Kevin, Alex Williamson
  Cc: jgg@nvidia.com, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun

On 2024/4/18 08:21, Tian, Kevin wrote:
>> From: Alex Williamson <alex.williamson@redhat.com>
>> Sent: Thursday, April 18, 2024 4:26 AM
>>
>> On Wed, 17 Apr 2024 07:09:52 +0000
>> "Tian, Kevin" <kevin.tian@intel.com> wrote:
>>
>>>> From: Alex Williamson <alex.williamson@redhat.com>
>>>> Sent: Wednesday, April 17, 2024 1:57 AM
>>>>
>>>> On Fri, 12 Apr 2024 01:21:21 -0700
>>>> Yi Liu <yi.l.liu@intel.com> wrote:
>>>>
>>>>> + */
>>>>> +struct vfio_device_feature_pasid {
>>>>> +	__u16 capabilities;
>>>>> +#define VFIO_DEVICE_PASID_CAP_EXEC	(1 << 0)
>>>>> +#define VFIO_DEVICE_PASID_CAP_PRIV	(1 << 1)
>>>>> +	__u8 width;
>>>>> +	__u8 __reserved;
>>>>> +};
>>>>
>>>> Building on Kevin's comment on the cover letter, if we could describe
>>>> an offset for emulating a PASID capability, this seems like the place
>>>> we'd do it.  I think we're not doing that because we'd like an in-band
>>>> mechanism for a device to report unused config space, such as a DVSEC
>>>> capability, so that it can be implemented on a physical device.  As
>>>> noted in the commit log here, we'd also prefer not to bloat the kernel
>>>> with more device quirks.
>>>>
>>>> In an ideal world we might be able to jump start support of that DVSEC
>>>> option by emulating the DVSEC capability on top of the PASID capability
>>>> for PFs, but unfortunately the PASID capability is 8 bytes while the
>>>> DVSEC capability is at least 12 bytes, so we can't implement that
>>>> generically either.
>>>
>>> Yeah, that's a problem.
>>>
>>>>
>>>> I don't know there's any good solution here or whether there's actually
>>>> any value to the PASID capability on a PF, but do we need to consider
>>>> leaving a field+flag here to describe the offset for that scenario?
>>>
>>> Yes, I prefer to this way.
>>>
>>>> Would we then allow variant drivers to take advantage of it?  Does this
>>>> then turn into the quirk that we're trying to avoid in the kernel
>>>> rather than userspace and is that a problem?  Thanks,
>>>>
>>>
>>> We don't want to proactively pursue quirks in the kernel.
>>>
>>> But if a variant driver exists for other reasons, I don't see why it
>>> should be prohibited from deciding an offset to ease the
>>> userspace. 😊
>>
>> At that point we've turned the corner into an arbitrary policy decision
>> that I can't defend.  A "worthy" variant driver can implement something
>> through a side channel vfio API, but implementing that side channel
>> itself is not enough to justify a variant driver?  It doesn't make
>> sense.
>>
>> Further, if we have a variant driver, why do we need a side channel for
>> the purpose of describing available config space when we expect devices
>> themselves to eventually describe the same through a DVSEC capability?
>> The purpose of enabling variant drivers is to enhance the functionality
>> of the device.  Adding an emulated DVSEC capability seems like a valid
>> enhancement to justify a variant driver to me.
>>
>> So the more I think about it, it would be easy to add something here
>> that hints a location for an emulated PASID capability in the VMM, but
>> it would also be counterproductive to an end goal of having a DVSEC
>> capability that describes unused config space.  The very narrow scope
>> where that side-band channel would be useful is an unknown PF device
>> which doesn't implement a DVSEC capability and without intervention
>> simply behaves as it always has, without PASID support.
>>
>> A vendor desiring such support can a) implement DVSEC in the hardware,
>> b) implement a variant driver emulating a DVSEC capability, or c)
>> directly modify the VMM to tell it where to place the PASID capability.
>> I also don't think we should exclude the possibility that b) could turn
>> into a shared variant driver that knows about multiple devices and has
>> a table of free config space for each.  Option c) is only the last
>> resort if there's not already 12 bytes of contiguous, aligned free
>> space to place a DVSEC capability.  That seems unlikely.
> 
> or b) could be a table in vfio_pci_config.c i.e. kind of making vfio-pci
> as the shared variant driver.
> 
>>
>> At some point we need to define the format and use of this DVSEC.  Do
>> we allow (not require) one at every gap in config space that's at least
>> 12-bytes long and adjust the DVSEC Length to describe longer gaps, or do
> 
> Does PCI spec allows multiple same-type capabilities co-existing?

For DVSEC, it allows. Below is the sentence from PCIe spec.

A single PCI Express Function or RCRB is permitted to contain multiple 
DVSEC Capability structures


>> we use a single DVSEC to describe a table of ranges throughout extended
>> (maybe even conventional) config space?  The former seems easier,
> 
> this might be challenging as the table itself requires a contiguous
> large free block.
> 
>> especially if we expect a device has a large block of free space,
>> enough for multiple emulated capabilities and described by a single
>> DVSEC.  Thanks,
>>

this is a good point. The ATS and PRI capability do not exist in VF as
well. They need to be emulated.

> 
> yes that sounds simpler.

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-18  0:06           ` Tian, Kevin
@ 2024-04-18  9:03             ` Yi Liu
  2024-04-18 20:37               ` Alex Williamson
  0 siblings, 1 reply; 102+ messages in thread
From: Yi Liu @ 2024-04-18  9:03 UTC (permalink / raw)
  To: Tian, Kevin, Alex Williamson, Jason Gunthorpe
  Cc: joro@8bytes.org, robin.murphy@arm.com, eric.auger@redhat.com,
	nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun

On 2024/4/18 08:06, Tian, Kevin wrote:
>> From: Alex Williamson <alex.williamson@redhat.com>
>> Sent: Thursday, April 18, 2024 7:02 AM
>>
>> On Wed, 17 Apr 2024 09:20:51 -0300
>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>
>>> On Wed, Apr 17, 2024 at 07:16:05AM +0000, Tian, Kevin wrote:
>>>>> From: Jason Gunthorpe <jgg@nvidia.com>
>>>>> Sent: Wednesday, April 17, 2024 1:50 AM
>>>>>
>>>>> On Tue, Apr 16, 2024 at 08:38:50AM +0000, Tian, Kevin wrote:
>>>>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>>>>> Sent: Friday, April 12, 2024 4:21 PM
>>>>>>>
>>>>>>> A userspace VMM is supposed to get the details of the device's
>> PASID
>>>>>>> capability
>>>>>>> and assemble a virtual PASID capability in a proper offset in the
>> virtual
>>>>> PCI
>>>>>>> configuration space. While it is still an open on how to get the
>> available
>>>>>>> offsets. Devices may have hidden bits that are not in the PCI cap
>> chain.
>>>>> For
>>>>>>> now, there are two options to get the available offsets.[2]
>>>>>>>
>>>>>>> - Report the available offsets via ioctl. This requires device-specific
>> logic
>>>>>>>    to provide available offsets. e.g., vfio-pci variant driver. Or may the
>>>>> device
>>>>>>>    provide the available offset by DVSEC.
>>>>>>> - Store the available offsets in a static table in userspace VMM.
>> VMM gets
>>>>> the
>>>>>>>    empty offsets from this table.
>>>>>>>
>>>>>>
>>>>>> I'm not a fan of requesting a variant driver for every PASID-capable
>>>>>> VF just for the purpose of reporting a free range in the PCI config
>> space.
>>>>>>
>>>>>> It's easier to do that quirk in userspace.
>>>>>>
>>>>>> But I like Alex's original comment that at least for PF there is no
>> reason
>>>>>> to hide the offset. there could be a flag+field to communicate it. or
>>>>>> if there will be a new variant VF driver for other purposes e.g.
>> migration
>>>>>> it can certainly fill the field too.
>>>>>
>>>>> Yes, since this has been such a sticking point can we get a clean
>>>>> series that just enables it for PF and then come with a solution for
>>>>> VF?
>>>>>
>>>>
>>>> sure but we at least need to reach consensus on a minimal required
>>>> uapi covering both PF/VF to move forward so the user doesn't need
>>>> to touch different contracts for PF vs. VF.
>>>
>>> Do we? The situation where the VMM needs to wholly make a up a PASID
>>> capability seems completely new and seperate from just using an
>>> existing PASID capability as in the PF case.
>>
>> But we don't actually expose the PASID capability on the PF and as
>> argued in path 4/ we can't because it would break existing userspace.
> > Come back to this statement.
> 
> Does 'break' means that legacy Qemu will crash due to a guest write
> to the read-only PASID capability, or just a conceptually functional
> break i.e. non-faithful emulation due to writes being dropped?
> 
> If the latter it's probably not a bad idea to allow exposing the PASID
> capability on the PF as a sane guest shouldn't enable the PASID
> capability w/o seeing vIOMMU supporting PASID. And there is no
> status bit defined in the PASID capability to check back so even
> if an insane guest wants to blindly enable PASID it will naturally
> write and done. The only niche case is that the enable bits are
> defined as RW so ideally reading back those bits should get the
> latest written value. But probably this can be tolerated?
> 
> With that then should we consider exposing the PASID capability
> in PCI config space as the first option? For PF it's simple as how
> other caps are exposed. For VF a variant driver can also fake the
> PASID capability or emulate a DVSEC capability for unused space
> (to motivate the physical implementation so no variant driver is
> required in the future)

If kernel exposes pasid cap for PF same as other caps, and in the meantime
the variant driver chooses to emulate a DVSEC cap, then userspace follows
the below steps to expose pasid cap to VM.

1) Check if a pasid cap is already present in the virtual config space
    read from kernel. If no, but user wants pasid, then goto step 2).
2) Userspace invokes VFIO_DEVICE_FETURE to check if the device support
    pasid cap. If yes, goto step 3).
3) Userspace gets an available offset via reading the DVSEC cap.
4) Userspace assembles a pasid cap and inserts it to the vconfig space.

For PF, step 1) is enough. For VF, it needs to go through all the 4 steps.
This is a bit different from what we planned at the beginning. But sounds
doable if we want to pursue the staging direction.

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 1/4] ida: Add ida_get_lowest()
  2024-04-18  7:02     ` Yi Liu
@ 2024-04-18 16:23       ` Alex Williamson
  2024-04-18 17:12         ` Jason Gunthorpe
  2024-04-19 13:40         ` Yi Liu
  0 siblings, 2 replies; 102+ messages in thread
From: Alex Williamson @ 2024-04-18 16:23 UTC (permalink / raw)
  To: Yi Liu
  Cc: jgg, kevin.tian, joro, robin.murphy, eric.auger, nicolinc, kvm,
	chao.p.peng, iommu, baolu.lu, zhenzhong.duan, jacob.jun.pan,
	Matthew Wilcox

On Thu, 18 Apr 2024 15:02:46 +0800
Yi Liu <yi.l.liu@intel.com> wrote:

> On 2024/4/17 00:03, Alex Williamson wrote:
> > On Fri, 12 Apr 2024 01:21:18 -0700
> > Yi Liu <yi.l.liu@intel.com> wrote:
> >   
> >> There is no helpers for user to check if a given ID is allocated or not,
> >> neither a helper to loop all the allocated IDs in an IDA and do something
> >> for cleanup. With the two needs, a helper to get the lowest allocated ID
> >> of a range can help to achieve it.
> >>
> >> Caller can check if a given ID is allocated or not by:
> >> 	int id = 200, rc;
> >>
> >> 	rc = ida_get_lowest(&ida, id, id);
> >> 	if (rc == id)
> >> 		//id 200 is used
> >> 	else
> >> 		//id 200 is not used
> >>
> >> Caller can iterate all allocated IDs by:
> >> 	int id = 0;
> >>
> >> 	while (!ida_is_empty(&pasid_ida)) {
> >> 		id = ida_get_lowest(pasid_ida, id, INT_MAX);
> >> 		if (id < 0)
> >> 			break;
> >> 		//anything to do with the allocated ID
> >> 		ida_free(pasid_ida, pasid);
> >> 	}
> >>
> >> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> >> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> >> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> >> ---
> >>   include/linux/idr.h |  1 +
> >>   lib/idr.c           | 67 +++++++++++++++++++++++++++++++++++++++++++++
> >>   2 files changed, 68 insertions(+)
> >>
> >> diff --git a/include/linux/idr.h b/include/linux/idr.h
> >> index da5f5fa4a3a6..1dae71d4a75d 100644
> >> --- a/include/linux/idr.h
> >> +++ b/include/linux/idr.h
> >> @@ -257,6 +257,7 @@ struct ida {
> >>   int ida_alloc_range(struct ida *, unsigned int min, unsigned int max, gfp_t);
> >>   void ida_free(struct ida *, unsigned int id);
> >>   void ida_destroy(struct ida *ida);
> >> +int ida_get_lowest(struct ida *ida, unsigned int min, unsigned int max);
> >>   
> >>   /**
> >>    * ida_alloc() - Allocate an unused ID.
> >> diff --git a/lib/idr.c b/lib/idr.c
> >> index da36054c3ca0..03e461242fe2 100644
> >> --- a/lib/idr.c
> >> +++ b/lib/idr.c
> >> @@ -476,6 +476,73 @@ int ida_alloc_range(struct ida *ida, unsigned int min, unsigned int max,
> >>   }
> >>   EXPORT_SYMBOL(ida_alloc_range);
> >>   
> >> +/**
> >> + * ida_get_lowest - Get the lowest used ID.
> >> + * @ida: IDA handle.
> >> + * @min: Lowest ID to get.
> >> + * @max: Highest ID to get.
> >> + *
> >> + * Get the lowest used ID between @min and @max, inclusive.  The returned
> >> + * ID will not exceed %INT_MAX, even if @max is larger.
> >> + *
> >> + * Context: Any context. Takes and releases the xa_lock.
> >> + * Return: The lowest used ID, or errno if no used ID is found.
> >> + */
> >> +int ida_get_lowest(struct ida *ida, unsigned int min, unsigned int max)
> >> +{
> >> +	unsigned long index = min / IDA_BITMAP_BITS;
> >> +	unsigned int offset = min % IDA_BITMAP_BITS;
> >> +	unsigned long *addr, size, bit;
> >> +	unsigned long flags;
> >> +	void *entry;
> >> +	int ret;
> >> +
> >> +	if (min >= INT_MAX)
> >> +		return -EINVAL;
> >> +	if (max >= INT_MAX)
> >> +		max = INT_MAX;
> >> +  
> > 
> > Could these be made consistent with the test in ida_alloc_range(), ie:
> > 
> > 	if ((int)min < 0)
> > 		return -EINVAL;
> > 	if ((int)max < 0)
> > 		max = INT_MAX;
> >   
> 
> sure.
> 
> >> +	xa_lock_irqsave(&ida->xa, flags);
> >> +
> >> +	entry = xa_find(&ida->xa, &index, max / IDA_BITMAP_BITS, XA_PRESENT);
> >> +	if (!entry) {
> >> +		ret = -ENOTTY;  
> > 
> > -ENOENT?  Same for all below too.  
> 
> I see.
> 
> >> +		goto err_unlock;
> >> +	}
> >> +
> >> +	if (index > min / IDA_BITMAP_BITS)
> >> +		offset = 0;
> >> +	if (index * IDA_BITMAP_BITS + offset > max) {
> >> +		ret = -ENOTTY;
> >> +		goto err_unlock;
> >> +	}
> >> +
> >> +	if (xa_is_value(entry)) {
> >> +		unsigned long tmp = xa_to_value(entry);
> >> +
> >> +		addr = &tmp;
> >> +		size = BITS_PER_XA_VALUE;
> >> +	} else {
> >> +		addr = ((struct ida_bitmap *)entry)->bitmap;
> >> +		size = IDA_BITMAP_BITS;
> >> +	}
> >> +
> >> +	bit = find_next_bit(addr, size, offset);
> >> +
> >> +	xa_unlock_irqrestore(&ida->xa, flags);
> >> +
> >> +	if (bit == size ||
> >> +	    index * IDA_BITMAP_BITS + bit > max)
> >> +		return -ENOTTY;
> >> +
> >> +	return index * IDA_BITMAP_BITS + bit;
> >> +
> >> +err_unlock:
> >> +	xa_unlock_irqrestore(&ida->xa, flags);
> >> +	return ret;
> >> +}
> >> +EXPORT_SYMBOL(ida_get_lowest);  
> > 
> > The API is a bit awkward to me, I wonder if it might be helped with
> > some renaming and wrappers...
> > 
> > int ida_find_first_range(struct ida *ida, unsigned int min, unsigned int max);  
> 
> ok.
> 
> > bool ida_exists(struct ida *ida, unsigned int id)
> > {
> > 	return ida_find_first_range(ida, id, id) == id;
> > }  
> 
> this does helps in next patch.
> 
> > 
> > int ida_find_first(struct ida *ida)
> > {
> > 	return ida_find_first_range(ida, 0, ~0);
> > }
> >  
> 
> perhaps it can be added in future. This series has two usages. One is to
> check if a given ID is allocated. This can be covered by your ida_exists().
> Another usage is to loop each IDs, do detach and free. This can still use
> the ida_find_first_range() like the example in the commit message. The
> first loop starts from 0, and next would start from the last found ID.
> This may be more efficient than starts from 0 in every loop.
> 
> 
> > _min and _max variations of the latter would align with existing
> > ida_alloc variants, but maybe no need to add them preemptively.  
> 
> yes.
> 
> > Possibly an ida_for_each() could be useful in the use case of
> > disassociating each id, but not required for the brute force iterative
> > method.  Thanks,  
> 
> yep. maybe we can start with the below code, no need for ida_for_each()
> today.
> 
> 
>   	int id = 0;
> 
>   	while (!ida_is_empty(&pasid_ida)) {
>   		id = ida_find_first_range(pasid_ida, id, INT_MAX);

You've actually already justified the _min function here:

static inline int ida_find_first_min(struct ida *ida, unsigned int min)
{
	return ida_find_first_range(ida, min, ~0);
}

Thanks,
Alex

>   		if (unlikely(WARN_ON(id < 0))
> 			break;
>   		iommufd_device_pasid_detach();
>   		ida_free(pasid_ida, pasid);
>   	}
> 
> >   
> >> +
> >>   /**
> >>    * ida_free() - Release an allocated ID.
> >>    * @ida: IDA handle.  
> >   
> 


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl
  2024-04-18  0:21         ` Tian, Kevin
  2024-04-18  8:23           ` Yi Liu
@ 2024-04-18 16:34           ` Alex Williamson
  1 sibling, 0 replies; 102+ messages in thread
From: Alex Williamson @ 2024-04-18 16:34 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, jgg@nvidia.com, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun

On Thu, 18 Apr 2024 00:21:36 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Thursday, April 18, 2024 4:26 AM
> > 
> > On Wed, 17 Apr 2024 07:09:52 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Wednesday, April 17, 2024 1:57 AM
> > > >
> > > > On Fri, 12 Apr 2024 01:21:21 -0700
> > > > Yi Liu <yi.l.liu@intel.com> wrote:
> > > >  
> > > > > + */
> > > > > +struct vfio_device_feature_pasid {
> > > > > +	__u16 capabilities;
> > > > > +#define VFIO_DEVICE_PASID_CAP_EXEC	(1 << 0)
> > > > > +#define VFIO_DEVICE_PASID_CAP_PRIV	(1 << 1)
> > > > > +	__u8 width;
> > > > > +	__u8 __reserved;
> > > > > +};  
> > > >
> > > > Building on Kevin's comment on the cover letter, if we could describe
> > > > an offset for emulating a PASID capability, this seems like the place
> > > > we'd do it.  I think we're not doing that because we'd like an in-band
> > > > mechanism for a device to report unused config space, such as a DVSEC
> > > > capability, so that it can be implemented on a physical device.  As
> > > > noted in the commit log here, we'd also prefer not to bloat the kernel
> > > > with more device quirks.
> > > >
> > > > In an ideal world we might be able to jump start support of that DVSEC
> > > > option by emulating the DVSEC capability on top of the PASID capability
> > > > for PFs, but unfortunately the PASID capability is 8 bytes while the
> > > > DVSEC capability is at least 12 bytes, so we can't implement that
> > > > generically either.  
> > >
> > > Yeah, that's a problem.
> > >  
> > > >
> > > > I don't know there's any good solution here or whether there's actually
> > > > any value to the PASID capability on a PF, but do we need to consider
> > > > leaving a field+flag here to describe the offset for that scenario?  
> > >
> > > Yes, I prefer to this way.
> > >  
> > > > Would we then allow variant drivers to take advantage of it?  Does this
> > > > then turn into the quirk that we're trying to avoid in the kernel
> > > > rather than userspace and is that a problem?  Thanks,
> > > >  
> > >
> > > We don't want to proactively pursue quirks in the kernel.
> > >
> > > But if a variant driver exists for other reasons, I don't see why it
> > > should be prohibited from deciding an offset to ease the
> > > userspace. 😊  
> > 
> > At that point we've turned the corner into an arbitrary policy decision
> > that I can't defend.  A "worthy" variant driver can implement something
> > through a side channel vfio API, but implementing that side channel
> > itself is not enough to justify a variant driver?  It doesn't make
> > sense.
> > 
> > Further, if we have a variant driver, why do we need a side channel for
> > the purpose of describing available config space when we expect devices
> > themselves to eventually describe the same through a DVSEC capability?
> > The purpose of enabling variant drivers is to enhance the functionality
> > of the device.  Adding an emulated DVSEC capability seems like a valid
> > enhancement to justify a variant driver to me.
> > 
> > So the more I think about it, it would be easy to add something here
> > that hints a location for an emulated PASID capability in the VMM, but
> > it would also be counterproductive to an end goal of having a DVSEC
> > capability that describes unused config space.  The very narrow scope
> > where that side-band channel would be useful is an unknown PF device
> > which doesn't implement a DVSEC capability and without intervention
> > simply behaves as it always has, without PASID support.
> > 
> > A vendor desiring such support can a) implement DVSEC in the hardware,
> > b) implement a variant driver emulating a DVSEC capability, or c)
> > directly modify the VMM to tell it where to place the PASID capability.
> > I also don't think we should exclude the possibility that b) could turn
> > into a shared variant driver that knows about multiple devices and has
> > a table of free config space for each.  Option c) is only the last
> > resort if there's not already 12 bytes of contiguous, aligned free
> > space to place a DVSEC capability.  That seems unlikely.  
> 
> or b) could be a table in vfio_pci_config.c i.e. kind of making vfio-pci
> as the shared variant driver.

We've kind of made the statement that variant drivers should be used
for any device specific quirks rather than further extending vfio-pci.
That's not to say that there couldn't be a shared variant driver that
binds to devices across vendors with device specific knowledge to
insert a DVSEC capability.

> > At some point we need to define the format and use of this DVSEC.  Do
> > we allow (not require) one at every gap in config space that's at least
> > 12-bytes long and adjust the DVSEC Length to describe longer gaps, or do  
> 
> Does PCI spec allows multiple same-type capabilities co-existing?

As Yi notes, yes.

> > we use a single DVSEC to describe a table of ranges throughout extended
> > (maybe even conventional) config space?  The former seems easier,  
> 
> this might be challenging as the table itself requires a contiguous
> large free block.

Yes, but DVSEC is an extended capability, so we have a fair bit of
address space to work with and the table could always collapse to zero
entries to indicate only the DVSEC capability itself is available, so
it's really no different in minimum described size to the other
approach.  Thanks,

Alex

> > especially if we expect a device has a large block of free space,
> > enough for multiple emulated capabilities and described by a single
> > DVSEC.  Thanks,
> >   
> 
> yes that sounds simpler.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 1/4] ida: Add ida_get_lowest()
  2024-04-18 16:23       ` Alex Williamson
@ 2024-04-18 17:12         ` Jason Gunthorpe
  2024-04-19 13:43           ` Yi Liu
  2024-04-19 13:40         ` Yi Liu
  1 sibling, 1 reply; 102+ messages in thread
From: Jason Gunthorpe @ 2024-04-18 17:12 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yi Liu, kevin.tian, joro, robin.murphy, eric.auger, nicolinc, kvm,
	chao.p.peng, iommu, baolu.lu, zhenzhong.duan, jacob.jun.pan,
	Matthew Wilcox

On Thu, Apr 18, 2024 at 10:23:14AM -0600, Alex Williamson wrote:
> > yep. maybe we can start with the below code, no need for ida_for_each()
> > today.
> > 
> > 
> >   	int id = 0;
> > 
> >   	while (!ida_is_empty(&pasid_ida)) {
> >   		id = ida_find_first_range(pasid_ida, id, INT_MAX);
> 
> You've actually already justified the _min function here:
> 
> static inline int ida_find_first_min(struct ida *ida, unsigned int min)
> {
> 	return ida_find_first_range(ida, min, ~0);
> }

It should also always start from 0..

Ideally written more like:

while ((id = ida_find_first(pasid_ida)) != EMPTY_IDA) {
  ida_remove(id);
}

Jason

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-18  9:03             ` Yi Liu
@ 2024-04-18 20:37               ` Alex Williamson
  2024-04-19  5:52                 ` Tian, Kevin
  2024-04-19 13:59                 ` Jason Gunthorpe
  0 siblings, 2 replies; 102+ messages in thread
From: Alex Williamson @ 2024-04-18 20:37 UTC (permalink / raw)
  To: Yi Liu
  Cc: Tian, Kevin, Jason Gunthorpe, joro@8bytes.org,
	robin.murphy@arm.com, eric.auger@redhat.com, nicolinc@nvidia.com,
	kvm@vger.kernel.org, chao.p.peng@linux.intel.com,
	iommu@lists.linux.dev, baolu.lu@linux.intel.com, Duan, Zhenzhong,
	Pan, Jacob jun

On Thu, 18 Apr 2024 17:03:15 +0800
Yi Liu <yi.l.liu@intel.com> wrote:

> On 2024/4/18 08:06, Tian, Kevin wrote:
> >> From: Alex Williamson <alex.williamson@redhat.com>
> >> Sent: Thursday, April 18, 2024 7:02 AM
> >>
> >> On Wed, 17 Apr 2024 09:20:51 -0300
> >> Jason Gunthorpe <jgg@nvidia.com> wrote:
> >>  
> >>> On Wed, Apr 17, 2024 at 07:16:05AM +0000, Tian, Kevin wrote:  
> >>>>> From: Jason Gunthorpe <jgg@nvidia.com>
> >>>>> Sent: Wednesday, April 17, 2024 1:50 AM
> >>>>>
> >>>>> On Tue, Apr 16, 2024 at 08:38:50AM +0000, Tian, Kevin wrote:  
> >>>>>>> From: Liu, Yi L <yi.l.liu@intel.com>
> >>>>>>> Sent: Friday, April 12, 2024 4:21 PM
> >>>>>>>
> >>>>>>> A userspace VMM is supposed to get the details of the device's  
> >> PASID  
> >>>>>>> capability
> >>>>>>> and assemble a virtual PASID capability in a proper offset in the  
> >> virtual  
> >>>>> PCI  
> >>>>>>> configuration space. While it is still an open on how to get the  
> >> available  
> >>>>>>> offsets. Devices may have hidden bits that are not in the PCI cap  
> >> chain.  
> >>>>> For  
> >>>>>>> now, there are two options to get the available offsets.[2]
> >>>>>>>
> >>>>>>> - Report the available offsets via ioctl. This requires device-specific  
> >> logic  
> >>>>>>>    to provide available offsets. e.g., vfio-pci variant driver. Or may the  
> >>>>> device  
> >>>>>>>    provide the available offset by DVSEC.
> >>>>>>> - Store the available offsets in a static table in userspace VMM.  
> >> VMM gets  
> >>>>> the  
> >>>>>>>    empty offsets from this table.
> >>>>>>>  
> >>>>>>
> >>>>>> I'm not a fan of requesting a variant driver for every PASID-capable
> >>>>>> VF just for the purpose of reporting a free range in the PCI config  
> >> space.  
> >>>>>>
> >>>>>> It's easier to do that quirk in userspace.
> >>>>>>
> >>>>>> But I like Alex's original comment that at least for PF there is no  
> >> reason  
> >>>>>> to hide the offset. there could be a flag+field to communicate it. or
> >>>>>> if there will be a new variant VF driver for other purposes e.g.  
> >> migration  
> >>>>>> it can certainly fill the field too.  
> >>>>>
> >>>>> Yes, since this has been such a sticking point can we get a clean
> >>>>> series that just enables it for PF and then come with a solution for
> >>>>> VF?
> >>>>>  
> >>>>
> >>>> sure but we at least need to reach consensus on a minimal required
> >>>> uapi covering both PF/VF to move forward so the user doesn't need
> >>>> to touch different contracts for PF vs. VF.  
> >>>
> >>> Do we? The situation where the VMM needs to wholly make a up a PASID
> >>> capability seems completely new and seperate from just using an
> >>> existing PASID capability as in the PF case.  
> >>
> >> But we don't actually expose the PASID capability on the PF and as
> >> argued in path 4/ we can't because it would break existing userspace.
> > > Come back to this statement.  
> > 
> > Does 'break' means that legacy Qemu will crash due to a guest write
> > to the read-only PASID capability, or just a conceptually functional
> > break i.e. non-faithful emulation due to writes being dropped?

I expect more the latter.

> > If the latter it's probably not a bad idea to allow exposing the PASID
> > capability on the PF as a sane guest shouldn't enable the PASID
> > capability w/o seeing vIOMMU supporting PASID. And there is no
> > status bit defined in the PASID capability to check back so even
> > if an insane guest wants to blindly enable PASID it will naturally
> > write and done. The only niche case is that the enable bits are
> > defined as RW so ideally reading back those bits should get the
> > latest written value. But probably this can be tolerated?

Some degree of inconsistency is likely tolerated, the guest is unlikely
to check that a RW bit was set or cleared.  How would we virtualize the
control registers for a VF and are they similarly virtualized for a PF
or would we allow the guest to manipulate the physical PASID control
registers?

> > With that then should we consider exposing the PASID capability
> > in PCI config space as the first option? For PF it's simple as how
> > other caps are exposed. For VF a variant driver can also fake the
> > PASID capability or emulate a DVSEC capability for unused space
> > (to motivate the physical implementation so no variant driver is
> > required in the future)  
> 
> If kernel exposes pasid cap for PF same as other caps, and in the meantime
> the variant driver chooses to emulate a DVSEC cap, then userspace follows
> the below steps to expose pasid cap to VM.

If we have a variant driver, why wouldn't it expose an emulated PASID
capability rather than a DVSEC if we're choosing to expose PASID for
PFs?

> 
> 1) Check if a pasid cap is already present in the virtual config space
>     read from kernel. If no, but user wants pasid, then goto step 2).
> 2) Userspace invokes VFIO_DEVICE_FETURE to check if the device support
>     pasid cap. If yes, goto step 3).

Why do we need the vfio feature interface if a physical or virtual PASID
capability on the device exposes the same info?

> 3) Userspace gets an available offset via reading the DVSEC cap.

What's the scenario where we'd have a VF wanting to expose PASID
support which doesn't also have a variant driver that could implement a
virtual PASID?

> 4) Userspace assembles a pasid cap and inserts it to the vconfig space.
> 
> For PF, step 1) is enough. For VF, it needs to go through all the 4 steps.
> This is a bit different from what we planned at the beginning. But sounds
> doable if we want to pursue the staging direction.

Seems like if we decide that we can just expose the PASID capability
for a PF then we should just have any VF variant drivers also implement
a virtual PASID capability.  In this case DVSEC would only be used to
provide information for a purely userspace emulation of PASID (in which
case it also wouldn't necessarily need the vfio feature because it
might implicitly know the PASID capabilities of the device).  Thanks,

Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* RE: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-18 20:37               ` Alex Williamson
@ 2024-04-19  5:52                 ` Tian, Kevin
  2024-04-19 16:35                   ` Alex Williamson
  2024-04-19 13:59                 ` Jason Gunthorpe
  1 sibling, 1 reply; 102+ messages in thread
From: Tian, Kevin @ 2024-04-19  5:52 UTC (permalink / raw)
  To: Alex Williamson, Liu, Yi L
  Cc: Jason Gunthorpe, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, April 19, 2024 4:38 AM
> 
> On Thu, 18 Apr 2024 17:03:15 +0800
> Yi Liu <yi.l.liu@intel.com> wrote:
> 
> > On 2024/4/18 08:06, Tian, Kevin wrote:
> > >> From: Alex Williamson <alex.williamson@redhat.com>
> > >> Sent: Thursday, April 18, 2024 7:02 AM
> > >>
> > >> But we don't actually expose the PASID capability on the PF and as
> > >> argued in path 4/ we can't because it would break existing userspace.
> > > > Come back to this statement.
> > >
> > > Does 'break' means that legacy Qemu will crash due to a guest write
> > > to the read-only PASID capability, or just a conceptually functional
> > > break i.e. non-faithful emulation due to writes being dropped?
> 
> I expect more the latter.
> 
> > > If the latter it's probably not a bad idea to allow exposing the PASID
> > > capability on the PF as a sane guest shouldn't enable the PASID
> > > capability w/o seeing vIOMMU supporting PASID. And there is no
> > > status bit defined in the PASID capability to check back so even
> > > if an insane guest wants to blindly enable PASID it will naturally
> > > write and done. The only niche case is that the enable bits are
> > > defined as RW so ideally reading back those bits should get the
> > > latest written value. But probably this can be tolerated?
> 
> Some degree of inconsistency is likely tolerated, the guest is unlikely
> to check that a RW bit was set or cleared.  How would we virtualize the
> control registers for a VF and are they similarly virtualized for a PF
> or would we allow the guest to manipulate the physical PASID control
> registers?

it's shared so the guest shouldn't be allowed to touch the physical
register.

Even for PF this is virtualized as the physical control is toggled by
the iommu driver today. We discussed before whether there is a
value moving the control to device driver but the conclusion is no.

> 
> > 4) Userspace assembles a pasid cap and inserts it to the vconfig space.
> >
> > For PF, step 1) is enough. For VF, it needs to go through all the 4 steps.
> > This is a bit different from what we planned at the beginning. But sounds
> > doable if we want to pursue the staging direction.
> 
> Seems like if we decide that we can just expose the PASID capability
> for a PF then we should just have any VF variant drivers also implement
> a virtual PASID capability.  In this case DVSEC would only be used to

I'm leaning toward this direction now.

> provide information for a purely userspace emulation of PASID (in which
> case it also wouldn't necessarily need the vfio feature because it
> might implicitly know the PASID capabilities of the device).  Thanks,
> 

that's a good point. Then no new contract is required.

and allowing variant driver to implement a virtual PASID capability
seems also make a room for making a shared variant driver to host
a table of virtual capabilities (both offset and content) for VFs, just
as discussed in patch4 having a shared driver to host a table for DVSEC?

Along this route probably most vendors will just extend the table in
the shared driver, leading to decreased value on DVSEC and question
on its necessity...

then it's back to the quirk-in-kernel approach... but if simple enough
probably not a bad idea to pursue? 😊

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-17 23:02         ` Alex Williamson
  2024-04-18  0:06           ` Tian, Kevin
@ 2024-04-19 13:34           ` Jason Gunthorpe
  1 sibling, 0 replies; 102+ messages in thread
From: Jason Gunthorpe @ 2024-04-19 13:34 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Liu, Yi L, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun

On Wed, Apr 17, 2024 at 05:02:16PM -0600, Alex Williamson wrote:

> > > sure but we at least need to reach consensus on a minimal required
> > > uapi covering both PF/VF to move forward so the user doesn't need
> > > to touch different contracts for PF vs. VF.  
> > 
> > Do we? The situation where the VMM needs to wholly make a up a PASID
> > capability seems completely new and seperate from just using an
> > existing PASID capability as in the PF case.
> 
> But we don't actually expose the PASID capability on the PF and as
> argued in path 4/ we can't because it would break existing
> userspace.

Are we sure about that argument? Exposing the PF's PASID cap to qemu
shouldn't break qemu at all - it will just pass it over into a vPCI
function?

Ultimately the guest will observe a vPCI device with a PASID cap. This
is not really so different from plugging in a new PCI device with the
PASID cap into an old HW system that doesn't support PASID.

So does a guest break? I'd expect no - the viommu's created by qemu
should not advertise PASID support already. So the guest can't use
PASID - just like an old HW system.

Is there a known concrete thing that breaks there?

> > If it needs to make another system call or otherwise to do that then
> > that seems fine to do incrementally??
> 
> With PASID currently hidden, VF and PF support really seem too similar
> not to handle them both at the same time.  What's missing is a
> mechanism to describe unused config space where userspace can implement
> an emulated PASID capability. 

Yes, sure the unused config space is a great idea. I thought we were
talking about doing the fixup in userspace, but a kernel solution is
good too.

But also if we are set on the dvsec, in kernel or user, then we can
move ahead with the core PASID enablement immediately?

Jason

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 1/4] ida: Add ida_get_lowest()
  2024-04-18 16:23       ` Alex Williamson
  2024-04-18 17:12         ` Jason Gunthorpe
@ 2024-04-19 13:40         ` Yi Liu
  1 sibling, 0 replies; 102+ messages in thread
From: Yi Liu @ 2024-04-19 13:40 UTC (permalink / raw)
  To: Alex Williamson
  Cc: jgg, kevin.tian, joro, robin.murphy, eric.auger, nicolinc, kvm,
	chao.p.peng, iommu, baolu.lu, zhenzhong.duan, jacob.jun.pan,
	Matthew Wilcox

On 2024/4/19 00:23, Alex Williamson wrote:
> On Thu, 18 Apr 2024 15:02:46 +0800
> Yi Liu <yi.l.liu@intel.com> wrote:
> 
>> On 2024/4/17 00:03, Alex Williamson wrote:
>>> On Fri, 12 Apr 2024 01:21:18 -0700
>>> Yi Liu <yi.l.liu@intel.com> wrote:
>>>    
>>>> There is no helpers for user to check if a given ID is allocated or not,
>>>> neither a helper to loop all the allocated IDs in an IDA and do something
>>>> for cleanup. With the two needs, a helper to get the lowest allocated ID
>>>> of a range can help to achieve it.
>>>>
>>>> Caller can check if a given ID is allocated or not by:
>>>> 	int id = 200, rc;
>>>>
>>>> 	rc = ida_get_lowest(&ida, id, id);
>>>> 	if (rc == id)
>>>> 		//id 200 is used
>>>> 	else
>>>> 		//id 200 is not used
>>>>
>>>> Caller can iterate all allocated IDs by:
>>>> 	int id = 0;
>>>>
>>>> 	while (!ida_is_empty(&pasid_ida)) {
>>>> 		id = ida_get_lowest(pasid_ida, id, INT_MAX);
>>>> 		if (id < 0)
>>>> 			break;
>>>> 		//anything to do with the allocated ID
>>>> 		ida_free(pasid_ida, pasid);
>>>> 	}
>>>>
>>>> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
>>>> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
>>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>>> ---
>>>>    include/linux/idr.h |  1 +
>>>>    lib/idr.c           | 67 +++++++++++++++++++++++++++++++++++++++++++++
>>>>    2 files changed, 68 insertions(+)
>>>>
>>>> diff --git a/include/linux/idr.h b/include/linux/idr.h
>>>> index da5f5fa4a3a6..1dae71d4a75d 100644
>>>> --- a/include/linux/idr.h
>>>> +++ b/include/linux/idr.h
>>>> @@ -257,6 +257,7 @@ struct ida {
>>>>    int ida_alloc_range(struct ida *, unsigned int min, unsigned int max, gfp_t);
>>>>    void ida_free(struct ida *, unsigned int id);
>>>>    void ida_destroy(struct ida *ida);
>>>> +int ida_get_lowest(struct ida *ida, unsigned int min, unsigned int max);
>>>>    
>>>>    /**
>>>>     * ida_alloc() - Allocate an unused ID.
>>>> diff --git a/lib/idr.c b/lib/idr.c
>>>> index da36054c3ca0..03e461242fe2 100644
>>>> --- a/lib/idr.c
>>>> +++ b/lib/idr.c
>>>> @@ -476,6 +476,73 @@ int ida_alloc_range(struct ida *ida, unsigned int min, unsigned int max,
>>>>    }
>>>>    EXPORT_SYMBOL(ida_alloc_range);
>>>>    
>>>> +/**
>>>> + * ida_get_lowest - Get the lowest used ID.
>>>> + * @ida: IDA handle.
>>>> + * @min: Lowest ID to get.
>>>> + * @max: Highest ID to get.
>>>> + *
>>>> + * Get the lowest used ID between @min and @max, inclusive.  The returned
>>>> + * ID will not exceed %INT_MAX, even if @max is larger.
>>>> + *
>>>> + * Context: Any context. Takes and releases the xa_lock.
>>>> + * Return: The lowest used ID, or errno if no used ID is found.
>>>> + */
>>>> +int ida_get_lowest(struct ida *ida, unsigned int min, unsigned int max)
>>>> +{
>>>> +	unsigned long index = min / IDA_BITMAP_BITS;
>>>> +	unsigned int offset = min % IDA_BITMAP_BITS;
>>>> +	unsigned long *addr, size, bit;
>>>> +	unsigned long flags;
>>>> +	void *entry;
>>>> +	int ret;
>>>> +
>>>> +	if (min >= INT_MAX)
>>>> +		return -EINVAL;
>>>> +	if (max >= INT_MAX)
>>>> +		max = INT_MAX;
>>>> +
>>>
>>> Could these be made consistent with the test in ida_alloc_range(), ie:
>>>
>>> 	if ((int)min < 0)
>>> 		return -EINVAL;
>>> 	if ((int)max < 0)
>>> 		max = INT_MAX;
>>>    
>>
>> sure.
>>
>>>> +	xa_lock_irqsave(&ida->xa, flags);
>>>> +
>>>> +	entry = xa_find(&ida->xa, &index, max / IDA_BITMAP_BITS, XA_PRESENT);
>>>> +	if (!entry) {
>>>> +		ret = -ENOTTY;
>>>
>>> -ENOENT?  Same for all below too.
>>
>> I see.
>>
>>>> +		goto err_unlock;
>>>> +	}
>>>> +
>>>> +	if (index > min / IDA_BITMAP_BITS)
>>>> +		offset = 0;
>>>> +	if (index * IDA_BITMAP_BITS + offset > max) {
>>>> +		ret = -ENOTTY;
>>>> +		goto err_unlock;
>>>> +	}
>>>> +
>>>> +	if (xa_is_value(entry)) {
>>>> +		unsigned long tmp = xa_to_value(entry);
>>>> +
>>>> +		addr = &tmp;
>>>> +		size = BITS_PER_XA_VALUE;
>>>> +	} else {
>>>> +		addr = ((struct ida_bitmap *)entry)->bitmap;
>>>> +		size = IDA_BITMAP_BITS;
>>>> +	}
>>>> +
>>>> +	bit = find_next_bit(addr, size, offset);
>>>> +
>>>> +	xa_unlock_irqrestore(&ida->xa, flags);
>>>> +
>>>> +	if (bit == size ||
>>>> +	    index * IDA_BITMAP_BITS + bit > max)
>>>> +		return -ENOTTY;
>>>> +
>>>> +	return index * IDA_BITMAP_BITS + bit;
>>>> +
>>>> +err_unlock:
>>>> +	xa_unlock_irqrestore(&ida->xa, flags);
>>>> +	return ret;
>>>> +}
>>>> +EXPORT_SYMBOL(ida_get_lowest);
>>>
>>> The API is a bit awkward to me, I wonder if it might be helped with
>>> some renaming and wrappers...
>>>
>>> int ida_find_first_range(struct ida *ida, unsigned int min, unsigned int max);
>>
>> ok.
>>
>>> bool ida_exists(struct ida *ida, unsigned int id)
>>> {
>>> 	return ida_find_first_range(ida, id, id) == id;
>>> }
>>
>> this does helps in next patch.
>>
>>>
>>> int ida_find_first(struct ida *ida)
>>> {
>>> 	return ida_find_first_range(ida, 0, ~0);
>>> }
>>>   
>>
>> perhaps it can be added in future. This series has two usages. One is to
>> check if a given ID is allocated. This can be covered by your ida_exists().
>> Another usage is to loop each IDs, do detach and free. This can still use
>> the ida_find_first_range() like the example in the commit message. The
>> first loop starts from 0, and next would start from the last found ID.
>> This may be more efficient than starts from 0 in every loop.
>>
>>
>>> _min and _max variations of the latter would align with existing
>>> ida_alloc variants, but maybe no need to add them preemptively.
>>
>> yes.
>>
>>> Possibly an ida_for_each() could be useful in the use case of
>>> disassociating each id, but not required for the brute force iterative
>>> method.  Thanks,
>>
>> yep. maybe we can start with the below code, no need for ida_for_each()
>> today.
>>
>>
>>    	int id = 0;
>>
>>    	while (!ida_is_empty(&pasid_ida)) {
>>    		id = ida_find_first_range(pasid_ida, id, INT_MAX);
> 
> You've actually already justified the _min function here:
> 
> static inline int ida_find_first_min(struct ida *ida, unsigned int min)
> {
> 	return ida_find_first_range(ida, min, ~0);
> }

aha, I see. :)

> 
>>    		if (unlikely(WARN_ON(id < 0))
>> 			break;
>>    		iommufd_device_pasid_detach();
>>    		ida_free(pasid_ida, pasid);
>>    	}
>>
>>>    
>>>> +
>>>>    /**
>>>>     * ida_free() - Release an allocated ID.
>>>>     * @ida: IDA handle.
>>>    
>>
> 
> 

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 1/4] ida: Add ida_get_lowest()
  2024-04-18 17:12         ` Jason Gunthorpe
@ 2024-04-19 13:43           ` Yi Liu
  2024-04-19 13:55             ` Alex Williamson
  0 siblings, 1 reply; 102+ messages in thread
From: Yi Liu @ 2024-04-19 13:43 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: kevin.tian, joro, robin.murphy, eric.auger, nicolinc, kvm,
	chao.p.peng, iommu, baolu.lu, zhenzhong.duan, jacob.jun.pan,
	Matthew Wilcox

On 2024/4/19 01:12, Jason Gunthorpe wrote:
> On Thu, Apr 18, 2024 at 10:23:14AM -0600, Alex Williamson wrote:
>>> yep. maybe we can start with the below code, no need for ida_for_each()
>>> today.
>>>
>>>
>>>    	int id = 0;
>>>
>>>    	while (!ida_is_empty(&pasid_ida)) {
>>>    		id = ida_find_first_range(pasid_ida, id, INT_MAX);
>>
>> You've actually already justified the _min function here:
>>
>> static inline int ida_find_first_min(struct ida *ida, unsigned int min)
>> {
>> 	return ida_find_first_range(ida, min, ~0);
>> }
> 
> It should also always start from 0..

any special reason to always start from 0? Here we want to loop all the
IDs, and remove them. In this usage, it should be more efficient if we
start from the last found ID.

> Ideally written more like:
> 
> while ((id = ida_find_first(pasid_ida)) != EMPTY_IDA) {
>    ida_remove(id);
> }

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 1/4] ida: Add ida_get_lowest()
  2024-04-19 13:43           ` Yi Liu
@ 2024-04-19 13:55             ` Alex Williamson
  2024-04-19 14:00               ` Jason Gunthorpe
  0 siblings, 1 reply; 102+ messages in thread
From: Alex Williamson @ 2024-04-19 13:55 UTC (permalink / raw)
  To: Yi Liu
  Cc: Jason Gunthorpe, kevin.tian, joro, robin.murphy, eric.auger,
	nicolinc, kvm, chao.p.peng, iommu, baolu.lu, zhenzhong.duan,
	jacob.jun.pan, Matthew Wilcox

On Fri, 19 Apr 2024 21:43:17 +0800
Yi Liu <yi.l.liu@intel.com> wrote:

> On 2024/4/19 01:12, Jason Gunthorpe wrote:
> > On Thu, Apr 18, 2024 at 10:23:14AM -0600, Alex Williamson wrote:  
> >>> yep. maybe we can start with the below code, no need for ida_for_each()
> >>> today.
> >>>
> >>>
> >>>    	int id = 0;
> >>>
> >>>    	while (!ida_is_empty(&pasid_ida)) {
> >>>    		id = ida_find_first_range(pasid_ida, id, INT_MAX);  
> >>
> >> You've actually already justified the _min function here:
> >>
> >> static inline int ida_find_first_min(struct ida *ida, unsigned int min)
> >> {
> >> 	return ida_find_first_range(ida, min, ~0);
> >> }  
> > 
> > It should also always start from 0..  
> 
> any special reason to always start from 0? Here we want to loop all the
> IDs, and remove them. In this usage, it should be more efficient if we
> start from the last found ID.

In the above version, there's a possibility of an infinite loop, in the
below there's not.  I don't think the infinite loop is actually
reachable, but given the xarray backend to ida I'm not sure you're
gaining much to restart after the previously found id either.  Thanks,

Alex

> > Ideally written more like:
> > 
> > while ((id = ida_find_first(pasid_ida)) != EMPTY_IDA) {
> >    ida_remove(id);
> > }  
> 


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-18 20:37               ` Alex Williamson
  2024-04-19  5:52                 ` Tian, Kevin
@ 2024-04-19 13:59                 ` Jason Gunthorpe
  2024-04-23  7:58                   ` Yi Liu
  1 sibling, 1 reply; 102+ messages in thread
From: Jason Gunthorpe @ 2024-04-19 13:59 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yi Liu, Tian, Kevin, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun

On Thu, Apr 18, 2024 at 02:37:47PM -0600, Alex Williamson wrote:

> Some degree of inconsistency is likely tolerated, the guest is unlikely
> to check that a RW bit was set or cleared.  How would we virtualize the
> control registers for a VF and are they similarly virtualized for a PF
> or would we allow the guest to manipulate the physical PASID control
> registers?

No, the OS owns the physical PASID control. If the platform IOMMU
knows how to parse PASID then PASID support is turned on and left on
at boot time.

There should be no guest visible difference to not supporting global
PASID disable, and we can't even implement it for VFs anyhow.

Same sort of argument for ATS/etc

> > If kernel exposes pasid cap for PF same as other caps, and in the meantime
> > the variant driver chooses to emulate a DVSEC cap, then userspace follows
> > the below steps to expose pasid cap to VM.
> 
> If we have a variant driver, why wouldn't it expose an emulated PASID
> capability rather than a DVSEC if we're choosing to expose PASID for
> PFs?

Indeed, also an option. Supplying the DVSEC is probably simpler and
addresses other synthesized capability blocks in future. VMM is a
better place to build various synthetic blocks in general, IMHO.

New VMM's could parse the PF PASID cap and add it to its list of "free
space"

We may also be overdoing it here..

Maybe if the VMM wants to enable PASID we should flip the logic and
the VMM should assume that unused config space is safe to use. Only
devices that violate that rule need to join an ID list and provide a
DVSEC/free space list/etc.

I'm guessing that list will be pretty small and hopefully will not
grow. It is easy and better for future devices to wrap their hidden
registers in a private DVSEC.

Then I'd suggest just writing the special list in a text file and
leaving it in the VMM side.. Users can adjust the text file right away
if they have old and troublesome devices and all VMMs can share it.

> > 1) Check if a pasid cap is already present in the virtual config space
> >     read from kernel. If no, but user wants pasid, then goto step 2).
> > 2) Userspace invokes VFIO_DEVICE_FETURE to check if the device support
> >     pasid cap. If yes, goto step 3).
> 
> Why do we need the vfio feature interface if a physical or virtual PASID
> capability on the device exposes the same info?

Still need to check if the platform, os, iommu, etc are all OK with
enabling PASID support before the viommu advertises it.

Jason

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 1/4] ida: Add ida_get_lowest()
  2024-04-19 13:55             ` Alex Williamson
@ 2024-04-19 14:00               ` Jason Gunthorpe
  2024-04-23  7:19                 ` Yi Liu
  0 siblings, 1 reply; 102+ messages in thread
From: Jason Gunthorpe @ 2024-04-19 14:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yi Liu, kevin.tian, joro, robin.murphy, eric.auger, nicolinc, kvm,
	chao.p.peng, iommu, baolu.lu, zhenzhong.duan, jacob.jun.pan,
	Matthew Wilcox

On Fri, Apr 19, 2024 at 07:55:04AM -0600, Alex Williamson wrote:
> On Fri, 19 Apr 2024 21:43:17 +0800
> Yi Liu <yi.l.liu@intel.com> wrote:
> 
> > On 2024/4/19 01:12, Jason Gunthorpe wrote:
> > > On Thu, Apr 18, 2024 at 10:23:14AM -0600, Alex Williamson wrote:  
> > >>> yep. maybe we can start with the below code, no need for ida_for_each()
> > >>> today.
> > >>>
> > >>>
> > >>>    	int id = 0;
> > >>>
> > >>>    	while (!ida_is_empty(&pasid_ida)) {
> > >>>    		id = ida_find_first_range(pasid_ida, id, INT_MAX);  
> > >>
> > >> You've actually already justified the _min function here:
> > >>
> > >> static inline int ida_find_first_min(struct ida *ida, unsigned int min)
> > >> {
> > >> 	return ida_find_first_range(ida, min, ~0);
> > >> }  
> > > 
> > > It should also always start from 0..  
> > 
> > any special reason to always start from 0? Here we want to loop all the
> > IDs, and remove them. In this usage, it should be more efficient if we
> > start from the last found ID.
> 
> In the above version, there's a possibility of an infinite loop, in the
> below there's not.  I don't think the infinite loop is actually
> reachable, but given the xarray backend to ida I'm not sure you're
> gaining much to restart after the previously found id either.  Thanks,

Right, there is no performance win on xarray and it only risks an
infinite loop compared to:

> > > while ((id = ida_find_first(pasid_ida)) != EMPTY_IDA) {
> > >    ida_remove(id);
> > > }  

Which does not by construction

Jason

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-19  5:52                 ` Tian, Kevin
@ 2024-04-19 16:35                   ` Alex Williamson
  2024-04-23  7:43                     ` Tian, Kevin
  0 siblings, 1 reply; 102+ messages in thread
From: Alex Williamson @ 2024-04-19 16:35 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, Jason Gunthorpe, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun

On Fri, 19 Apr 2024 05:52:01 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Friday, April 19, 2024 4:38 AM
> > 
> > On Thu, 18 Apr 2024 17:03:15 +0800
> > Yi Liu <yi.l.liu@intel.com> wrote:
> >   
> > > On 2024/4/18 08:06, Tian, Kevin wrote:  
> > > >> From: Alex Williamson <alex.williamson@redhat.com>
> > > >> Sent: Thursday, April 18, 2024 7:02 AM
> > > >>
> > > >> But we don't actually expose the PASID capability on the PF and as
> > > >> argued in path 4/ we can't because it would break existing userspace.
> > > > > Come back to this statement.  
> > > >
> > > > Does 'break' means that legacy Qemu will crash due to a guest write
> > > > to the read-only PASID capability, or just a conceptually functional
> > > > break i.e. non-faithful emulation due to writes being dropped?  
> > 
> > I expect more the latter.
> >   
> > > > If the latter it's probably not a bad idea to allow exposing the PASID
> > > > capability on the PF as a sane guest shouldn't enable the PASID
> > > > capability w/o seeing vIOMMU supporting PASID. And there is no
> > > > status bit defined in the PASID capability to check back so even
> > > > if an insane guest wants to blindly enable PASID it will naturally
> > > > write and done. The only niche case is that the enable bits are
> > > > defined as RW so ideally reading back those bits should get the
> > > > latest written value. But probably this can be tolerated?  
> > 
> > Some degree of inconsistency is likely tolerated, the guest is unlikely
> > to check that a RW bit was set or cleared.  How would we virtualize the
> > control registers for a VF and are they similarly virtualized for a PF
> > or would we allow the guest to manipulate the physical PASID control
> > registers?  
> 
> it's shared so the guest shouldn't be allowed to touch the physical
> register.
> 
> Even for PF this is virtualized as the physical control is toggled by
> the iommu driver today. We discussed before whether there is a
> value moving the control to device driver but the conclusion is no.

So in both cases we virtualize the PASID bits in the vfio variant
driver in order to maintain spec compliant behavior of the register
(ie. the control bits are RW with no underlying hardware effect and
capability bits only reflect the features enabled by the host in the
control register)?

> > > 4) Userspace assembles a pasid cap and inserts it to the vconfig space.
> > >
> > > For PF, step 1) is enough. For VF, it needs to go through all the 4 steps.
> > > This is a bit different from what we planned at the beginning. But sounds
> > > doable if we want to pursue the staging direction.  
> > 
> > Seems like if we decide that we can just expose the PASID capability
> > for a PF then we should just have any VF variant drivers also implement
> > a virtual PASID capability.  In this case DVSEC would only be used to  
> 
> I'm leaning toward this direction now.
> 
> > provide information for a purely userspace emulation of PASID (in which
> > case it also wouldn't necessarily need the vfio feature because it
> > might implicitly know the PASID capabilities of the device).  Thanks,
> >   
> 
> that's a good point. Then no new contract is required.
> 
> and allowing variant driver to implement a virtual PASID capability
> seems also make a room for making a shared variant driver to host
> a table of virtual capabilities (both offset and content) for VFs, just
> as discussed in patch4 having a shared driver to host a table for DVSEC?

Yes, vfio-pci-core would support virtualizing the PF PASID capability
mapped 1:1 at the physical PASID location.  We should architect that
support to be easily reused for a driver provided offset for the VF use
case and then we'd need to decide if a lookup table to associate an
offset to a VF vendor:device warrants a variant driver (which could be
shared by multiple devices) or if we'd accept that into vfio-pci-core.

> Along this route probably most vendors will just extend the table in
> the shared driver, leading to decreased value on DVSEC and question
> on its necessity...
> 
> then it's back to the quirk-in-kernel approach... but if simple enough
> probably not a bad idea to pursue? 😊

A DVSEC to express unused config space could still support a generic
vfio-pci-core or variant driver implementation of PASID virtualization.
The table lookup would provide a device-specific quirk to a base
implementation of carving it from DVSEC reported free space.

The question of whether it should be in kernel or userspace is
difficult.  There are certainly other capabilities where vfio-pci
exposes RW registers as read-only and we rely on the userspace VMM to
emulate them.  We could consider this one of those cases so long as the
change of exposing PASID as a read-only capability is tolerated for old
QEMU, new kernel.

Then come VFs.  AFAIK, it would not be possible for an unprivileged
QEMU to inspect the PASID state for the PF.  Therefore I think vfio
needs to provide that information either in-band (ie. emulated PASID) or
out-of-band (device feature).  At that point, there's some kernel code
regardless, which leans me towards virtualizing in the kernel.  I'd
welcome a complete, coherent proposal that could be done in userspace
though.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 1/4] ida: Add ida_get_lowest()
  2024-04-19 14:00               ` Jason Gunthorpe
@ 2024-04-23  7:19                 ` Yi Liu
  0 siblings, 0 replies; 102+ messages in thread
From: Yi Liu @ 2024-04-23  7:19 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: kevin.tian, joro, robin.murphy, eric.auger, nicolinc, kvm,
	chao.p.peng, iommu, baolu.lu, zhenzhong.duan, jacob.jun.pan,
	Matthew Wilcox

On 2024/4/19 22:00, Jason Gunthorpe wrote:
> On Fri, Apr 19, 2024 at 07:55:04AM -0600, Alex Williamson wrote:
>> On Fri, 19 Apr 2024 21:43:17 +0800
>> Yi Liu <yi.l.liu@intel.com> wrote:
>>
>>> On 2024/4/19 01:12, Jason Gunthorpe wrote:
>>>> On Thu, Apr 18, 2024 at 10:23:14AM -0600, Alex Williamson wrote:
>>>>>> yep. maybe we can start with the below code, no need for ida_for_each()
>>>>>> today.
>>>>>>
>>>>>>
>>>>>>     	int id = 0;
>>>>>>
>>>>>>     	while (!ida_is_empty(&pasid_ida)) {
>>>>>>     		id = ida_find_first_range(pasid_ida, id, INT_MAX);
>>>>>
>>>>> You've actually already justified the _min function here:
>>>>>
>>>>> static inline int ida_find_first_min(struct ida *ida, unsigned int min)
>>>>> {
>>>>> 	return ida_find_first_range(ida, min, ~0);
>>>>> }
>>>>
>>>> It should also always start from 0..
>>>
>>> any special reason to always start from 0? Here we want to loop all the
>>> IDs, and remove them. In this usage, it should be more efficient if we
>>> start from the last found ID.
>>
>> In the above version, there's a possibility of an infinite loop, in the
>> below there's not.  I don't think the infinite loop is actually
>> reachable, but given the xarray backend to ida I'm not sure you're
>> gaining much to restart after the previously found id either.  Thanks,
> 
> Right, there is no performance win on xarray and it only risks an
> infinite loop compared to:
> 
>>>> while ((id = ida_find_first(pasid_ida)) != EMPTY_IDA) {
>>>>     ida_remove(id);
>>>> }
> 
> Which does not by construction

thanks, got you two. :) Let's go with below. < 0 should mean
no ID found.

while ((id = ida_find_first(pasid_ida)) < 0) {
     ida_free(id);
}

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 102+ messages in thread

* RE: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-19 16:35                   ` Alex Williamson
@ 2024-04-23  7:43                     ` Tian, Kevin
  2024-04-23 12:01                       ` Jason Gunthorpe
  0 siblings, 1 reply; 102+ messages in thread
From: Tian, Kevin @ 2024-04-23  7:43 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Liu, Yi L, Jason Gunthorpe, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Saturday, April 20, 2024 12:36 AM
> 
> On Fri, 19 Apr 2024 05:52:01 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Friday, April 19, 2024 4:38 AM
> > >
> > > On Thu, 18 Apr 2024 17:03:15 +0800
> > > Yi Liu <yi.l.liu@intel.com> wrote:
> > >
> > > > On 2024/4/18 08:06, Tian, Kevin wrote:
> > > > >> From: Alex Williamson <alex.williamson@redhat.com>
> > > > >> Sent: Thursday, April 18, 2024 7:02 AM
> > > > >>
> > > > >> But we don't actually expose the PASID capability on the PF and as
> > > > >> argued in path 4/ we can't because it would break existing userspace.
> > > > > > Come back to this statement.
> > > > >
> > > > > Does 'break' means that legacy Qemu will crash due to a guest write
> > > > > to the read-only PASID capability, or just a conceptually functional
> > > > > break i.e. non-faithful emulation due to writes being dropped?
> > >
> > > I expect more the latter.
> > >
> > > > > If the latter it's probably not a bad idea to allow exposing the PASID
> > > > > capability on the PF as a sane guest shouldn't enable the PASID
> > > > > capability w/o seeing vIOMMU supporting PASID. And there is no
> > > > > status bit defined in the PASID capability to check back so even
> > > > > if an insane guest wants to blindly enable PASID it will naturally
> > > > > write and done. The only niche case is that the enable bits are
> > > > > defined as RW so ideally reading back those bits should get the
> > > > > latest written value. But probably this can be tolerated?
> > >
> > > Some degree of inconsistency is likely tolerated, the guest is unlikely
> > > to check that a RW bit was set or cleared.  How would we virtualize the
> > > control registers for a VF and are they similarly virtualized for a PF
> > > or would we allow the guest to manipulate the physical PASID control
> > > registers?
> >
> > it's shared so the guest shouldn't be allowed to touch the physical
> > register.
> >
> > Even for PF this is virtualized as the physical control is toggled by
> > the iommu driver today. We discussed before whether there is a
> > value moving the control to device driver but the conclusion is no.
> 
> So in both cases we virtualize the PASID bits in the vfio variant
> driver in order to maintain spec compliant behavior of the register
> (ie. the control bits are RW with no underlying hardware effect and
> capability bits only reflect the features enabled by the host in the
> control register)?

yes

> 
> > > > 4) Userspace assembles a pasid cap and inserts it to the vconfig space.
> > > >
> > > > For PF, step 1) is enough. For VF, it needs to go through all the 4 steps.
> > > > This is a bit different from what we planned at the beginning. But
> sounds
> > > > doable if we want to pursue the staging direction.
> > >
> > > Seems like if we decide that we can just expose the PASID capability
> > > for a PF then we should just have any VF variant drivers also implement
> > > a virtual PASID capability.  In this case DVSEC would only be used to
> >
> > I'm leaning toward this direction now.
> >
> > > provide information for a purely userspace emulation of PASID (in which
> > > case it also wouldn't necessarily need the vfio feature because it
> > > might implicitly know the PASID capabilities of the device).  Thanks,
> > >
> >
> > that's a good point. Then no new contract is required.
> >
> > and allowing variant driver to implement a virtual PASID capability
> > seems also make a room for making a shared variant driver to host
> > a table of virtual capabilities (both offset and content) for VFs, just
> > as discussed in patch4 having a shared driver to host a table for DVSEC?
> 
> Yes, vfio-pci-core would support virtualizing the PF PASID capability
> mapped 1:1 at the physical PASID location.  We should architect that
> support to be easily reused for a driver provided offset for the VF use
> case and then we'd need to decide if a lookup table to associate an
> offset to a VF vendor:device warrants a variant driver (which could be
> shared by multiple devices) or if we'd accept that into vfio-pci-core.

yes

> 
> > Along this route probably most vendors will just extend the table in
> > the shared driver, leading to decreased value on DVSEC and question
> > on its necessity...
> >
> > then it's back to the quirk-in-kernel approach... but if simple enough
> > probably not a bad idea to pursue? 😊
> 
> A DVSEC to express unused config space could still support a generic
> vfio-pci-core or variant driver implementation of PASID virtualization.
> The table lookup would provide a device-specific quirk to a base
> implementation of carving it from DVSEC reported free space.
> 
> The question of whether it should be in kernel or userspace is
> difficult.  There are certainly other capabilities where vfio-pci
> exposes RW registers as read-only and we rely on the userspace VMM to
> emulate them.  We could consider this one of those cases so long as the
> change of exposing PASID as a read-only capability is tolerated for old
> QEMU, new kernel.

old QEMU/new kernel seems OK according to discussions in this thread.

> 
> Then come VFs.  AFAIK, it would not be possible for an unprivileged
> QEMU to inspect the PASID state for the PF.  Therefore I think vfio
> needs to provide that information either in-band (ie. emulated PASID) or
> out-of-band (device feature).  At that point, there's some kernel code
> regardless, which leans me towards virtualizing in the kernel.  I'd
> welcome a complete, coherent proposal that could be done in userspace
> though.  Thanks,
> 

I'm not sure how userspace can fully handle this w/o certain assistance
from the kernel.

So I kind of agree that emulated PASID capability is probably the only
contract which the kernel should provide:
  - mapped 1:1 at the physical location, or
  - constructed at an offset according to DVSEC, or
  - constructed at an offset according to a look-up table

The VMM always scans the vfio pci config space to expose vPASID.

Then the remaining open is what VMM could do when a VF supports
PASID but unfortunately it's not reported by vfio. W/o the capability
of inspecting the PASID state of PF, probably the only feasible option
is to maintain a look-up table in VMM itself and assumes the kernel
always enables the PASID cap on PF.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-19 13:59                 ` Jason Gunthorpe
@ 2024-04-23  7:58                   ` Yi Liu
  2024-04-23 12:05                     ` Jason Gunthorpe
  0 siblings, 1 reply; 102+ messages in thread
From: Yi Liu @ 2024-04-23  7:58 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Tian, Kevin, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun

On 2024/4/19 21:59, Jason Gunthorpe wrote:
> On Thu, Apr 18, 2024 at 02:37:47PM -0600, Alex Williamson wrote:
> 
>> Some degree of inconsistency is likely tolerated, the guest is unlikely
>> to check that a RW bit was set or cleared.  How would we virtualize the
>> control registers for a VF and are they similarly virtualized for a PF
>> or would we allow the guest to manipulate the physical PASID control
>> registers?
> 
> No, the OS owns the physical PASID control. If the platform IOMMU
> knows how to parse PASID then PASID support is turned on and left on
> at boot time.

I think you mean host os. right?

> There should be no guest visible difference to not supporting global
> PASID disable, and we can't even implement it for VFs anyhow.
> 
> Same sort of argument for ATS/etc
> 
>>> If kernel exposes pasid cap for PF same as other caps, and in the meantime
>>> the variant driver chooses to emulate a DVSEC cap, then userspace follows
>>> the below steps to expose pasid cap to VM.
>>
>> If we have a variant driver, why wouldn't it expose an emulated PASID
>> capability rather than a DVSEC if we're choosing to expose PASID for
>> PFs?
> 
> Indeed, also an option. Supplying the DVSEC is probably simpler and
> addresses other synthesized capability blocks in future. VMM is a
> better place to build various synthetic blocks in general, IMHO.
> 
> New VMM's could parse the PF PASID cap and add it to its list of "free
> space"
> 
> We may also be overdoing it here..
> 
> Maybe if the VMM wants to enable PASID we should flip the logic and
> the VMM should assume that unused config space is safe to use. Only
> devices that violate that rule need to join an ID list and provide a
> DVSEC/free space list/etc.

So, if the kernel decides to hide a specific physical capability, the
space of this capability would be considered as free to use as well.
is it?

> I'm guessing that list will be pretty small and hopefully will not
> grow.

any channel to collect this kind of info? :)

> It is easy and better for future devices to wrap their hidden
> registers in a private DVSEC.

hmmm, do you mean include the registers a DVSEC hence userspace can
work out the free space by iterating the cap chain? or still mean
indicating the free spaces by DVSEC? I guess the prior one.

> Then I'd suggest just writing the special list in a text file and
> leaving it in the VMM side.. Users can adjust the text file right away
> if they have old and troublesome devices and all VMMs can share it.

So for the existing devices that have both pasid cap and hidden registers,
userspace should add them in the special list, and work out the free space
by referring the file. While for the devices that only have pasid cap, or
have the hidden register in a DVSEC, userspace finds a free space by
iterating the cap chain. This seems to be general for today and future.

>>> 1) Check if a pasid cap is already present in the virtual config space
>>>      read from kernel. If no, but user wants pasid, then goto step 2).
>>> 2) Userspace invokes VFIO_DEVICE_FETURE to check if the device support
>>>      pasid cap. If yes, goto step 3).
>>
>> Why do we need the vfio feature interface if a physical or virtual PASID
>> capability on the device exposes the same info?
> 
> Still need to check if the platform, os, iommu, etc are all OK with
> enabling PASID support before the viommu advertises it.

This means we don't expose physical or virtual PASID cap, is it? Otherwise,
host kernel could check if pasid is enabled before exposing the PASID cap.

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-23  7:43                     ` Tian, Kevin
@ 2024-04-23 12:01                       ` Jason Gunthorpe
  2024-04-23 23:47                         ` Tian, Kevin
  0 siblings, 1 reply; 102+ messages in thread
From: Jason Gunthorpe @ 2024-04-23 12:01 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Liu, Yi L, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun

On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote:
> I'm not sure how userspace can fully handle this w/o certain assistance
> from the kernel.
> 
> So I kind of agree that emulated PASID capability is probably the only
> contract which the kernel should provide:
>   - mapped 1:1 at the physical location, or
>   - constructed at an offset according to DVSEC, or
>   - constructed at an offset according to a look-up table
> 
> The VMM always scans the vfio pci config space to expose vPASID.
> 
> Then the remaining open is what VMM could do when a VF supports
> PASID but unfortunately it's not reported by vfio. W/o the capability
> of inspecting the PASID state of PF, probably the only feasible option
> is to maintain a look-up table in VMM itself and assumes the kernel
> always enables the PASID cap on PF.

I'm still not sure I like doing this in the kernel - we need to do the
same sort of thing for ATS too, right?

It feels simpler if the indicates if PASID and ATS can be supported
and userspace builds the capability blocks.

There are migration considerations too - the blocks need to be
migrated over and end up in the same place as well..

Jason

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-23  7:58                   ` Yi Liu
@ 2024-04-23 12:05                     ` Jason Gunthorpe
  0 siblings, 0 replies; 102+ messages in thread
From: Jason Gunthorpe @ 2024-04-23 12:05 UTC (permalink / raw)
  To: Yi Liu
  Cc: Alex Williamson, Tian, Kevin, joro@8bytes.org,
	robin.murphy@arm.com, eric.auger@redhat.com, nicolinc@nvidia.com,
	kvm@vger.kernel.org, chao.p.peng@linux.intel.com,
	iommu@lists.linux.dev, baolu.lu@linux.intel.com, Duan, Zhenzhong,
	Pan, Jacob jun

On Tue, Apr 23, 2024 at 03:58:17PM +0800, Yi Liu wrote:
> > > > 1) Check if a pasid cap is already present in the virtual config space
> > > >      read from kernel. If no, but user wants pasid, then goto step 2).
> > > > 2) Userspace invokes VFIO_DEVICE_FETURE to check if the device support
> > > >      pasid cap. If yes, goto step 3).
> > > 
> > > Why do we need the vfio feature interface if a physical or virtual PASID
> > > capability on the device exposes the same info?
> > 
> > Still need to check if the platform, os, iommu, etc are all OK with
> > enabling PASID support before the viommu advertises it.
> 
> This means we don't expose physical or virtual PASID cap, is it?

Yeah keep hiding both still works. Some kind of test to see if PASID
and ATS are supportable on the device. Probably via
IOMMUFD_CMD_GET_HW_INFO

If they are supported then the VMM will find empty space and generate
the missing caps. The VMM will have means to fix the location during
migration.

Jason

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl
  2024-04-12  8:21 ` [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl Yi Liu
  2024-04-16  9:40   ` Tian, Kevin
  2024-04-16 17:57   ` Alex Williamson
@ 2024-04-23 12:39   ` Jason Gunthorpe
  2024-04-24  0:24     ` Tian, Kevin
  2 siblings, 1 reply; 102+ messages in thread
From: Jason Gunthorpe @ 2024-04-23 12:39 UTC (permalink / raw)
  To: Yi Liu
  Cc: alex.williamson, kevin.tian, joro, robin.murphy, eric.auger,
	nicolinc, kvm, chao.p.peng, iommu, baolu.lu, zhenzhong.duan,
	jacob.jun.pan

On Fri, Apr 12, 2024 at 01:21:21AM -0700, Yi Liu wrote:
> Today, vfio-pci hides the PASID capability of devices from userspace. Unlike
> other PCI capabilities, PASID capability is going to be reported to user by
> VFIO_DEVICE_FEATURE. Hence userspace could probe PASID capability by it.
> This is a bit different from the other capabilities which are reported to
> userspace when the user reads the device's PCI configuration space. There
> are two reasons for this.

I'm thinking this probably does not belong in VFIO, iommufd should
report what the device, driver and OS is able to do with this
device. PASID support is at least 50% an iommu property too.

This is a seperate issue to forming the config space.

I didn't notice anything about SIOV in this, are we tackling it later?

IIRC we need the vIOMMU to specify a vPASID during attach and somehow
that gets mapped into a pPASID and synchronized with the KVM ENQCMD
translation?

Jason

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 2/4] vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices
  2024-04-12  8:21 ` [PATCH v2 2/4] vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices Yi Liu
  2024-04-16  9:01   ` Tian, Kevin
@ 2024-04-23 12:43   ` Jason Gunthorpe
  2024-04-24  0:33     ` Tian, Kevin
  2024-04-24  4:48     ` Yi Liu
  1 sibling, 2 replies; 102+ messages in thread
From: Jason Gunthorpe @ 2024-04-23 12:43 UTC (permalink / raw)
  To: Yi Liu
  Cc: alex.williamson, kevin.tian, joro, robin.murphy, eric.auger,
	nicolinc, kvm, chao.p.peng, iommu, baolu.lu, zhenzhong.duan,
	jacob.jun.pan, Matthew Wilcox

On Fri, Apr 12, 2024 at 01:21:19AM -0700, Yi Liu wrote:
> +int vfio_iommufd_physical_pasid_attach_ioas(struct vfio_device *vdev,
> +					    u32 pasid, u32 *pt_id)
> +{
> +	int rc;
> +
> +	lockdep_assert_held(&vdev->dev_set->lock);
> +
> +	if (WARN_ON(!vdev->iommufd_device))
> +		return -EINVAL;
> +
> +	rc = ida_get_lowest(&vdev->pasids, pasid, pasid);

A helper inline

    bool ida_is_allocate(&ida, id)

Would be nicer for that

> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index cb5b7f865d58..e0198851ffd2 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -142,6 +142,8 @@ static const struct vfio_device_ops vfio_pci_ops = {
>  	.unbind_iommufd	= vfio_iommufd_physical_unbind,
>  	.attach_ioas	= vfio_iommufd_physical_attach_ioas,
>  	.detach_ioas	= vfio_iommufd_physical_detach_ioas,
> +	.pasid_attach_ioas	= vfio_iommufd_physical_pasid_attach_ioas,
> +	.pasid_detach_ioas	= vfio_iommufd_physical_pasid_detach_ioas,
>  };

This should be copied into mlx5 and nvgrace-gpu at least as well

Jason

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 3/4] vfio: Add VFIO_DEVICE_PASID_[AT|DE]TACH_IOMMUFD_PT
  2024-04-12  8:21 ` [PATCH v2 3/4] vfio: Add VFIO_DEVICE_PASID_[AT|DE]TACH_IOMMUFD_PT Yi Liu
  2024-04-16  9:13   ` Tian, Kevin
@ 2024-04-23 12:45   ` Jason Gunthorpe
  1 sibling, 0 replies; 102+ messages in thread
From: Jason Gunthorpe @ 2024-04-23 12:45 UTC (permalink / raw)
  To: Yi Liu
  Cc: alex.williamson, kevin.tian, joro, robin.murphy, eric.auger,
	nicolinc, kvm, chao.p.peng, iommu, baolu.lu, zhenzhong.duan,
	jacob.jun.pan

On Fri, Apr 12, 2024 at 01:21:20AM -0700, Yi Liu wrote:
> This adds ioctls for the userspace to attach/detach a given pasid of a
> vfio device to/from an IOAS/HWPT.
> 
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  drivers/vfio/device_cdev.c | 51 +++++++++++++++++++++++++++++++++++
>  drivers/vfio/vfio.h        |  4 +++
>  drivers/vfio/vfio_main.c   |  8 ++++++
>  include/uapi/linux/vfio.h  | 55 ++++++++++++++++++++++++++++++++++++++
>  4 files changed, 118 insertions(+)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 102+ messages in thread

* RE: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-23 12:01                       ` Jason Gunthorpe
@ 2024-04-23 23:47                         ` Tian, Kevin
  2024-04-24  0:12                           ` Jason Gunthorpe
  0 siblings, 1 reply; 102+ messages in thread
From: Tian, Kevin @ 2024-04-23 23:47 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Liu, Yi L, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, April 23, 2024 8:02 PM
> 
> On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote:
> > I'm not sure how userspace can fully handle this w/o certain assistance
> > from the kernel.
> >
> > So I kind of agree that emulated PASID capability is probably the only
> > contract which the kernel should provide:
> >   - mapped 1:1 at the physical location, or
> >   - constructed at an offset according to DVSEC, or
> >   - constructed at an offset according to a look-up table
> >
> > The VMM always scans the vfio pci config space to expose vPASID.
> >
> > Then the remaining open is what VMM could do when a VF supports
> > PASID but unfortunately it's not reported by vfio. W/o the capability
> > of inspecting the PASID state of PF, probably the only feasible option
> > is to maintain a look-up table in VMM itself and assumes the kernel
> > always enables the PASID cap on PF.
> 
> I'm still not sure I like doing this in the kernel - we need to do the
> same sort of thing for ATS too, right?

VF is allowed to implement ATS.

PRI has the same problem as PASID.

> 
> It feels simpler if the indicates if PASID and ATS can be supported
> and userspace builds the capability blocks.

this routes back to Alex's original question about using different
interfaces (a device feature vs. PCI PASID cap) for VF and PF.

Are we OK with that divergence?

> 
> There are migration considerations too - the blocks need to be
> migrated over and end up in the same place as well..
> 

Can you elaborate what is the problem with the kernel emulating
the PASID cap in this consideration?

Does it talk about a case where the devices between src/dest are
different versions (but backward compatible) with different unused
space layout and the kernel approach may pick up different offsets
while the VMM can guarantee the same offset?

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-23 23:47                         ` Tian, Kevin
@ 2024-04-24  0:12                           ` Jason Gunthorpe
  2024-04-24  2:57                             ` Tian, Kevin
                                               ` (2 more replies)
  0 siblings, 3 replies; 102+ messages in thread
From: Jason Gunthorpe @ 2024-04-24  0:12 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Liu, Yi L, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun

On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, April 23, 2024 8:02 PM
> > 
> > On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote:
> > > I'm not sure how userspace can fully handle this w/o certain assistance
> > > from the kernel.
> > >
> > > So I kind of agree that emulated PASID capability is probably the only
> > > contract which the kernel should provide:
> > >   - mapped 1:1 at the physical location, or
> > >   - constructed at an offset according to DVSEC, or
> > >   - constructed at an offset according to a look-up table
> > >
> > > The VMM always scans the vfio pci config space to expose vPASID.
> > >
> > > Then the remaining open is what VMM could do when a VF supports
> > > PASID but unfortunately it's not reported by vfio. W/o the capability
> > > of inspecting the PASID state of PF, probably the only feasible option
> > > is to maintain a look-up table in VMM itself and assumes the kernel
> > > always enables the PASID cap on PF.
> > 
> > I'm still not sure I like doing this in the kernel - we need to do the
> > same sort of thing for ATS too, right?
> 
> VF is allowed to implement ATS.
> 
> PRI has the same problem as PASID.

I'm surprised by this, I would have guessed ATS would be the device
global one, PRI not being per-VF seems problematic??? How do you
disable PRI generation to get a clean shutdown?

> > It feels simpler if the indicates if PASID and ATS can be supported
> > and userspace builds the capability blocks.
> 
> this routes back to Alex's original question about using different
> interfaces (a device feature vs. PCI PASID cap) for VF and PF.

I'm not sure it is different interfaces..

The only reason to pass the PF's PASID cap is to give free space to
the VMM. If we are saying that gaps are free space (excluding a list
of bad devices) then we don't acutally need to do that anymore.

VMM will always create a synthetic PASID cap and kernel will always
suppress a real one.

An iommufd query will indicate if the vIOMMU can support vPASID on
that device.

Same for all the troublesome non-physical caps.

> > There are migration considerations too - the blocks need to be
> > migrated over and end up in the same place as well..
> 
> Can you elaborate what is the problem with the kernel emulating
> the PASID cap in this consideration?

If the kernel changes the algorithm, say it wants to do PASID, PRI,
something_new then it might change the layout

We can't just have the kernel decide without also providing a way for
userspace to say what the right layout actually is. :\

> Does it talk about a case where the devices between src/dest are
> different versions (but backward compatible) with different unused
> space layout and the kernel approach may pick up different offsets
> while the VMM can guarantee the same offset?

That is also a concern where the PCI cap layout may change a bit but
they are still migration compatible, but my bigger worry is that the
kernel just lays out the fake caps in a different way because the
kernel changes.

At least if the VMM is doing this then the VMM can include the
information in its migration scheme and use it to recreate the PCI
layout withotu having to create a bunch of uAPI to do so.

Jason

^ permalink raw reply	[flat|nested] 102+ messages in thread

* RE: [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl
  2024-04-23 12:39   ` Jason Gunthorpe
@ 2024-04-24  0:24     ` Tian, Kevin
  2024-04-24 13:59       ` Jason Gunthorpe
  0 siblings, 1 reply; 102+ messages in thread
From: Tian, Kevin @ 2024-04-24  0:24 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: alex.williamson@redhat.com, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, April 23, 2024 8:40 PM
> 
> On Fri, Apr 12, 2024 at 01:21:21AM -0700, Yi Liu wrote:
> > Today, vfio-pci hides the PASID capability of devices from userspace. Unlike
> > other PCI capabilities, PASID capability is going to be reported to user by
> > VFIO_DEVICE_FEATURE. Hence userspace could probe PASID capability by
> it.
> > This is a bit different from the other capabilities which are reported to
> > userspace when the user reads the device's PCI configuration space. There
> > are two reasons for this.
> 
> I'm thinking this probably does not belong in VFIO, iommufd should
> report what the device, driver and OS is able to do with this
> device. PASID support is at least 50% an iommu property too.

We have PASID capability in both device side and iommu side.

VFIO is for the former and iommufd is for the latter.

both should report the capability only if that cap exists and is
enabled by OS.

> 
> This is a seperate issue to forming the config space.
> 
> I didn't notice anything about SIOV in this, are we tackling it later?

yes.

> 
> IIRC we need the vIOMMU to specify a vPASID during attach and somehow
> that gets mapped into a pPASID and synchronized with the KVM ENQCMD
> translation?
> 

yes, that is the original plan. More accurately the vfio attach uAPI
is always about a pPASID. The mapping will be added separately to
iommufd and synced with KVM.

But internally we are evaluating whether there is enough value
to justify adding this complexity to the kernel. It's the main burden
in SIOVr1. Given the limited usages very likely we'll only do the
basic SIOV support w/o the vPASID cap...

^ permalink raw reply	[flat|nested] 102+ messages in thread

* RE: [PATCH v2 2/4] vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices
  2024-04-23 12:43   ` Jason Gunthorpe
@ 2024-04-24  0:33     ` Tian, Kevin
  2024-04-24  4:48     ` Yi Liu
  1 sibling, 0 replies; 102+ messages in thread
From: Tian, Kevin @ 2024-04-24  0:33 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: alex.williamson@redhat.com, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun,
	Matthew Wilcox

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, April 23, 2024 8:44 PM
> 
> On Fri, Apr 12, 2024 at 01:21:19AM -0700, Yi Liu wrote:
> > +int vfio_iommufd_physical_pasid_attach_ioas(struct vfio_device *vdev,
> > +					    u32 pasid, u32 *pt_id)
> > +{
> > +	int rc;
> > +
> > +	lockdep_assert_held(&vdev->dev_set->lock);
> > +
> > +	if (WARN_ON(!vdev->iommufd_device))
> > +		return -EINVAL;
> > +
> > +	rc = ida_get_lowest(&vdev->pasids, pasid, pasid);
> 
> A helper inline
> 
>     bool ida_is_allocate(&ida, id)
> 
> Would be nicer for that
> 
> > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > index cb5b7f865d58..e0198851ffd2 100644
> > --- a/drivers/vfio/pci/vfio_pci.c
> > +++ b/drivers/vfio/pci/vfio_pci.c
> > @@ -142,6 +142,8 @@ static const struct vfio_device_ops vfio_pci_ops = {
> >  	.unbind_iommufd	= vfio_iommufd_physical_unbind,
> >  	.attach_ioas	= vfio_iommufd_physical_attach_ioas,
> >  	.detach_ioas	= vfio_iommufd_physical_detach_ioas,
> > +	.pasid_attach_ioas	= vfio_iommufd_physical_pasid_attach_ioas,
> > +	.pasid_detach_ioas	= vfio_iommufd_physical_pasid_detach_ioas,
> >  };
> 
> This should be copied into mlx5 and nvgrace-gpu at least as well
> 

I'd prefer to the driver owners to add them separately. They know
their hardware and can do proper test.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* RE: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-24  0:12                           ` Jason Gunthorpe
@ 2024-04-24  2:57                             ` Tian, Kevin
  2024-04-24 12:29                               ` Baolu Lu
  2024-04-24 14:04                               ` Jason Gunthorpe
  2024-04-24  5:19                             ` Tian, Kevin
  2024-04-24 18:24                             ` Alex Williamson
  2 siblings, 2 replies; 102+ messages in thread
From: Tian, Kevin @ 2024-04-24  2:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Liu, Yi L, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, April 24, 2024 8:12 AM
> 
> On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Tuesday, April 23, 2024 8:02 PM
> > >
> > > On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote:
> > > > I'm not sure how userspace can fully handle this w/o certain assistance
> > > > from the kernel.
> > > >
> > > > So I kind of agree that emulated PASID capability is probably the only
> > > > contract which the kernel should provide:
> > > >   - mapped 1:1 at the physical location, or
> > > >   - constructed at an offset according to DVSEC, or
> > > >   - constructed at an offset according to a look-up table
> > > >
> > > > The VMM always scans the vfio pci config space to expose vPASID.
> > > >
> > > > Then the remaining open is what VMM could do when a VF supports
> > > > PASID but unfortunately it's not reported by vfio. W/o the capability
> > > > of inspecting the PASID state of PF, probably the only feasible option
> > > > is to maintain a look-up table in VMM itself and assumes the kernel
> > > > always enables the PASID cap on PF.
> > >
> > > I'm still not sure I like doing this in the kernel - we need to do the
> > > same sort of thing for ATS too, right?
> >
> > VF is allowed to implement ATS.
> >
> > PRI has the same problem as PASID.
> 
> I'm surprised by this, I would have guessed ATS would be the device
> global one, PRI not being per-VF seems problematic??? How do you
> disable PRI generation to get a clean shutdown?

Here is what the PCIe spec says:

  For SR-IOV devices, a single Page Request Interface is permitted for
  the PF and is shared between the PF and its associated VFs, in which
  case the PF implements this capability and its VFs must not.

I'll let Baolu chime in for the potential impact to his PRI cleanup
effort, e.g. whether disabling PRI generation is mandatory if the
IOMMU side is already put in a mode auto-responding error to
new PRI request instead of reporting to sw.

But I do see another problem for shared capabilities between PF/VFs.

Now those shared capabilities are enabled/disabled when the PF is
attached to/detached from a domain, w/o counting the shared usage
from VFs.

Looks we have a gap here.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 2/4] vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices
  2024-04-23 12:43   ` Jason Gunthorpe
  2024-04-24  0:33     ` Tian, Kevin
@ 2024-04-24  4:48     ` Yi Liu
  1 sibling, 0 replies; 102+ messages in thread
From: Yi Liu @ 2024-04-24  4:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: alex.williamson, kevin.tian, joro, robin.murphy, eric.auger,
	nicolinc, kvm, chao.p.peng, iommu, baolu.lu, zhenzhong.duan,
	jacob.jun.pan, Matthew Wilcox



On 2024/4/23 20:43, Jason Gunthorpe wrote:
> On Fri, Apr 12, 2024 at 01:21:19AM -0700, Yi Liu wrote:
>> +int vfio_iommufd_physical_pasid_attach_ioas(struct vfio_device *vdev,
>> +					    u32 pasid, u32 *pt_id)
>> +{
>> +	int rc;
>> +
>> +	lockdep_assert_held(&vdev->dev_set->lock);
>> +
>> +	if (WARN_ON(!vdev->iommufd_device))
>> +		return -EINVAL;
>> +
>> +	rc = ida_get_lowest(&vdev->pasids, pasid, pasid);
> 
> A helper inline
> 
>      bool ida_is_allocate(&ida, id)
> 
> Would be nicer for that

ok.

>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>> index cb5b7f865d58..e0198851ffd2 100644
>> --- a/drivers/vfio/pci/vfio_pci.c
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -142,6 +142,8 @@ static const struct vfio_device_ops vfio_pci_ops = {
>>   	.unbind_iommufd	= vfio_iommufd_physical_unbind,
>>   	.attach_ioas	= vfio_iommufd_physical_attach_ioas,
>>   	.detach_ioas	= vfio_iommufd_physical_detach_ioas,
>> +	.pasid_attach_ioas	= vfio_iommufd_physical_pasid_attach_ioas,
>> +	.pasid_detach_ioas	= vfio_iommufd_physical_pasid_detach_ioas,
>>   };
> 
> This should be copied into mlx5 and nvgrace-gpu at least as well

looks like Kevin has a different idea on it.


-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 102+ messages in thread

* RE: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-24  0:12                           ` Jason Gunthorpe
  2024-04-24  2:57                             ` Tian, Kevin
@ 2024-04-24  5:19                             ` Tian, Kevin
  2024-04-24 14:15                               ` Jason Gunthorpe
  2024-04-24 18:24                             ` Alex Williamson
  2 siblings, 1 reply; 102+ messages in thread
From: Tian, Kevin @ 2024-04-24  5:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Liu, Yi L, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, April 24, 2024 8:12 AM
> 
> On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Tuesday, April 23, 2024 8:02 PM
> > >
> > > It feels simpler if the indicates if PASID and ATS can be supported
> > > and userspace builds the capability blocks.
> >
> > this routes back to Alex's original question about using different
> > interfaces (a device feature vs. PCI PASID cap) for VF and PF.
> 
> I'm not sure it is different interfaces..
> 
> The only reason to pass the PF's PASID cap is to give free space to
> the VMM. If we are saying that gaps are free space (excluding a list
> of bad devices) then we don't acutally need to do that anymore.
> 
> VMM will always create a synthetic PASID cap and kernel will always
> suppress a real one.

oh you suggest that there won't even be a 1:1 map for PF!

kind of continue with the device_feature method as this series does.
and it could include all VMM-emulated capabilities which are not
enumerated properly from vfio pci config space.

this interface only reports the availability/features of a capability
but never includes information about offset.

If a device implements DVSEC it will be exposed to the VMM.

Then suppose the VMM will introduce a new cmd parameter to
turn on the emulation of the PASID capability. It's default off so
legacy usages can still work.

Once the parameter is on then the VMM will emulate the PASID
capability by:

  - Locating a free range according to DVSEC, or,
  - Locating a free range from gaps between PCI caps,

If a device is found not working properly then add a fixed offset for 
this device.
 
> 
> An iommufd query will indicate if the vIOMMU can support vPASID on
> that device.
> 
> Same for all the troublesome non-physical caps.
> 
> > > There are migration considerations too - the blocks need to be
> > > migrated over and end up in the same place as well..
> >
> > Can you elaborate what is the problem with the kernel emulating
> > the PASID cap in this consideration?
> 
> If the kernel changes the algorithm, say it wants to do PASID, PRI,
> something_new then it might change the layout
> 
> We can't just have the kernel decide without also providing a way for
> userspace to say what the right layout actually is. :\
> 

emm that's a good point.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-24  2:57                             ` Tian, Kevin
@ 2024-04-24 12:29                               ` Baolu Lu
  2024-04-24 14:04                               ` Jason Gunthorpe
  1 sibling, 0 replies; 102+ messages in thread
From: Baolu Lu @ 2024-04-24 12:29 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe
  Cc: baolu.lu, Alex Williamson, Liu, Yi L, joro@8bytes.org,
	robin.murphy@arm.com, eric.auger@redhat.com, nicolinc@nvidia.com,
	kvm@vger.kernel.org, chao.p.peng@linux.intel.com,
	iommu@lists.linux.dev, Duan, Zhenzhong, Pan, Jacob jun

On 2024/4/24 10:57, Tian, Kevin wrote:
>> From: Jason Gunthorpe <jgg@nvidia.com>
>> Sent: Wednesday, April 24, 2024 8:12 AM
>>
>> On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote:
>>>> From: Jason Gunthorpe <jgg@nvidia.com>
>>>> Sent: Tuesday, April 23, 2024 8:02 PM
>>>>
>>>> On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote:
>>>>> I'm not sure how userspace can fully handle this w/o certain assistance
>>>>> from the kernel.
>>>>>
>>>>> So I kind of agree that emulated PASID capability is probably the only
>>>>> contract which the kernel should provide:
>>>>>    - mapped 1:1 at the physical location, or
>>>>>    - constructed at an offset according to DVSEC, or
>>>>>    - constructed at an offset according to a look-up table
>>>>>
>>>>> The VMM always scans the vfio pci config space to expose vPASID.
>>>>>
>>>>> Then the remaining open is what VMM could do when a VF supports
>>>>> PASID but unfortunately it's not reported by vfio. W/o the capability
>>>>> of inspecting the PASID state of PF, probably the only feasible option
>>>>> is to maintain a look-up table in VMM itself and assumes the kernel
>>>>> always enables the PASID cap on PF.
>>>>
>>>> I'm still not sure I like doing this in the kernel - we need to do the
>>>> same sort of thing for ATS too, right?
>>>
>>> VF is allowed to implement ATS.
>>>
>>> PRI has the same problem as PASID.
>>
>> I'm surprised by this, I would have guessed ATS would be the device
>> global one, PRI not being per-VF seems problematic??? How do you
>> disable PRI generation to get a clean shutdown?
> 
> Here is what the PCIe spec says:
> 
>    For SR-IOV devices, a single Page Request Interface is permitted for
>    the PF and is shared between the PF and its associated VFs, in which
>    case the PF implements this capability and its VFs must not.
> 
> I'll let Baolu chime in for the potential impact to his PRI cleanup
> effort, e.g. whether disabling PRI generation is mandatory if the
> IOMMU side is already put in a mode auto-responding error to
> new PRI request instead of reporting to sw.

The PRI cleanup steps are defined like this:

  * - Disable new PRI reception: Turn off PRI generation in the IOMMU 
hardware
  *   and flush any hardware page request queues. This should be done before
  *   calling into this helper.
  * - Acknowledge all outstanding PRQs to the device: Respond to all 
outstanding
  *   page requests with IOMMU_PAGE_RESP_INVALID, indicating the device 
should
  *   not retry. This helper function handles this.
  * - Disable PRI on the device: After calling this helper, the caller could
  *   then disable PRI on the device.

Disabling PRI on the device is the last step and optional because the
IOMMU is required to support a PRI blocking state and has already been
put in that state at the first step.

For the VF case, it probably is a no-op except for maintaining a
reference count. Once PRI is disabled on all PFs and VFs, it can then be
physically disabled on the PF.

> 
> But I do see another problem for shared capabilities between PF/VFs.
> 
> Now those shared capabilities are enabled/disabled when the PF is
> attached to/detached from a domain, w/o counting the shared usage
> from VFs.
> 
> Looks we have a gap here.

Yes, there's a gap at least for the Intel IOMMU driver. I'll soon fix
this by moving the handling of ATS out of the driver, especially from
the default domain attach/detach paths.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl
  2024-04-24  0:24     ` Tian, Kevin
@ 2024-04-24 13:59       ` Jason Gunthorpe
  0 siblings, 0 replies; 102+ messages in thread
From: Jason Gunthorpe @ 2024-04-24 13:59 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson@redhat.com, joro@8bytes.org,
	robin.murphy@arm.com, eric.auger@redhat.com, nicolinc@nvidia.com,
	kvm@vger.kernel.org, chao.p.peng@linux.intel.com,
	iommu@lists.linux.dev, baolu.lu@linux.intel.com, Duan, Zhenzhong,
	Pan, Jacob jun

On Wed, Apr 24, 2024 at 12:24:19AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, April 23, 2024 8:40 PM
> > 
> > On Fri, Apr 12, 2024 at 01:21:21AM -0700, Yi Liu wrote:
> > > Today, vfio-pci hides the PASID capability of devices from userspace. Unlike
> > > other PCI capabilities, PASID capability is going to be reported to user by
> > > VFIO_DEVICE_FEATURE. Hence userspace could probe PASID capability by
> > it.
> > > This is a bit different from the other capabilities which are reported to
> > > userspace when the user reads the device's PCI configuration space. There
> > > are two reasons for this.
> > 
> > I'm thinking this probably does not belong in VFIO, iommufd should
> > report what the device, driver and OS is able to do with this
> > device. PASID support is at least 50% an iommu property too.
> 
> We have PASID capability in both device side and iommu side.
> 
> VFIO is for the former and iommufd is for the latter.

iommu can do the device side too, we have a device info ioctl after
all.

Jason

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-24  2:57                             ` Tian, Kevin
  2024-04-24 12:29                               ` Baolu Lu
@ 2024-04-24 14:04                               ` Jason Gunthorpe
  1 sibling, 0 replies; 102+ messages in thread
From: Jason Gunthorpe @ 2024-04-24 14:04 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Liu, Yi L, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun

On Wed, Apr 24, 2024 at 02:57:38AM +0000, Tian, Kevin wrote:
> Now those shared capabilities are enabled/disabled when the PF is
> attached to/detached from a domain, w/o counting the shared usage
> from VFs.

Urh..

This looks broken in the kernel right now?

If you attach a SVA PASID to a VF then the iommu driver will call 
pci_enable_pri(vf) which will always fail:

	/*
	 * VFs must not implement the PRI Capability.  If their PF
	 * implements PRI, it is shared by the VFs, so if the PF PRI is
	 * enabled, it is also enabled for the VF.
	 */
	if (pdev->is_virtfn) {
		if (pci_physfn(pdev)->pri_enabled)
			return 0;
		return -EINVAL;
	}

More to fix :(

Jason

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-24  5:19                             ` Tian, Kevin
@ 2024-04-24 14:15                               ` Jason Gunthorpe
  2024-04-24 18:38                                 ` Alex Williamson
  0 siblings, 1 reply; 102+ messages in thread
From: Jason Gunthorpe @ 2024-04-24 14:15 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Liu, Yi L, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun

On Wed, Apr 24, 2024 at 05:19:31AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, April 24, 2024 8:12 AM
> > 
> > On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Tuesday, April 23, 2024 8:02 PM
> > > >
> > > > It feels simpler if the indicates if PASID and ATS can be supported
> > > > and userspace builds the capability blocks.
> > >
> > > this routes back to Alex's original question about using different
> > > interfaces (a device feature vs. PCI PASID cap) for VF and PF.
> > 
> > I'm not sure it is different interfaces..
> > 
> > The only reason to pass the PF's PASID cap is to give free space to
> > the VMM. If we are saying that gaps are free space (excluding a list
> > of bad devices) then we don't acutally need to do that anymore.
> > 
> > VMM will always create a synthetic PASID cap and kernel will always
> > suppress a real one.
> 
> oh you suggest that there won't even be a 1:1 map for PF!

Right. No real need..

> kind of continue with the device_feature method as this series does.
> and it could include all VMM-emulated capabilities which are not
> enumerated properly from vfio pci config space.

1) VFIO creates the iommufd idev
2) VMM queries IOMMUFD_CMD_GET_HW_INFO to learn if PASID, PRI, etc,
   etc is supported
3) VMM locates empty space in the config space
4) VMM figures out where and what cap blocks to create (considering
   migration needs/etc)
5) VMM synthesizes the blocks and ties emulation to other iommufd things

This works generically for any synthetic vPCI function including a
non-vfio-pci one.

Most likely due to migration needs the exact layout of the PCI config
space should be configured to the VMM, including the location of any
blocks copied from physical and any blocks synthezied. This is the
only way to be sure the config space is actually 100% consistent.

For non migration cases to make it automatic we can check the free
space via gaps. The broken devices that have problems with this can
either be told to use the explicit approach above,the VMM could
consult some text file, or vPASID/etc can be left disabled. IMHO the
support of PASID is so rare today this is probably fine.

Vendors should be *strongly encouraged* to wrap their special used
config space areas in DVSEC and not hide them in free space.

We may also want a DVSEC to indicate free space - but if vendors are
going to change their devices I'd rather them change to mark the used
space with DVSEC then mark the free space :)

Jason

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-24  0:12                           ` Jason Gunthorpe
  2024-04-24  2:57                             ` Tian, Kevin
  2024-04-24  5:19                             ` Tian, Kevin
@ 2024-04-24 18:24                             ` Alex Williamson
  2024-04-24 18:36                               ` Jason Gunthorpe
  2024-04-25  9:26                               ` Yi Liu
  2 siblings, 2 replies; 102+ messages in thread
From: Alex Williamson @ 2024-04-24 18:24 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Liu, Yi L, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun

On Tue, 23 Apr 2024 21:12:21 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Tuesday, April 23, 2024 8:02 PM
> > > 
> > > On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote:  
> > > > I'm not sure how userspace can fully handle this w/o certain assistance
> > > > from the kernel.
> > > >
> > > > So I kind of agree that emulated PASID capability is probably the only
> > > > contract which the kernel should provide:
> > > >   - mapped 1:1 at the physical location, or
> > > >   - constructed at an offset according to DVSEC, or
> > > >   - constructed at an offset according to a look-up table
> > > >
> > > > The VMM always scans the vfio pci config space to expose vPASID.
> > > >
> > > > Then the remaining open is what VMM could do when a VF supports
> > > > PASID but unfortunately it's not reported by vfio. W/o the capability
> > > > of inspecting the PASID state of PF, probably the only feasible option
> > > > is to maintain a look-up table in VMM itself and assumes the kernel
> > > > always enables the PASID cap on PF.  
> > > 
> > > I'm still not sure I like doing this in the kernel - we need to do the
> > > same sort of thing for ATS too, right?  
> > 
> > VF is allowed to implement ATS.
> > 
> > PRI has the same problem as PASID.  
> 
> I'm surprised by this, I would have guessed ATS would be the device
> global one, PRI not being per-VF seems problematic??? How do you
> disable PRI generation to get a clean shutdown?
> 
> > > It feels simpler if the indicates if PASID and ATS can be supported
> > > and userspace builds the capability blocks.  
> > 
> > this routes back to Alex's original question about using different
> > interfaces (a device feature vs. PCI PASID cap) for VF and PF.  
> 
> I'm not sure it is different interfaces..
> 
> The only reason to pass the PF's PASID cap is to give free space to
> the VMM. If we are saying that gaps are free space (excluding a list
> of bad devices) then we don't acutally need to do that anymore.

Are we saying that now??  That's new.

> VMM will always create a synthetic PASID cap and kernel will always
> suppress a real one.
> 
> An iommufd query will indicate if the vIOMMU can support vPASID on
> that device.
> 
> Same for all the troublesome non-physical caps.
> 
> > > There are migration considerations too - the blocks need to be
> > > migrated over and end up in the same place as well..  
> > 
> > Can you elaborate what is the problem with the kernel emulating
> > the PASID cap in this consideration?  
> 
> If the kernel changes the algorithm, say it wants to do PASID, PRI,
> something_new then it might change the layout
> 
> We can't just have the kernel decide without also providing a way for
> userspace to say what the right layout actually is. :\

The capability layout is only relevant to migration, right?  A variant
driver that supports migration is a prerequisite and would also be
responsible for exposing the PASID capability.  This isn't as disjoint
as it's being portrayed.

> > Does it talk about a case where the devices between src/dest are
> > different versions (but backward compatible) with different unused
> > space layout and the kernel approach may pick up different offsets
> > while the VMM can guarantee the same offset?  
> 
> That is also a concern where the PCI cap layout may change a bit but
> they are still migration compatible, but my bigger worry is that the
> kernel just lays out the fake caps in a different way because the
> kernel changes.

Outside of migration, what does it matter if the cap layout is
different?  A driver should never hard code the address for a
capability.
 
> At least if the VMM is doing this then the VMM can include the
> information in its migration scheme and use it to recreate the PCI
> layout withotu having to create a bunch of uAPI to do so.

We're again back to migration compatibility, where again the capability
layout would be governed by the migration support in the in-kernel
variant driver.  Once migration is involved the location of a PASID
shouldn't be arbitrary, whether it's provided by the kernel or the VMM.

Regardless, the VMM ultimately has the authority what the guest
sees in config space.  The VMM is not bound to expose the PASID at the
offset provided by the kernel, or bound to expose it at all.  The
kernel exposed PASID can simply provide an available location and set
of enabled capabilities.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-24 18:24                             ` Alex Williamson
@ 2024-04-24 18:36                               ` Jason Gunthorpe
  2024-04-24 20:13                                 ` Alex Williamson
  2024-04-25  9:26                               ` Yi Liu
  1 sibling, 1 reply; 102+ messages in thread
From: Jason Gunthorpe @ 2024-04-24 18:36 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Liu, Yi L, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun

On Wed, Apr 24, 2024 at 12:24:37PM -0600, Alex Williamson wrote:
> > The only reason to pass the PF's PASID cap is to give free space to
> > the VMM. If we are saying that gaps are free space (excluding a list
> > of bad devices) then we don't acutally need to do that anymore.
> 
> Are we saying that now??  That's new.

I suggested it a few times

> 
> > VMM will always create a synthetic PASID cap and kernel will always
> > suppress a real one.
> > 
> > An iommufd query will indicate if the vIOMMU can support vPASID on
> > that device.
> > 
> > Same for all the troublesome non-physical caps.
> > 
> > > > There are migration considerations too - the blocks need to be
> > > > migrated over and end up in the same place as well..  
> > > 
> > > Can you elaborate what is the problem with the kernel emulating
> > > the PASID cap in this consideration?  
> > 
> > If the kernel changes the algorithm, say it wants to do PASID, PRI,
> > something_new then it might change the layout
> > 
> > We can't just have the kernel decide without also providing a way for
> > userspace to say what the right layout actually is. :\
> 
> The capability layout is only relevant to migration, right?  

Yes, proabbly

> A variant
> driver that supports migration is a prerequisite and would also be
> responsible for exposing the PASID capability.  This isn't as disjoint
> as it's being portrayed.

I guess..  But also not quite. We still have the problem that kernel
migration driver V1 could legitimately create a different config space
that migration driver V2

And now you are saying that the migration driver has to parse the
migration stream and readjust its own layout

And every driver needs to do this?

We can, it is a quite big bit of infrastructure I think, but sure..

I fear the VMM still has to be involved somehow because it still has
to know if the source VMM has removed any kernel created caps.

> Outside of migration, what does it matter if the cap layout is
> different?  A driver should never hard code the address for a
> capability.

Yes, talking about migration here - migration is the hardest case it
seems.
 
> > At least if the VMM is doing this then the VMM can include the
> > information in its migration scheme and use it to recreate the PCI
> > layout withotu having to create a bunch of uAPI to do so.
> 
> We're again back to migration compatibility, where again the capability
> layout would be governed by the migration support in the in-kernel
> variant driver.  Once migration is involved the location of a PASID
> shouldn't be arbitrary, whether it's provided by the kernel or the VMM.

I wasn't going in this direction. I was thinking to make the VMM
create the config space layout that is approriate and hold it stable
as a migration ABI.

I think in practice many VMMs are going to do this anyhow unless we
put full support for config space synthesis, stable versions, and
version selection in the kernel directly. I was thinking to avoid
doing that.
 
> Regardless, the VMM ultimately has the authority what the guest
> sees in config space.  The VMM is not bound to expose the PASID at the
> offset provided by the kernel, or bound to expose it at all.  The
> kernel exposed PASID can simply provide an available location and set
> of enabled capabilities. 

And if the VMM is going to ignore the kernel layout then why do so
much work in the kernel to create it?

I think we need to decide, either only the VMM or only the kernel
should do this.

Jason

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-24 14:15                               ` Jason Gunthorpe
@ 2024-04-24 18:38                                 ` Alex Williamson
  2024-04-24 18:45                                   ` Jason Gunthorpe
  0 siblings, 1 reply; 102+ messages in thread
From: Alex Williamson @ 2024-04-24 18:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Liu, Yi L, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun

On Wed, 24 Apr 2024 11:15:25 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Apr 24, 2024 at 05:19:31AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, April 24, 2024 8:12 AM
> > > 
> > > On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote:  
> > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > Sent: Tuesday, April 23, 2024 8:02 PM
> > > > >
> > > > > It feels simpler if the indicates if PASID and ATS can be supported
> > > > > and userspace builds the capability blocks.  
> > > >
> > > > this routes back to Alex's original question about using different
> > > > interfaces (a device feature vs. PCI PASID cap) for VF and PF.  
> > > 
> > > I'm not sure it is different interfaces..
> > > 
> > > The only reason to pass the PF's PASID cap is to give free space to
> > > the VMM. If we are saying that gaps are free space (excluding a list
> > > of bad devices) then we don't acutally need to do that anymore.
> > > 
> > > VMM will always create a synthetic PASID cap and kernel will always
> > > suppress a real one.  
> > 
> > oh you suggest that there won't even be a 1:1 map for PF!  
> 
> Right. No real need..
> 
> > kind of continue with the device_feature method as this series does.
> > and it could include all VMM-emulated capabilities which are not
> > enumerated properly from vfio pci config space.  
> 
> 1) VFIO creates the iommufd idev
> 2) VMM queries IOMMUFD_CMD_GET_HW_INFO to learn if PASID, PRI, etc,
>    etc is supported
> 3) VMM locates empty space in the config space
> 4) VMM figures out where and what cap blocks to create (considering
>    migration needs/etc)
> 5) VMM synthesizes the blocks and ties emulation to other iommufd things
> 
> This works generically for any synthetic vPCI function including a
> non-vfio-pci one.

Maybe this is the actual value in implementing this in the VMM, one
implementation can support multiple device interfaces.

> Most likely due to migration needs the exact layout of the PCI config
> space should be configured to the VMM, including the location of any
> blocks copied from physical and any blocks synthezied. This is the
> only way to be sure the config space is actually 100% consistent.

Where is this concern about config space arbitrarily changing coming
from?  It's possible, yes, but vfio-pci-core or a variant driver are
going to have some sort of reasoning for exposing a capability at a
given offset.  A variant driver is necessary for supporting migration,
that variant driver should be aware that capability offsets are part of
the migration contract, and QEMU will enforce identical config space
unless we introduce exceptions.

> For non migration cases to make it automatic we can check the free
> space via gaps. The broken devices that have problems with this can
> either be told to use the explicit approach above,the VMM could
> consult some text file, or vPASID/etc can be left disabled. IMHO the
> support of PASID is so rare today this is probably fine.
> 
> Vendors should be *strongly encouraged* to wrap their special used
> config space areas in DVSEC and not hide them in free space.

As in my previous reply, this is a new approach that I had thought we
weren't comfortable making, and I'm still not very comfortable with.
The fact is random registers exist outside of capabilities today and
creating a general policy that we're going to deal with that as issues
arise from a generic "find a gap" algorithm is concerning.

> We may also want a DVSEC to indicate free space - but if vendors are
> going to change their devices I'd rather them change to mark the used
> space with DVSEC then mark the free space :)

Sure, had we proposed this and had vendor buy-in 10+yrs ago, that'd be
great.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-24 18:38                                 ` Alex Williamson
@ 2024-04-24 18:45                                   ` Jason Gunthorpe
  0 siblings, 0 replies; 102+ messages in thread
From: Jason Gunthorpe @ 2024-04-24 18:45 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Liu, Yi L, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun

On Wed, Apr 24, 2024 at 12:38:51PM -0600, Alex Williamson wrote:
> On Wed, 24 Apr 2024 11:15:25 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Apr 24, 2024 at 05:19:31AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Wednesday, April 24, 2024 8:12 AM
> > > > 
> > > > On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote:  
> > > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > > Sent: Tuesday, April 23, 2024 8:02 PM
> > > > > >
> > > > > > It feels simpler if the indicates if PASID and ATS can be supported
> > > > > > and userspace builds the capability blocks.  
> > > > >
> > > > > this routes back to Alex's original question about using different
> > > > > interfaces (a device feature vs. PCI PASID cap) for VF and PF.  
> > > > 
> > > > I'm not sure it is different interfaces..
> > > > 
> > > > The only reason to pass the PF's PASID cap is to give free space to
> > > > the VMM. If we are saying that gaps are free space (excluding a list
> > > > of bad devices) then we don't acutally need to do that anymore.
> > > > 
> > > > VMM will always create a synthetic PASID cap and kernel will always
> > > > suppress a real one.  
> > > 
> > > oh you suggest that there won't even be a 1:1 map for PF!  
> > 
> > Right. No real need..
> > 
> > > kind of continue with the device_feature method as this series does.
> > > and it could include all VMM-emulated capabilities which are not
> > > enumerated properly from vfio pci config space.  
> > 
> > 1) VFIO creates the iommufd idev
> > 2) VMM queries IOMMUFD_CMD_GET_HW_INFO to learn if PASID, PRI, etc,
> >    etc is supported
> > 3) VMM locates empty space in the config space
> > 4) VMM figures out where and what cap blocks to create (considering
> >    migration needs/etc)
> > 5) VMM synthesizes the blocks and ties emulation to other iommufd things
> > 
> > This works generically for any synthetic vPCI function including a
> > non-vfio-pci one.
> 
> Maybe this is the actual value in implementing this in the VMM, one
> implementation can support multiple device interfaces.
> 
> > Most likely due to migration needs the exact layout of the PCI config
> > space should be configured to the VMM, including the location of any
> > blocks copied from physical and any blocks synthezied. This is the
> > only way to be sure the config space is actually 100% consistent.
> 
> Where is this concern about config space arbitrarily changing coming
> from?

It is important for migration.

Today with the drivers we have the devices have to take care of their
own config space layout only in HW. We are not expecting migration
drivers to worry about any SW created config space. SW config space is
a new thing.

> > We may also want a DVSEC to indicate free space - but if vendors are
> > going to change their devices I'd rather them change to mark the used
> > space with DVSEC then mark the free space :)
> 
> Sure, had we proposed this and had vendor buy-in 10+yrs ago, that'd be
> great

We are applying this going forward to PASID, and PRI, which barely
exist on devices at all today. We don't need to worry about those
10+yr old devices that don't support PASID/PRI in the first place.

I think this makes the problem much smaller.

Jason

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-24 18:36                               ` Jason Gunthorpe
@ 2024-04-24 20:13                                 ` Alex Williamson
  2024-04-26 14:11                                   ` Jason Gunthorpe
  0 siblings, 1 reply; 102+ messages in thread
From: Alex Williamson @ 2024-04-24 20:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Liu, Yi L, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun,
	Cédric Le Goater

On Wed, 24 Apr 2024 15:36:26 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Apr 24, 2024 at 12:24:37PM -0600, Alex Williamson wrote:
> > > The only reason to pass the PF's PASID cap is to give free space to
> > > the VMM. If we are saying that gaps are free space (excluding a list
> > > of bad devices) then we don't acutally need to do that anymore.  
> > 
> > Are we saying that now??  That's new.  
> 
> I suggested it a few times
> 
> >   
> > > VMM will always create a synthetic PASID cap and kernel will always
> > > suppress a real one.
> > > 
> > > An iommufd query will indicate if the vIOMMU can support vPASID on
> > > that device.
> > > 
> > > Same for all the troublesome non-physical caps.
> > >   
> > > > > There are migration considerations too - the blocks need to be
> > > > > migrated over and end up in the same place as well..    
> > > > 
> > > > Can you elaborate what is the problem with the kernel emulating
> > > > the PASID cap in this consideration?    
> > > 
> > > If the kernel changes the algorithm, say it wants to do PASID, PRI,
> > > something_new then it might change the layout
> > > 
> > > We can't just have the kernel decide without also providing a way for
> > > userspace to say what the right layout actually is. :\  
> > 
> > The capability layout is only relevant to migration, right?    
> 
> Yes, proabbly
> 
> > A variant
> > driver that supports migration is a prerequisite and would also be
> > responsible for exposing the PASID capability.  This isn't as disjoint
> > as it's being portrayed.  
> 
> I guess..  But also not quite. We still have the problem that kernel
> migration driver V1 could legitimately create a different config space
> that migration driver V2
> 
> And now you are saying that the migration driver has to parse the
> migration stream and readjust its own layout
> 
> And every driver needs to do this?
> 
> We can, it is a quite big bit of infrastructure I think, but sure..
> 
> I fear the VMM still has to be involved somehow because it still has
> to know if the source VMM has removed any kernel created caps.

This is kind of an absurd example to portray as a ubiquitous problem.
Typically the config space layout is a reflection of hardware whether
the device supports migration or not.  If a driver were to insert a
virtual capability, then yes it would want to be consistent about it if
it also cares about migration.  If the driver needs to change the
location of a virtual capability, problems will arise, but that's also
not something that every driver needs to do.

Also, how exactly does emulating the capability in the VMM solve this
problem?  Currently QEMU migration simply applies state to an identical
VM on the target.  QEMU doesn't modify the target VM to conform to the
data stream.  So in either case, the problem might be more along the
lines of how to make a V1 device from a V2 driver, which is more the
device type/flavor/persona problem.

> > Outside of migration, what does it matter if the cap layout is
> > different?  A driver should never hard code the address for a
> > capability.  
> 
> Yes, talking about migration here - migration is the hardest case it
> seems.
>  
> > > At least if the VMM is doing this then the VMM can include the
> > > information in its migration scheme and use it to recreate the PCI
> > > layout withotu having to create a bunch of uAPI to do so.  
> > 
> > We're again back to migration compatibility, where again the capability
> > layout would be governed by the migration support in the in-kernel
> > variant driver.  Once migration is involved the location of a PASID
> > shouldn't be arbitrary, whether it's provided by the kernel or the VMM.  
> 
> I wasn't going in this direction. I was thinking to make the VMM
> create the config space layout that is approriate and hold it stable
> as a migration ABI.
> 
> I think in practice many VMMs are going to do this anyhow unless we
> put full support for config space synthesis, stable versions, and
> version selection in the kernel directly. I was thinking to avoid
> doing that.

Currently QEMU replies on determinism that a given command line results
in an identical machine configuration and identical devices.  State of
that target VM is then populated, not defined by, the migration stream.

> > Regardless, the VMM ultimately has the authority what the guest
> > sees in config space.  The VMM is not bound to expose the PASID at the
> > offset provided by the kernel, or bound to expose it at all.  The
> > kernel exposed PASID can simply provide an available location and set
> > of enabled capabilities.   
> 
> And if the VMM is going to ignore the kernel layout then why do so
> much work in the kernel to create it?

Ok, let's not ignore it ;)

> I think we need to decide, either only the VMM or only the kernel
> should do this.

What are you actually proposing?  Are you suggesting that if a device
supports the Power Management capability it will be virtualized at
offset 0x60, if the device supports the MSI capability it will be
virtualized at 0x68,... if a device supports PASID it will be
virtualized at offset 0x300, etc...?

That's not only impractical because we can't layout all the capabilities
within the available space, but also because we will run into masking
hidden registers and devices where the driver hard codes a capability
offset.

If the VMM implements the "find a gap" solution then it's just as
subject to config space changes in hardware or provided by the variant
driver.

If not either of those, are we hard coding a device specific config
space map into the VMM or providing one on the command line?  I thought
we were using vfio-pci variant drivers and a defined vfio migration API
in order to prevent modifying the VMM for every device we want to
support migration.  Also I'd wonder if the driver itself shouldn't be
configured to provide a compatible type.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-24 18:24                             ` Alex Williamson
  2024-04-24 18:36                               ` Jason Gunthorpe
@ 2024-04-25  9:26                               ` Yi Liu
  2024-04-25 12:58                                 ` Alex Williamson
  1 sibling, 1 reply; 102+ messages in thread
From: Yi Liu @ 2024-04-25  9:26 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: Tian, Kevin, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun

On 2024/4/25 02:24, Alex Williamson wrote:
> On Tue, 23 Apr 2024 21:12:21 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
>> On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote:
>>>> From: Jason Gunthorpe <jgg@nvidia.com>
>>>> Sent: Tuesday, April 23, 2024 8:02 PM
>>>>
>>>> On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote:
>>>>> I'm not sure how userspace can fully handle this w/o certain assistance
>>>>> from the kernel.
>>>>>
>>>>> So I kind of agree that emulated PASID capability is probably the only
>>>>> contract which the kernel should provide:
>>>>>    - mapped 1:1 at the physical location, or
>>>>>    - constructed at an offset according to DVSEC, or
>>>>>    - constructed at an offset according to a look-up table
>>>>>
>>>>> The VMM always scans the vfio pci config space to expose vPASID.
>>>>>
>>>>> Then the remaining open is what VMM could do when a VF supports
>>>>> PASID but unfortunately it's not reported by vfio. W/o the capability
>>>>> of inspecting the PASID state of PF, probably the only feasible option
>>>>> is to maintain a look-up table in VMM itself and assumes the kernel
>>>>> always enables the PASID cap on PF.
>>>>
>>>> I'm still not sure I like doing this in the kernel - we need to do the
>>>> same sort of thing for ATS too, right?
>>>
>>> VF is allowed to implement ATS.
>>>
>>> PRI has the same problem as PASID.
>>
>> I'm surprised by this, I would have guessed ATS would be the device
>> global one, PRI not being per-VF seems problematic??? How do you
>> disable PRI generation to get a clean shutdown?
>>
>>>> It feels simpler if the indicates if PASID and ATS can be supported
>>>> and userspace builds the capability blocks.
>>>
>>> this routes back to Alex's original question about using different
>>> interfaces (a device feature vs. PCI PASID cap) for VF and PF.
>>
>> I'm not sure it is different interfaces..
>>
>> The only reason to pass the PF's PASID cap is to give free space to
>> the VMM. If we are saying that gaps are free space (excluding a list
>> of bad devices) then we don't acutally need to do that anymore.
> 
> Are we saying that now??  That's new.
> 
>> VMM will always create a synthetic PASID cap and kernel will always
>> suppress a real one.
>>
>> An iommufd query will indicate if the vIOMMU can support vPASID on
>> that device.
>>
>> Same for all the troublesome non-physical caps.
>>
>>>> There are migration considerations too - the blocks need to be
>>>> migrated over and end up in the same place as well..
>>>
>>> Can you elaborate what is the problem with the kernel emulating
>>> the PASID cap in this consideration?
>>
>> If the kernel changes the algorithm, say it wants to do PASID, PRI,
>> something_new then it might change the layout
>>
>> We can't just have the kernel decide without also providing a way for
>> userspace to say what the right layout actually is. :\
> 
> The capability layout is only relevant to migration, right?  A variant
> driver that supports migration is a prerequisite and would also be
> responsible for exposing the PASID capability.  This isn't as disjoint
> as it's being portrayed.
> 
>>> Does it talk about a case where the devices between src/dest are
>>> different versions (but backward compatible) with different unused
>>> space layout and the kernel approach may pick up different offsets
>>> while the VMM can guarantee the same offset?
>>
>> That is also a concern where the PCI cap layout may change a bit but
>> they are still migration compatible, but my bigger worry is that the
>> kernel just lays out the fake caps in a different way because the
>> kernel changes.
> 
> Outside of migration, what does it matter if the cap layout is
> different?  A driver should never hard code the address for a
> capability.
>   

But it may store the offset of capability to make next cap access more
convenient. I noticted struct pci_dev stores the offset of PRI and PASID
cap. So if the layout of config space changed between src and dst, it may
result in problem in guest when guest driver uses the offsets to access
PRI/PASID cap. I can see pci_dev stores offsets of other caps (acs, msi,
msix). So there is already a problem even put aside the PRI and PASID cap.

#ifdef CONFIG_PCI_PRI
	u16		pri_cap;	/* PRI Capability offset */
	u32		pri_reqs_alloc; /* Number of PRI requests allocated */
	unsigned int	pasid_required:1; /* PRG Response PASID Required */
#endif
#ifdef CONFIG_PCI_PASID
	u16		pasid_cap;	/* PASID Capability offset */
	u16		pasid_features;
#endif
#ifdef CONFIG_PCI_P2PDMA
	struct pci_p2pdma __rcu *p2pdma;
#endif
#ifdef CONFIG_PCI_DOE
	struct xarray	doe_mbs;	/* Data Object Exchange mailboxes */
#endif
	u16		acs_cap;	/* ACS Capability offset */

https://github.com/torvalds/linux/blob/master/include/linux/pci.h#L350

>> At least if the VMM is doing this then the VMM can include the
>> information in its migration scheme and use it to recreate the PCI
>> layout withotu having to create a bunch of uAPI to do so.
> 
> We're again back to migration compatibility, where again the capability
> layout would be governed by the migration support in the in-kernel
> variant driver.  Once migration is involved the location of a PASID
> shouldn't be arbitrary, whether it's provided by the kernel or the VMM.
> 
> Regardless, the VMM ultimately has the authority what the guest
> sees in config space.  The VMM is not bound to expose the PASID at the
> offset provided by the kernel, or bound to expose it at all.  The
> kernel exposed PASID can simply provide an available location and set
> of enabled capabilities.  Thanks,
> 
> Alex
> 

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-25  9:26                               ` Yi Liu
@ 2024-04-25 12:58                                 ` Alex Williamson
  2024-04-26  9:01                                   ` Yi Liu
  0 siblings, 1 reply; 102+ messages in thread
From: Alex Williamson @ 2024-04-25 12:58 UTC (permalink / raw)
  To: Yi Liu
  Cc: Jason Gunthorpe, Tian, Kevin, joro@8bytes.org,
	robin.murphy@arm.com, eric.auger@redhat.com, nicolinc@nvidia.com,
	kvm@vger.kernel.org, chao.p.peng@linux.intel.com,
	iommu@lists.linux.dev, baolu.lu@linux.intel.com, Duan, Zhenzhong,
	Pan, Jacob jun

On Thu, 25 Apr 2024 17:26:54 +0800
Yi Liu <yi.l.liu@intel.com> wrote:

> On 2024/4/25 02:24, Alex Williamson wrote:
> > On Tue, 23 Apr 2024 21:12:21 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> >> On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote:  
> >>>> From: Jason Gunthorpe <jgg@nvidia.com>
> >>>> Sent: Tuesday, April 23, 2024 8:02 PM
> >>>>
> >>>> On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote:  
> >>>>> I'm not sure how userspace can fully handle this w/o certain assistance
> >>>>> from the kernel.
> >>>>>
> >>>>> So I kind of agree that emulated PASID capability is probably the only
> >>>>> contract which the kernel should provide:
> >>>>>    - mapped 1:1 at the physical location, or
> >>>>>    - constructed at an offset according to DVSEC, or
> >>>>>    - constructed at an offset according to a look-up table
> >>>>>
> >>>>> The VMM always scans the vfio pci config space to expose vPASID.
> >>>>>
> >>>>> Then the remaining open is what VMM could do when a VF supports
> >>>>> PASID but unfortunately it's not reported by vfio. W/o the capability
> >>>>> of inspecting the PASID state of PF, probably the only feasible option
> >>>>> is to maintain a look-up table in VMM itself and assumes the kernel
> >>>>> always enables the PASID cap on PF.  
> >>>>
> >>>> I'm still not sure I like doing this in the kernel - we need to do the
> >>>> same sort of thing for ATS too, right?  
> >>>
> >>> VF is allowed to implement ATS.
> >>>
> >>> PRI has the same problem as PASID.  
> >>
> >> I'm surprised by this, I would have guessed ATS would be the device
> >> global one, PRI not being per-VF seems problematic??? How do you
> >> disable PRI generation to get a clean shutdown?
> >>  
> >>>> It feels simpler if the indicates if PASID and ATS can be supported
> >>>> and userspace builds the capability blocks.  
> >>>
> >>> this routes back to Alex's original question about using different
> >>> interfaces (a device feature vs. PCI PASID cap) for VF and PF.  
> >>
> >> I'm not sure it is different interfaces..
> >>
> >> The only reason to pass the PF's PASID cap is to give free space to
> >> the VMM. If we are saying that gaps are free space (excluding a list
> >> of bad devices) then we don't acutally need to do that anymore.  
> > 
> > Are we saying that now??  That's new.
> >   
> >> VMM will always create a synthetic PASID cap and kernel will always
> >> suppress a real one.
> >>
> >> An iommufd query will indicate if the vIOMMU can support vPASID on
> >> that device.
> >>
> >> Same for all the troublesome non-physical caps.
> >>  
> >>>> There are migration considerations too - the blocks need to be
> >>>> migrated over and end up in the same place as well..  
> >>>
> >>> Can you elaborate what is the problem with the kernel emulating
> >>> the PASID cap in this consideration?  
> >>
> >> If the kernel changes the algorithm, say it wants to do PASID, PRI,
> >> something_new then it might change the layout
> >>
> >> We can't just have the kernel decide without also providing a way for
> >> userspace to say what the right layout actually is. :\  
> > 
> > The capability layout is only relevant to migration, right?  A variant
> > driver that supports migration is a prerequisite and would also be
> > responsible for exposing the PASID capability.  This isn't as disjoint
> > as it's being portrayed.
> >   
> >>> Does it talk about a case where the devices between src/dest are
> >>> different versions (but backward compatible) with different unused
> >>> space layout and the kernel approach may pick up different offsets
> >>> while the VMM can guarantee the same offset?  
> >>
> >> That is also a concern where the PCI cap layout may change a bit but
> >> they are still migration compatible, but my bigger worry is that the
> >> kernel just lays out the fake caps in a different way because the
> >> kernel changes.  
> > 
> > Outside of migration, what does it matter if the cap layout is
    ^^^^^^^^^^^^^^^^^^^^
> > different?  A driver should never hard code the address for a
> > capability.
> >     
> 
> But it may store the offset of capability to make next cap access more
> convenient. I noticted struct pci_dev stores the offset of PRI and PASID
> cap. So if the layout of config space changed between src and dst, it may
> result in problem in guest when guest driver uses the offsets to access
> PRI/PASID cap. I can see pci_dev stores offsets of other caps (acs, msi,
> msix). So there is already a problem even put aside the PRI and PASID cap.

Yes, I had noted "outside of migration" above.  Config space must be
consistent to a running VM.  But the possibility of config space
changing like this only exists in the case where the driver supports
migration, so I think we're inventing an unrealistic concern that a
driver that supports migration would arbitrarily modify the config space
layout in order to make an argument for VMM managed layout.  Thanks,

Alex

> #ifdef CONFIG_PCI_PRI
> 	u16		pri_cap;	/* PRI Capability offset */
> 	u32		pri_reqs_alloc; /* Number of PRI requests allocated */
> 	unsigned int	pasid_required:1; /* PRG Response PASID Required */
> #endif
> #ifdef CONFIG_PCI_PASID
> 	u16		pasid_cap;	/* PASID Capability offset */
> 	u16		pasid_features;
> #endif
> #ifdef CONFIG_PCI_P2PDMA
> 	struct pci_p2pdma __rcu *p2pdma;
> #endif
> #ifdef CONFIG_PCI_DOE
> 	struct xarray	doe_mbs;	/* Data Object Exchange mailboxes */
> #endif
> 	u16		acs_cap;	/* ACS Capability offset */
> 
> https://github.com/torvalds/linux/blob/master/include/linux/pci.h#L350
> 
> >> At least if the VMM is doing this then the VMM can include the
> >> information in its migration scheme and use it to recreate the PCI
> >> layout withotu having to create a bunch of uAPI to do so.  
> > 
> > We're again back to migration compatibility, where again the capability
> > layout would be governed by the migration support in the in-kernel
> > variant driver.  Once migration is involved the location of a PASID
> > shouldn't be arbitrary, whether it's provided by the kernel or the VMM.
> > 
> > Regardless, the VMM ultimately has the authority what the guest
> > sees in config space.  The VMM is not bound to expose the PASID at the
> > offset provided by the kernel, or bound to expose it at all.  The
> > kernel exposed PASID can simply provide an available location and set
> > of enabled capabilities.  Thanks,
> > 
> > Alex
> >   
> 


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-25 12:58                                 ` Alex Williamson
@ 2024-04-26  9:01                                   ` Yi Liu
  0 siblings, 0 replies; 102+ messages in thread
From: Yi Liu @ 2024-04-26  9:01 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Tian, Kevin, joro@8bytes.org,
	robin.murphy@arm.com, eric.auger@redhat.com, nicolinc@nvidia.com,
	kvm@vger.kernel.org, chao.p.peng@linux.intel.com,
	iommu@lists.linux.dev, baolu.lu@linux.intel.com, Duan, Zhenzhong,
	Pan, Jacob jun

On 2024/4/25 20:58, Alex Williamson wrote:
> On Thu, 25 Apr 2024 17:26:54 +0800
> Yi Liu <yi.l.liu@intel.com> wrote:
> 
>> On 2024/4/25 02:24, Alex Williamson wrote:
>>> On Tue, 23 Apr 2024 21:12:21 -0300
>>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>>    
>>>> On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote:
>>>>>> From: Jason Gunthorpe <jgg@nvidia.com>
>>>>>> Sent: Tuesday, April 23, 2024 8:02 PM
>>>>>>
>>>>>> On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote:
>>>>>>> I'm not sure how userspace can fully handle this w/o certain assistance
>>>>>>> from the kernel.
>>>>>>>
>>>>>>> So I kind of agree that emulated PASID capability is probably the only
>>>>>>> contract which the kernel should provide:
>>>>>>>     - mapped 1:1 at the physical location, or
>>>>>>>     - constructed at an offset according to DVSEC, or
>>>>>>>     - constructed at an offset according to a look-up table
>>>>>>>
>>>>>>> The VMM always scans the vfio pci config space to expose vPASID.
>>>>>>>
>>>>>>> Then the remaining open is what VMM could do when a VF supports
>>>>>>> PASID but unfortunately it's not reported by vfio. W/o the capability
>>>>>>> of inspecting the PASID state of PF, probably the only feasible option
>>>>>>> is to maintain a look-up table in VMM itself and assumes the kernel
>>>>>>> always enables the PASID cap on PF.
>>>>>>
>>>>>> I'm still not sure I like doing this in the kernel - we need to do the
>>>>>> same sort of thing for ATS too, right?
>>>>>
>>>>> VF is allowed to implement ATS.
>>>>>
>>>>> PRI has the same problem as PASID.
>>>>
>>>> I'm surprised by this, I would have guessed ATS would be the device
>>>> global one, PRI not being per-VF seems problematic??? How do you
>>>> disable PRI generation to get a clean shutdown?
>>>>   
>>>>>> It feels simpler if the indicates if PASID and ATS can be supported
>>>>>> and userspace builds the capability blocks.
>>>>>
>>>>> this routes back to Alex's original question about using different
>>>>> interfaces (a device feature vs. PCI PASID cap) for VF and PF.
>>>>
>>>> I'm not sure it is different interfaces..
>>>>
>>>> The only reason to pass the PF's PASID cap is to give free space to
>>>> the VMM. If we are saying that gaps are free space (excluding a list
>>>> of bad devices) then we don't acutally need to do that anymore.
>>>
>>> Are we saying that now??  That's new.
>>>    
>>>> VMM will always create a synthetic PASID cap and kernel will always
>>>> suppress a real one.
>>>>
>>>> An iommufd query will indicate if the vIOMMU can support vPASID on
>>>> that device.
>>>>
>>>> Same for all the troublesome non-physical caps.
>>>>   
>>>>>> There are migration considerations too - the blocks need to be
>>>>>> migrated over and end up in the same place as well..
>>>>>
>>>>> Can you elaborate what is the problem with the kernel emulating
>>>>> the PASID cap in this consideration?
>>>>
>>>> If the kernel changes the algorithm, say it wants to do PASID, PRI,
>>>> something_new then it might change the layout
>>>>
>>>> We can't just have the kernel decide without also providing a way for
>>>> userspace to say what the right layout actually is. :\
>>>
>>> The capability layout is only relevant to migration, right?  A variant
>>> driver that supports migration is a prerequisite and would also be
>>> responsible for exposing the PASID capability.  This isn't as disjoint
>>> as it's being portrayed.
>>>    
>>>>> Does it talk about a case where the devices between src/dest are
>>>>> different versions (but backward compatible) with different unused
>>>>> space layout and the kernel approach may pick up different offsets
>>>>> while the VMM can guarantee the same offset?
>>>>
>>>> That is also a concern where the PCI cap layout may change a bit but
>>>> they are still migration compatible, but my bigger worry is that the
>>>> kernel just lays out the fake caps in a different way because the
>>>> kernel changes.
>>>
>>> Outside of migration, what does it matter if the cap layout is
>      ^^^^^^^^^^^^^^^^^^^^
>>> different?  A driver should never hard code the address for a
>>> capability.
>>>      
>>
>> But it may store the offset of capability to make next cap access more
>> convenient. I noticted struct pci_dev stores the offset of PRI and PASID
>> cap. So if the layout of config space changed between src and dst, it may
>> result in problem in guest when guest driver uses the offsets to access
>> PRI/PASID cap. I can see pci_dev stores offsets of other caps (acs, msi,
>> msix). So there is already a problem even put aside the PRI and PASID cap.
> 
> Yes, I had noted "outside of migration" above.  Config space must be
> consistent to a running VM.  But the possibility of config space
> changing like this only exists in the case where the driver supports
> migration, so I think we're inventing an unrealistic concern that a
> driver that supports migration would arbitrarily modify the config space
> layout in order to make an argument for VMM managed layout.  Thanks,

I was considering a case in which the src VM has a v1 hw device, while the
dst VM has a v2 hw device. Due to whatever reasons, the config space
layouts are different between v1 and v2 hw. Since the current vfio copies
the physical layout (except for the hidden caps) to VMM, the layout between
the src and dst VM would be different. This would result in problem since 
the cap offsets are stale. It seems to be a problem you mentioned in
another email of this thread[1]. :)

[1] "So in either case, the problem might be more along the
lines of how to make a V1 device from a V2 driver, which is more the
device type/flavor/persona problem.
" 
https://lore.kernel.org/linux-iommu/20240424141349.376bdbf9.alex.williamson@redhat.com/


> Alex
> 
>> #ifdef CONFIG_PCI_PRI
>> 	u16		pri_cap;	/* PRI Capability offset */
>> 	u32		pri_reqs_alloc; /* Number of PRI requests allocated */
>> 	unsigned int	pasid_required:1; /* PRG Response PASID Required */
>> #endif
>> #ifdef CONFIG_PCI_PASID
>> 	u16		pasid_cap;	/* PASID Capability offset */
>> 	u16		pasid_features;
>> #endif
>> #ifdef CONFIG_PCI_P2PDMA
>> 	struct pci_p2pdma __rcu *p2pdma;
>> #endif
>> #ifdef CONFIG_PCI_DOE
>> 	struct xarray	doe_mbs;	/* Data Object Exchange mailboxes */
>> #endif
>> 	u16		acs_cap;	/* ACS Capability offset */
>>
>> https://github.com/torvalds/linux/blob/master/include/linux/pci.h#L350
>>
>>>> At least if the VMM is doing this then the VMM can include the
>>>> information in its migration scheme and use it to recreate the PCI
>>>> layout withotu having to create a bunch of uAPI to do so.
>>>
>>> We're again back to migration compatibility, where again the capability
>>> layout would be governed by the migration support in the in-kernel
>>> variant driver.  Once migration is involved the location of a PASID
>>> shouldn't be arbitrary, whether it's provided by the kernel or the VMM.
>>>
>>> Regardless, the VMM ultimately has the authority what the guest
>>> sees in config space.  The VMM is not bound to expose the PASID at the
>>> offset provided by the kernel, or bound to expose it at all.  The
>>> kernel exposed PASID can simply provide an available location and set
>>> of enabled capabilities.  Thanks,
>>>
>>> Alex
>>>    
>>
> 

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-24 20:13                                 ` Alex Williamson
@ 2024-04-26 14:11                                   ` Jason Gunthorpe
  2024-04-26 20:13                                     ` Alex Williamson
  2024-04-27  5:05                                     ` Christoph Hellwig
  0 siblings, 2 replies; 102+ messages in thread
From: Jason Gunthorpe @ 2024-04-26 14:11 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Liu, Yi L, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun,
	Cédric Le Goater

On Wed, Apr 24, 2024 at 02:13:49PM -0600, Alex Williamson wrote:

> This is kind of an absurd example to portray as a ubiquitous problem.
> Typically the config space layout is a reflection of hardware whether
> the device supports migration or not.

Er, all our HW has FW constructed config space. It changes with FW
upgrades. We change it during the life of the product. This has to be
considered..

> If a driver were to insert a
> virtual capability, then yes it would want to be consistent about it if
> it also cares about migration.  If the driver needs to change the
> location of a virtual capability, problems will arise, but that's also
> not something that every driver needs to do.

Well, mlx5 has to cope with this. It supports so many devices with so
many config space layouts :( I don't know if we can just hard wire an
offset to stick in a PASID cap and expect that to work...

> Also, how exactly does emulating the capability in the VMM solve this
> problem?  Currently QEMU migration simply applies state to an identical
> VM on the target.  QEMU doesn't modify the target VM to conform to the
> data stream.  So in either case, the problem might be more along the
> lines of how to make a V1 device from a V2 driver, which is more the
> device type/flavor/persona problem.

Yes, it doesn't solve anything, it just puts the responsibility for
something that is very complicated in userspace where there are more
options to configure and customize it to the environment.

> Currently QEMU replies on determinism that a given command line results
> in an identical machine configuration and identical devices.  State of
> that target VM is then populated, not defined by, the migration stream.

But that won't be true if the kernel is making decisions. The config
space layout depends now on the kernel driver version too.

> > I think we need to decide, either only the VMM or only the kernel
> > should do this.
> 
> What are you actually proposing?

Okay, what I'm thinking about is a text file that describes the vPCI
function configuration space to create. The community will standardize
this and VMMs will have to implement to get PASID/etc. Maybe the
community will provide a BSD licensed library to do this job.

The text file allows the operator to specify exactly the configuration
space the VFIO function should have. It would not be derived
automatically from physical. AFAIK qemu does not have this capability
currently.

This reflects my observation and discussions around the live migration
standardization. I belive we are fast reaching a point where this is
required.

Consider standards based migration between wildly different
devices. The devices will not standardize their physical config space,
but an operator could generate a consistent vPCI config space that
works with all the devices in their fleet.

Consider the usual working model of the large operators - they define
instance types with some regularity. But an instance type is fixed in
concrete once it is specified, things like the vPCI config space are
fixed.

Running Instance A on newer hardware with a changed physical config
space should continue to present Instance A's vPCI config layout
regardless. Ie Instance A might not support PASID but Instance B can
run on newer HW that does. The config space layout depends on the
requested Instance Type, not the physical layout.

The auto-configuration of the config layout from physical is a nice
feature and is excellent for development/small scale, but it shouldn't
be the only way to work.

So - if we accept that text file configuration should be something the
VMM supports then let's reconsider how to solve the PASID problem.

I'd say the way to solve it should be via a text file specifying a
full config space layout that includes the PASID cap. From the VMM
perspective this works fine, and it ports to every VMM directly via
processing the text file.

The autoconfiguration use case can be done by making a tool build the
text file by deriving it from physical, much like today. The single
instance of that tool could have device specific knowledge to avoid
quirks. This way the smarts can still be shared by all the VMMs
without going into the kernel. Special devices with hidden config
space could get special quirks or special reference text files into
the tool repo.

Serious operators doing production SRIOV/etc would negotiate the text
file with the HW vendors when they define their Instance Type. Ideally
these reference text files would be contributed to the tool repo
above. I think there would be some nice idea to define fully open
source Instance Types that include VFIO devices too.

Is it too much of a fantasy?

Jason

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-26 14:11                                   ` Jason Gunthorpe
@ 2024-04-26 20:13                                     ` Alex Williamson
  2024-04-28  6:19                                       ` Tian, Kevin
  2024-04-29 17:44                                       ` Jason Gunthorpe
  2024-04-27  5:05                                     ` Christoph Hellwig
  1 sibling, 2 replies; 102+ messages in thread
From: Alex Williamson @ 2024-04-26 20:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Liu, Yi L, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun,
	Cédric Le Goater

On Fri, 26 Apr 2024 11:11:17 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Apr 24, 2024 at 02:13:49PM -0600, Alex Williamson wrote:
> 
> > This is kind of an absurd example to portray as a ubiquitous problem.
> > Typically the config space layout is a reflection of hardware whether
> > the device supports migration or not.  
> 
> Er, all our HW has FW constructed config space. It changes with FW
> upgrades. We change it during the life of the product. This has to be
> considered..

So as I understand it, the concern is that you have firmware that
supports migration, but it also openly hostile to the fundamental
aspects of exposing a stable device ABI in support of migration.

> > If a driver were to insert a
> > virtual capability, then yes it would want to be consistent about it if
> > it also cares about migration.  If the driver needs to change the
> > location of a virtual capability, problems will arise, but that's also
> > not something that every driver needs to do.  
> 
> Well, mlx5 has to cope with this. It supports so many devices with so
> many config space layouts :( I don't know if we can just hard wire an
> offset to stick in a PASID cap and expect that to work...
> 
> > Also, how exactly does emulating the capability in the VMM solve this
> > problem?  Currently QEMU migration simply applies state to an identical
> > VM on the target.  QEMU doesn't modify the target VM to conform to the
> > data stream.  So in either case, the problem might be more along the
> > lines of how to make a V1 device from a V2 driver, which is more the
> > device type/flavor/persona problem.  
> 
> Yes, it doesn't solve anything, it just puts the responsibility for
> something that is very complicated in userspace where there are more
> options to configure and customize it to the environment.
> 
> > Currently QEMU replies on determinism that a given command line results
> > in an identical machine configuration and identical devices.  State of
> > that target VM is then populated, not defined by, the migration stream.  
> 
> But that won't be true if the kernel is making decisions. The config
> space layout depends now on the kernel driver version too.

But in the cases where we support migration there's a device specific
variant driver that supports that migration.  It's the job of that
variant driver to not only export and import the device state, but also
to provide a consistent ABI to the user, which includes the config
space layout.  I don't understand why we'd say the device programming
ABI itself falls within the purview of the device/variant driver, but
PCI config space is defined by device specific code at a higher level.

> > > I think we need to decide, either only the VMM or only the kernel
> > > should do this.  
> > 
> > What are you actually proposing?  
> 
> Okay, what I'm thinking about is a text file that describes the vPCI
> function configuration space to create. The community will standardize
> this and VMMs will have to implement to get PASID/etc. Maybe the
> community will provide a BSD licensed library to do this job.
> 
> The text file allows the operator to specify exactly the configuration
> space the VFIO function should have. It would not be derived
> automatically from physical. AFAIK qemu does not have this capability
> currently.
> 
> This reflects my observation and discussions around the live migration
> standardization. I belive we are fast reaching a point where this is
> required.
> 
> Consider standards based migration between wildly different
> devices. The devices will not standardize their physical config space,
> but an operator could generate a consistent vPCI config space that
> works with all the devices in their fleet.
> 
> Consider the usual working model of the large operators - they define
> instance types with some regularity. But an instance type is fixed in
> concrete once it is specified, things like the vPCI config space are
> fixed.
> 
> Running Instance A on newer hardware with a changed physical config
> space should continue to present Instance A's vPCI config layout
> regardless. Ie Instance A might not support PASID but Instance B can
> run on newer HW that does. The config space layout depends on the
> requested Instance Type, not the physical layout.
> 
> The auto-configuration of the config layout from physical is a nice
> feature and is excellent for development/small scale, but it shouldn't
> be the only way to work.
> 
> So - if we accept that text file configuration should be something the
> VMM supports then let's reconsider how to solve the PASID problem.
> 
> I'd say the way to solve it should be via a text file specifying a
> full config space layout that includes the PASID cap. From the VMM
> perspective this works fine, and it ports to every VMM directly via
> processing the text file.
> 
> The autoconfiguration use case can be done by making a tool build the
> text file by deriving it from physical, much like today. The single
> instance of that tool could have device specific knowledge to avoid
> quirks. This way the smarts can still be shared by all the VMMs
> without going into the kernel. Special devices with hidden config
> space could get special quirks or special reference text files into
> the tool repo.
> 
> Serious operators doing production SRIOV/etc would negotiate the text
> file with the HW vendors when they define their Instance Type. Ideally
> these reference text files would be contributed to the tool repo
> above. I think there would be some nice idea to define fully open
> source Instance Types that include VFIO devices too.

Regarding "if we accept that text file configuration should be
something the VMM supports", I'm not on board with this yet, so
applying it to PASID discussion seems premature.

We've developed variant drivers specifically to host the device specific
aspects of migration support.  The requirement of a consistent config
space layout is a problem that only exists relative to migration.  This
is an issue that I would have considered the responsibility of the
variant driver, which would likely expect a consistent interface from
the hardware/firmware.  Why does hostile firmware suddenly make it the
VMM's problem to provide a consistent ABI to the config space of the
device rather than the variant driver?

Obviously config maps are something that a VMM could do, but it also
seems to impose a non-trivial burden that every VMM requires an
implementation of a config space map and integration for each device
rather than simply expecting the exposed config space of the device to
be part of the migration ABI.  Also this solution specifically only
addresses config space compatibility without considering the more
generic issue that a variant driver can expose different device
personas.  A versioned persona and config space virtualization in the
variant driver is a much more flexible solution.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-26 14:11                                   ` Jason Gunthorpe
  2024-04-26 20:13                                     ` Alex Williamson
@ 2024-04-27  5:05                                     ` Christoph Hellwig
  1 sibling, 0 replies; 102+ messages in thread
From: Christoph Hellwig @ 2024-04-27  5:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Tian, Kevin, Liu, Yi L, joro@8bytes.org,
	robin.murphy@arm.com, eric.auger@redhat.com, nicolinc@nvidia.com,
	kvm@vger.kernel.org, chao.p.peng@linux.intel.com,
	iommu@lists.linux.dev, baolu.lu@linux.intel.com, Duan, Zhenzhong,
	Pan, Jacob jun, Cédric Le Goater

On Fri, Apr 26, 2024 at 11:11:17AM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 24, 2024 at 02:13:49PM -0600, Alex Williamson wrote:
> 
> > This is kind of an absurd example to portray as a ubiquitous problem.
> > Typically the config space layout is a reflection of hardware whether
> > the device supports migration or not.
> 
> Er, all our HW has FW constructed config space. It changes with FW
> upgrades. We change it during the life of the product. This has to be
> considered..

FYI, the same is true for just about any storage device I know.  Maybe
not all of the config space, but definitely parts of it.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* RE: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-26 20:13                                     ` Alex Williamson
@ 2024-04-28  6:19                                       ` Tian, Kevin
  2024-04-29  7:43                                         ` Yi Liu
  2024-04-29 17:15                                         ` Jason Gunthorpe
  2024-04-29 17:44                                       ` Jason Gunthorpe
  1 sibling, 2 replies; 102+ messages in thread
From: Tian, Kevin @ 2024-04-28  6:19 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: Liu, Yi L, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun,
	Cédric Le Goater

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Saturday, April 27, 2024 4:14 AM
> 
> On Fri, 26 Apr 2024 11:11:17 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Apr 24, 2024 at 02:13:49PM -0600, Alex Williamson wrote:
> >
> > > This is kind of an absurd example to portray as a ubiquitous problem.
> > > Typically the config space layout is a reflection of hardware whether
> > > the device supports migration or not.
> >
> > Er, all our HW has FW constructed config space. It changes with FW
> > upgrades. We change it during the life of the product. This has to be
> > considered..
> 
> So as I understand it, the concern is that you have firmware that
> supports migration, but it also openly hostile to the fundamental
> aspects of exposing a stable device ABI in support of migration.
> 
> > > If a driver were to insert a
> > > virtual capability, then yes it would want to be consistent about it if
> > > it also cares about migration.  If the driver needs to change the
> > > location of a virtual capability, problems will arise, but that's also
> > > not something that every driver needs to do.
> >
> > Well, mlx5 has to cope with this. It supports so many devices with so
> > many config space layouts :( I don't know if we can just hard wire an
> > offset to stick in a PASID cap and expect that to work...

Are those config space layout differences usually also coming with
mmio-side interface change? If yes there are more to handle for
running V1 instance on V2 device and it'd make sense to manage
everything about compatibility in one place.

If we pursue the direction deciding the vconfig layout in VMM, does
it imply that anything related to mmio layout would also be put in
VMM too?

e.g. it's not unusual to see a device mmio layout as:

	REG_BASE:  base addr of a register block in a BAR
	REG_FEAT1_OFF: feature1 regs offset in the register block
	REG_FEAT2_OFF: feature2 regs offset in the register block
	...

Driver accesses registers according to those read-only offsets.

A FW upgrade may lead to change of offsets but functions stay the
same. An instance created on an old version can be migrated to 
the new version as long as accesses to old offsets are trapped
and routed to the new offsets.

Do we envision things like above in the variant driver or in VMM?

> > >
> > > What are you actually proposing?
> >
> > Okay, what I'm thinking about is a text file that describes the vPCI
> > function configuration space to create. The community will standardize
> > this and VMMs will have to implement to get PASID/etc. Maybe the
> > community will provide a BSD licensed library to do this job.
> >
> > The text file allows the operator to specify exactly the configuration
> > space the VFIO function should have. It would not be derived
> > automatically from physical. AFAIK qemu does not have this capability
> > currently.
> >
> > This reflects my observation and discussions around the live migration
> > standardization. I belive we are fast reaching a point where this is
> > required.
> >
> > Consider standards based migration between wildly different
> > devices. The devices will not standardize their physical config space,
> > but an operator could generate a consistent vPCI config space that
> > works with all the devices in their fleet.

It's hard to believe that 'wildly different' devices only have difference
in the layout of vPCI config space. 

> >
> > Consider the usual working model of the large operators - they define
> > instance types with some regularity. But an instance type is fixed in
> > concrete once it is specified, things like the vPCI config space are
> > fixed.
> >
> > Running Instance A on newer hardware with a changed physical config
> > space should continue to present Instance A's vPCI config layout
> > regardless. Ie Instance A might not support PASID but Instance B can
> > run on newer HW that does. The config space layout depends on the
> > requested Instance Type, not the physical layout.
> >
> > The auto-configuration of the config layout from physical is a nice
> > feature and is excellent for development/small scale, but it shouldn't
> > be the only way to work.
> >
> > So - if we accept that text file configuration should be something the
> > VMM supports then let's reconsider how to solve the PASID problem.
> >
> > I'd say the way to solve it should be via a text file specifying a
> > full config space layout that includes the PASID cap. From the VMM
> > perspective this works fine, and it ports to every VMM directly via
> > processing the text file.
> >
> > The autoconfiguration use case can be done by making a tool build the
> > text file by deriving it from physical, much like today. The single
> > instance of that tool could have device specific knowledge to avoid
> > quirks. This way the smarts can still be shared by all the VMMs
> > without going into the kernel. Special devices with hidden config
> > space could get special quirks or special reference text files into
> > the tool repo.
> >
> > Serious operators doing production SRIOV/etc would negotiate the text
> > file with the HW vendors when they define their Instance Type. Ideally
> > these reference text files would be contributed to the tool repo
> > above. I think there would be some nice idea to define fully open
> > source Instance Types that include VFIO devices too.
> 
> Regarding "if we accept that text file configuration should be
> something the VMM supports", I'm not on board with this yet, so
> applying it to PASID discussion seems premature.
> 
> We've developed variant drivers specifically to host the device specific
> aspects of migration support.  The requirement of a consistent config
> space layout is a problem that only exists relative to migration.  This
> is an issue that I would have considered the responsibility of the
> variant driver, which would likely expect a consistent interface from
> the hardware/firmware.  Why does hostile firmware suddenly make it the
> VMM's problem to provide a consistent ABI to the config space of the
> device rather than the variant driver?
> 
> Obviously config maps are something that a VMM could do, but it also
> seems to impose a non-trivial burden that every VMM requires an
> implementation of a config space map and integration for each device
> rather than simply expecting the exposed config space of the device to
> be part of the migration ABI.  Also this solution specifically only
> addresses config space compatibility without considering the more
> generic issue that a variant driver can expose different device
> personas.  A versioned persona and config space virtualization in the
> variant driver is a much more flexible solution.  Thanks,
> 

and looks this community lacks of a clear criteria on what burden
should be put in the kernel vs. in the VMM.

e.g. in earlier nvgrace-gpu discussion a major open was whether
the PCI bar emulation should be done by the variant driver or
by the VMM (with variant driver providing a device feature).

It ends up to be in the variant driver with one major argument
that doing so avoids the burden in various VMMs.

But now seems the 'text-file' proposal heads the opposite direction?

btw while this discussion may continue some time, I wonder whether
this vPASID reporting open can be handled separately from the
pasid attach/detach series so we can move the ball and merge
something already in agreement. anyway it's just a read-only cap so
won't affect how VFIO/IOMMUFD handles the pasid related requests.

Thanks
Kevin


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-28  6:19                                       ` Tian, Kevin
@ 2024-04-29  7:43                                         ` Yi Liu
  2024-04-29 17:15                                         ` Jason Gunthorpe
  1 sibling, 0 replies; 102+ messages in thread
From: Yi Liu @ 2024-04-29  7:43 UTC (permalink / raw)
  To: Tian, Kevin, Alex Williamson, Jason Gunthorpe
  Cc: joro@8bytes.org, robin.murphy@arm.com, eric.auger@redhat.com,
	nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun,
	Cédric Le Goater

On 2024/4/28 14:19, Tian, Kevin wrote:
>> From: Alex Williamson <alex.williamson@redhat.com>
>> Sent: Saturday, April 27, 2024 4:14 AM
>>
>> On Fri, 26 Apr 2024 11:11:17 -0300
>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>
>>> On Wed, Apr 24, 2024 at 02:13:49PM -0600, Alex Williamson wrote:
>>>
>>>> This is kind of an absurd example to portray as a ubiquitous problem.
>>>> Typically the config space layout is a reflection of hardware whether
>>>> the device supports migration or not.
>>>
>>> Er, all our HW has FW constructed config space. It changes with FW
>>> upgrades. We change it during the life of the product. This has to be
>>> considered..
>>
>> So as I understand it, the concern is that you have firmware that
>> supports migration, but it also openly hostile to the fundamental
>> aspects of exposing a stable device ABI in support of migration.
>>
>>>> If a driver were to insert a
>>>> virtual capability, then yes it would want to be consistent about it if
>>>> it also cares about migration.  If the driver needs to change the
>>>> location of a virtual capability, problems will arise, but that's also
>>>> not something that every driver needs to do.
>>>
>>> Well, mlx5 has to cope with this. It supports so many devices with so
>>> many config space layouts :( I don't know if we can just hard wire an
>>> offset to stick in a PASID cap and expect that to work...
> 
> Are those config space layout differences usually also coming with
> mmio-side interface change? If yes there are more to handle for
> running V1 instance on V2 device and it'd make sense to manage
> everything about compatibility in one place.
> 
> If we pursue the direction deciding the vconfig layout in VMM, does
> it imply that anything related to mmio layout would also be put in
> VMM too?
> 
> e.g. it's not unusual to see a device mmio layout as:
> 
> 	REG_BASE:  base addr of a register block in a BAR
> 	REG_FEAT1_OFF: feature1 regs offset in the register block
> 	REG_FEAT2_OFF: feature2 regs offset in the register block
> 	...

Should we also consider the possibility of vBAR address difference between
the src and dst? It should be impossible that the guest driver mmaps vBAR
again if it has already been loaded.

> Driver accesses registers according to those read-only offsets.
> 
> A FW upgrade may lead to change of offsets but functions stay the
> same. An instance created on an old version can be migrated to
> the new version as long as accesses to old offsets are trapped
> and routed to the new offsets.
> 
> Do we envision things like above in the variant driver or in VMM?
> 
>>>>
>>>> What are you actually proposing?
>>>
>>> Okay, what I'm thinking about is a text file that describes the vPCI
>>> function configuration space to create. The community will standardize
>>> this and VMMs will have to implement to get PASID/etc. Maybe the
>>> community will provide a BSD licensed library to do this job.
>>>
>>> The text file allows the operator to specify exactly the configuration
>>> space the VFIO function should have. It would not be derived
>>> automatically from physical. AFAIK qemu does not have this capability
>>> currently.
>>>
>>> This reflects my observation and discussions around the live migration
>>> standardization. I belive we are fast reaching a point where this is
>>> required.
>>>
>>> Consider standards based migration between wildly different
>>> devices. The devices will not standardize their physical config space,
>>> but an operator could generate a consistent vPCI config space that
>>> works with all the devices in their fleet.
> 
> It's hard to believe that 'wildly different' devices only have difference
> in the layout of vPCI config space.
> 
>>>
>>> Consider the usual working model of the large operators - they define
>>> instance types with some regularity. But an instance type is fixed in
>>> concrete once it is specified, things like the vPCI config space are
>>> fixed.
>>>
>>> Running Instance A on newer hardware with a changed physical config
>>> space should continue to present Instance A's vPCI config layout
>>> regardless. Ie Instance A might not support PASID but Instance B can
>>> run on newer HW that does. The config space layout depends on the
>>> requested Instance Type, not the physical layout.
>>>
>>> The auto-configuration of the config layout from physical is a nice
>>> feature and is excellent for development/small scale, but it shouldn't
>>> be the only way to work.
>>>
>>> So - if we accept that text file configuration should be something the
>>> VMM supports then let's reconsider how to solve the PASID problem.
>>>
>>> I'd say the way to solve it should be via a text file specifying a
>>> full config space layout that includes the PASID cap. From the VMM
>>> perspective this works fine, and it ports to every VMM directly via
>>> processing the text file.
>>>
>>> The autoconfiguration use case can be done by making a tool build the
>>> text file by deriving it from physical, much like today. The single
>>> instance of that tool could have device specific knowledge to avoid
>>> quirks. This way the smarts can still be shared by all the VMMs
>>> without going into the kernel. Special devices with hidden config
>>> space could get special quirks or special reference text files into
>>> the tool repo.
>>>
>>> Serious operators doing production SRIOV/etc would negotiate the text
>>> file with the HW vendors when they define their Instance Type. Ideally
>>> these reference text files would be contributed to the tool repo
>>> above. I think there would be some nice idea to define fully open
>>> source Instance Types that include VFIO devices too.
>>
>> Regarding "if we accept that text file configuration should be
>> something the VMM supports", I'm not on board with this yet, so
>> applying it to PASID discussion seems premature.
>>
>> We've developed variant drivers specifically to host the device specific
>> aspects of migration support.  The requirement of a consistent config
>> space layout is a problem that only exists relative to migration.  This
>> is an issue that I would have considered the responsibility of the
>> variant driver, which would likely expect a consistent interface from
>> the hardware/firmware.  Why does hostile firmware suddenly make it the
>> VMM's problem to provide a consistent ABI to the config space of the
>> device rather than the variant driver?
>>
>> Obviously config maps are something that a VMM could do, but it also
>> seems to impose a non-trivial burden that every VMM requires an
>> implementation of a config space map and integration for each device
>> rather than simply expecting the exposed config space of the device to
>> be part of the migration ABI.  Also this solution specifically only
>> addresses config space compatibility without considering the more
>> generic issue that a variant driver can expose different device
>> personas.  A versioned persona and config space virtualization in the
>> variant driver is a much more flexible solution.  Thanks,
>>
> 
> and looks this community lacks of a clear criteria on what burden
> should be put in the kernel vs. in the VMM.
> 
> e.g. in earlier nvgrace-gpu discussion a major open was whether
> the PCI bar emulation should be done by the variant driver or
> by the VMM (with variant driver providing a device feature).
> 
> It ends up to be in the variant driver with one major argument
> that doing so avoids the burden in various VMMs.
> 
> But now seems the 'text-file' proposal heads the opposite direction?
> 
> btw while this discussion may continue some time, I wonder whether
> this vPASID reporting open can be handled separately from the
> pasid attach/detach series so we can move the ball and merge
> something already in agreement. anyway it's just a read-only cap so
> won't affect how VFIO/IOMMUFD handles the pasid related requests.
> 
> Thanks
> Kevin
> 

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-28  6:19                                       ` Tian, Kevin
  2024-04-29  7:43                                         ` Yi Liu
@ 2024-04-29 17:15                                         ` Jason Gunthorpe
  1 sibling, 0 replies; 102+ messages in thread
From: Jason Gunthorpe @ 2024-04-29 17:15 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Liu, Yi L, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun,
	Cédric Le Goater

On Sun, Apr 28, 2024 at 06:19:29AM +0000, Tian, Kevin wrote:
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Saturday, April 27, 2024 4:14 AM
> > 
> > On Fri, 26 Apr 2024 11:11:17 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > 
> > > On Wed, Apr 24, 2024 at 02:13:49PM -0600, Alex Williamson wrote:
> > >
> > > > This is kind of an absurd example to portray as a ubiquitous problem.
> > > > Typically the config space layout is a reflection of hardware whether
> > > > the device supports migration or not.
> > >
> > > Er, all our HW has FW constructed config space. It changes with FW
> > > upgrades. We change it during the life of the product. This has to be
> > > considered..
> > 
> > So as I understand it, the concern is that you have firmware that
> > supports migration, but it also openly hostile to the fundamental
> > aspects of exposing a stable device ABI in support of migration.
> > 
> > > > If a driver were to insert a
> > > > virtual capability, then yes it would want to be consistent about it if
> > > > it also cares about migration.  If the driver needs to change the
> > > > location of a virtual capability, problems will arise, but that's also
> > > > not something that every driver needs to do.
> > >
> > > Well, mlx5 has to cope with this. It supports so many devices with so
> > > many config space layouts :( I don't know if we can just hard wire an
> > > offset to stick in a PASID cap and expect that to work...
> 
> Are those config space layout differences usually also coming with
> mmio-side interface change? 

Not really

> If yes there are more to handle for
> running V1 instance on V2 device and it'd make sense to manage
> everything about compatibility in one place.

It's complicated :|

The config space layout can't change once the device is discovered by
the OS and I have a feeling it can't differ between VFs in a SRIOV.

So even if we did say that a device HW/FW will change it's personality
on the fly, I'm not sure we actually can and still conform to the
PCI specs?

> If we pursue the direction deciding the vconfig layout in VMM, does
> it imply that anything related to mmio layout would also be put in
> VMM too?

I'd guess no, MMIO doesn't have the OS and standards entanglements. If
the FW can reprogram the MMIO layout then when it loads the migration
state, or provisions the device, it should put the MMIO to the right
configuration.

> Do we envision things like above in the variant driver or in VMM?

In the device itself. I think it would be an extreme case for someone
to make a device that had a different MMIO layout that could be
backwards compatible to an old layout and NOT provide a way for the
device HW to go back to the old layout directly. If so that device
would have to mediate MMIO in the VMM or kernel. Depends on the
complexity I suppose where to do it.

> > > Consider standards based migration between wildly different
> > > devices. The devices will not standardize their physical config space,
> > > but an operator could generate a consistent vPCI config space that
> > > works with all the devices in their fleet.
> 
> It's hard to believe that 'wildly different' devices only have difference
> in the layout of vPCI config space. 

Well, I mean a device from vendor A and another device from vendor
B. They don't have anything in common except implementing the
standard.

The standard perscribes the MMIO layout. Both devices have a way to
accept a migration stream defined by a standard. The MMIO will be
adjusted as the standard requires after loading the migration stream.

Config space primiarily remains a problem in this imagined world.

My best view is that it needs to be solved by config space synthesis
in the hypervisor and not via HW/FW on the physical device.

> and looks this community lacks of a clear criteria on what burden
> should be put in the kernel vs. in the VMM.

Right, we've always discussed this. I've fairly consistently wanted to
push things to userspace, principally for security.

> e.g. in earlier nvgrace-gpu discussion a major open was whether
> the PCI bar emulation should be done by the variant driver or
> by the VMM (with variant driver providing a device feature).

Sort of, this was actually also about config space synthesis. If we
had this kind of scheme then perhaps the argument would be different.

> It ends up to be in the variant driver with one major argument
> that doing so avoids the burden in various VMMs.

Yes, that has always been the argument that has won out.

> btw while this discussion may continue some time, I wonder whether
> this vPASID reporting open can be handled separately from the
> pasid attach/detach series so we can move the ball and merge
> something already in agreement. anyway it's just a read-only cap so
> won't affect how VFIO/IOMMUFD handles the pasid related requests.

Yes, this makes sense to me. Ultimately we all agree the config space
will have to be synthesized.

Jason

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-26 20:13                                     ` Alex Williamson
  2024-04-28  6:19                                       ` Tian, Kevin
@ 2024-04-29 17:44                                       ` Jason Gunthorpe
  2024-07-18 13:02                                         ` Yi Liu
  2024-07-24  2:26                                         ` Tian, Kevin
  1 sibling, 2 replies; 102+ messages in thread
From: Jason Gunthorpe @ 2024-04-29 17:44 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Liu, Yi L, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun,
	Cédric Le Goater

On Fri, Apr 26, 2024 at 02:13:54PM -0600, Alex Williamson wrote:
> On Fri, 26 Apr 2024 11:11:17 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Apr 24, 2024 at 02:13:49PM -0600, Alex Williamson wrote:
> > 
> > > This is kind of an absurd example to portray as a ubiquitous problem.
> > > Typically the config space layout is a reflection of hardware whether
> > > the device supports migration or not.  
> > 
> > Er, all our HW has FW constructed config space. It changes with FW
> > upgrades. We change it during the life of the product. This has to be
> > considered..
> 
> So as I understand it, the concern is that you have firmware that
> supports migration, but it also openly hostile to the fundamental
> aspects of exposing a stable device ABI in support of migration.

Well, that makes it sound rude, but yes that is part of it.

mlx5 is tremendously FW defined. The FW can only cope with migration
in some limited cases today. Making that compatability bigger is
ongoing work.

Config space is one of the areas that has not been addressed.
Currently things are such that the FW won't support migration in
combinations that have different physical config space - so it is not
a problem.

But, in principle, it is an issue. AFAIK, the only complete solution
is for the hypervisor to fully synthesize a stable config space.

So, if we keep this in the kernel then I'd imagine the kernel will
need to grow some shared infrastructure to fully synthezise the config
space - not text file based, but basically the same as what I
described for the VMM.

> > But that won't be true if the kernel is making decisions. The config
> > space layout depends now on the kernel driver version too.
> 
> But in the cases where we support migration there's a device specific
> variant driver that supports that migration.  It's the job of that
> variant driver to not only export and import the device state, but also
> to provide a consistent ABI to the user, which includes the config
> space layout. 

Yes, we could do that, but I'm not sure how it will work in all cases.

> I don't understand why we'd say the device programming ABI itself
> falls within the purview of the device/variant driver, but PCI
> config space is defined by device specific code at a higher level.

The "device programming ABI" doesn't contain any *policy*. The layout
of the config space is like 50% policy. Especially when we start to
talk about standards defined migration. The standards will set the
"device programming ABI" and maybe even specify the migration
stream. They will not, and arguably can not, specify the config space.

Config space layout is substantially policy of the instance type. Even
little things like the vendor IDs can be meaningfully replaced in VMs.

> Regarding "if we accept that text file configuration should be
> something the VMM supports", I'm not on board with this yet, so
> applying it to PASID discussion seems premature.

Sure, I'm just explaining a way this could all fit together.

> We've developed variant drivers specifically to host the device specific
> aspects of migration support.  The requirement of a consistent config
> space layout is a problem that only exists relative to migration.  

Well, I wouldn't go quite so far. Arguably even non-migritable
instance types may want to adjust thier config space. Eg if I'm using
a DPU and I get a NVMe/Virtio PCI function I may want to scrub out
details from the config space to make it more general. Even without
migration.

This already happens today in places like VDPA which completely
replace the underlying config space in some cases.

I see it as a difference from a world of highly constrained "instance
types" and a more ad hoc world.

> is an issue that I would have considered the responsibility of the
> variant driver, which would likely expect a consistent interface from
> the hardware/firmware.  Why does hostile firmware suddenly make it the
> VMM's problem to provide a consistent ABI to the config space of the
> device rather than the variant driver?

It is not "hostile firmware"! It accepting that a significant aspect
of the config layout is actually policy.

Plus the standards limitations that mean we can't change the config
space on the fly make it pretty much impossible for the device to
acutally do anything to help here. Software must fix the config space.

> Obviously config maps are something that a VMM could do, but it also
> seems to impose a non-trivial burden that every VMM requires an
> implementation of a config space map and integration for each device
> rather than simply expecting the exposed config space of the device to
> be part of the migration ABI.  

Well, the flip is true to, it is alot of burden on every variant
device driver implement and on the kernel in general to manage config
space policy on behalf of the VMM.

My point is if the VMM is already going to be forced to manage config
space policy for other good reasons, are we sure we want to put a
bunch of stuff in the kernel that sometimes won't be used?

> Also this solution specifically only addresses config space
> compatibility without considering the more generic issue that a
> variant driver can expose different device personas.  A versioned
> persona and config space virtualization in the variant driver is a
> much more flexible solution.

It is addressed, the different personas would have their own text file
maps. The target VMM would have to load the right map. Shared common
code across all the variant drivers.

Jason

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-29 17:44                                       ` Jason Gunthorpe
@ 2024-07-18 13:02                                         ` Yi Liu
  2024-07-24  2:26                                         ` Tian, Kevin
  1 sibling, 0 replies; 102+ messages in thread
From: Yi Liu @ 2024-07-18 13:02 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Tian, Kevin, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun,
	Cédric Le Goater, Vasant Hegde

Hi Alex, Jason,

We didn't get any conclusion on the vPASID cap last time. Shall we
have more alignment on it? +Vasant as well.

On 2024/4/30 01:44, Jason Gunthorpe wrote:
> On Fri, Apr 26, 2024 at 02:13:54PM -0600, Alex Williamson wrote:
>> On Fri, 26 Apr 2024 11:11:17 -0300
>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>
>>> On Wed, Apr 24, 2024 at 02:13:49PM -0600, Alex Williamson wrote:
>>>
>>>> This is kind of an absurd example to portray as a ubiquitous problem.
>>>> Typically the config space layout is a reflection of hardware whether
>>>> the device supports migration or not.
>>>
>>> Er, all our HW has FW constructed config space. It changes with FW
>>> upgrades. We change it during the life of the product. This has to be
>>> considered..
>>
>> So as I understand it, the concern is that you have firmware that
>> supports migration, but it also openly hostile to the fundamental
>> aspects of exposing a stable device ABI in support of migration.
> 
> Well, that makes it sound rude, but yes that is part of it.
> 
> mlx5 is tremendously FW defined. The FW can only cope with migration
> in some limited cases today. Making that compatability bigger is
> ongoing work.
> 
> Config space is one of the areas that has not been addressed.
> Currently things are such that the FW won't support migration in
> combinations that have different physical config space - so it is not
> a problem.
> 
> But, in principle, it is an issue. AFAIK, the only complete solution
> is for the hypervisor to fully synthesize a stable config space.
> 
> So, if we keep this in the kernel then I'd imagine the kernel will
> need to grow some shared infrastructure to fully synthezise the config
> space - not text file based, but basically the same as what I
> described for the VMM.
> 
>>> But that won't be true if the kernel is making decisions. The config
>>> space layout depends now on the kernel driver version too.
>>
>> But in the cases where we support migration there's a device specific
>> variant driver that supports that migration.  It's the job of that
>> variant driver to not only export and import the device state, but also
>> to provide a consistent ABI to the user, which includes the config
>> space layout.
> 
> Yes, we could do that, but I'm not sure how it will work in all cases.
> 
>> I don't understand why we'd say the device programming ABI itself
>> falls within the purview of the device/variant driver, but PCI
>> config space is defined by device specific code at a higher level.
> 
> The "device programming ABI" doesn't contain any *policy*. The layout
> of the config space is like 50% policy. Especially when we start to
> talk about standards defined migration. The standards will set the
> "device programming ABI" and maybe even specify the migration
> stream. They will not, and arguably can not, specify the config space.
> 
> Config space layout is substantially policy of the instance type. Even
> little things like the vendor IDs can be meaningfully replaced in VMs.
> 
>> Regarding "if we accept that text file configuration should be
>> something the VMM supports", I'm not on board with this yet, so
>> applying it to PASID discussion seems premature.
> 
> Sure, I'm just explaining a way this could all fit together.
> 
>> We've developed variant drivers specifically to host the device specific
>> aspects of migration support.  The requirement of a consistent config
>> space layout is a problem that only exists relative to migration.
> 
> Well, I wouldn't go quite so far. Arguably even non-migritable
> instance types may want to adjust thier config space. Eg if I'm using
> a DPU and I get a NVMe/Virtio PCI function I may want to scrub out
> details from the config space to make it more general. Even without
> migration.
> 
> This already happens today in places like VDPA which completely
> replace the underlying config space in some cases.
> 
> I see it as a difference from a world of highly constrained "instance
> types" and a more ad hoc world.
> 
>> is an issue that I would have considered the responsibility of the
>> variant driver, which would likely expect a consistent interface from
>> the hardware/firmware.  Why does hostile firmware suddenly make it the
>> VMM's problem to provide a consistent ABI to the config space of the
>> device rather than the variant driver?
> 
> It is not "hostile firmware"! It accepting that a significant aspect
> of the config layout is actually policy.
> 
> Plus the standards limitations that mean we can't change the config
> space on the fly make it pretty much impossible for the device to
> acutally do anything to help here. Software must fix the config space.
> 
>> Obviously config maps are something that a VMM could do, but it also
>> seems to impose a non-trivial burden that every VMM requires an
>> implementation of a config space map and integration for each device
>> rather than simply expecting the exposed config space of the device to
>> be part of the migration ABI.
> 
> Well, the flip is true to, it is alot of burden on every variant
> device driver implement and on the kernel in general to manage config
> space policy on behalf of the VMM.
> 
> My point is if the VMM is already going to be forced to manage config
> space policy for other good reasons, are we sure we want to put a
> bunch of stuff in the kernel that sometimes won't be used?
> 
>> Also this solution specifically only addresses config space
>> compatibility without considering the more generic issue that a
>> variant driver can expose different device personas.  A versioned
>> persona and config space virtualization in the variant driver is a
>> much more flexible solution.
> 
> It is addressed, the different personas would have their own text file
> maps. The target VMM would have to load the right map. Shared common
> code across all the variant drivers.
> 
> Jason

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 102+ messages in thread

* RE: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-29 17:44                                       ` Jason Gunthorpe
  2024-07-18 13:02                                         ` Yi Liu
@ 2024-07-24  2:26                                         ` Tian, Kevin
  2024-07-30 17:35                                           ` Alex Williamson
  1 sibling, 1 reply; 102+ messages in thread
From: Tian, Kevin @ 2024-07-24  2:26 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Liu, Yi L, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun,
	Cédric Le Goater

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, April 30, 2024 1:45 AM
> 
> On Fri, Apr 26, 2024 at 02:13:54PM -0600, Alex Williamson wrote:
> > Regarding "if we accept that text file configuration should be
> > something the VMM supports", I'm not on board with this yet, so
> > applying it to PASID discussion seems premature.
> 
> Sure, I'm just explaining a way this could all fit together.
> 

Thinking more along this direction.

I'm not sure how long it will take to standardize such text files and
share them across VMMs. We may need a way to move in steps in
Qemu to unblock the kernel development toward that end goal, e.g.
first accepting a pasid option plus user-specified offset (if offset
unspecified then auto-pick one in cap holes). Later when the text
file is ready then such per-cap options can be deprecated.

This simple way won't fix the migration issue, but at least it's on
par with physical caps (i.e. fail the migration if offset mismatched
between dest/src) and both will be fixed when the text file model
is ready.

Then look at what uAPI is required to report the vPASID cap.

In earlier discussion it's leaning toward extending GET_HW_INFO
in iommufd given both iommu/pci support are required to get
PASID working and iommu driver will not report such support until
pasid has been enabled in both iommu/pci. With that there is no
need to further report PASID in vfio-pci.

But there may be other caps which are shared between VF and
PF while having nothing to do with the iommu. e.g. the Device
Serial Number extended cap (permitted but not recommended
in VF). If there is a need to report such cap on VF which doesn't
implement it to userspace, a vfio uAPI (device_feature or a new
one dedicated to synthetical vcap) appears to be inevitable.

So I wonder whether we leave this part untouched until a real
demand comes or use vpasid to formalize that uAPI to be forward
looking. If in the end such uAPI will exist then it's a bit weird to
have PASID escaped (especially when vfio-pci already reports
PRI/ATS  which have iommu dependency too in vconfig).

In concept the Qemu logic will be clearer if any PCI caps (real
or synthesized) is always conveyed via vfio-pci while iommufd is
for identifying a viommu cap.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-07-24  2:26                                         ` Tian, Kevin
@ 2024-07-30 17:35                                           ` Alex Williamson
  2024-07-31  5:15                                             ` Tian, Kevin
  2024-08-06 14:30                                             ` Jason Gunthorpe
  0 siblings, 2 replies; 102+ messages in thread
From: Alex Williamson @ 2024-07-30 17:35 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, Liu, Yi L, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun,
	Cédric Le Goater

On Wed, 24 Jul 2024 02:26:20 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, April 30, 2024 1:45 AM
> > 
> > On Fri, Apr 26, 2024 at 02:13:54PM -0600, Alex Williamson wrote:  
> > > Regarding "if we accept that text file configuration should be
> > > something the VMM supports", I'm not on board with this yet, so
> > > applying it to PASID discussion seems premature.  
> > 
> > Sure, I'm just explaining a way this could all fit together.
> >   
> 
> Thinking more along this direction.
> 
> I'm not sure how long it will take to standardize such text files and
> share them across VMMs. We may need a way to move in steps in
> Qemu to unblock the kernel development toward that end goal, e.g.
> first accepting a pasid option plus user-specified offset (if offset
> unspecified then auto-pick one in cap holes). Later when the text
> file is ready then such per-cap options can be deprecated.

Planned obsolescence is a hard sell.

> This simple way won't fix the migration issue, but at least it's on
> par with physical caps (i.e. fail the migration if offset mismatched
> between dest/src) and both will be fixed when the text file model
> is ready.
> 
> Then look at what uAPI is required to report the vPASID cap.
> 
> In earlier discussion it's leaning toward extending GET_HW_INFO
> in iommufd given both iommu/pci support are required to get
> PASID working and iommu driver will not report such support until
> pasid has been enabled in both iommu/pci. With that there is no
> need to further report PASID in vfio-pci.
> 
> But there may be other caps which are shared between VF and
> PF while having nothing to do with the iommu. e.g. the Device
> Serial Number extended cap (permitted but not recommended
> in VF). If there is a need to report such cap on VF which doesn't
> implement it to userspace, a vfio uAPI (device_feature or a new
> one dedicated to synthetical vcap) appears to be inevitable.
> 
> So I wonder whether we leave this part untouched until a real
> demand comes or use vpasid to formalize that uAPI to be forward
> looking. If in the end such uAPI will exist then it's a bit weird to
> have PASID escaped (especially when vfio-pci already reports
> PRI/ATS  which have iommu dependency too in vconfig).
> 
> In concept the Qemu logic will be clearer if any PCI caps (real
> or synthesized) is always conveyed via vfio-pci while iommufd is
> for identifying a viommu cap.

There are so many moving pieces here and the discussion trailed off a
long time ago.  I have trouble keeping all the relevant considerations
in my head, so let me try to enumerate them, please correct/add.

 - The PASID capability cannot be implemented on VFs per the PCIe spec.
   All VFs share the PF PASID configuration.  This also implies that
   the VF PASID capability is essentially emulated since the VF driver
   cannot manipulate the PF PASID directly.

 - VFIO does not currently expose the PASID capability for PFs, nor
   does anything construct a vPASID capability for VFs.

 - The PASID capability is only useful in combination with a vIOMMU
   with PASID support, which does not yet exist in QEMU.

 - Some devices are known to place registers in configuration space,
   outside of the capability chains, which historically makes it
   difficult to place a purely virtual capability without potentially
   masking such hidden registers.  Current virtual capabilities are
   placed at vendor defined fixed locations to avoid conflicts.

 - There is some expectation that otherwise compatible devices may
   not present identical capability chains, for example devices running
   different firmware or devices from different vendors implementing a
   standard register ABI (virtio) where capability chain layout is not
   standardized.

 - There have been arguments that the layout of device capabilities is
   a policy choice, where both the kernel and libvirt traditionally try
   to avoid making policy decisions.

 - Seamless live migration of devices requires that configuration space
   remains at least consistent, if not identical for much of it.
   Capability offsets cannot change during live migration.  This leads
   to the text file reference above, which is essentially just the
   notion that if the VMM defines the capability layout in config
   space, it would need to do so via a static reference, independent of
   the layout of the physical device and we might want to share that
   among multiple VMMs.

 - For a vfio-pci device to support live migration it must be enabled
   to do so by a vfio-pci variant driver.

 - We've discussed in the community and seem to have a consensus that a
   DVSEC (Designated Vendor Specific Extended Capability) could be
   defined to describe unused configuration space.  Such a DVSEC could
   be implemented natively by the device or supplied by a vfio-pci
   variant driver.  There is currently no definition of such a DVSEC.

So what are we trying to accomplish here.  PASID is the first
non-device specific virtual capability that we'd like to insert into
the VM view of the capability chain.  It won't be the last.

 - Do we push the policy of defining the capability offset to the user?

 - Do we do some hand waving that devices supporting PASID shouldn't
   have hidden registers and therefore the VMM can simply find a gap?

 - Do we ask the hardware vendor or variant driver to insert a DVSEC to
   identify available config space?

 - Do we handle this as just another device quirk, where we maintain a
   table of supported devices and vPASID offset for each?

 - Do we consider this an inflection point where the VMM entirely takes
   over the layout of the capability spaces to impose a stable
   migration layout?  On what basis do we apply that inflection?

 - Also, do we require the same policy for both standard and extended
   capability chains?

I understand the desire to make some progress, but QEMU relies on
integration with management tools, so a temporary option for a user to
specify a PASID offset in isolation sounds like a non-starter to me.

This might be a better sell if the user interface allowed fully
defining the capability chain layout from the command line and this
interface would continue to exist and supersede how the VMM might
otherwise define the capability chain when used.  A fully user defined
layout would be complicated though, so I think there would still be a
desire for QEMU to consume or define a consistent policy itself.

Even if QEMU defines the layout for a device, there may be multiple
versions of that device.  For example, maybe we just add PASID now, but
at some point we decide that we do want to replicate the PF serial
number capability.  At that point we have versions of the device which
would need to be tied to versions of the machine and maybe also
selected via a profile switch on the device command line.

If we want to simplify this, maybe we do just look at whether the
vIOMMU is configured for PASID support and if the device supports it,
then we just look for a gap and add the capability.  If we end up with
different results between source and target for migration, then
migration will fail.  Possibly we end up with a quirk table to override
the default placement of specific capabilities on specific devices.
That might evolve into a lookup for where we place all capabilities,
which essentially turns into the "file" where the VMM defines the entire
layout for some devices.

This is already TL;DR, so I'll end with that before I further drowned
the possibility of discussion.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* RE: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-07-30 17:35                                           ` Alex Williamson
@ 2024-07-31  5:15                                             ` Tian, Kevin
  2024-07-31 17:04                                               ` Alex Williamson
  2024-08-06 14:30                                             ` Jason Gunthorpe
  1 sibling, 1 reply; 102+ messages in thread
From: Tian, Kevin @ 2024-07-31  5:15 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Liu, Yi L, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun,
	Cédric Le Goater

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, July 31, 2024 1:35 AM
> 
> On Wed, 24 Jul 2024 02:26:20 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Tuesday, April 30, 2024 1:45 AM
> > >
> > > On Fri, Apr 26, 2024 at 02:13:54PM -0600, Alex Williamson wrote:
> > > > Regarding "if we accept that text file configuration should be
> > > > something the VMM supports", I'm not on board with this yet, so
> > > > applying it to PASID discussion seems premature.
> > >
> > > Sure, I'm just explaining a way this could all fit together.
> > >
> >
> > Thinking more along this direction.
> >
> > I'm not sure how long it will take to standardize such text files and
> > share them across VMMs. We may need a way to move in steps in
> > Qemu to unblock the kernel development toward that end goal, e.g.
> > first accepting a pasid option plus user-specified offset (if offset
> > unspecified then auto-pick one in cap holes). Later when the text
> > file is ready then such per-cap options can be deprecated.
> 
> Planned obsolescence is a hard sell.
> 
> > This simple way won't fix the migration issue, but at least it's on
> > par with physical caps (i.e. fail the migration if offset mismatched
> > between dest/src) and both will be fixed when the text file model
> > is ready.
> >
> > Then look at what uAPI is required to report the vPASID cap.
> >
> > In earlier discussion it's leaning toward extending GET_HW_INFO
> > in iommufd given both iommu/pci support are required to get
> > PASID working and iommu driver will not report such support until
> > pasid has been enabled in both iommu/pci. With that there is no
> > need to further report PASID in vfio-pci.
> >
> > But there may be other caps which are shared between VF and
> > PF while having nothing to do with the iommu. e.g. the Device
> > Serial Number extended cap (permitted but not recommended
> > in VF). If there is a need to report such cap on VF which doesn't
> > implement it to userspace, a vfio uAPI (device_feature or a new
> > one dedicated to synthetical vcap) appears to be inevitable.
> >
> > So I wonder whether we leave this part untouched until a real
> > demand comes or use vpasid to formalize that uAPI to be forward
> > looking. If in the end such uAPI will exist then it's a bit weird to
> > have PASID escaped (especially when vfio-pci already reports
> > PRI/ATS  which have iommu dependency too in vconfig).
> >
> > In concept the Qemu logic will be clearer if any PCI caps (real
> > or synthesized) is always conveyed via vfio-pci while iommufd is
> > for identifying a viommu cap.
> 
> There are so many moving pieces here and the discussion trailed off a
> long time ago.  I have trouble keeping all the relevant considerations
> in my head, so let me try to enumerate them, please correct/add.

Thanks for the summary!

> 
>  - The PASID capability cannot be implemented on VFs per the PCIe spec.
>    All VFs share the PF PASID configuration.  This also implies that
>    the VF PASID capability is essentially emulated since the VF driver
>    cannot manipulate the PF PASID directly.
> 
>  - VFIO does not currently expose the PASID capability for PFs, nor
>    does anything construct a vPASID capability for VFs.
> 
>  - The PASID capability is only useful in combination with a vIOMMU
>    with PASID support, which does not yet exist in QEMU.
> 
>  - Some devices are known to place registers in configuration space,
>    outside of the capability chains, which historically makes it
>    difficult to place a purely virtual capability without potentially
>    masking such hidden registers.  Current virtual capabilities are
>    placed at vendor defined fixed locations to avoid conflicts.
> 
>  - There is some expectation that otherwise compatible devices may
>    not present identical capability chains, for example devices running
>    different firmware or devices from different vendors implementing a
>    standard register ABI (virtio) where capability chain layout is not
>    standardized.
> 
>  - There have been arguments that the layout of device capabilities is
>    a policy choice, where both the kernel and libvirt traditionally try
>    to avoid making policy decisions.
> 
>  - Seamless live migration of devices requires that configuration space
>    remains at least consistent, if not identical for much of it.

I didn't quite get it. I thought being consistent means fully identical
config space from guest p.o.v.

>    Capability offsets cannot change during live migration.  This leads
>    to the text file reference above, which is essentially just the
>    notion that if the VMM defines the capability layout in config
>    space, it would need to do so via a static reference, independent of
>    the layout of the physical device and we might want to share that
>    among multiple VMMs.
> 
>  - For a vfio-pci device to support live migration it must be enabled
>    to do so by a vfio-pci variant driver.
> 
>  - We've discussed in the community and seem to have a consensus that a
>    DVSEC (Designated Vendor Specific Extended Capability) could be
>    defined to describe unused configuration space.  Such a DVSEC could
>    be implemented natively by the device or supplied by a vfio-pci
>    variant driver.  There is currently no definition of such a DVSEC.

I'm not sure whether DVSEC is still that necessary if the direction is
to go userspace-defined layout. In a synthetic world the unused
physical space doesn't really matter.

So this consensus IMHO was better placed under the umbrella of
the other direction having the kernel define the layout.

> 
> So what are we trying to accomplish here.  PASID is the first
> non-device specific virtual capability that we'd like to insert into
> the VM view of the capability chain.  It won't be the last.
> 
>  - Do we push the policy of defining the capability offset to the user?

Looks yes as I didn't see a strong argument for the opposite way.

> 
>  - Do we do some hand waving that devices supporting PASID shouldn't
>    have hidden registers and therefore the VMM can simply find a gap?

I assume 'handwaving' doesn't mean any measure in code to actually
block those devices (as doing so likely requires certain denylist based on
device/vendor ID but then why not going a step further to also hard
code an offset?). It's more a try-and-fail model where vPASID is opted
in via a cmdline parameter then a device with hidden registers may
misbehave if the VMM happens to find a conflict gap. And the impact
is restricted only to a new setup where the user is interested in
PASID  to opt hence can afford diagnostics effort to figure out the restriction.

> 
>  - Do we ask the hardware vendor or variant driver to insert a DVSEC to
>    identify available config space?

As said I don't think it's necessary if leaving the policy to the user

> 
>  - Do we handle this as just another device quirk, where we maintain a
>    table of supported devices and vPASID offset for each?
> 
>  - Do we consider this an inflection point where the VMM entirely takes
>    over the layout of the capability spaces to impose a stable
>    migration layout?  On what basis do we apply that inflection?
> 
>  - Also, do we require the same policy for both standard and extended
>    capability chains?

suppose yes.

> 
> I understand the desire to make some progress, but QEMU relies on
> integration with management tools, so a temporary option for a user to
> specify a PASID offset in isolation sounds like a non-starter to me.
> 
> This might be a better sell if the user interface allowed fully
> defining the capability chain layout from the command line and this
> interface would continue to exist and supersede how the VMM might
> otherwise define the capability chain when used.  A fully user defined
> layout would be complicated though, so I think there would still be a
> desire for QEMU to consume or define a consistent policy itself.
> 
> Even if QEMU defines the layout for a device, there may be multiple
> versions of that device.  For example, maybe we just add PASID now, but
> at some point we decide that we do want to replicate the PF serial
> number capability.  At that point we have versions of the device which
> would need to be tied to versions of the machine and maybe also
> selected via a profile switch on the device command line.
> 
> If we want to simplify this, maybe we do just look at whether the
> vIOMMU is configured for PASID support and if the device supports it,

and this is related to the open which I raised in last mail - whether we
want to report the PASID support both in iommufd and vfio-pci uAPI.

My impression is yes as there may be requirement of exposing a virtual
capability which doesn't rely on the IOMMU.

> then we just look for a gap and add the capability.  If we end up with
> different results between source and target for migration, then
> migration will fail.  Possibly we end up with a quirk table to override
> the default placement of specific capabilities on specific devices.

emm how does a quirk table work with devices having volatile config
space layout cross FW versions? Can VMM assigned with a VF be able
to check the FW version of the PF?

> That might evolve into a lookup for where we place all capabilities,
> which essentially turns into the "file" where the VMM defines the entire
> layout for some devices.

Overall this sounds a feasible path to move forward - starting with
the VMM to find the gap automatically if a new PASID option is
opted in. Devices with hidden registers may fail. Devices with volatile
config space due to FW upgrade or cross vendors may fail to migrate.
Then evolving it to the file-based scheme, and there is time to discuss
any intermediate improvement (fixed quirks, cmdline offset, etc.) in
between.

> 
> This is already TL;DR, so I'll end with that before I further drowned
> the possibility of discussion.  Thanks,
> 
> Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-07-31  5:15                                             ` Tian, Kevin
@ 2024-07-31 17:04                                               ` Alex Williamson
  2024-08-01  7:45                                                 ` Tian, Kevin
  2024-08-06 13:54                                                 ` Jason Gunthorpe
  0 siblings, 2 replies; 102+ messages in thread
From: Alex Williamson @ 2024-07-31 17:04 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, Liu, Yi L, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun,
	Cédric Le Goater

On Wed, 31 Jul 2024 05:15:25 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, July 31, 2024 1:35 AM
> > 
> > On Wed, 24 Jul 2024 02:26:20 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Tuesday, April 30, 2024 1:45 AM
> > > >
> > > > On Fri, Apr 26, 2024 at 02:13:54PM -0600, Alex Williamson wrote:  
> > > > > Regarding "if we accept that text file configuration should be
> > > > > something the VMM supports", I'm not on board with this yet, so
> > > > > applying it to PASID discussion seems premature.  
> > > >
> > > > Sure, I'm just explaining a way this could all fit together.
> > > >  
> > >
> > > Thinking more along this direction.
> > >
> > > I'm not sure how long it will take to standardize such text files and
> > > share them across VMMs. We may need a way to move in steps in
> > > Qemu to unblock the kernel development toward that end goal, e.g.
> > > first accepting a pasid option plus user-specified offset (if offset
> > > unspecified then auto-pick one in cap holes). Later when the text
> > > file is ready then such per-cap options can be deprecated.  
> > 
> > Planned obsolescence is a hard sell.
> >   
> > > This simple way won't fix the migration issue, but at least it's on
> > > par with physical caps (i.e. fail the migration if offset mismatched
> > > between dest/src) and both will be fixed when the text file model
> > > is ready.
> > >
> > > Then look at what uAPI is required to report the vPASID cap.
> > >
> > > In earlier discussion it's leaning toward extending GET_HW_INFO
> > > in iommufd given both iommu/pci support are required to get
> > > PASID working and iommu driver will not report such support until
> > > pasid has been enabled in both iommu/pci. With that there is no
> > > need to further report PASID in vfio-pci.
> > >
> > > But there may be other caps which are shared between VF and
> > > PF while having nothing to do with the iommu. e.g. the Device
> > > Serial Number extended cap (permitted but not recommended
> > > in VF). If there is a need to report such cap on VF which doesn't
> > > implement it to userspace, a vfio uAPI (device_feature or a new
> > > one dedicated to synthetical vcap) appears to be inevitable.
> > >
> > > So I wonder whether we leave this part untouched until a real
> > > demand comes or use vpasid to formalize that uAPI to be forward
> > > looking. If in the end such uAPI will exist then it's a bit weird to
> > > have PASID escaped (especially when vfio-pci already reports
> > > PRI/ATS  which have iommu dependency too in vconfig).
> > >
> > > In concept the Qemu logic will be clearer if any PCI caps (real
> > > or synthesized) is always conveyed via vfio-pci while iommufd is
> > > for identifying a viommu cap.  
> > 
> > There are so many moving pieces here and the discussion trailed off a
> > long time ago.  I have trouble keeping all the relevant considerations
> > in my head, so let me try to enumerate them, please correct/add.  
> 
> Thanks for the summary!
> 
> > 
> >  - The PASID capability cannot be implemented on VFs per the PCIe spec.
> >    All VFs share the PF PASID configuration.  This also implies that
> >    the VF PASID capability is essentially emulated since the VF driver
> >    cannot manipulate the PF PASID directly.
> > 
> >  - VFIO does not currently expose the PASID capability for PFs, nor
> >    does anything construct a vPASID capability for VFs.
> > 
> >  - The PASID capability is only useful in combination with a vIOMMU
> >    with PASID support, which does not yet exist in QEMU.
> > 
> >  - Some devices are known to place registers in configuration space,
> >    outside of the capability chains, which historically makes it
> >    difficult to place a purely virtual capability without potentially
> >    masking such hidden registers.  Current virtual capabilities are
> >    placed at vendor defined fixed locations to avoid conflicts.
> > 
> >  - There is some expectation that otherwise compatible devices may
> >    not present identical capability chains, for example devices running
> >    different firmware or devices from different vendors implementing a
> >    standard register ABI (virtio) where capability chain layout is not
> >    standardized.
> > 
> >  - There have been arguments that the layout of device capabilities is
> >    a policy choice, where both the kernel and libvirt traditionally try
> >    to avoid making policy decisions.
> > 
> >  - Seamless live migration of devices requires that configuration space
> >    remains at least consistent, if not identical for much of it.  
> 
> I didn't quite get it. I thought being consistent means fully identical
> config space from guest p.o.v.

See for example:

https://gitlab.com/qemu-project/qemu/-/commit/187716feeba406b5a3879db66a7bafd687472a1f 

The layout of config space and most of the contents therein need to be
identical, but there are arguably elements that could be volatile which
only need to be consistent.

> >    Capability offsets cannot change during live migration.  This leads
> >    to the text file reference above, which is essentially just the
> >    notion that if the VMM defines the capability layout in config
> >    space, it would need to do so via a static reference, independent of
> >    the layout of the physical device and we might want to share that
> >    among multiple VMMs.
> > 
> >  - For a vfio-pci device to support live migration it must be enabled
> >    to do so by a vfio-pci variant driver.
> > 
> >  - We've discussed in the community and seem to have a consensus that a
> >    DVSEC (Designated Vendor Specific Extended Capability) could be
> >    defined to describe unused configuration space.  Such a DVSEC could
> >    be implemented natively by the device or supplied by a vfio-pci
> >    variant driver.  There is currently no definition of such a DVSEC.  
> 
> I'm not sure whether DVSEC is still that necessary if the direction is
> to go userspace-defined layout. In a synthetic world the unused
> physical space doesn't really matter.
> 
> So this consensus IMHO was better placed under the umbrella of
> the other direction having the kernel define the layout.

I agree that we don't seem to be headed in a direction that requires
this, but I just wanted to include that there was a roughly agreed upon
way for devices and variant drivers to annotate unused config space
ranges for higher levels.  If we head in a direction where the VMM
chooses an offset for the PASID capability, we need to keep track of
whether this DVSEC comes to fruition and how that affects the offset
that QEMU might choose.

> > So what are we trying to accomplish here.  PASID is the first
> > non-device specific virtual capability that we'd like to insert into
> > the VM view of the capability chain.  It won't be the last.
> > 
> >  - Do we push the policy of defining the capability offset to the user?  
> 
> Looks yes as I didn't see a strong argument for the opposite way.

It's a policy choice though, so where and how is it implemented?  It
works fine for those of us willing to edit xml or launch VMs by command
line, but libvirt isn't going to sign up to insert a policy choice for
a device.  If we get to even higher level tools, does anything that
wants to implement PASID support required a vendor operator driver to
make such policy choices (btw, I'm just throwing out the "operator"
term as if I know what it means, I don't).

> >  - Do we do some hand waving that devices supporting PASID shouldn't
> >    have hidden registers and therefore the VMM can simply find a gap?  
> 
> I assume 'handwaving' doesn't mean any measure in code to actually
> block those devices (as doing so likely requires certain denylist based on
> device/vendor ID but then why not going a step further to also hard
> code an offset?). It's more a try-and-fail model where vPASID is opted
> in via a cmdline parameter then a device with hidden registers may
> misbehave if the VMM happens to find a conflict gap. And the impact
> is restricted only to a new setup where the user is interested in
> PASID  to opt hence can afford diagnostics effort to figure out the restriction.

If you want to hard code an offset then we're effectively introducing a
device specific quirk to enable PASID support.  I thought we wanted
this to work generically for any device exposing PASID, therefore I was
thinking more of "find a gap" as the default strategy with quirks used
to augment the resulting offset where necessary.

I'd also be careful about command line parameters.  I think we require
one for the vIOMMU to enable PASID support, but I'd prefer to avoid one
on the vfio-pci device, instead simply enabling support when both the
vIOMMU support is enabled and the device is detected to support it.
Each command line option requires support in the upper level tools to
enable it.

> >  - Do we ask the hardware vendor or variant driver to insert a DVSEC to
> >    identify available config space?  
> 
> As said I don't think it's necessary if leaving the policy to the user

Leaving the policy to the user is essentially just kicking the can down
the road and I don't know where that policy actually gets implemented
in a cloud, production environment.  I think it would align with QEMU
practices if the user could override a default policy on the command
line, but ultimately if we want to keep the policy decision out of the
kernel then the defaults probably need to be implemented in QEMU.

> >  - Do we handle this as just another device quirk, where we maintain a
> >    table of supported devices and vPASID offset for each?
> > 
> >  - Do we consider this an inflection point where the VMM entirely takes
> >    over the layout of the capability spaces to impose a stable
> >    migration layout?  On what basis do we apply that inflection?
> > 
> >  - Also, do we require the same policy for both standard and extended
> >    capability chains?  
> 
> suppose yes.
> 
> > 
> > I understand the desire to make some progress, but QEMU relies on
> > integration with management tools, so a temporary option for a user to
> > specify a PASID offset in isolation sounds like a non-starter to me.
> > 
> > This might be a better sell if the user interface allowed fully
> > defining the capability chain layout from the command line and this
> > interface would continue to exist and supersede how the VMM might
> > otherwise define the capability chain when used.  A fully user defined
> > layout would be complicated though, so I think there would still be a
> > desire for QEMU to consume or define a consistent policy itself.
> > 
> > Even if QEMU defines the layout for a device, there may be multiple
> > versions of that device.  For example, maybe we just add PASID now, but
> > at some point we decide that we do want to replicate the PF serial
> > number capability.  At that point we have versions of the device which
> > would need to be tied to versions of the machine and maybe also
> > selected via a profile switch on the device command line.
> > 
> > If we want to simplify this, maybe we do just look at whether the
> > vIOMMU is configured for PASID support and if the device supports it,  
> 
> and this is related to the open which I raised in last mail - whether we
> want to report the PASID support both in iommufd and vfio-pci uAPI.
> 
> My impression is yes as there may be requirement of exposing a virtual
> capability which doesn't rely on the IOMMU.

What's the purpose of reporting PASID via both iommufd and vfio-pci?  I
agree that there will be capabilities related to the iommufd and
capabilities only related to the device, but I disagree that that
provides justification to report PASID via both uAPIs.  Are we also
going to ask iommufd to report that a device has an optional serial
number capability?  It clearly doesn't make sense for iommufd to be
involved with that, so why does it make sense for vfio-pci to be
involved in reporting something that is more iommufd specific?

> > then we just look for a gap and add the capability.  If we end up with
> > different results between source and target for migration, then
> > migration will fail.  Possibly we end up with a quirk table to override
> > the default placement of specific capabilities on specific devices.  
> 
> emm how does a quirk table work with devices having volatile config
> space layout cross FW versions? Can VMM assigned with a VF be able
> to check the FW version of the PF?

If the VMM can't find the same gap between source and destination then
a quirk could make sure that the PASID offset is consistent.  But also
if the VMM doesn't find the same gap then that suggests the config
space is already different and not only the offset of the PASID
capability will need to be fixed via a quirk, so then we're into
quirking the entire capability space for the device.

The VMM should not be assumed to have any additional privileges beyond
what we provide it through the vfio device and iommufd interface.
Testing anything about the PF would require access on the host that
won't work in more secure environments.  Therefore if we can't
consistently place the PASID for a device, we probably need to quirk it
based on the vendor/device IDs or sub-IDs or we need to rely on a
management implied policy such as a device profile option on the QEMU
command line or maybe different classes of the vfio-pci driver in QEMU.

> > That might evolve into a lookup for where we place all capabilities,
> > which essentially turns into the "file" where the VMM defines the entire
> > layout for some devices.  
> 
> Overall this sounds a feasible path to move forward - starting with
> the VMM to find the gap automatically if a new PASID option is
> opted in. Devices with hidden registers may fail. Devices with volatile
> config space due to FW upgrade or cross vendors may fail to migrate.
> Then evolving it to the file-based scheme, and there is time to discuss
> any intermediate improvement (fixed quirks, cmdline offset, etc.) in
> between.

As above, let's be careful about introducing unnecessary command line
options, especially if we expect support for them in higher level
tools.  If we place the PASID somewhere that makes the device not work,
then disabling PASID on the vIOMMU should resolve that.  It won't be a
regression, it will only be an incompatibility with a new feature.
That incompatibility may require a quirk to resolve to have the PASID
placed somewhere else.  If the PASID is placed at different offsets
based on device firmware or vendor then the location of the PASID alone
isn't the only thing preventing migration and we'll need to introduce
code for the VMM to take ownership of the capability layout at that
point.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* RE: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-07-31 17:04                                               ` Alex Williamson
@ 2024-08-01  7:45                                                 ` Tian, Kevin
  2024-08-02 18:25                                                   ` Alex Williamson
  2024-08-06 13:54                                                 ` Jason Gunthorpe
  1 sibling, 1 reply; 102+ messages in thread
From: Tian, Kevin @ 2024-08-01  7:45 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Liu, Yi L, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun,
	Cédric Le Goater

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Thursday, August 1, 2024 1:05 AM
> 
> On Wed, 31 Jul 2024 05:15:25 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Wednesday, July 31, 2024 1:35 AM
> > >
> > >  - Seamless live migration of devices requires that configuration space
> > >    remains at least consistent, if not identical for much of it.
> >
> > I didn't quite get it. I thought being consistent means fully identical
> > config space from guest p.o.v.
> 
> See for example:
> 
> https://gitlab.com/qemu-project/qemu/-
> /commit/187716feeba406b5a3879db66a7bafd687472a1f

Thanks!

> 
> The layout of config space and most of the contents therein need to be
> identical, but there are arguably elements that could be volatile which
> only need to be consistent.

hmm IMHO it's more that the guest doesn't care volatile content in that
field instead of the guest view being strictly consistent. Probably I
don't really understand the meaning of consistency in this context...

btw that fix claims:

  "
  Here consistency could mean that VSC format should be same on
  source and destination, however actual Vendor Specific Info may
  not be byte-to-byte identical.
  "

Does it apply to all devices supporting VSC? It's OK for NVDIA vGPU
but I'm not sure whether some vendor driver might be sensitive to
byte-to-byte consistency in VSC.

> > >
> > >  - We've discussed in the community and seem to have a consensus that a
> > >    DVSEC (Designated Vendor Specific Extended Capability) could be
> > >    defined to describe unused configuration space.  Such a DVSEC could
> > >    be implemented natively by the device or supplied by a vfio-pci
> > >    variant driver.  There is currently no definition of such a DVSEC.
> >
> > I'm not sure whether DVSEC is still that necessary if the direction is
> > to go userspace-defined layout. In a synthetic world the unused
> > physical space doesn't really matter.
> >
> > So this consensus IMHO was better placed under the umbrella of
> > the other direction having the kernel define the layout.
> 
> I agree that we don't seem to be headed in a direction that requires
> this, but I just wanted to include that there was a roughly agreed upon
> way for devices and variant drivers to annotate unused config space
> ranges for higher levels.  If we head in a direction where the VMM
> chooses an offset for the PASID capability, we need to keep track of
> whether this DVSEC comes to fruition and how that affects the offset
> that QEMU might choose.

Yes. We can keep this option in case there is a demand, especially
if the file-based synthesized scheme won't be built in one day and
we need a default policy in VMM.

> 
> > > So what are we trying to accomplish here.  PASID is the first
> > > non-device specific virtual capability that we'd like to insert into
> > > the VM view of the capability chain.  It won't be the last.
> > >
> > >  - Do we push the policy of defining the capability offset to the user?
> >
> > Looks yes as I didn't see a strong argument for the opposite way.
> 
> It's a policy choice though, so where and how is it implemented?  It
> works fine for those of us willing to edit xml or launch VMs by command
> line, but libvirt isn't going to sign up to insert a policy choice for
> a device.  If we get to even higher level tools, does anything that
> wants to implement PASID support required a vendor operator driver to
> make such policy choices (btw, I'm just throwing out the "operator"
> term as if I know what it means, I don't).

I had a rough feeling that there might be other usages requiring such
vendor plugin, e.g. provisioning VF/ADI may require vendor specific
configurations, but not really an expert in this area.

Overall I feel most of our discussions so far are about VMM-auto-
find-offset vs. file-based-policy-scheme which both belong to
user-defined policy, suggesting that we all agreed to drop the other
way having kernel define the offset (plus in-kernel quirks, etc.)?

Even the said DVSEC is to assist such user-defined direction.

> 
> > >  - Do we do some hand waving that devices supporting PASID shouldn't
> > >    have hidden registers and therefore the VMM can simply find a gap?
> >
> > I assume 'handwaving' doesn't mean any measure in code to actually
> > block those devices (as doing so likely requires certain denylist based on
> > device/vendor ID but then why not going a step further to also hard
> > code an offset?). It's more a try-and-fail model where vPASID is opted
> > in via a cmdline parameter then a device with hidden registers may
> > misbehave if the VMM happens to find a conflict gap. And the impact
> > is restricted only to a new setup where the user is interested in
> > PASID  to opt hence can afford diagnostics effort to figure out the
> restriction.
> 
> If you want to hard code an offset then we're effectively introducing a
> device specific quirk to enable PASID support.  I thought we wanted
> this to work generically for any device exposing PASID, therefore I was
> thinking more of "find a gap" as the default strategy with quirks used
> to augment the resulting offset where necessary.
> 
> I'd also be careful about command line parameters.  I think we require
> one for the vIOMMU to enable PASID support, but I'd prefer to avoid one
> on the vfio-pci device, instead simply enabling support when both the
> vIOMMU support is enabled and the device is detected to support it.
> Each command line option requires support in the upper level tools to
> enable it.

Make sense. btw will there be a requirement that the user wants to
disable PASID even if the device supports it, e.g. for testing purpose
or to workaround a HW errata disclosed after host driver claims the
support in an old kernel?

> > >
> > > I understand the desire to make some progress, but QEMU relies on
> > > integration with management tools, so a temporary option for a user to
> > > specify a PASID offset in isolation sounds like a non-starter to me.
> > >
> > > This might be a better sell if the user interface allowed fully
> > > defining the capability chain layout from the command line and this
> > > interface would continue to exist and supersede how the VMM might
> > > otherwise define the capability chain when used.  A fully user defined
> > > layout would be complicated though, so I think there would still be a
> > > desire for QEMU to consume or define a consistent policy itself.
> > >
> > > Even if QEMU defines the layout for a device, there may be multiple
> > > versions of that device.  For example, maybe we just add PASID now, but
> > > at some point we decide that we do want to replicate the PF serial
> > > number capability.  At that point we have versions of the device which
> > > would need to be tied to versions of the machine and maybe also
> > > selected via a profile switch on the device command line.
> > >
> > > If we want to simplify this, maybe we do just look at whether the
> > > vIOMMU is configured for PASID support and if the device supports it,
> >
> > and this is related to the open which I raised in last mail - whether we
> > want to report the PASID support both in iommufd and vfio-pci uAPI.
> >
> > My impression is yes as there may be requirement of exposing a virtual
> > capability which doesn't rely on the IOMMU.
> 
> What's the purpose of reporting PASID via both iommufd and vfio-pci?  I
> agree that there will be capabilities related to the iommufd and
> capabilities only related to the device, but I disagree that that
> provides justification to report PASID via both uAPIs.  Are we also
> going to ask iommufd to report that a device has an optional serial
> number capability?  It clearly doesn't make sense for iommufd to be

Certainly no. My point was that vfio-pci/iommufd each reports its
own capability set. They may overlap but this fact just matches the
physical world.

> involved with that, so why does it make sense for vfio-pci to be
> involved in reporting something that is more iommufd specific?

It doesn't matter which one involves more. It's more akin to the
physical world.

btw vfio-pci already reports ATS/PRI which both rely on iommufd
in vconfig space. Throwing PASID alone to iommufd uAPI lacks of a
good justification for why it's special.

I envision an extension to vfio device feature or a new vfio uAPI
for reporting virtual capabilities as augment to the ones filled in
vconfig space. 

> 
> > > then we just look for a gap and add the capability.  If we end up with
> > > different results between source and target for migration, then
> > > migration will fail.  Possibly we end up with a quirk table to override
> > > the default placement of specific capabilities on specific devices.
> >
> > emm how does a quirk table work with devices having volatile config
> > space layout cross FW versions? Can VMM assigned with a VF be able
> > to check the FW version of the PF?
> 
> If the VMM can't find the same gap between source and destination then
> a quirk could make sure that the PASID offset is consistent.  But also
> if the VMM doesn't find the same gap then that suggests the config
> space is already different and not only the offset of the PASID
> capability will need to be fixed via a quirk, so then we're into
> quirking the entire capability space for the device.

yes. So the quirk table is more for fixing the functional gap (i.e. not
overlap with a hidden register) instead of for migration. As long as
a device can function correctly with it, the virtual caps fall into the
same restriction as physical caps in migration i.e. upon inconsistent
layout between src/dest we'll need separate way to synthesize the
entire space.

> 
> The VMM should not be assumed to have any additional privileges beyond
> what we provide it through the vfio device and iommufd interface.
> Testing anything about the PF would require access on the host that
> won't work in more secure environments.  Therefore if we can't
> consistently place the PASID for a device, we probably need to quirk it
> based on the vendor/device IDs or sub-IDs or we need to rely on a
> management implied policy such as a device profile option on the QEMU
> command line or maybe different classes of the vfio-pci driver in QEMU.
> 
> > > That might evolve into a lookup for where we place all capabilities,
> > > which essentially turns into the "file" where the VMM defines the entire
> > > layout for some devices.
> >
> > Overall this sounds a feasible path to move forward - starting with
> > the VMM to find the gap automatically if a new PASID option is
> > opted in. Devices with hidden registers may fail. Devices with volatile
> > config space due to FW upgrade or cross vendors may fail to migrate.
> > Then evolving it to the file-based scheme, and there is time to discuss
> > any intermediate improvement (fixed quirks, cmdline offset, etc.) in
> > between.
> 
> As above, let's be careful about introducing unnecessary command line
> options, especially if we expect support for them in higher level
> tools.  If we place the PASID somewhere that makes the device not work,
> then disabling PASID on the vIOMMU should resolve that.  It won't be a

vIOMMU is per-platform then it applies to all devices behind, including
those which don't have a problem with auto-selected offset. Not sure
whether one would want to continue enabling PASID for other devices
or should stop immediately to find a quirk for the problematic one and
then resume.

> regression, it will only be an incompatibility with a new feature.
> That incompatibility may require a quirk to resolve to have the PASID
> placed somewhere else.  If the PASID is placed at different offsets
> based on device firmware or vendor then the location of the PASID alone
> isn't the only thing preventing migration and we'll need to introduce
> code for the VMM to take ownership of the capability layout at that
> point.  Thanks,
> 

Yes, the migration issue might be solved in a separate track as it applies
to both physical and virtual capabilities.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-08-01  7:45                                                 ` Tian, Kevin
@ 2024-08-02 18:25                                                   ` Alex Williamson
  2024-08-05  5:35                                                     ` Tian, Kevin
  2024-08-06 14:05                                                     ` Jason Gunthorpe
  0 siblings, 2 replies; 102+ messages in thread
From: Alex Williamson @ 2024-08-02 18:25 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, Liu, Yi L, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun,
	Cédric Le Goater

On Thu, 1 Aug 2024 07:45:43 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Thursday, August 1, 2024 1:05 AM
> > 
> > On Wed, 31 Jul 2024 05:15:25 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Wednesday, July 31, 2024 1:35 AM
> > > >
> > > >  - Seamless live migration of devices requires that configuration space
> > > >    remains at least consistent, if not identical for much of it.  
> > >
> > > I didn't quite get it. I thought being consistent means fully identical
> > > config space from guest p.o.v.  
> > 
> > See for example:
> > 
> > https://gitlab.com/qemu-project/qemu/-
> > /commit/187716feeba406b5a3879db66a7bafd687472a1f  
> 
> Thanks!
> 
> > 
> > The layout of config space and most of the contents therein need to be
> > identical, but there are arguably elements that could be volatile which
> > only need to be consistent.  
> 
> hmm IMHO it's more that the guest doesn't care volatile content in that
> field instead of the guest view being strictly consistent. Probably I
> don't really understand the meaning of consistency in this context...
> 
> btw that fix claims:
> 
>   "
>   Here consistency could mean that VSC format should be same on
>   source and destination, however actual Vendor Specific Info may
>   not be byte-to-byte identical.
>   "
> 
> Does it apply to all devices supporting VSC? It's OK for NVDIA vGPU
> but I'm not sure whether some vendor driver might be sensitive to
> byte-to-byte consistency in VSC.

It applies to all VSC capabilities.  The argument is that there is no
standard definition of the content here therefore QEMU has no authority
to impose a policy on the contents.
 
> > > >
> > > >  - We've discussed in the community and seem to have a consensus that a
> > > >    DVSEC (Designated Vendor Specific Extended Capability) could be
> > > >    defined to describe unused configuration space.  Such a DVSEC could
> > > >    be implemented natively by the device or supplied by a vfio-pci
> > > >    variant driver.  There is currently no definition of such a DVSEC.  
> > >
> > > I'm not sure whether DVSEC is still that necessary if the direction is
> > > to go userspace-defined layout. In a synthetic world the unused
> > > physical space doesn't really matter.
> > >
> > > So this consensus IMHO was better placed under the umbrella of
> > > the other direction having the kernel define the layout.  
> > 
> > I agree that we don't seem to be headed in a direction that requires
> > this, but I just wanted to include that there was a roughly agreed upon
> > way for devices and variant drivers to annotate unused config space
> > ranges for higher levels.  If we head in a direction where the VMM
> > chooses an offset for the PASID capability, we need to keep track of
> > whether this DVSEC comes to fruition and how that affects the offset
> > that QEMU might choose.  
> 
> Yes. We can keep this option in case there is a demand, especially
> if the file-based synthesized scheme won't be built in one day and
> we need a default policy in VMM.
> 
> >   
> > > > So what are we trying to accomplish here.  PASID is the first
> > > > non-device specific virtual capability that we'd like to insert into
> > > > the VM view of the capability chain.  It won't be the last.
> > > >
> > > >  - Do we push the policy of defining the capability offset to the user?  
> > >
> > > Looks yes as I didn't see a strong argument for the opposite way.  
> > 
> > It's a policy choice though, so where and how is it implemented?  It
> > works fine for those of us willing to edit xml or launch VMs by command
> > line, but libvirt isn't going to sign up to insert a policy choice for
> > a device.  If we get to even higher level tools, does anything that
> > wants to implement PASID support required a vendor operator driver to
> > make such policy choices (btw, I'm just throwing out the "operator"
> > term as if I know what it means, I don't).  
> 
> I had a rough feeling that there might be other usages requiring such
> vendor plugin, e.g. provisioning VF/ADI may require vendor specific
> configurations, but not really an expert in this area.
> 
> Overall I feel most of our discussions so far are about VMM-auto-
> find-offset vs. file-based-policy-scheme which both belong to
> user-defined policy, suggesting that we all agreed to drop the other
> way having kernel define the offset (plus in-kernel quirks, etc.)?
> 
> Even the said DVSEC is to assist such user-defined direction.

To me a "user defined policy" is placing an option on the command line
and requiring the user, or some higher level authority representing the
user, to provide the policy.  If it's done by the VMM then we're saying
QEMU owns the policy but it might be overridden by the user via a
command line argument or modifying the policy file consumed by QEMU.
 
> > > >  - Do we do some hand waving that devices supporting PASID shouldn't
> > > >    have hidden registers and therefore the VMM can simply find a gap?  
> > >
> > > I assume 'handwaving' doesn't mean any measure in code to actually
> > > block those devices (as doing so likely requires certain denylist based on
> > > device/vendor ID but then why not going a step further to also hard
> > > code an offset?). It's more a try-and-fail model where vPASID is opted
> > > in via a cmdline parameter then a device with hidden registers may
> > > misbehave if the VMM happens to find a conflict gap. And the impact
> > > is restricted only to a new setup where the user is interested in
> > > PASID  to opt hence can afford diagnostics effort to figure out the  
> > restriction.
> > 
> > If you want to hard code an offset then we're effectively introducing a
> > device specific quirk to enable PASID support.  I thought we wanted
> > this to work generically for any device exposing PASID, therefore I was
> > thinking more of "find a gap" as the default strategy with quirks used
> > to augment the resulting offset where necessary.
> > 
> > I'd also be careful about command line parameters.  I think we require
> > one for the vIOMMU to enable PASID support, but I'd prefer to avoid one
> > on the vfio-pci device, instead simply enabling support when both the
> > vIOMMU support is enabled and the device is detected to support it.
> > Each command line option requires support in the upper level tools to
> > enable it.  
> 
> Make sense. btw will there be a requirement that the user wants to
> disable PASID even if the device supports it, e.g. for testing purpose
> or to workaround a HW errata disclosed after host driver claims the
> support in an old kernel?

We can make use of "experimental options", ie. "x-" prefixed options,
in QEMU for this.  These are useful for debugging but will not be
considered for support by higher level tools nor do we consider them to
have a stable ABI.

> > > > I understand the desire to make some progress, but QEMU relies on
> > > > integration with management tools, so a temporary option for a user to
> > > > specify a PASID offset in isolation sounds like a non-starter to me.
> > > >
> > > > This might be a better sell if the user interface allowed fully
> > > > defining the capability chain layout from the command line and this
> > > > interface would continue to exist and supersede how the VMM might
> > > > otherwise define the capability chain when used.  A fully user defined
> > > > layout would be complicated though, so I think there would still be a
> > > > desire for QEMU to consume or define a consistent policy itself.
> > > >
> > > > Even if QEMU defines the layout for a device, there may be multiple
> > > > versions of that device.  For example, maybe we just add PASID now, but
> > > > at some point we decide that we do want to replicate the PF serial
> > > > number capability.  At that point we have versions of the device which
> > > > would need to be tied to versions of the machine and maybe also
> > > > selected via a profile switch on the device command line.
> > > >
> > > > If we want to simplify this, maybe we do just look at whether the
> > > > vIOMMU is configured for PASID support and if the device supports it,  
> > >
> > > and this is related to the open which I raised in last mail - whether we
> > > want to report the PASID support both in iommufd and vfio-pci uAPI.
> > >
> > > My impression is yes as there may be requirement of exposing a virtual
> > > capability which doesn't rely on the IOMMU.  
> > 
> > What's the purpose of reporting PASID via both iommufd and vfio-pci?  I
> > agree that there will be capabilities related to the iommufd and
> > capabilities only related to the device, but I disagree that that
> > provides justification to report PASID via both uAPIs.  Are we also
> > going to ask iommufd to report that a device has an optional serial
> > number capability?  It clearly doesn't make sense for iommufd to be  
> 
> Certainly no. My point was that vfio-pci/iommufd each reports its
> own capability set. They may overlap but this fact just matches the
> physical world.
> 
> > involved with that, so why does it make sense for vfio-pci to be
> > involved in reporting something that is more iommufd specific?  
> 
> It doesn't matter which one involves more. It's more akin to the
> physical world.
> 
> btw vfio-pci already reports ATS/PRI which both rely on iommufd
> in vconfig space. Throwing PASID alone to iommufd uAPI lacks of a
> good justification for why it's special.
> 
> I envision an extension to vfio device feature or a new vfio uAPI
> for reporting virtual capabilities as augment to the ones filled in
> vconfig space. 

Should ATS and PRI be reported through vfio-pci or should we just turn
them off to be more like PASID?  Maybe the issue simply hasn't arisen
yet because we don't have vIOMMU support and with that support QEMU
might need to filter out those capabilities and look elsewhere.
Anyway, iommufd and vfio-pci should not duplicate each other here.

> > > > then we just look for a gap and add the capability.  If we end up with
> > > > different results between source and target for migration, then
> > > > migration will fail.  Possibly we end up with a quirk table to override
> > > > the default placement of specific capabilities on specific devices.  
> > >
> > > emm how does a quirk table work with devices having volatile config
> > > space layout cross FW versions? Can VMM assigned with a VF be able
> > > to check the FW version of the PF?  
> > 
> > If the VMM can't find the same gap between source and destination then
> > a quirk could make sure that the PASID offset is consistent.  But also
> > if the VMM doesn't find the same gap then that suggests the config
> > space is already different and not only the offset of the PASID
> > capability will need to be fixed via a quirk, so then we're into
> > quirking the entire capability space for the device.  
> 
> yes. So the quirk table is more for fixing the functional gap (i.e. not
> overlap with a hidden register) instead of for migration. As long as
> a device can function correctly with it, the virtual caps fall into the
> same restriction as physical caps in migration i.e. upon inconsistent
> layout between src/dest we'll need separate way to synthesize the
> entire space.

Yes.

> > The VMM should not be assumed to have any additional privileges beyond
> > what we provide it through the vfio device and iommufd interface.
> > Testing anything about the PF would require access on the host that
> > won't work in more secure environments.  Therefore if we can't
> > consistently place the PASID for a device, we probably need to quirk it
> > based on the vendor/device IDs or sub-IDs or we need to rely on a
> > management implied policy such as a device profile option on the QEMU
> > command line or maybe different classes of the vfio-pci driver in QEMU.
> >   
> > > > That might evolve into a lookup for where we place all capabilities,
> > > > which essentially turns into the "file" where the VMM defines the entire
> > > > layout for some devices.  
> > >
> > > Overall this sounds a feasible path to move forward - starting with
> > > the VMM to find the gap automatically if a new PASID option is
> > > opted in. Devices with hidden registers may fail. Devices with volatile
> > > config space due to FW upgrade or cross vendors may fail to migrate.
> > > Then evolving it to the file-based scheme, and there is time to discuss
> > > any intermediate improvement (fixed quirks, cmdline offset, etc.) in
> > > between.  
> > 
> > As above, let's be careful about introducing unnecessary command line
> > options, especially if we expect support for them in higher level
> > tools.  If we place the PASID somewhere that makes the device not work,
> > then disabling PASID on the vIOMMU should resolve that.  It won't be a  
> 
> vIOMMU is per-platform then it applies to all devices behind, including
> those which don't have a problem with auto-selected offset. Not sure
> whether one would want to continue enabling PASID for other devices
> or should stop immediately to find a quirk for the problematic one and
> then resume.

I'm not sure if this is a real issue, we're talking about a VM, not a
server.  If a user wants PASID support and it's incompatible with a
device, the device can be excluded from the VM or we can have an
experimental option on the vfio-pci device in QEMU as a workaround.  I
don't think this is something we need to plumb up through the tool
stack.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* RE: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-08-02 18:25                                                   ` Alex Williamson
@ 2024-08-05  5:35                                                     ` Tian, Kevin
  2024-08-06 14:20                                                       ` Jason Gunthorpe
  2024-08-06 14:05                                                     ` Jason Gunthorpe
  1 sibling, 1 reply; 102+ messages in thread
From: Tian, Kevin @ 2024-08-05  5:35 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Liu, Yi L, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun,
	Cédric Le Goater

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Saturday, August 3, 2024 2:25 AM
> 
> On Thu, 1 Aug 2024 07:45:43 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Thursday, August 1, 2024 1:05 AM
> > >
> > > On Wed, 31 Jul 2024 05:15:25 +0000
> > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > >
> > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > Sent: Wednesday, July 31, 2024 1:35 AM
> > > > >
> > > > > So what are we trying to accomplish here.  PASID is the first
> > > > > non-device specific virtual capability that we'd like to insert into
> > > > > the VM view of the capability chain.  It won't be the last.
> > > > >
> > > > >  - Do we push the policy of defining the capability offset to the user?
> > > >
> > > > Looks yes as I didn't see a strong argument for the opposite way.
> > >
> > > It's a policy choice though, so where and how is it implemented?  It
> > > works fine for those of us willing to edit xml or launch VMs by command
> > > line, but libvirt isn't going to sign up to insert a policy choice for
> > > a device.  If we get to even higher level tools, does anything that
> > > wants to implement PASID support required a vendor operator driver to
> > > make such policy choices (btw, I'm just throwing out the "operator"
> > > term as if I know what it means, I don't).
> >
> > I had a rough feeling that there might be other usages requiring such
> > vendor plugin, e.g. provisioning VF/ADI may require vendor specific
> > configurations, but not really an expert in this area.
> >
> > Overall I feel most of our discussions so far are about VMM-auto-
> > find-offset vs. file-based-policy-scheme which both belong to
> > user-defined policy, suggesting that we all agreed to drop the other
> > way having kernel define the offset (plus in-kernel quirks, etc.)?
> >
> > Even the said DVSEC is to assist such user-defined direction.
> 
> To me a "user defined policy" is placing an option on the command line
> and requiring the user, or some higher level authority representing the
> user, to provide the policy.  If it's done by the VMM then we're saying
> QEMU owns the policy but it might be overridden by the user via a
> command line argument or modifying the policy file consumed by QEMU.

Okay, I see the difference here. In my reply it's clearer to say
"userspace" instead of "user". 😊

> > > involved with that, so why does it make sense for vfio-pci to be
> > > involved in reporting something that is more iommufd specific?
> >
> > It doesn't matter which one involves more. It's more akin to the
> > physical world.
> >
> > btw vfio-pci already reports ATS/PRI which both rely on iommufd
> > in vconfig space. Throwing PASID alone to iommufd uAPI lacks of a
> > good justification for why it's special.
> >
> > I envision an extension to vfio device feature or a new vfio uAPI
> > for reporting virtual capabilities as augment to the ones filled in
> > vconfig space.
> 
> Should ATS and PRI be reported through vfio-pci or should we just turn
> them off to be more like PASID?  Maybe the issue simply hasn't arisen
> yet because we don't have vIOMMU support and with that support QEMU
> might need to filter out those capabilities and look elsewhere.
> Anyway, iommufd and vfio-pci should not duplicate each other here.

If no-duplication is the agreed way, yes it's clearer to have
ATS/PRI/PASID reported consistently via iommufd hence hidden
in vfio-pci.

Given there is no vIOMMU support this change shouldn't break
any applications.

> 
> > > > > then we just look for a gap and add the capability.  If we end up with
> > > > > different results between source and target for migration, then
> > > > > migration will fail.  Possibly we end up with a quirk table to override
> > > > > the default placement of specific capabilities on specific devices.
> > > >
> > > > emm how does a quirk table work with devices having volatile config
> > > > space layout cross FW versions? Can VMM assigned with a VF be able
> > > > to check the FW version of the PF?
> > >
> > > If the VMM can't find the same gap between source and destination then
> > > a quirk could make sure that the PASID offset is consistent.  But also
> > > if the VMM doesn't find the same gap then that suggests the config
> > > space is already different and not only the offset of the PASID
> > > capability will need to be fixed via a quirk, so then we're into
> > > quirking the entire capability space for the device.
> >
> > yes. So the quirk table is more for fixing the functional gap (i.e. not
> > overlap with a hidden register) instead of for migration. As long as
> > a device can function correctly with it, the virtual caps fall into the
> > same restriction as physical caps in migration i.e. upon inconsistent
> > layout between src/dest we'll need separate way to synthesize the
> > entire space.
> 
> Yes.
> 
> > > The VMM should not be assumed to have any additional privileges
> beyond
> > > what we provide it through the vfio device and iommufd interface.
> > > Testing anything about the PF would require access on the host that
> > > won't work in more secure environments.  Therefore if we can't
> > > consistently place the PASID for a device, we probably need to quirk it
> > > based on the vendor/device IDs or sub-IDs or we need to rely on a
> > > management implied policy such as a device profile option on the QEMU
> > > command line or maybe different classes of the vfio-pci driver in QEMU.
> > >
> > > > > That might evolve into a lookup for where we place all capabilities,
> > > > > which essentially turns into the "file" where the VMM defines the
> entire
> > > > > layout for some devices.
> > > >
> > > > Overall this sounds a feasible path to move forward - starting with
> > > > the VMM to find the gap automatically if a new PASID option is
> > > > opted in. Devices with hidden registers may fail. Devices with volatile
> > > > config space due to FW upgrade or cross vendors may fail to migrate.
> > > > Then evolving it to the file-based scheme, and there is time to discuss
> > > > any intermediate improvement (fixed quirks, cmdline offset, etc.) in
> > > > between.
> > >
> > > As above, let's be careful about introducing unnecessary command line
> > > options, especially if we expect support for them in higher level
> > > tools.  If we place the PASID somewhere that makes the device not work,
> > > then disabling PASID on the vIOMMU should resolve that.  It won't be a
> >
> > vIOMMU is per-platform then it applies to all devices behind, including
> > those which don't have a problem with auto-selected offset. Not sure
> > whether one would want to continue enabling PASID for other devices
> > or should stop immediately to find a quirk for the problematic one and
> > then resume.
> 
> I'm not sure if this is a real issue, we're talking about a VM, not a
> server.  If a user wants PASID support and it's incompatible with a
> device, the device can be excluded from the VM or we can have an
> experimental option on the vfio-pci device in QEMU as a workaround.  I
> don't think this is something we need to plumb up through the tool
> stack.  Thanks,
> 

Okay. With that I edited my earlier reply a bit by removing the note
of cmdline option, adding DVSEC possibility, and making it clear that
the PASID option is in vIOMMU:

"
Overall this sounds a feasible path to move forward - starting with
the VMM to find the gap automatically if PASID is opted in vIOMMU. 
Devices with hidden registers may fail. Devices with volatile
config space due to FW upgrade or cross vendors may fail to migrate.
Then evolving it to the file-based scheme, and there is time to discuss
any intermediate improvement (fixed quirks, DVSEC, etc.) in between.
"

Jason, your thoughts?

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-07-31 17:04                                               ` Alex Williamson
  2024-08-01  7:45                                                 ` Tian, Kevin
@ 2024-08-06 13:54                                                 ` Jason Gunthorpe
  1 sibling, 0 replies; 102+ messages in thread
From: Jason Gunthorpe @ 2024-08-06 13:54 UTC (permalink / raw)
  To: Alex Williamson, Vasant Hegde
  Cc: Tian, Kevin, Liu, Yi L, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun,
	Cédric Le Goater

On Wed, Jul 31, 2024 at 11:04:36AM -0600, Alex Williamson wrote:

> I'd also be careful about command line parameters.  I think we require
> one for the vIOMMU to enable PASID support, but I'd prefer to avoid one
> on the vfio-pci device, instead simply enabling support when both the
> vIOMMU support is enabled and the device is detected to support it.
> Each command line option requires support in the upper level tools to
> enable it.

FWIW AMD seems to have an issue here where if they don't intend to use
PASID then they get a bunch more performance if they leave it turned
off. 

So cases like no vIOMMU, or a vIOMMU with no PASID capability, should
automatically tell the kernel to disable PASID support for the
device. These decisions needs to be made when the HWPT's are first
allocated.

Jason

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-08-02 18:25                                                   ` Alex Williamson
  2024-08-05  5:35                                                     ` Tian, Kevin
@ 2024-08-06 14:05                                                     ` Jason Gunthorpe
  1 sibling, 0 replies; 102+ messages in thread
From: Jason Gunthorpe @ 2024-08-06 14:05 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Liu, Yi L, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun,
	Cédric Le Goater

On Fri, Aug 02, 2024 at 12:25:28PM -0600, Alex Williamson wrote:

> > I envision an extension to vfio device feature or a new vfio uAPI
> > for reporting virtual capabilities as augment to the ones filled in
> > vconfig space. 
> 
> Should ATS and PRI be reported through vfio-pci or should we just turn
> them off to be more like PASID?  Maybe the issue simply hasn't arisen
> yet because we don't have vIOMMU support and with that support QEMU
> might need to filter out those capabilities and look elsewhere.
> Anyway, iommufd and vfio-pci should not duplicate each other here.

ATS and PRI are strictly controlled only by the iommu driver, VFIO
must not change them dynamically.

Effectively they become set/unset based on what domains are attached
through iommufd.

PRI is turned on when any fault capable domain is attached, eg the
HWPT uses IOMMU_HWPT_FAULT_ID_VALID. I'm not sure I know the full
story here how this will work on the qemu side, but I would guess that
PRI changes in the virtual config space will result in different
nesting domains becoming attached?

ATS, at least for SMMUv3, is controlled by a bit in the vSTE from the
guest, the virtual PCI config space changes should not be reflected to
real HW. There are data integrity/security concerns here. ATS must
only be changed as part of a sequence that can flush the cache.

Jason

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-08-05  5:35                                                     ` Tian, Kevin
@ 2024-08-06 14:20                                                       ` Jason Gunthorpe
  2024-08-14  6:38                                                         ` Yi Liu
  0 siblings, 1 reply; 102+ messages in thread
From: Jason Gunthorpe @ 2024-08-06 14:20 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Liu, Yi L, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun,
	Cédric Le Goater

On Mon, Aug 05, 2024 at 05:35:17AM +0000, Tian, Kevin wrote:

> Okay. With that I edited my earlier reply a bit by removing the note
> of cmdline option, adding DVSEC possibility, and making it clear that
> the PASID option is in vIOMMU:
> 
> "
> Overall this sounds a feasible path to move forward - starting with
> the VMM to find the gap automatically if PASID is opted in vIOMMU. 
> Devices with hidden registers may fail. Devices with volatile
> config space due to FW upgrade or cross vendors may fail to migrate.
> Then evolving it to the file-based scheme, and there is time to discuss
> any intermediate improvement (fixed quirks, DVSEC, etc.) in between.
> "
> 
> Jason, your thoughts?

This thread is big and I've read it quickly, but I could support the
above summary.

For migration, where we are today, it is completely up to the device
and it's FW to present a new config space that is stable across
migration. In practice this means the device needs to have pre-set the
PF to whatever config state it needs during FLR, and refuse to migrate
if things are not correct.

Moving away from that limitation toward a VMM created stable config
space is definitely a long term interest. I understand other VMM's are
already doing things like this.

It seems we need more time to bake on this long term direction. I
suggested a text file as something very general, but we can do other
things too.

What I see as the general industry direction is toward a very
perspective vPCI function, where you might have v1, v2, etc of a
device that present different config space/behavior/etc. I expect
standards bodies are going to define reference vPCI functions for
their standards along these lines. This would specify all the MMIO
memory layout and every bit of config space right down to each offset
and bit.

This is the sort of restriction that will be needed to allow more
generic live migration between different devices and different
FW. There is a pretty clear role here for the VMM to synthesize a
highly perscribed config space. How we get there and how the libvirt
stack should support this, I don't know.

Jason

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-07-30 17:35                                           ` Alex Williamson
  2024-07-31  5:15                                             ` Tian, Kevin
@ 2024-08-06 14:30                                             ` Jason Gunthorpe
  1 sibling, 0 replies; 102+ messages in thread
From: Jason Gunthorpe @ 2024-08-06 14:30 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Liu, Yi L, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun,
	Cédric Le Goater

On Tue, Jul 30, 2024 at 11:35:17AM -0600, Alex Williamson wrote:

> Even if QEMU defines the layout for a device, there may be multiple
> versions of that device.  For example, maybe we just add PASID now, but
> at some point we decide that we do want to replicate the PF serial
> number capability.  At that point we have versions of the device which
> would need to be tied to versions of the machine and maybe also
> selected via a profile switch on the device command line.

This is the larger end state I see here.

When the admin asks to install a vPCI function into a VM the admin
would specify they want, say, 'virtio-net vPCI spec 1.0'.

The text of spec 1.0 would specify each and every bit of config space
and register layout. It would specify the content of the live
migration blobs, and an exact feature set/behaviors the device must
advertise.

Even if the device can do more, a combination of the VMM and the
variant PCI driver+device FW would make it do only what spec 1.0 says
it should do.

You can imagine there would be a range of these standards available,
and large sites have a high chance to have their own private forks.

This should be viewed as totally opposite the current VFIO behavior
that intends to reflect the exact device, as-is, into the VM. I think
we can reasonably have different approaches to each.

So, how we manage this as an ecosystem, I don't know. It sure would be
nice to not need kernel changes to push through a new virtual device
spec!

But I think your summary is good.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-08-06 14:20                                                       ` Jason Gunthorpe
@ 2024-08-14  6:38                                                         ` Yi Liu
  2024-08-14  7:38                                                           ` Tian, Kevin
  0 siblings, 1 reply; 102+ messages in thread
From: Yi Liu @ 2024-08-14  6:38 UTC (permalink / raw)
  To: Jason Gunthorpe, Tian, Kevin
  Cc: Alex Williamson, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun,
	Cédric Le Goater

On 2024/8/6 22:20, Jason Gunthorpe wrote:
> On Mon, Aug 05, 2024 at 05:35:17AM +0000, Tian, Kevin wrote:
> 
>> Okay. With that I edited my earlier reply a bit by removing the note
>> of cmdline option, adding DVSEC possibility, and making it clear that
>> the PASID option is in vIOMMU:
>>
>> "
>> Overall this sounds a feasible path to move forward - starting with
>> the VMM to find the gap automatically if PASID is opted in vIOMMU.
>> Devices with hidden registers may fail. Devices with volatile
>> config space due to FW upgrade or cross vendors may fail to migrate.
>> Then evolving it to the file-based scheme, and there is time to discuss
>> any intermediate improvement (fixed quirks, DVSEC, etc.) in between.
>> "
>>
>> Jason, your thoughts?
> 
> This thread is big and I've read it quickly, but I could support the
> above summary.
> 

thanks for the ideas. I think we still need a uapi to report if the device
supports PASID or not. Do we have agreement on where should this uapi be
defined? vfio or iommufd.

Besides, I've a question on how the userspace know the hidden registers
when trying to find a gap for the vPASID cap. It should only know the
standard pci caps.

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 102+ messages in thread

* RE: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-08-14  6:38                                                         ` Yi Liu
@ 2024-08-14  7:38                                                           ` Tian, Kevin
  2024-08-14  8:19                                                             ` Yi Liu
  2024-09-09 12:59                                                             ` Yi Liu
  0 siblings, 2 replies; 102+ messages in thread
From: Tian, Kevin @ 2024-08-14  7:38 UTC (permalink / raw)
  To: Liu, Yi L, Jason Gunthorpe
  Cc: Alex Williamson, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun,
	Cédric Le Goater

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Wednesday, August 14, 2024 2:39 PM
> 
> On 2024/8/6 22:20, Jason Gunthorpe wrote:
> > On Mon, Aug 05, 2024 at 05:35:17AM +0000, Tian, Kevin wrote:
> >
> >> Okay. With that I edited my earlier reply a bit by removing the note
> >> of cmdline option, adding DVSEC possibility, and making it clear that
> >> the PASID option is in vIOMMU:
> >>
> >> "
> >> Overall this sounds a feasible path to move forward - starting with
> >> the VMM to find the gap automatically if PASID is opted in vIOMMU.
> >> Devices with hidden registers may fail. Devices with volatile
> >> config space due to FW upgrade or cross vendors may fail to migrate.
> >> Then evolving it to the file-based scheme, and there is time to discuss
> >> any intermediate improvement (fixed quirks, DVSEC, etc.) in between.
> >> "
> >>
> >> Jason, your thoughts?
> >
> > This thread is big and I've read it quickly, but I could support the
> > above summary.
> >
> 
> thanks for the ideas. I think we still need a uapi to report if the device
> supports PASID or not. Do we have agreement on where should this uapi be
> defined? vfio or iommufd.

IOMMUFD_CMD_GET_HW_INFO. 

> 
> Besides, I've a question on how the userspace know the hidden registers
> when trying to find a gap for the vPASID cap. It should only know the
> standard pci caps.
> 

for the initial implementation VMM doesn't know any hidden registers.
The user passes a new vIOMMU option to the VMM for exposing
the PASID capability in vIOMMU and in device, based on info in
IOMMUFD_CMD_GET_HW_INFO. The VMM identifies a hole between
existing caps to put the vPASID cap. If a device with hidden registers
doesn’t work correctly then then a quirk may be added for it.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-08-14  7:38                                                           ` Tian, Kevin
@ 2024-08-14  8:19                                                             ` Yi Liu
  2024-08-14  8:26                                                               ` Tian, Kevin
  2024-08-14 14:40                                                               ` Jason Gunthorpe
  2024-09-09 12:59                                                             ` Yi Liu
  1 sibling, 2 replies; 102+ messages in thread
From: Yi Liu @ 2024-08-14  8:19 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe
  Cc: Alex Williamson, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun,
	Cédric Le Goater

On 2024/8/14 15:38, Tian, Kevin wrote:
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Sent: Wednesday, August 14, 2024 2:39 PM
>>
>> On 2024/8/6 22:20, Jason Gunthorpe wrote:
>>> On Mon, Aug 05, 2024 at 05:35:17AM +0000, Tian, Kevin wrote:
>>>
>>>> Okay. With that I edited my earlier reply a bit by removing the note
>>>> of cmdline option, adding DVSEC possibility, and making it clear that
>>>> the PASID option is in vIOMMU:
>>>>
>>>> "
>>>> Overall this sounds a feasible path to move forward - starting with
>>>> the VMM to find the gap automatically if PASID is opted in vIOMMU.
>>>> Devices with hidden registers may fail. Devices with volatile
>>>> config space due to FW upgrade or cross vendors may fail to migrate.
>>>> Then evolving it to the file-based scheme, and there is time to discuss
>>>> any intermediate improvement (fixed quirks, DVSEC, etc.) in between.
>>>> "
>>>>
>>>> Jason, your thoughts?
>>>
>>> This thread is big and I've read it quickly, but I could support the
>>> above summary.
>>>
>>
>> thanks for the ideas. I think we still need a uapi to report if the device
>> supports PASID or not. Do we have agreement on where should this uapi be
>> defined? vfio or iommufd.
> 
> IOMMUFD_CMD_GET_HW_INFO.

I see. TBH. The existing GET_HW_INFO is for iommu hw info. Extending it to
report PASID supporting makes it cover device capability now. I may need to
add one more capability enum in struct iommu_hw_info just like the below
one. Perhaps name it as iommufd_device_capabilities. And only set the PASID 
capability in the new device capability when the device's or its PF's pasid 
cap is enabled. Does it look good?


/**
  * enum iommufd_hw_capabilities
  * @IOMMU_HW_CAP_DIRTY_TRACKING: IOMMU hardware support for dirty tracking
  *                               If available, it means the following APIs
  *                               are supported:
  *
  *                                   IOMMU_HWPT_GET_DIRTY_BITMAP
  *                                   IOMMU_HWPT_SET_DIRTY_TRACKING
  *
  */
enum iommufd_hw_capabilities {
	IOMMU_HW_CAP_DIRTY_TRACKING = 1 << 0,
};

>>
>> Besides, I've a question on how the userspace know the hidden registers
>> when trying to find a gap for the vPASID cap. It should only know the
>> standard pci caps.
>>
> 
> for the initial implementation VMM doesn't know any hidden registers.
> The user passes a new vIOMMU option to the VMM for exposing
> the PASID capability in vIOMMU and in device, based on info in
> IOMMUFD_CMD_GET_HW_INFO. The VMM identifies a hole between
> existing caps to put the vPASID cap. If a device with hidden registers
> doesn’t work correctly then then a quirk may be added for it.

I see. But we cannot know it until a guest device driver failed. This is
acceptable. right? Perhaps this should be documented somewhere to let the
user/device vendor know before safely exposing vPASID cap on a device. :)

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 102+ messages in thread

* RE: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-08-14  8:19                                                             ` Yi Liu
@ 2024-08-14  8:26                                                               ` Tian, Kevin
  2024-08-14 14:40                                                               ` Jason Gunthorpe
  1 sibling, 0 replies; 102+ messages in thread
From: Tian, Kevin @ 2024-08-14  8:26 UTC (permalink / raw)
  To: Liu, Yi L, Jason Gunthorpe
  Cc: Alex Williamson, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun,
	Cédric Le Goater

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Wednesday, August 14, 2024 4:19 PM
> 
> On 2024/8/14 15:38, Tian, Kevin wrote:
> >> From: Liu, Yi L <yi.l.liu@intel.com>
> >> Sent: Wednesday, August 14, 2024 2:39 PM
> >>
> >> On 2024/8/6 22:20, Jason Gunthorpe wrote:
> >>> On Mon, Aug 05, 2024 at 05:35:17AM +0000, Tian, Kevin wrote:
> >>>
> >>>> Okay. With that I edited my earlier reply a bit by removing the note
> >>>> of cmdline option, adding DVSEC possibility, and making it clear that
> >>>> the PASID option is in vIOMMU:
> >>>>
> >>>> "
> >>>> Overall this sounds a feasible path to move forward - starting with
> >>>> the VMM to find the gap automatically if PASID is opted in vIOMMU.
> >>>> Devices with hidden registers may fail. Devices with volatile
> >>>> config space due to FW upgrade or cross vendors may fail to migrate.
> >>>> Then evolving it to the file-based scheme, and there is time to discuss
> >>>> any intermediate improvement (fixed quirks, DVSEC, etc.) in between.
> >>>> "
> >>>>
> >>>> Jason, your thoughts?
> >>>
> >>> This thread is big and I've read it quickly, but I could support the
> >>> above summary.
> >>>
> >>
> >> thanks for the ideas. I think we still need a uapi to report if the device
> >> supports PASID or not. Do we have agreement on where should this uapi
> be
> >> defined? vfio or iommufd.
> >
> > IOMMUFD_CMD_GET_HW_INFO.
> 
> I see. TBH. The existing GET_HW_INFO is for iommu hw info. Extending it to
> report PASID supporting makes it cover device capability now. I may need to
> add one more capability enum in struct iommu_hw_info just like the below
> one. Perhaps name it as iommufd_device_capabilities. And only set the
> PASID
> capability in the new device capability when the device's or its PF's pasid
> cap is enabled. Does it look good?

Just add a new bit in iommufd_hw_capabilities.

Anyway PASID is meaningful only when both device and IOMMU supports
it. Here we are reporting what IOMMU capabilities meaningful to a given
device so checking both does make sense. 

> 
> 
> /**
>   * enum iommufd_hw_capabilities
>   * @IOMMU_HW_CAP_DIRTY_TRACKING: IOMMU hardware support for
> dirty tracking
>   *                               If available, it means the following APIs
>   *                               are supported:
>   *
>   *                                   IOMMU_HWPT_GET_DIRTY_BITMAP
>   *                                   IOMMU_HWPT_SET_DIRTY_TRACKING
>   *
>   */
> enum iommufd_hw_capabilities {
> 	IOMMU_HW_CAP_DIRTY_TRACKING = 1 << 0,
> };
> 
> >>
> >> Besides, I've a question on how the userspace know the hidden registers
> >> when trying to find a gap for the vPASID cap. It should only know the
> >> standard pci caps.
> >>
> >
> > for the initial implementation VMM doesn't know any hidden registers.
> > The user passes a new vIOMMU option to the VMM for exposing
> > the PASID capability in vIOMMU and in device, based on info in
> > IOMMUFD_CMD_GET_HW_INFO. The VMM identifies a hole between
> > existing caps to put the vPASID cap. If a device with hidden registers
> > doesn’t work correctly then then a quirk may be added for it.
> 
> I see. But we cannot know it until a guest device driver failed. This is
> acceptable. right? Perhaps this should be documented somewhere to let the
> user/device vendor know before safely exposing vPASID cap on a device. :)
> 

presumably in VMM.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-08-14  8:19                                                             ` Yi Liu
  2024-08-14  8:26                                                               ` Tian, Kevin
@ 2024-08-14 14:40                                                               ` Jason Gunthorpe
  2024-08-15  2:12                                                                 ` Yi Liu
  1 sibling, 1 reply; 102+ messages in thread
From: Jason Gunthorpe @ 2024-08-14 14:40 UTC (permalink / raw)
  To: Yi Liu
  Cc: Tian, Kevin, Alex Williamson, joro@8bytes.org,
	robin.murphy@arm.com, eric.auger@redhat.com, nicolinc@nvidia.com,
	kvm@vger.kernel.org, chao.p.peng@linux.intel.com,
	iommu@lists.linux.dev, baolu.lu@linux.intel.com, Duan, Zhenzhong,
	Pan, Jacob jun, Cédric Le Goater

On Wed, Aug 14, 2024 at 04:19:13PM +0800, Yi Liu wrote:

> /**
>  * enum iommufd_hw_capabilities
>  * @IOMMU_HW_CAP_DIRTY_TRACKING: IOMMU hardware support for dirty tracking
>  *                               If available, it means the following APIs
>  *                               are supported:
>  *
>  *                                   IOMMU_HWPT_GET_DIRTY_BITMAP
>  *                                   IOMMU_HWPT_SET_DIRTY_TRACKING
>  *
>  */
> enum iommufd_hw_capabilities {
> 	IOMMU_HW_CAP_DIRTY_TRACKING = 1 << 0,
> };

I think it would be appropriate to add the flag here

Is it OK to rely on the PCI config space PASID enable? I see all the
drivers right now are turning on PASID support during probe if the
iommu supports it.

Jason

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-08-14 14:40                                                               ` Jason Gunthorpe
@ 2024-08-15  2:12                                                                 ` Yi Liu
  2024-08-16  8:29                                                                   ` Vasant Hegde
  0 siblings, 1 reply; 102+ messages in thread
From: Yi Liu @ 2024-08-15  2:12 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Alex Williamson, joro@8bytes.org,
	robin.murphy@arm.com, eric.auger@redhat.com, nicolinc@nvidia.com,
	kvm@vger.kernel.org, chao.p.peng@linux.intel.com,
	iommu@lists.linux.dev, baolu.lu@linux.intel.com, Duan, Zhenzhong,
	Pan, Jacob jun, Cédric Le Goater, Suravee Suthikulpanit,
	Vasant Hegde

On 2024/8/14 22:40, Jason Gunthorpe wrote:
> On Wed, Aug 14, 2024 at 04:19:13PM +0800, Yi Liu wrote:
> 
>> /**
>>   * enum iommufd_hw_capabilities
>>   * @IOMMU_HW_CAP_DIRTY_TRACKING: IOMMU hardware support for dirty tracking
>>   *                               If available, it means the following APIs
>>   *                               are supported:
>>   *
>>   *                                   IOMMU_HWPT_GET_DIRTY_BITMAP
>>   *                                   IOMMU_HWPT_SET_DIRTY_TRACKING
>>   *
>>   */
>> enum iommufd_hw_capabilities {
>> 	IOMMU_HW_CAP_DIRTY_TRACKING = 1 << 0,
>> };
> 
> I think it would be appropriate to add the flag here

ok.

> Is it OK to rely on the PCI config space PASID enable? I see all the
> drivers right now are turning on PASID support during probe if the
> iommu supports it.

Intel side is not ready yet as it enables pasid when the device is attached
to a non-blocking domain. I've chatted with Baolu, and he will kindly to
enable the pasid cap in the probe_device() op if both iommu and device has
this cap. After that, Intel side should be fine to rely on the PASID enable
bit in the PCI config space.

How about SMMU and AMD iommu side? @Jason, @Suravee, @Vasant?

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-08-15  2:12                                                                 ` Yi Liu
@ 2024-08-16  8:29                                                                   ` Vasant Hegde
  2024-08-16 11:52                                                                     ` Yi Liu
  0 siblings, 1 reply; 102+ messages in thread
From: Vasant Hegde @ 2024-08-16  8:29 UTC (permalink / raw)
  To: Yi Liu, Jason Gunthorpe
  Cc: Tian, Kevin, Alex Williamson, joro@8bytes.org,
	robin.murphy@arm.com, eric.auger@redhat.com, nicolinc@nvidia.com,
	kvm@vger.kernel.org, chao.p.peng@linux.intel.com,
	iommu@lists.linux.dev, baolu.lu@linux.intel.com, Duan, Zhenzhong,
	Pan, Jacob jun, Cédric Le Goater, Suravee Suthikulpanit

Yi,


On 8/15/2024 7:42 AM, Yi Liu wrote:
> On 2024/8/14 22:40, Jason Gunthorpe wrote:
>> On Wed, Aug 14, 2024 at 04:19:13PM +0800, Yi Liu wrote:
>>
>>> /**
>>>   * enum iommufd_hw_capabilities
>>>   * @IOMMU_HW_CAP_DIRTY_TRACKING: IOMMU hardware support for dirty tracking
>>>   *                               If available, it means the following APIs
>>>   *                               are supported:
>>>   *
>>>   *                                   IOMMU_HWPT_GET_DIRTY_BITMAP
>>>   *                                   IOMMU_HWPT_SET_DIRTY_TRACKING
>>>   *
>>>   */
>>> enum iommufd_hw_capabilities {
>>>     IOMMU_HW_CAP_DIRTY_TRACKING = 1 << 0,
>>> };
>>
>> I think it would be appropriate to add the flag here
> 
> ok.
> 
>> Is it OK to rely on the PCI config space PASID enable? I see all the
>> drivers right now are turning on PASID support during probe if the
>> iommu supports it.
> 
> Intel side is not ready yet as it enables pasid when the device is attached
> to a non-blocking domain. I've chatted with Baolu, and he will kindly to
> enable the pasid cap in the probe_device() op if both iommu and device has
> this cap. After that, Intel side should be fine to rely on the PASID enable
> bit in the PCI config space.
> 
> How about SMMU and AMD iommu side? @Jason, @Suravee, @Vasant?

AMD driver currently discovers capability in probe_device() and enabled it in
attach_dev() path.

-Vasant

> 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-08-16  8:29                                                                   ` Vasant Hegde
@ 2024-08-16 11:52                                                                     ` Yi Liu
  2024-08-16 12:01                                                                       ` Vasant Hegde
  0 siblings, 1 reply; 102+ messages in thread
From: Yi Liu @ 2024-08-16 11:52 UTC (permalink / raw)
  To: Vasant Hegde, Jason Gunthorpe
  Cc: Tian, Kevin, Alex Williamson, joro@8bytes.org,
	robin.murphy@arm.com, eric.auger@redhat.com, nicolinc@nvidia.com,
	kvm@vger.kernel.org, chao.p.peng@linux.intel.com,
	iommu@lists.linux.dev, baolu.lu@linux.intel.com, Duan, Zhenzhong,
	Pan, Jacob jun, Cédric Le Goater, Suravee Suthikulpanit

On 2024/8/16 16:29, Vasant Hegde wrote:
> Yi,
> 
> 
> On 8/15/2024 7:42 AM, Yi Liu wrote:
>> On 2024/8/14 22:40, Jason Gunthorpe wrote:
>>> On Wed, Aug 14, 2024 at 04:19:13PM +0800, Yi Liu wrote:
>>>
>>>> /**
>>>>    * enum iommufd_hw_capabilities
>>>>    * @IOMMU_HW_CAP_DIRTY_TRACKING: IOMMU hardware support for dirty tracking
>>>>    *                               If available, it means the following APIs
>>>>    *                               are supported:
>>>>    *
>>>>    *                                   IOMMU_HWPT_GET_DIRTY_BITMAP
>>>>    *                                   IOMMU_HWPT_SET_DIRTY_TRACKING
>>>>    *
>>>>    */
>>>> enum iommufd_hw_capabilities {
>>>>      IOMMU_HW_CAP_DIRTY_TRACKING = 1 << 0,
>>>> };
>>>
>>> I think it would be appropriate to add the flag here
>>
>> ok.
>>
>>> Is it OK to rely on the PCI config space PASID enable? I see all the
>>> drivers right now are turning on PASID support during probe if the
>>> iommu supports it.
>>
>> Intel side is not ready yet as it enables pasid when the device is attached
>> to a non-blocking domain. I've chatted with Baolu, and he will kindly to
>> enable the pasid cap in the probe_device() op if both iommu and device has
>> this cap. After that, Intel side should be fine to rely on the PASID enable
>> bit in the PCI config space.
>>
>> How about SMMU and AMD iommu side? @Jason, @Suravee, @Vasant?
> 
> AMD driver currently discovers capability in probe_device() and enabled it in
> attach_dev() path.

I see. So AMD side also has a gap. Is it easy to make it suit Jason's
suggestion in the above?

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-08-16 11:52                                                                     ` Yi Liu
@ 2024-08-16 12:01                                                                       ` Vasant Hegde
  2024-08-16 12:52                                                                         ` Jason Gunthorpe
  0 siblings, 1 reply; 102+ messages in thread
From: Vasant Hegde @ 2024-08-16 12:01 UTC (permalink / raw)
  To: Yi Liu, Jason Gunthorpe, Baolu Lu
  Cc: Tian, Kevin, Alex Williamson, joro@8bytes.org,
	robin.murphy@arm.com, eric.auger@redhat.com, nicolinc@nvidia.com,
	kvm@vger.kernel.org, chao.p.peng@linux.intel.com,
	iommu@lists.linux.dev, baolu.lu@linux.intel.com, Duan, Zhenzhong,
	Pan, Jacob jun, Cédric Le Goater, Suravee Suthikulpanit

Yi,


On 8/16/2024 5:22 PM, Yi Liu wrote:
> On 2024/8/16 16:29, Vasant Hegde wrote:
>> Yi,
>>
>>
>> On 8/15/2024 7:42 AM, Yi Liu wrote:
>>> On 2024/8/14 22:40, Jason Gunthorpe wrote:
>>>> On Wed, Aug 14, 2024 at 04:19:13PM +0800, Yi Liu wrote:
>>>>
>>>>> /**
>>>>>    * enum iommufd_hw_capabilities
>>>>>    * @IOMMU_HW_CAP_DIRTY_TRACKING: IOMMU hardware support for dirty tracking
>>>>>    *                               If available, it means the following APIs
>>>>>    *                               are supported:
>>>>>    *
>>>>>    *                                   IOMMU_HWPT_GET_DIRTY_BITMAP
>>>>>    *                                   IOMMU_HWPT_SET_DIRTY_TRACKING
>>>>>    *
>>>>>    */
>>>>> enum iommufd_hw_capabilities {
>>>>>      IOMMU_HW_CAP_DIRTY_TRACKING = 1 << 0,
>>>>> };
>>>>
>>>> I think it would be appropriate to add the flag here
>>>
>>> ok.
>>>
>>>> Is it OK to rely on the PCI config space PASID enable? I see all the
>>>> drivers right now are turning on PASID support during probe if the
>>>> iommu supports it.
>>>
>>> Intel side is not ready yet as it enables pasid when the device is attached
>>> to a non-blocking domain. I've chatted with Baolu, and he will kindly to
>>> enable the pasid cap in the probe_device() op if both iommu and device has
>>> this cap. After that, Intel side should be fine to rely on the PASID enable
>>> bit in the PCI config space.
>>>
>>> How about SMMU and AMD iommu side? @Jason, @Suravee, @Vasant?
>>
>> AMD driver currently discovers capability in probe_device() and enabled it in
>> attach_dev() path.
> 
> I see. So AMD side also has a gap. Is it easy to make it suit Jason's
> suggestion in the above?

We can do that. We can enable ATS, PRI and PASID capability during probe time
and keep it enabled always.

-Vasant



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-08-16 12:01                                                                       ` Vasant Hegde
@ 2024-08-16 12:52                                                                         ` Jason Gunthorpe
  2024-08-16 13:14                                                                           ` Baolu Lu
  2024-08-19  8:21                                                                           ` Vasant Hegde
  0 siblings, 2 replies; 102+ messages in thread
From: Jason Gunthorpe @ 2024-08-16 12:52 UTC (permalink / raw)
  To: Vasant Hegde
  Cc: Yi Liu, Baolu Lu, Tian, Kevin, Alex Williamson, joro@8bytes.org,
	robin.murphy@arm.com, eric.auger@redhat.com, nicolinc@nvidia.com,
	kvm@vger.kernel.org, chao.p.peng@linux.intel.com,
	iommu@lists.linux.dev, Duan, Zhenzhong, Pan, Jacob jun,
	Cédric Le Goater, Suravee Suthikulpanit

On Fri, Aug 16, 2024 at 05:31:31PM +0530, Vasant Hegde wrote:
> > I see. So AMD side also has a gap. Is it easy to make it suit Jason's
> > suggestion in the above?
> 
> We can do that. We can enable ATS, PRI and PASID capability during probe time
> and keep it enabled always.

I don't see a downside to enabling PASID at probe time, it exists to
handshake with the device if the root complex is able to understand
PASID TLPs.

SMMU3 calls 
 arm_smmu_probe_device
  arm_smmu_enable_pasid
   pci_enable_pasid

So it looks Ok

Jason

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-08-16 12:52                                                                         ` Jason Gunthorpe
@ 2024-08-16 13:14                                                                           ` Baolu Lu
  2024-08-19  8:21                                                                           ` Vasant Hegde
  1 sibling, 0 replies; 102+ messages in thread
From: Baolu Lu @ 2024-08-16 13:14 UTC (permalink / raw)
  To: Jason Gunthorpe, Vasant Hegde
  Cc: baolu.lu, Yi Liu, Tian, Kevin, Alex Williamson, joro@8bytes.org,
	robin.murphy@arm.com, eric.auger@redhat.com, nicolinc@nvidia.com,
	kvm@vger.kernel.org, chao.p.peng@linux.intel.com,
	iommu@lists.linux.dev, Duan, Zhenzhong, Pan, Jacob jun,
	Cédric Le Goater, Suravee Suthikulpanit

On 2024/8/16 20:52, Jason Gunthorpe wrote:
> On Fri, Aug 16, 2024 at 05:31:31PM +0530, Vasant Hegde wrote:
>>> I see. So AMD side also has a gap. Is it easy to make it suit Jason's
>>> suggestion in the above?
>> We can do that. We can enable ATS, PRI and PASID capability during probe time
>> and keep it enabled always.
> I don't see a downside to enabling PASID at probe time, it exists to
> handshake with the device if the root complex is able to understand
> PASID TLPs.
> 
> SMMU3 calls
>   arm_smmu_probe_device
>    arm_smmu_enable_pasid
>     pci_enable_pasid
> 
> So it looks Ok

I made a patch for the Intel driver.

https://lore.kernel.org/linux-iommu/20240816104945.97160-1-baolu.lu@linux.intel.com/

Thanks,
baolu

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-08-16 12:52                                                                         ` Jason Gunthorpe
  2024-08-16 13:14                                                                           ` Baolu Lu
@ 2024-08-19  8:21                                                                           ` Vasant Hegde
  1 sibling, 0 replies; 102+ messages in thread
From: Vasant Hegde @ 2024-08-19  8:21 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yi Liu, Baolu Lu, Tian, Kevin, Alex Williamson, joro@8bytes.org,
	robin.murphy@arm.com, eric.auger@redhat.com, nicolinc@nvidia.com,
	kvm@vger.kernel.org, chao.p.peng@linux.intel.com,
	iommu@lists.linux.dev, Duan, Zhenzhong, Pan, Jacob jun,
	Cédric Le Goater, Suravee Suthikulpanit

Hi Jason, Yi,


On 8/16/2024 6:22 PM, Jason Gunthorpe wrote:
> On Fri, Aug 16, 2024 at 05:31:31PM +0530, Vasant Hegde wrote:
>>> I see. So AMD side also has a gap. Is it easy to make it suit Jason's
>>> suggestion in the above?
>>
>> We can do that. We can enable ATS, PRI and PASID capability during probe time
>> and keep it enabled always.
> 
> I don't see a downside to enabling PASID at probe time, it exists to
> handshake with the device if the root complex is able to understand
> PASID TLPs.

Ack. I will put it on my list. Will make change to AMD driver soon.

-Vasant

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-08-14  7:38                                                           ` Tian, Kevin
  2024-08-14  8:19                                                             ` Yi Liu
@ 2024-09-09 12:59                                                             ` Yi Liu
  2024-09-09 13:04                                                               ` Jason Gunthorpe
  1 sibling, 1 reply; 102+ messages in thread
From: Yi Liu @ 2024-09-09 12:59 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe
  Cc: Alex Williamson, joro@8bytes.org, robin.murphy@arm.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	chao.p.peng@linux.intel.com, iommu@lists.linux.dev,
	baolu.lu@linux.intel.com, Duan, Zhenzhong, Pan, Jacob jun,
	Cédric Le Goater

On 2024/8/14 15:38, Tian, Kevin wrote:
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Sent: Wednesday, August 14, 2024 2:39 PM
>>
>> On 2024/8/6 22:20, Jason Gunthorpe wrote:
>>> On Mon, Aug 05, 2024 at 05:35:17AM +0000, Tian, Kevin wrote:
>>>
>>>> Okay. With that I edited my earlier reply a bit by removing the note
>>>> of cmdline option, adding DVSEC possibility, and making it clear that
>>>> the PASID option is in vIOMMU:
>>>>
>>>> "
>>>> Overall this sounds a feasible path to move forward - starting with
>>>> the VMM to find the gap automatically if PASID is opted in vIOMMU.
>>>> Devices with hidden registers may fail. Devices with volatile
>>>> config space due to FW upgrade or cross vendors may fail to migrate.
>>>> Then evolving it to the file-based scheme, and there is time to discuss
>>>> any intermediate improvement (fixed quirks, DVSEC, etc.) in between.
>>>> "
>>>>
>>>> Jason, your thoughts?
>>>
>>> This thread is big and I've read it quickly, but I could support the
>>> above summary.
>>>
>>
>> thanks for the ideas. I think we still need a uapi to report if the device
>> supports PASID or not. Do we have agreement on where should this uapi be
>> defined? vfio or iommufd.
> 
> IOMMUFD_CMD_GET_HW_INFO.

Hi Kevin, Jason,

In order to synthesize the vPASID cap, the VMM should get to know the
capabilities like Privilege mode, Execute permission from the physical
device's config space. We have two choices as well. vfio or iommufd.

It appears to be better reporting the capabilities via vfio uapi (e.g.
VFIO_DEVICE_FEATURE). If we want to go through iommufd, then we need to
add a pair of data_uptr/data_size fields in the GET_HW_INFO to report the
PASID capabilities to userspace. Please let me know your preference. :)

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-09-09 12:59                                                             ` Yi Liu
@ 2024-09-09 13:04                                                               ` Jason Gunthorpe
  2024-09-09 13:29                                                                 ` Yi Liu
  0 siblings, 1 reply; 102+ messages in thread
From: Jason Gunthorpe @ 2024-09-09 13:04 UTC (permalink / raw)
  To: Yi Liu
  Cc: Tian, Kevin, Alex Williamson, joro@8bytes.org,
	robin.murphy@arm.com, eric.auger@redhat.com, nicolinc@nvidia.com,
	kvm@vger.kernel.org, chao.p.peng@linux.intel.com,
	iommu@lists.linux.dev, baolu.lu@linux.intel.com, Duan, Zhenzhong,
	Pan, Jacob jun, Cédric Le Goater

On Mon, Sep 09, 2024 at 08:59:32PM +0800, Yi Liu wrote:

> In order to synthesize the vPASID cap, the VMM should get to know the
> capabilities like Privilege mode, Execute permission from the physical
> device's config space. We have two choices as well. vfio or iommufd.
> 
> It appears to be better reporting the capabilities via vfio uapi (e.g.
> VFIO_DEVICE_FEATURE). If we want to go through iommufd, then we need to
> add a pair of data_uptr/data_size fields in the GET_HW_INFO to report the
> PASID capabilities to userspace. Please let me know your preference. :)

I don't think you'd need a new data_uptr, that doesn't quite make
sense

What struct data do you imagine needing?

Jason

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-09-09 13:04                                                               ` Jason Gunthorpe
@ 2024-09-09 13:29                                                                 ` Yi Liu
  2024-09-09 13:40                                                                   ` Jason Gunthorpe
  0 siblings, 1 reply; 102+ messages in thread
From: Yi Liu @ 2024-09-09 13:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Alex Williamson, joro@8bytes.org,
	robin.murphy@arm.com, eric.auger@redhat.com, nicolinc@nvidia.com,
	kvm@vger.kernel.org, chao.p.peng@linux.intel.com,
	iommu@lists.linux.dev, baolu.lu@linux.intel.com, Duan, Zhenzhong,
	Pan, Jacob jun, Cédric Le Goater

On 2024/9/9 21:04, Jason Gunthorpe wrote:
> On Mon, Sep 09, 2024 at 08:59:32PM +0800, Yi Liu wrote:
> 
>> In order to synthesize the vPASID cap, the VMM should get to know the
>> capabilities like Privilege mode, Execute permission from the physical
>> device's config space. We have two choices as well. vfio or iommufd.
>>
>> It appears to be better reporting the capabilities via vfio uapi (e.g.
>> VFIO_DEVICE_FEATURE). If we want to go through iommufd, then we need to
>> add a pair of data_uptr/data_size fields in the GET_HW_INFO to report the
>> PASID capabilities to userspace. Please let me know your preference. :)
> 
> I don't think you'd need a new data_uptr, that doesn't quite make
> sense
> 
> What struct data do you imagine needing?

something like below.

struct iommufd_hw_info_pasid {
        __u16 capabilities;
#define IOMMUFD_PASID_CAP_EXEC     (1 << 0)
#define IOMMUFD_PASID_CAP_PRIV     (1 << 1)
        __u8 width;
        __u8 __reserved;
};

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-09-09 13:29                                                                 ` Yi Liu
@ 2024-09-09 13:40                                                                   ` Jason Gunthorpe
  2024-09-09 14:02                                                                     ` Yi Liu
  0 siblings, 1 reply; 102+ messages in thread
From: Jason Gunthorpe @ 2024-09-09 13:40 UTC (permalink / raw)
  To: Yi Liu
  Cc: Tian, Kevin, Alex Williamson, joro@8bytes.org,
	robin.murphy@arm.com, eric.auger@redhat.com, nicolinc@nvidia.com,
	kvm@vger.kernel.org, chao.p.peng@linux.intel.com,
	iommu@lists.linux.dev, baolu.lu@linux.intel.com, Duan, Zhenzhong,
	Pan, Jacob jun, Cédric Le Goater

On Mon, Sep 09, 2024 at 09:29:09PM +0800, Yi Liu wrote:
> On 2024/9/9 21:04, Jason Gunthorpe wrote:
> > On Mon, Sep 09, 2024 at 08:59:32PM +0800, Yi Liu wrote:
> > 
> > > In order to synthesize the vPASID cap, the VMM should get to know the
> > > capabilities like Privilege mode, Execute permission from the physical
> > > device's config space. We have two choices as well. vfio or iommufd.
> > > 
> > > It appears to be better reporting the capabilities via vfio uapi (e.g.
> > > VFIO_DEVICE_FEATURE). If we want to go through iommufd, then we need to
> > > add a pair of data_uptr/data_size fields in the GET_HW_INFO to report the
> > > PASID capabilities to userspace. Please let me know your preference. :)
> > 
> > I don't think you'd need a new data_uptr, that doesn't quite make
> > sense
> > 
> > What struct data do you imagine needing?
> 
> something like below.
> 
> struct iommufd_hw_info_pasid {
>        __u16 capabilities;
> #define IOMMUFD_PASID_CAP_EXEC     (1 << 0)
> #define IOMMUFD_PASID_CAP_PRIV     (1 << 1)
>        __u8 width;
>        __u8 __reserved;
> };

I think you could just stick that in the top level GET_HW_INFO struct
if you want.

It does make a sense that an iommufd user would need to know that
information, especially width (but call it something better,
max_pasid_log2 or something) to successefully use the iommfd PASID
APIs anyhow.

Jason

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-09-09 14:02                                                                     ` Yi Liu
@ 2024-09-09 13:59                                                                       ` Jason Gunthorpe
  0 siblings, 0 replies; 102+ messages in thread
From: Jason Gunthorpe @ 2024-09-09 13:59 UTC (permalink / raw)
  To: Yi Liu
  Cc: Tian, Kevin, Alex Williamson, joro@8bytes.org,
	robin.murphy@arm.com, eric.auger@redhat.com, nicolinc@nvidia.com,
	kvm@vger.kernel.org, chao.p.peng@linux.intel.com,
	iommu@lists.linux.dev, baolu.lu@linux.intel.com, Duan, Zhenzhong,
	Pan, Jacob jun, Cédric Le Goater

On Mon, Sep 09, 2024 at 10:02:02PM +0800, Yi Liu wrote:

> However, I have one scalability concern. Do we only have PASID and PRI that
> needs to be synthesized by userspace? For PRI, kernel would need to report
> Outstanding Page Request Capacity to userspace. If only PASID and PRI cap,
> it's fine. But if there are more such caps, then the struct iommu_hw_info
> may grow bigger and bigger.

I don't see a particular issue with this though?

As long as it is well defined and it is clear to userspace when what
parts are valid or not it is fine to get pretty big.

Jason

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-09-09 13:40                                                                   ` Jason Gunthorpe
@ 2024-09-09 14:02                                                                     ` Yi Liu
  2024-09-09 13:59                                                                       ` Jason Gunthorpe
  0 siblings, 1 reply; 102+ messages in thread
From: Yi Liu @ 2024-09-09 14:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Alex Williamson, joro@8bytes.org,
	robin.murphy@arm.com, eric.auger@redhat.com, nicolinc@nvidia.com,
	kvm@vger.kernel.org, chao.p.peng@linux.intel.com,
	iommu@lists.linux.dev, baolu.lu@linux.intel.com, Duan, Zhenzhong,
	Pan, Jacob jun, Cédric Le Goater

On 2024/9/9 21:40, Jason Gunthorpe wrote:
> On Mon, Sep 09, 2024 at 09:29:09PM +0800, Yi Liu wrote:
>> On 2024/9/9 21:04, Jason Gunthorpe wrote:
>>> On Mon, Sep 09, 2024 at 08:59:32PM +0800, Yi Liu wrote:
>>>
>>>> In order to synthesize the vPASID cap, the VMM should get to know the
>>>> capabilities like Privilege mode, Execute permission from the physical
>>>> device's config space. We have two choices as well. vfio or iommufd.
>>>>
>>>> It appears to be better reporting the capabilities via vfio uapi (e.g.
>>>> VFIO_DEVICE_FEATURE). If we want to go through iommufd, then we need to
>>>> add a pair of data_uptr/data_size fields in the GET_HW_INFO to report the
>>>> PASID capabilities to userspace. Please let me know your preference. :)
>>>
>>> I don't think you'd need a new data_uptr, that doesn't quite make
>>> sense
>>>
>>> What struct data do you imagine needing?
>>
>> something like below.
>>
>> struct iommufd_hw_info_pasid {
>>         __u16 capabilities;
>> #define IOMMUFD_PASID_CAP_EXEC     (1 << 0)
>> #define IOMMUFD_PASID_CAP_PRIV     (1 << 1)
>>         __u8 width;
>>         __u8 __reserved;
>> };
> 
> I think you could just stick that in the top level GET_HW_INFO struct
> if you want.
> 
> It does make a sense that an iommufd user would need to know that
> information, especially width (but call it something better,
> max_pasid_log2 or something) to successefully use the iommfd PASID
> APIs anyhow.

I see, we may define the IOMMUFD_PASID_CAP_EXEC and IOMMUFD_PASID_CAP_PRIV
in the struct iommu_hw_info::out_capabilities, and add a max_pasid_log2 to
the struct iommu_hw_info.

However, I have one scalability concern. Do we only have PASID and PRI that
needs to be synthesized by userspace? For PRI, kernel would need to report
Outstanding Page Request Capacity to userspace. If only PASID and PRI cap,
it's fine. But if there are more such caps, then the struct iommu_hw_info
may grow bigger and bigger.

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 102+ messages in thread

end of thread, other threads:[~2024-09-09 13:59 UTC | newest]

Thread overview: 102+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-12  8:21 [PATCH v2 0/4] vfio-pci support pasid attach/detach Yi Liu
2024-04-12  8:21 ` [PATCH v2 1/4] ida: Add ida_get_lowest() Yi Liu
2024-04-16 16:03   ` Alex Williamson
2024-04-18  7:02     ` Yi Liu
2024-04-18 16:23       ` Alex Williamson
2024-04-18 17:12         ` Jason Gunthorpe
2024-04-19 13:43           ` Yi Liu
2024-04-19 13:55             ` Alex Williamson
2024-04-19 14:00               ` Jason Gunthorpe
2024-04-23  7:19                 ` Yi Liu
2024-04-19 13:40         ` Yi Liu
2024-04-12  8:21 ` [PATCH v2 2/4] vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices Yi Liu
2024-04-16  9:01   ` Tian, Kevin
2024-04-16  9:24     ` Yi Liu
2024-04-16  9:47       ` Tian, Kevin
2024-04-18  7:04         ` Yi Liu
2024-04-23 12:43   ` Jason Gunthorpe
2024-04-24  0:33     ` Tian, Kevin
2024-04-24  4:48     ` Yi Liu
2024-04-12  8:21 ` [PATCH v2 3/4] vfio: Add VFIO_DEVICE_PASID_[AT|DE]TACH_IOMMUFD_PT Yi Liu
2024-04-16  9:13   ` Tian, Kevin
2024-04-16  9:36     ` Yi Liu
2024-04-23 12:45   ` Jason Gunthorpe
2024-04-12  8:21 ` [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl Yi Liu
2024-04-16  9:40   ` Tian, Kevin
2024-04-16 17:57   ` Alex Williamson
2024-04-17  7:09     ` Tian, Kevin
2024-04-17 20:25       ` Alex Williamson
2024-04-18  0:21         ` Tian, Kevin
2024-04-18  8:23           ` Yi Liu
2024-04-18 16:34           ` Alex Williamson
2024-04-23 12:39   ` Jason Gunthorpe
2024-04-24  0:24     ` Tian, Kevin
2024-04-24 13:59       ` Jason Gunthorpe
2024-04-16  8:38 ` [PATCH v2 0/4] vfio-pci support pasid attach/detach Tian, Kevin
2024-04-16 17:50   ` Jason Gunthorpe
2024-04-17  7:16     ` Tian, Kevin
2024-04-17 12:20       ` Jason Gunthorpe
2024-04-17 23:02         ` Alex Williamson
2024-04-18  0:06           ` Tian, Kevin
2024-04-18  9:03             ` Yi Liu
2024-04-18 20:37               ` Alex Williamson
2024-04-19  5:52                 ` Tian, Kevin
2024-04-19 16:35                   ` Alex Williamson
2024-04-23  7:43                     ` Tian, Kevin
2024-04-23 12:01                       ` Jason Gunthorpe
2024-04-23 23:47                         ` Tian, Kevin
2024-04-24  0:12                           ` Jason Gunthorpe
2024-04-24  2:57                             ` Tian, Kevin
2024-04-24 12:29                               ` Baolu Lu
2024-04-24 14:04                               ` Jason Gunthorpe
2024-04-24  5:19                             ` Tian, Kevin
2024-04-24 14:15                               ` Jason Gunthorpe
2024-04-24 18:38                                 ` Alex Williamson
2024-04-24 18:45                                   ` Jason Gunthorpe
2024-04-24 18:24                             ` Alex Williamson
2024-04-24 18:36                               ` Jason Gunthorpe
2024-04-24 20:13                                 ` Alex Williamson
2024-04-26 14:11                                   ` Jason Gunthorpe
2024-04-26 20:13                                     ` Alex Williamson
2024-04-28  6:19                                       ` Tian, Kevin
2024-04-29  7:43                                         ` Yi Liu
2024-04-29 17:15                                         ` Jason Gunthorpe
2024-04-29 17:44                                       ` Jason Gunthorpe
2024-07-18 13:02                                         ` Yi Liu
2024-07-24  2:26                                         ` Tian, Kevin
2024-07-30 17:35                                           ` Alex Williamson
2024-07-31  5:15                                             ` Tian, Kevin
2024-07-31 17:04                                               ` Alex Williamson
2024-08-01  7:45                                                 ` Tian, Kevin
2024-08-02 18:25                                                   ` Alex Williamson
2024-08-05  5:35                                                     ` Tian, Kevin
2024-08-06 14:20                                                       ` Jason Gunthorpe
2024-08-14  6:38                                                         ` Yi Liu
2024-08-14  7:38                                                           ` Tian, Kevin
2024-08-14  8:19                                                             ` Yi Liu
2024-08-14  8:26                                                               ` Tian, Kevin
2024-08-14 14:40                                                               ` Jason Gunthorpe
2024-08-15  2:12                                                                 ` Yi Liu
2024-08-16  8:29                                                                   ` Vasant Hegde
2024-08-16 11:52                                                                     ` Yi Liu
2024-08-16 12:01                                                                       ` Vasant Hegde
2024-08-16 12:52                                                                         ` Jason Gunthorpe
2024-08-16 13:14                                                                           ` Baolu Lu
2024-08-19  8:21                                                                           ` Vasant Hegde
2024-09-09 12:59                                                             ` Yi Liu
2024-09-09 13:04                                                               ` Jason Gunthorpe
2024-09-09 13:29                                                                 ` Yi Liu
2024-09-09 13:40                                                                   ` Jason Gunthorpe
2024-09-09 14:02                                                                     ` Yi Liu
2024-09-09 13:59                                                                       ` Jason Gunthorpe
2024-08-06 14:05                                                     ` Jason Gunthorpe
2024-08-06 13:54                                                 ` Jason Gunthorpe
2024-08-06 14:30                                             ` Jason Gunthorpe
2024-04-27  5:05                                     ` Christoph Hellwig
2024-04-25  9:26                               ` Yi Liu
2024-04-25 12:58                                 ` Alex Williamson
2024-04-26  9:01                                   ` Yi Liu
2024-04-19 13:59                 ` Jason Gunthorpe
2024-04-23  7:58                   ` Yi Liu
2024-04-23 12:05                     ` Jason Gunthorpe
2024-04-19 13:34           ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).