public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/9] Enhance vfio PCI hot reset for vfio cdev device
@ 2023-04-26 14:54 Yi Liu
  2023-04-26 14:54 ` [PATCH v4 1/9] vfio: Determine noiommu in vfio_device registration Yi Liu
                   ` (10 more replies)
  0 siblings, 11 replies; 50+ messages in thread
From: Yi Liu @ 2023-04-26 14:54 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: joro, robin.murphy, cohuck, eric.auger, nicolinc, kvm, mjrosato,
	chao.p.peng, yi.l.liu, yi.y.sun, peterx, jasowang,
	shameerali.kolothum.thodi, lulu, suravee.suthikulpanit,
	intel-gvt-dev, intel-gfx, linux-s390, xudong.hao, yan.y.zhao,
	terrence.xu, yanting.jiang, zhenzhong.duan

VFIO_DEVICE_PCI_HOT_RESET requires user to pass an array of group fds
to prove that it owns all devices affected by resetting the calling
device. While for cdev devices, user can use an iommufd-based ownership
checking model and invoke VFIO_DEVICE_PCI_HOT_RESET with a zero-length
fd array.

This series first creates iommufd_access for noiommu devices to fill the
gap for adding iommufd-based ownership checking model, then extends
VFIO_DEVICE_GET_PCI_HOT_RESET_INFO to check ownership and return the
check result and the devid of affected devices to user. In the end, extends
the VFIO_DEVICE_PCI_HOT_RESET to accept zero-length fd array for hot-reset
with cdev devices.

The new hot reset method and updated _INFO ioctl are tested with the
below qemu:

https://github.com/yiliu1765/qemu/tree/iommufd_rfcv4.mig.reset.v4_var3
(requires to test with the cdev kernel)

Change log:

v4:
 - Rename the patch series subject
 - Patch 01 is moved from the cdev series
 - Patch 02, 06 are new per review comments in v3
 - Patch 03/04/05/07/08/09 are from v3 with updates

v3: https://lore.kernel.org/kvm/20230401144429.88673-1-yi.l.liu@intel.com/
 - Remove the new _INFO ioctl of v2, extend the existing _INFO ioctl to
   report devid (Alex)
 - Add r-b from Jason
 - Add t-b from Terrence Xu and Yanting Jiang (mainly regression test)

v2: https://lore.kernel.org/kvm/20230327093458.44939-1-yi.l.liu@intel.com/
 - Split the patch 03 of v1 to be 03, 04 and 05 of v2 (Jaon)
 - Add r-b from Kevin and Jason
 - Add patch 10 to introduce a new _INFO ioctl for the usage of device
   fd passing usage in cdev path (Jason, Alex)

v1: https://lore.kernel.org/kvm/20230316124156.12064-1-yi.l.liu@intel.com/

Regards,
	Yi Liu

Yi Liu (9):
  vfio: Determine noiommu in vfio_device registration
  vfio-iommufd: Create iommufd_access for noiommu devices
  vfio/pci: Update comment around group_fd get in
    vfio_pci_ioctl_pci_hot_reset()
  vfio/pci: Move the existing hot reset logic to be a helper
  vfio: Mark cdev usage in vfio_device
  iommufd: Reserved -1 in the iommufd xarray
  vfio-iommufd: Add helper to retrieve iommufd_ctx and devid for
    vfio_device
  vfio/pci: Extend VFIO_DEVICE_GET_PCI_HOT_RESET_INFO for vfio device
    cdev
  vfio/pci: Allow passing zero-length fd array in
    VFIO_DEVICE_PCI_HOT_RESET

 drivers/iommu/iommufd/device.c   |  24 ++++
 drivers/iommu/iommufd/main.c     |   5 +-
 drivers/vfio/iommufd.c           |  48 +++++--
 drivers/vfio/pci/vfio_pci_core.c | 223 +++++++++++++++++++++++++------
 drivers/vfio/vfio.h              |   7 +-
 drivers/vfio/vfio_main.c         |   4 +
 include/linux/iommufd.h          |   6 +
 include/linux/vfio.h             |  22 +++
 include/uapi/linux/vfio.h        |  61 ++++++++-
 9 files changed, 347 insertions(+), 53 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 1/9] vfio: Determine noiommu in vfio_device registration
  2023-04-26 14:54 [PATCH v4 0/9] Enhance vfio PCI hot reset for vfio cdev device Yi Liu
@ 2023-04-26 14:54 ` Yi Liu
  2023-04-27  6:36   ` Tian, Kevin
  2023-04-26 14:54 ` [PATCH v4 2/9] vfio-iommufd: Create iommufd_access for noiommu devices Yi Liu
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 50+ messages in thread
From: Yi Liu @ 2023-04-26 14:54 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: joro, robin.murphy, cohuck, eric.auger, nicolinc, kvm, mjrosato,
	chao.p.peng, yi.l.liu, yi.y.sun, peterx, jasowang,
	shameerali.kolothum.thodi, lulu, suravee.suthikulpanit,
	intel-gvt-dev, intel-gfx, linux-s390, xudong.hao, yan.y.zhao,
	terrence.xu, yanting.jiang, zhenzhong.duan

This adds a noiommu flag in vfio_device, hence caller of the
vfio_device_is_noiommu() just refers to the flag for noiommu
check.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Yanting Jiang <yanting.jiang@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/iommufd.c   | 4 ++--
 drivers/vfio/vfio.h      | 7 ++++---
 drivers/vfio/vfio_main.c | 4 ++++
 include/linux/vfio.h     | 1 +
 4 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/vfio/iommufd.c b/drivers/vfio/iommufd.c
index 88b00c501015..895852ad37ed 100644
--- a/drivers/vfio/iommufd.c
+++ b/drivers/vfio/iommufd.c
@@ -18,7 +18,7 @@ int vfio_iommufd_bind(struct vfio_device *vdev, struct iommufd_ctx *ictx)
 
 	lockdep_assert_held(&vdev->dev_set->lock);
 
-	if (vfio_device_is_noiommu(vdev)) {
+	if (vdev->noiommu) {
 		if (!capable(CAP_SYS_RAWIO))
 			return -EPERM;
 
@@ -59,7 +59,7 @@ void vfio_iommufd_unbind(struct vfio_device *vdev)
 {
 	lockdep_assert_held(&vdev->dev_set->lock);
 
-	if (vfio_device_is_noiommu(vdev))
+	if (vdev->noiommu)
 		return;
 
 	if (vdev->ops->unbind_iommufd)
diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index 7b19c621e0e6..1ddf43863ad6 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -88,10 +88,11 @@ bool vfio_device_has_container(struct vfio_device *device);
 int __init vfio_group_init(void);
 void vfio_group_cleanup(void);
 
-static inline bool vfio_device_is_noiommu(struct vfio_device *vdev)
+static inline int vfio_device_set_noiommu(struct vfio_device *device)
 {
-	return IS_ENABLED(CONFIG_VFIO_NOIOMMU) &&
-	       vdev->group->type == VFIO_NO_IOMMU;
+	device->noiommu = IS_ENABLED(CONFIG_VFIO_NOIOMMU) &&
+			  device->group->type == VFIO_NO_IOMMU;
+	return 0;
 }
 
 #if IS_ENABLED(CONFIG_VFIO_CONTAINER)
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 89497c933490..09be9df2ceca 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -276,6 +276,10 @@ static int __vfio_register_dev(struct vfio_device *device,
 	if (ret)
 		return ret;
 
+	ret = vfio_device_set_noiommu(device);
+	if (ret)
+		goto err_out;
+
 	ret = device_add(&device->device);
 	if (ret)
 		goto err_out;
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 2c137ea94a3e..4ee613924435 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -62,6 +62,7 @@ struct vfio_device {
 	struct iommufd_device *iommufd_device;
 	bool iommufd_attached;
 #endif
+	bool noiommu;
 };
 
 /**
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v4 2/9] vfio-iommufd: Create iommufd_access for noiommu devices
  2023-04-26 14:54 [PATCH v4 0/9] Enhance vfio PCI hot reset for vfio cdev device Yi Liu
  2023-04-26 14:54 ` [PATCH v4 1/9] vfio: Determine noiommu in vfio_device registration Yi Liu
@ 2023-04-26 14:54 ` Yi Liu
  2023-04-27  6:39   ` Tian, Kevin
  2023-04-26 14:54 ` [PATCH v4 3/9] vfio/pci: Update comment around group_fd get in vfio_pci_ioctl_pci_hot_reset() Yi Liu
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 50+ messages in thread
From: Yi Liu @ 2023-04-26 14:54 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: joro, robin.murphy, cohuck, eric.auger, nicolinc, kvm, mjrosato,
	chao.p.peng, yi.l.liu, yi.y.sun, peterx, jasowang,
	shameerali.kolothum.thodi, lulu, suravee.suthikulpanit,
	intel-gvt-dev, intel-gfx, linux-s390, xudong.hao, yan.y.zhao,
	terrence.xu, yanting.jiang, zhenzhong.duan

This binds noiommu device to iommufd and creates iommufd_access for this
bond. This is useful for adding an iommufd-based device ownership check
for VFIO_DEVICE_PCI_HOT_RESET since this model requires all the other
affected devices bound to the same iommufd as the device to be reset.
For noiommu devices, there is no backend iommu, so create iommufd_access
instead of iommufd_device.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/iommufd.c | 22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/iommufd.c b/drivers/vfio/iommufd.c
index 895852ad37ed..ca29c4feded3 100644
--- a/drivers/vfio/iommufd.c
+++ b/drivers/vfio/iommufd.c
@@ -29,7 +29,8 @@ int vfio_iommufd_bind(struct vfio_device *vdev, struct iommufd_ctx *ictx)
 		 */
 		if (!iommufd_vfio_compat_ioas_get_id(ictx, &ioas_id))
 			return -EPERM;
-		return 0;
+
+		return vfio_iommufd_emulated_bind(vdev, ictx, &device_id);
 	}
 
 	ret = vdev->ops->bind_iommufd(vdev, ictx, &device_id);
@@ -59,8 +60,10 @@ void vfio_iommufd_unbind(struct vfio_device *vdev)
 {
 	lockdep_assert_held(&vdev->dev_set->lock);
 
-	if (vdev->noiommu)
+	if (vdev->noiommu) {
+		vfio_iommufd_emulated_unbind(vdev);
 		return;
+	}
 
 	if (vdev->ops->unbind_iommufd)
 		vdev->ops->unbind_iommufd(vdev);
@@ -110,10 +113,14 @@ int vfio_iommufd_physical_attach_ioas(struct vfio_device *vdev, u32 *pt_id)
 EXPORT_SYMBOL_GPL(vfio_iommufd_physical_attach_ioas);
 
 /*
- * The emulated standard ops mean that vfio_device is going to use the
- * "mdev path" and will call vfio_pin_pages()/vfio_dma_rw(). Drivers using this
- * ops set should call vfio_register_emulated_iommu_dev(). Drivers that do
- * not call vfio_pin_pages()/vfio_dma_rw() have no need to provide dma_unmap.
+ * The emulated standard ops can be used by below usages:
+ * 1) The vfio_device that is going to use the "mdev path" and will call
+ *    vfio_pin_pages()/vfio_dma_rw().  Such drivers using should call
+ *    vfio_register_emulated_iommu_dev().  Drivers that do not call
+ *    vfio_pin_pages()/vfio_dma_rw() have no need to provide dma_unmap.
+ * 2) The noiommu device which doesn't have backend iommu but creating
+ *    an iommufd_access allows generating a dev_id for it. noiommu device
+ *    is not allowed to do map/unmap so this becomes a nop.
  */
 
 static void vfio_emulated_unmap(void *data, unsigned long iova,
@@ -121,7 +128,8 @@ static void vfio_emulated_unmap(void *data, unsigned long iova,
 {
 	struct vfio_device *vdev = data;
 
-	if (vdev->ops->dma_unmap)
+	/* noiommu devices cannot do map/unmap */
+	if (vdev->noiommu && vdev->ops->dma_unmap)
 		vdev->ops->dma_unmap(vdev, iova, length);
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v4 3/9] vfio/pci: Update comment around group_fd get in vfio_pci_ioctl_pci_hot_reset()
  2023-04-26 14:54 [PATCH v4 0/9] Enhance vfio PCI hot reset for vfio cdev device Yi Liu
  2023-04-26 14:54 ` [PATCH v4 1/9] vfio: Determine noiommu in vfio_device registration Yi Liu
  2023-04-26 14:54 ` [PATCH v4 2/9] vfio-iommufd: Create iommufd_access for noiommu devices Yi Liu
@ 2023-04-26 14:54 ` Yi Liu
  2023-04-26 14:54 ` [PATCH v4 4/9] vfio/pci: Move the existing hot reset logic to be a helper Yi Liu
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 50+ messages in thread
From: Yi Liu @ 2023-04-26 14:54 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: joro, robin.murphy, cohuck, eric.auger, nicolinc, kvm, mjrosato,
	chao.p.peng, yi.l.liu, yi.y.sun, peterx, jasowang,
	shameerali.kolothum.thodi, lulu, suravee.suthikulpanit,
	intel-gvt-dev, intel-gfx, linux-s390, xudong.hao, yan.y.zhao,
	terrence.xu, yanting.jiang, zhenzhong.duan

this suits more on what the code does.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index a5ab416cf476..f824de4dbf27 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1308,9 +1308,8 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 	}
 
 	/*
-	 * For each group_fd, get the group through the vfio external user
-	 * interface and store the group and iommu ID.  This ensures the group
-	 * is held across the reset.
+	 * Get the group file for each fd to ensure the group is held across
+	 * the reset
 	 */
 	for (file_idx = 0; file_idx < hdr.count; file_idx++) {
 		struct file *file = fget(group_fds[file_idx]);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v4 4/9] vfio/pci: Move the existing hot reset logic to be a helper
  2023-04-26 14:54 [PATCH v4 0/9] Enhance vfio PCI hot reset for vfio cdev device Yi Liu
                   ` (2 preceding siblings ...)
  2023-04-26 14:54 ` [PATCH v4 3/9] vfio/pci: Update comment around group_fd get in vfio_pci_ioctl_pci_hot_reset() Yi Liu
@ 2023-04-26 14:54 ` Yi Liu
  2023-04-27  6:39   ` Tian, Kevin
  2023-04-26 14:54 ` [PATCH v4 5/9] vfio: Mark cdev usage in vfio_device Yi Liu
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 50+ messages in thread
From: Yi Liu @ 2023-04-26 14:54 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: joro, robin.murphy, cohuck, eric.auger, nicolinc, kvm, mjrosato,
	chao.p.peng, yi.l.liu, yi.y.sun, peterx, jasowang,
	shameerali.kolothum.thodi, lulu, suravee.suthikulpanit,
	intel-gvt-dev, intel-gfx, linux-s390, xudong.hao, yan.y.zhao,
	terrence.xu, yanting.jiang, zhenzhong.duan

This prepares to add another method for hot reset. The major hot reset logic
are moved to vfio_pci_ioctl_pci_hot_reset_groups().

No functional change is intended.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Yanting Jiang <yanting.jiang@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 55 +++++++++++++++++++-------------
 1 file changed, 32 insertions(+), 23 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index f824de4dbf27..39e7823088e7 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1255,29 +1255,16 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info(
 	return ret;
 }
 
-static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
-					struct vfio_pci_hot_reset __user *arg)
+static int
+vfio_pci_ioctl_pci_hot_reset_groups(struct vfio_pci_core_device *vdev,
+				    int array_count, bool slot,
+				    struct vfio_pci_hot_reset __user *arg)
 {
-	unsigned long minsz = offsetofend(struct vfio_pci_hot_reset, count);
-	struct vfio_pci_hot_reset hdr;
 	int32_t *group_fds;
 	struct file **files;
 	struct vfio_pci_group_info info;
-	bool slot = false;
 	int file_idx, count = 0, ret = 0;
 
-	if (copy_from_user(&hdr, arg, minsz))
-		return -EFAULT;
-
-	if (hdr.argsz < minsz || hdr.flags)
-		return -EINVAL;
-
-	/* Can we do a slot or bus reset or neither? */
-	if (!pci_probe_reset_slot(vdev->pdev->slot))
-		slot = true;
-	else if (pci_probe_reset_bus(vdev->pdev->bus))
-		return -ENODEV;
-
 	/*
 	 * We can't let userspace give us an arbitrarily large buffer to copy,
 	 * so verify how many we think there could be.  Note groups can have
@@ -1289,11 +1276,11 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 		return ret;
 
 	/* Somewhere between 1 and count is OK */
-	if (!hdr.count || hdr.count > count)
+	if (!array_count || array_count > count)
 		return -EINVAL;
 
-	group_fds = kcalloc(hdr.count, sizeof(*group_fds), GFP_KERNEL);
-	files = kcalloc(hdr.count, sizeof(*files), GFP_KERNEL);
+	group_fds = kcalloc(array_count, sizeof(*group_fds), GFP_KERNEL);
+	files = kcalloc(array_count, sizeof(*files), GFP_KERNEL);
 	if (!group_fds || !files) {
 		kfree(group_fds);
 		kfree(files);
@@ -1301,7 +1288,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 	}
 
 	if (copy_from_user(group_fds, arg->group_fds,
-			   hdr.count * sizeof(*group_fds))) {
+			   array_count * sizeof(*group_fds))) {
 		kfree(group_fds);
 		kfree(files);
 		return -EFAULT;
@@ -1311,7 +1298,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 	 * Get the group file for each fd to ensure the group is held across
 	 * the reset
 	 */
-	for (file_idx = 0; file_idx < hdr.count; file_idx++) {
+	for (file_idx = 0; file_idx < array_count; file_idx++) {
 		struct file *file = fget(group_fds[file_idx]);
 
 		if (!file) {
@@ -1335,7 +1322,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 	if (ret)
 		goto hot_reset_release;
 
-	info.count = hdr.count;
+	info.count = array_count;
 	info.files = files;
 
 	ret = vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, &info);
@@ -1348,6 +1335,28 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 	return ret;
 }
 
+static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
+					struct vfio_pci_hot_reset __user *arg)
+{
+	unsigned long minsz = offsetofend(struct vfio_pci_hot_reset, count);
+	struct vfio_pci_hot_reset hdr;
+	bool slot = false;
+
+	if (copy_from_user(&hdr, arg, minsz))
+		return -EFAULT;
+
+	if (hdr.argsz < minsz || hdr.flags)
+		return -EINVAL;
+
+	/* Can we do a slot or bus reset or neither? */
+	if (!pci_probe_reset_slot(vdev->pdev->slot))
+		slot = true;
+	else if (pci_probe_reset_bus(vdev->pdev->bus))
+		return -ENODEV;
+
+	return vfio_pci_ioctl_pci_hot_reset_groups(vdev, hdr.count, slot, arg);
+}
+
 static int vfio_pci_ioctl_ioeventfd(struct vfio_pci_core_device *vdev,
 				    struct vfio_device_ioeventfd __user *arg)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v4 5/9] vfio: Mark cdev usage in vfio_device
  2023-04-26 14:54 [PATCH v4 0/9] Enhance vfio PCI hot reset for vfio cdev device Yi Liu
                   ` (3 preceding siblings ...)
  2023-04-26 14:54 ` [PATCH v4 4/9] vfio/pci: Move the existing hot reset logic to be a helper Yi Liu
@ 2023-04-26 14:54 ` Yi Liu
  2023-04-27  6:40   ` Tian, Kevin
  2023-04-27 18:43   ` Alex Williamson
  2023-04-26 14:54 ` [PATCH v4 6/9] iommufd: Reserved -1 in the iommufd xarray Yi Liu
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 50+ messages in thread
From: Yi Liu @ 2023-04-26 14:54 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: joro, robin.murphy, cohuck, eric.auger, nicolinc, kvm, mjrosato,
	chao.p.peng, yi.l.liu, yi.y.sun, peterx, jasowang,
	shameerali.kolothum.thodi, lulu, suravee.suthikulpanit,
	intel-gvt-dev, intel-gfx, linux-s390, xudong.hao, yan.y.zhao,
	terrence.xu, yanting.jiang, zhenzhong.duan

Use it to differentiate whether to report group_id or dev_id in revised
VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl. Though it is not set at this
moment introducing it now allows us to get hot reset ready for cdev.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 include/linux/vfio.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 4ee613924435..298f4ef16be7 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -63,6 +63,7 @@ struct vfio_device {
 	bool iommufd_attached;
 #endif
 	bool noiommu;
+	bool cdev_opened;
 };
 
 /**
@@ -140,6 +141,12 @@ int vfio_iommufd_emulated_attach_ioas(struct vfio_device *vdev, u32 *pt_id);
 	((int (*)(struct vfio_device *vdev, u32 *pt_id)) NULL)
 #endif
 
+static inline bool vfio_device_cdev_opened(struct vfio_device *device)
+{
+	lockdep_assert_held(&device->dev_set->lock);
+	return device->cdev_opened;
+}
+
 /**
  * struct vfio_migration_ops - VFIO bus device driver migration callbacks
  *
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v4 6/9] iommufd: Reserved -1 in the iommufd xarray
  2023-04-26 14:54 [PATCH v4 0/9] Enhance vfio PCI hot reset for vfio cdev device Yi Liu
                   ` (4 preceding siblings ...)
  2023-04-26 14:54 ` [PATCH v4 5/9] vfio: Mark cdev usage in vfio_device Yi Liu
@ 2023-04-26 14:54 ` Yi Liu
  2023-04-27  6:41   ` Tian, Kevin
  2023-04-26 14:54 ` [PATCH v4 7/9] vfio-iommufd: Add helper to retrieve iommufd_ctx and devid for vfio_device Yi Liu
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 50+ messages in thread
From: Yi Liu @ 2023-04-26 14:54 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: joro, robin.murphy, cohuck, eric.auger, nicolinc, kvm, mjrosato,
	chao.p.peng, yi.l.liu, yi.y.sun, peterx, jasowang,
	shameerali.kolothum.thodi, lulu, suravee.suthikulpanit,
	intel-gvt-dev, intel-gfx, linux-s390, xudong.hao, yan.y.zhao,
	terrence.xu, yanting.jiang, zhenzhong.duan

VFIO needs two reserved values. 0 is already reserved by initializing
xarray with XA_FLAGS_ALLOC1. This reserves -1 by limiting the xa alloc
range.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/main.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 3fbe636c3d8a..51b27c96c52f 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -28,6 +28,9 @@ struct iommufd_object_ops {
 static const struct iommufd_object_ops iommufd_object_ops[];
 static struct miscdevice vfio_misc_dev;
 
+/* -1 is reserved */
+#define iommufd_xa_limit_32b XA_LIMIT(0, (-2U))
+
 struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
 					     size_t size,
 					     enum iommufd_object_type type)
@@ -50,7 +53,7 @@ struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
 	 * before calling iommufd_object_finalize().
 	 */
 	rc = xa_alloc(&ictx->objects, &obj->id, XA_ZERO_ENTRY,
-		      xa_limit_32b, GFP_KERNEL_ACCOUNT);
+		      iommufd_xa_limit_32b, GFP_KERNEL_ACCOUNT);
 	if (rc)
 		goto out_free;
 	return obj;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v4 7/9] vfio-iommufd: Add helper to retrieve iommufd_ctx and devid for vfio_device
  2023-04-26 14:54 [PATCH v4 0/9] Enhance vfio PCI hot reset for vfio cdev device Yi Liu
                   ` (5 preceding siblings ...)
  2023-04-26 14:54 ` [PATCH v4 6/9] iommufd: Reserved -1 in the iommufd xarray Yi Liu
@ 2023-04-26 14:54 ` Yi Liu
  2023-04-27  6:45   ` Tian, Kevin
  2023-04-26 14:54 ` [PATCH v4 8/9] vfio/pci: Extend VFIO_DEVICE_GET_PCI_HOT_RESET_INFO for vfio device cdev Yi Liu
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 50+ messages in thread
From: Yi Liu @ 2023-04-26 14:54 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: joro, robin.murphy, cohuck, eric.auger, nicolinc, kvm, mjrosato,
	chao.p.peng, yi.l.liu, yi.y.sun, peterx, jasowang,
	shameerali.kolothum.thodi, lulu, suravee.suthikulpanit,
	intel-gvt-dev, intel-gfx, linux-s390, xudong.hao, yan.y.zhao,
	terrence.xu, yanting.jiang, zhenzhong.duan

This is needed by the vfio-pci driver to report affected devices in the
hot reset for a given device.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/device.c | 24 ++++++++++++++++++++++++
 drivers/vfio/iommufd.c         | 24 ++++++++++++++++++++++++
 include/linux/iommufd.h        |  6 ++++++
 include/linux/vfio.h           | 14 ++++++++++++++
 4 files changed, 68 insertions(+)

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 4b565a5b51da..40b52f1a071d 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -131,6 +131,18 @@ void iommufd_device_unbind(struct iommufd_device *idev)
 }
 EXPORT_SYMBOL_NS_GPL(iommufd_device_unbind, IOMMUFD);
 
+struct iommufd_ctx *iommufd_device_to_ictx(struct iommufd_device *idev)
+{
+	return idev->ictx;
+}
+EXPORT_SYMBOL_NS_GPL(iommufd_device_to_ictx, IOMMUFD);
+
+u32 iommufd_device_to_id(struct iommufd_device *idev)
+{
+	return idev->obj.id;
+}
+EXPORT_SYMBOL_NS_GPL(iommufd_device_to_id, IOMMUFD);
+
 static int iommufd_device_setup_msi(struct iommufd_device *idev,
 				    struct iommufd_hw_pagetable *hwpt,
 				    phys_addr_t sw_msi_start)
@@ -480,6 +492,18 @@ void iommufd_access_destroy(struct iommufd_access *access)
 }
 EXPORT_SYMBOL_NS_GPL(iommufd_access_destroy, IOMMUFD);
 
+struct iommufd_ctx *iommufd_access_to_ictx(struct iommufd_access *access)
+{
+	return access->ictx;
+}
+EXPORT_SYMBOL_NS_GPL(iommufd_access_to_ictx, IOMMUFD);
+
+u32 iommufd_access_to_id(struct iommufd_access *access)
+{
+	return access->obj.id;
+}
+EXPORT_SYMBOL_NS_GPL(iommufd_access_to_id, IOMMUFD);
+
 int iommufd_access_attach(struct iommufd_access *access, u32 ioas_id)
 {
 	struct iommufd_ioas *new_ioas;
diff --git a/drivers/vfio/iommufd.c b/drivers/vfio/iommufd.c
index ca29c4feded3..54ac7be49b80 100644
--- a/drivers/vfio/iommufd.c
+++ b/drivers/vfio/iommufd.c
@@ -69,6 +69,30 @@ void vfio_iommufd_unbind(struct vfio_device *vdev)
 		vdev->ops->unbind_iommufd(vdev);
 }
 
+struct iommufd_ctx *vfio_iommufd_physical_ictx(struct vfio_device *vdev)
+{
+	if (vdev->iommufd_device)
+		return iommufd_device_to_ictx(vdev->iommufd_device);
+	if (vdev->iommufd_access)
+		return iommufd_access_to_ictx(vdev->iommufd_access);
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(vfio_iommufd_physical_ictx);
+
+/*
+ * Return devid for devices that have been bound with iommufd,
+ * returns 0 if not bound yet.
+ */
+u32 vfio_iommufd_physical_devid(struct vfio_device *vdev)
+{
+	if (WARN_ON(!vdev->iommufd_device && !vdev->iommufd_access))
+		return 0;
+	if (vdev->iommufd_device)
+		return iommufd_device_to_id(vdev->iommufd_device);
+	else
+		return iommufd_access_to_id(vdev->iommufd_access);
+}
+EXPORT_SYMBOL_GPL(vfio_iommufd_physical_devid);
 /*
  * The physical standard ops mean that the iommufd_device is bound to the
  * physical device vdev->dev that was provided to vfio_init_group_dev(). Drivers
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 1129a36a74c4..68cd65274e28 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -24,6 +24,9 @@ void iommufd_device_unbind(struct iommufd_device *idev);
 int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id);
 void iommufd_device_detach(struct iommufd_device *idev);
 
+struct iommufd_ctx *iommufd_device_to_ictx(struct iommufd_device *idev);
+u32 iommufd_device_to_id(struct iommufd_device *idev);
+
 struct iommufd_access_ops {
 	u8 needs_pin_pages : 1;
 	void (*unmap)(void *data, unsigned long iova, unsigned long length);
@@ -45,6 +48,9 @@ iommufd_access_create(struct iommufd_ctx *ictx,
 void iommufd_access_destroy(struct iommufd_access *access);
 int iommufd_access_attach(struct iommufd_access *access, u32 ioas_id);
 
+struct iommufd_ctx *iommufd_access_to_ictx(struct iommufd_access *access);
+u32 iommufd_access_to_id(struct iommufd_access *access);
+
 void iommufd_ctx_get(struct iommufd_ctx *ictx);
 
 #if IS_ENABLED(CONFIG_IOMMUFD)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 298f4ef16be7..cead7aebd527 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -116,6 +116,8 @@ struct vfio_device_ops {
 };
 
 #if IS_ENABLED(CONFIG_IOMMUFD)
+struct iommufd_ctx *vfio_iommufd_physical_ictx(struct vfio_device *vdev);
+u32 vfio_iommufd_physical_devid(struct vfio_device *vdev);
 int vfio_iommufd_physical_bind(struct vfio_device *vdev,
 			       struct iommufd_ctx *ictx, u32 *out_device_id);
 void vfio_iommufd_physical_unbind(struct vfio_device *vdev);
@@ -125,6 +127,18 @@ int vfio_iommufd_emulated_bind(struct vfio_device *vdev,
 void vfio_iommufd_emulated_unbind(struct vfio_device *vdev);
 int vfio_iommufd_emulated_attach_ioas(struct vfio_device *vdev, u32 *pt_id);
 #else
+static inline struct iommufd_ctx *
+vfio_iommufd_physical_ictx(struct vfio_device *vdev)
+{
+	return NULL;
+}
+
+static inline u32
+vfio_iommufd_physical_devid(struct vfio_device *vdev)
+{
+	return 0;
+}
+
 #define vfio_iommufd_physical_bind                                      \
 	((int (*)(struct vfio_device *vdev, struct iommufd_ctx *ictx,   \
 		  u32 *out_device_id)) NULL)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v4 8/9] vfio/pci: Extend VFIO_DEVICE_GET_PCI_HOT_RESET_INFO for vfio device cdev
  2023-04-26 14:54 [PATCH v4 0/9] Enhance vfio PCI hot reset for vfio cdev device Yi Liu
                   ` (6 preceding siblings ...)
  2023-04-26 14:54 ` [PATCH v4 7/9] vfio-iommufd: Add helper to retrieve iommufd_ctx and devid for vfio_device Yi Liu
@ 2023-04-26 14:54 ` Yi Liu
  2023-04-27  6:51   ` Tian, Kevin
  2023-04-27 20:04   ` Alex Williamson
  2023-04-26 14:54 ` [PATCH v4 9/9] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET Yi Liu
                   ` (2 subsequent siblings)
  10 siblings, 2 replies; 50+ messages in thread
From: Yi Liu @ 2023-04-26 14:54 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: joro, robin.murphy, cohuck, eric.auger, nicolinc, kvm, mjrosato,
	chao.p.peng, yi.l.liu, yi.y.sun, peterx, jasowang,
	shameerali.kolothum.thodi, lulu, suravee.suthikulpanit,
	intel-gvt-dev, intel-gfx, linux-s390, xudong.hao, yan.y.zhao,
	terrence.xu, yanting.jiang, zhenzhong.duan

This makes VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl to use the bound
iommufd of the cdev device to check the ownership of the other affected
devices and set a flag to tell user if the cdev device is resettable
with a zero-length fd array.

For each of the affected devices, if it is bound to the iommufd of the
cdev device, _INFO reports a valid dev_id > 0; if it is not opened by
the calling user, but it is in the iommu_group of a device that is bound
to the iommufd of the cdev device, reports dev_id == 0; If the device is
un-owned device, configured within a different iommufd, or opened outside
of the vfio device cdev API, the _INFO ioctl shall report dev_id==-1 for
such affected devices. dev_id >=0 doesn't block hot-reset, while
dev_id == -1 will block hot-reset.

This adds flag VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID to tell the user
dev_id is returned and adds flag VFIO_PCI_HOT_RESET_FLAG_RESETTABLE to
tell user if the cdev device is resettable or not.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 101 ++++++++++++++++++++++++++++---
 include/uapi/linux/vfio.h        |  39 +++++++++++-
 2 files changed, 132 insertions(+), 8 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 39e7823088e7..43858d471447 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -766,6 +766,51 @@ static int vfio_pci_get_irq_count(struct vfio_pci_core_device *vdev, int irq_typ
 	return 0;
 }
 
+static struct vfio_device *
+vfio_pci_find_device_in_devset(struct vfio_device_set *dev_set,
+			       struct pci_dev *pdev)
+{
+	struct vfio_device *cur;
+
+	lockdep_assert_held(&dev_set->lock);
+
+	list_for_each_entry(cur, &dev_set->device_list, dev_set_list)
+		if (cur->dev == &pdev->dev)
+			return cur;
+	return NULL;
+}
+
+/*
+ * Check if a given iommu_group has been bound to an iommufd within a
+ * devset.  Returns true if there is device in the devset which is in
+ * the input iommu_group and meanwhile bound to the input iommufd.
+ * Otherwise, returns false.
+ */
+static bool
+vfio_devset_iommufd_has_group(struct vfio_device_set *dev_set,
+			      struct iommufd_ctx *iommufd,
+			      struct iommu_group *iommu_group)
+{
+	struct vfio_device *cur;
+	struct iommu_group *grp;
+	bool found = false;
+
+	lockdep_assert_held(&dev_set->lock);
+
+	list_for_each_entry(cur, &dev_set->device_list, dev_set_list) {
+		grp = iommu_group_get(cur->dev);
+		if (!grp)
+			continue;
+		iommu_group_put(grp);
+		if (iommu_group == grp &&
+		    iommufd == vfio_iommufd_physical_ictx(cur)) {
+			found = true;
+			break;
+		}
+	}
+	return found;
+}
+
 static int vfio_pci_count_devs(struct pci_dev *pdev, void *data)
 {
 	(*(int *)data)++;
@@ -776,13 +821,20 @@ struct vfio_pci_fill_info {
 	int max;
 	int cur;
 	struct vfio_pci_dependent_device *devices;
+	struct vfio_device *vdev;
+	bool devid;
+	bool resettable;
 };
 
 static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
 {
 	struct vfio_pci_fill_info *fill = data;
+	struct iommufd_ctx *iommufd = vfio_iommufd_physical_ictx(fill->vdev);
+	struct vfio_device_set *dev_set = fill->vdev->dev_set;
 	struct iommu_group *iommu_group;
 
+	lockdep_assert_held(&dev_set->lock);
+
 	if (fill->cur == fill->max)
 		return -EAGAIN; /* Something changed, try again */
 
@@ -790,7 +842,34 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
 	if (!iommu_group)
 		return -EPERM; /* Cannot reset non-isolated devices */
 
-	fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
+	if (fill->devid) {
+		struct vfio_device *vdev;
+
+		/*
+		 * Report devid for the affected devices:
+		 * - valid devid > 0 for the devices that are bound with
+		 *   the iommufd of the calling device.
+		 * - devid == 0 for the devices that have not been opened
+		 *   but have same group with one of the devices bound to
+		 *   the iommufd of the calling device.
+		 * - devid == -1 for others, and clear resettable flag.
+		 */
+		vdev = vfio_pci_find_device_in_devset(dev_set, pdev);
+		if (vdev && iommufd == vfio_iommufd_physical_ictx(vdev)) {
+			fill->devices[fill->cur].dev_id =
+						vfio_iommufd_physical_devid(vdev);
+			if (unlikely(!fill->devices[fill->cur].dev_id))
+				return -EINVAL;
+		} else if (vfio_devset_iommufd_has_group(dev_set, iommufd,
+							 iommu_group)) {
+			fill->devices[fill->cur].dev_id = VFIO_PCI_DEVID_NONBLOCKING;
+		} else {
+			fill->devices[fill->cur].dev_id = VFIO_PCI_DEVID_BLOCKING;
+			fill->resettable = false;
+		}
+	} else {
+		fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
+	}
 	fill->devices[fill->cur].segment = pci_domain_nr(pdev->bus);
 	fill->devices[fill->cur].bus = pdev->bus->number;
 	fill->devices[fill->cur].devfn = pdev->devfn;
@@ -1229,17 +1308,27 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info(
 		return -ENOMEM;
 
 	fill.devices = devices;
+	fill.vdev = &vdev->vdev;
 
+	mutex_lock(&vdev->vdev.dev_set->lock);
+	fill.devid = fill.resettable = vfio_device_cdev_opened(&vdev->vdev);
 	ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_fill_devs,
 					    &fill, slot);
+	mutex_unlock(&vdev->vdev.dev_set->lock);
 
 	/*
 	 * If a device was removed between counting and filling, we may come up
 	 * short of fill.max.  If a device was added, we'll have a return of
 	 * -EAGAIN above.
 	 */
-	if (!ret)
+	if (!ret) {
 		hdr.count = fill.cur;
+		if (fill.devid) {
+			hdr.flags = VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID;
+			if (fill.resettable)
+				hdr.flags |= VFIO_PCI_HOT_RESET_FLAG_RESETTABLE;
+		}
+	}
 
 reset_info_exit:
 	if (copy_to_user(arg, &hdr, minsz))
@@ -2335,12 +2424,10 @@ static bool vfio_dev_in_groups(struct vfio_pci_core_device *vdev,
 static int vfio_pci_is_device_in_set(struct pci_dev *pdev, void *data)
 {
 	struct vfio_device_set *dev_set = data;
-	struct vfio_device *cur;
 
-	list_for_each_entry(cur, &dev_set->device_list, dev_set_list)
-		if (cur->dev == &pdev->dev)
-			return 0;
-	return -EBUSY;
+	lockdep_assert_held(&dev_set->lock);
+
+	return vfio_pci_find_device_in_devset(dev_set, pdev) ? 0 : -EBUSY;
 }
 
 /*
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 0552e8dcf0cb..4b4e2c28984b 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -650,11 +650,46 @@ enum {
  * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 12,
  *					      struct vfio_pci_hot_reset_info)
  *
+ * This command is used to query the affected devices in the hot reset for
+ * a given device.
+ *
+ * This command always reports the segment, bus, and devfn information for
+ * each affected device, and selectively reports the group_id or dev_id per
+ * the way how the calling device is opened.
+ *
+ *	- If the calling device is opened via the traditional group/container
+ *	  API, group_id is reported.  User should check if it has owned all
+ *	  the affected devices and provides a set of group fds to prove the
+ *	  ownership in VFIO_DEVICE_PCI_HOT_RESET ioctl.
+ *
+ *	- If the calling device is opened as a cdev, dev_id is reported.
+ *	  Flag VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID would be set.  Flag
+ *	  VFIO_PCI_HOT_RESET_FLAG_RESETTABLE would be set per the ownership
+ *	  of the other affected devices.  If it is set, the user could invoke
+ *	  VFIO_DEVICE_PCI_HOT_RESET with a zero-length fd array.  Kernel
+ *	  set this flag when all the affected devices are owned by the user.
+ *	  This flag is available only VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID
+ *	  is set, otherwise ignored.  For a given affected device, it is owned
+ *	  if it suits one of the below cases:
+ *		1) bound to the same iommufd_ctx with the calling device
+ *		2) has not been bound to iommufd_ctx, but it is within the
+ *		   iommu_group of an owned device.
+ *	  For 1), the dev_id > 0, for 2) dev_id == 0. Otherwise, dev_id == -1.
+ *
+ * If the affected devices of a calling device span into multiple iommufds
+ * or opened by different APIs (group/container or cdev), hot-reset on
+ * this device would be rejected.
+ *
  * Return: 0 on success, -errno on failure:
  *	-enospc = insufficient buffer, -enodev = unsupported for device.
  */
 struct vfio_pci_dependent_device {
-	__u32	group_id;
+	union {
+		__u32   group_id;
+		__u32	dev_id;
+#define VFIO_PCI_DEVID_NONBLOCKING	0
+#define VFIO_PCI_DEVID_BLOCKING	-1
+	};
 	__u16	segment;
 	__u8	bus;
 	__u8	devfn; /* Use PCI_SLOT/PCI_FUNC */
@@ -663,6 +698,8 @@ struct vfio_pci_dependent_device {
 struct vfio_pci_hot_reset_info {
 	__u32	argsz;
 	__u32	flags;
+#define VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID	(1 << 0)
+#define VFIO_PCI_HOT_RESET_FLAG_RESETTABLE	(1 << 1)
 	__u32	count;
 	struct vfio_pci_dependent_device	devices[];
 };
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v4 9/9] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-04-26 14:54 [PATCH v4 0/9] Enhance vfio PCI hot reset for vfio cdev device Yi Liu
                   ` (7 preceding siblings ...)
  2023-04-26 14:54 ` [PATCH v4 8/9] vfio/pci: Extend VFIO_DEVICE_GET_PCI_HOT_RESET_INFO for vfio device cdev Yi Liu
@ 2023-04-26 14:54 ` Yi Liu
  2023-04-27  6:54   ` Tian, Kevin
  2023-04-27 21:55   ` Alex Williamson
  2023-04-26 15:07 ` [PATCH v4 0/9] Enhance vfio PCI hot reset for vfio cdev device Liu, Yi L
  2023-04-28  9:28 ` Jiang, Yanting
  10 siblings, 2 replies; 50+ messages in thread
From: Yi Liu @ 2023-04-26 14:54 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: joro, robin.murphy, cohuck, eric.auger, nicolinc, kvm, mjrosato,
	chao.p.peng, yi.l.liu, yi.y.sun, peterx, jasowang,
	shameerali.kolothum.thodi, lulu, suravee.suthikulpanit,
	intel-gvt-dev, intel-gfx, linux-s390, xudong.hao, yan.y.zhao,
	terrence.xu, yanting.jiang, zhenzhong.duan

This is the way user to invoke hot-reset for the devices opened by cdev
interface. User should check the flag VFIO_PCI_HOT_RESET_FLAG_RESETTABLE
in the output of VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl before doing
hot-reset for cdev devices.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Yanting Jiang <yanting.jiang@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 66 +++++++++++++++++++++++++++-----
 include/uapi/linux/vfio.h        | 22 +++++++++++
 2 files changed, 79 insertions(+), 9 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 43858d471447..f70e3b948b16 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -180,7 +180,8 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
 struct vfio_pci_group_info;
 static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
 static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
-				      struct vfio_pci_group_info *groups);
+				      struct vfio_pci_group_info *groups,
+				      struct iommufd_ctx *iommufd_ctx);
 
 /*
  * INTx masking requires the ability to disable INTx signaling via PCI_COMMAND
@@ -1364,8 +1365,7 @@ vfio_pci_ioctl_pci_hot_reset_groups(struct vfio_pci_core_device *vdev,
 	if (ret)
 		return ret;
 
-	/* Somewhere between 1 and count is OK */
-	if (!array_count || array_count > count)
+	if (array_count > count)
 		return -EINVAL;
 
 	group_fds = kcalloc(array_count, sizeof(*group_fds), GFP_KERNEL);
@@ -1414,7 +1414,7 @@ vfio_pci_ioctl_pci_hot_reset_groups(struct vfio_pci_core_device *vdev,
 	info.count = array_count;
 	info.files = files;
 
-	ret = vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, &info);
+	ret = vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, &info, NULL);
 
 hot_reset_release:
 	for (file_idx--; file_idx >= 0; file_idx--)
@@ -1429,6 +1429,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 {
 	unsigned long minsz = offsetofend(struct vfio_pci_hot_reset, count);
 	struct vfio_pci_hot_reset hdr;
+	struct iommufd_ctx *iommufd;
 	bool slot = false;
 
 	if (copy_from_user(&hdr, arg, minsz))
@@ -1443,7 +1444,12 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 	else if (pci_probe_reset_bus(vdev->pdev->bus))
 		return -ENODEV;
 
-	return vfio_pci_ioctl_pci_hot_reset_groups(vdev, hdr.count, slot, arg);
+	if (hdr.count)
+		return vfio_pci_ioctl_pci_hot_reset_groups(vdev, hdr.count, slot, arg);
+
+	iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
+
+	return vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, NULL, iommufd);
 }
 
 static int vfio_pci_ioctl_ioeventfd(struct vfio_pci_core_device *vdev,
@@ -2415,6 +2421,9 @@ static bool vfio_dev_in_groups(struct vfio_pci_core_device *vdev,
 {
 	unsigned int i;
 
+	if (!groups)
+		return false;
+
 	for (i = 0; i < groups->count; i++)
 		if (vfio_file_has_dev(groups->files[i], &vdev->vdev))
 			return true;
@@ -2488,13 +2497,38 @@ static int vfio_pci_dev_set_pm_runtime_get(struct vfio_device_set *dev_set)
 	return ret;
 }
 
+static bool vfio_dev_in_iommufd_ctx(struct vfio_pci_core_device *vdev,
+				    struct iommufd_ctx *iommufd_ctx)
+{
+	struct iommufd_ctx *iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
+	struct iommu_group *iommu_group;
+
+	if (!iommufd_ctx)
+		return false;
+
+	if (iommufd == iommufd_ctx)
+		return true;
+
+	iommu_group = iommu_group_get(vdev->vdev.dev);
+	if (!iommu_group)
+		return false;
+
+	/*
+	 * Try to check if any device within iommu_group is bound with
+	 * the input iommufd_ctx.
+	 */
+	return vfio_devset_iommufd_has_group(vdev->vdev.dev_set,
+					     iommufd_ctx, iommu_group);
+}
+
 /*
  * We need to get memory_lock for each device, but devices can share mmap_lock,
  * therefore we need to zap and hold the vma_lock for each device, and only then
  * get each memory_lock.
  */
 static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
-				      struct vfio_pci_group_info *groups)
+				      struct vfio_pci_group_info *groups,
+				      struct iommufd_ctx *iommufd_ctx)
 {
 	struct vfio_pci_core_device *cur_mem;
 	struct vfio_pci_core_device *cur_vma;
@@ -2525,10 +2559,24 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 
 	list_for_each_entry(cur_vma, &dev_set->device_list, vdev.dev_set_list) {
 		/*
-		 * Test whether all the affected devices are contained by the
-		 * set of groups provided by the user.
+		 * Test whether all the affected devices can be reset by the
+		 * user.
+		 *
+		 * If user provides a set of groups, all the opened devices
+		 * in the dev_set should be contained by the set of groups
+		 * provided by the user.
+		 *
+		 * If user provides a zero-length group fd array, then all
+		 * the affected devices must be bound to same iommufd_ctx as
+		 * the input iommufd_ctx.  If there is device that has not
+		 * been bound to iommufd_ctx yet, shall check if there is any
+		 * device within its iommu_group that has been bound to the
+		 * input iommufd_ctx.
+		 *
+		 * Otherwise, reset is not allowed.
 		 */
-		if (!vfio_dev_in_groups(cur_vma, groups)) {
+		if (!vfio_dev_in_groups(cur_vma, groups) &&
+		    !vfio_dev_in_iommufd_ctx(cur_vma, iommufd_ctx)) {
 			ret = -EINVAL;
 			goto err_undo;
 		}
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 4b4e2c28984b..1241d02d8701 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -710,6 +710,28 @@ struct vfio_pci_hot_reset_info {
  * VFIO_DEVICE_PCI_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + 13,
  *				    struct vfio_pci_hot_reset)
  *
+ * Userspace requests hot reset for the devices it operates.  Due to the
+ * underlying topology, multiple devices can be affected in the reset
+ * while some might be opened by another user.  To avoid interference
+ * the calling user must ensure all affected devices are owned by itself.
+ * The ownership proof needs to refer the output of
+ * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO.  Ownership can be proved as:
+ *
+ *   1) An array of group fds - This is used for the devices opened via
+ *				the group/container interface.
+ *   2) A zero-length array - This is used for the devices opened via
+ *			      the cdev interface.  User should check the
+ *			      flag VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID
+ *			      and flag VFIO_PCI_HOT_RESET_FLAG_RESETTABLE
+ *			      before using this method.
+ *
+ * In case a non void group fd array is passed, the devices affected by
+ * the reset must belong to those opened VFIO groups.  In case a zero
+ * length array is passed, the other devices affected by the reset, if
+ * any, must be either bound to the same iommufd as this VFIO device or
+ * in the same iommu_group with a device that does.  Either of the two
+ * methods is applied to check the feasibility of the hot reset.
+ *
  * Return: 0 on success, -errno on failure.
  */
 struct vfio_pci_hot_reset {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* RE: [PATCH v4 0/9] Enhance vfio PCI hot reset for vfio cdev device
  2023-04-26 14:54 [PATCH v4 0/9] Enhance vfio PCI hot reset for vfio cdev device Yi Liu
                   ` (8 preceding siblings ...)
  2023-04-26 14:54 ` [PATCH v4 9/9] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET Yi Liu
@ 2023-04-26 15:07 ` Liu, Yi L
  2023-04-28  9:28 ` Jiang, Yanting
  10 siblings, 0 replies; 50+ messages in thread
From: Liu, Yi L @ 2023-04-26 15:07 UTC (permalink / raw)
  To: alex.williamson@redhat.com, jgg@nvidia.com, Tian, Kevin
  Cc: joro@8bytes.org, robin.murphy@arm.com, cohuck@redhat.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com,
	intel-gvt-dev@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-s390@vger.kernel.org,
	Hao, Xudong, Zhao, Yan Y, Xu, Terrence, Jiang, Yanting,
	Duan, Zhenzhong

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Wednesday, April 26, 2023 10:54 PM
> 
> VFIO_DEVICE_PCI_HOT_RESET requires user to pass an array of group fds
> to prove that it owns all devices affected by resetting the calling
> device. While for cdev devices, user can use an iommufd-based ownership
> checking model and invoke VFIO_DEVICE_PCI_HOT_RESET with a zero-length
> fd array.
> 
> This series first creates iommufd_access for noiommu devices to fill the
> gap for adding iommufd-based ownership checking model, then extends
> VFIO_DEVICE_GET_PCI_HOT_RESET_INFO to check ownership and return the
> check result and the devid of affected devices to user. In the end, extends
> the VFIO_DEVICE_PCI_HOT_RESET to accept zero-length fd array for hot-reset
> with cdev devices.
> 
> The new hot reset method and updated _INFO ioctl are tested with the
> below qemu:
> 
> https://github.com/yiliu1765/qemu/tree/iommufd_rfcv4.mig.reset.v4_var3
> (requires to test with the cdev kernel)

The cdev kernel is below branch, this series is part of below branch.

https://github.com/yiliu1765/iommufd/tree/vfio_device_cdev_v10

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v4 1/9] vfio: Determine noiommu in vfio_device registration
  2023-04-26 14:54 ` [PATCH v4 1/9] vfio: Determine noiommu in vfio_device registration Yi Liu
@ 2023-04-27  6:36   ` Tian, Kevin
  2023-04-27  7:05     ` Liu, Yi L
  0 siblings, 1 reply; 50+ messages in thread
From: Tian, Kevin @ 2023-04-27  6:36 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson@redhat.com, jgg@nvidia.com
  Cc: joro@8bytes.org, robin.murphy@arm.com, cohuck@redhat.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com,
	intel-gvt-dev@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-s390@vger.kernel.org,
	Hao, Xudong, Zhao, Yan Y, Xu, Terrence, Jiang, Yanting,
	Duan, Zhenzhong

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Wednesday, April 26, 2023 10:54 PM
> 
> -static inline bool vfio_device_is_noiommu(struct vfio_device *vdev)
> +static inline int vfio_device_set_noiommu(struct vfio_device *device)
>  {
> -	return IS_ENABLED(CONFIG_VFIO_NOIOMMU) &&
> -	       vdev->group->type == VFIO_NO_IOMMU;
> +	device->noiommu = IS_ENABLED(CONFIG_VFIO_NOIOMMU) &&
> +			  device->group->type == VFIO_NO_IOMMU;
> +	return 0;

Just void. this can't fail.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v4 2/9] vfio-iommufd: Create iommufd_access for noiommu devices
  2023-04-26 14:54 ` [PATCH v4 2/9] vfio-iommufd: Create iommufd_access for noiommu devices Yi Liu
@ 2023-04-27  6:39   ` Tian, Kevin
  2023-04-27  6:59     ` Liu, Yi L
  0 siblings, 1 reply; 50+ messages in thread
From: Tian, Kevin @ 2023-04-27  6:39 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson@redhat.com, jgg@nvidia.com
  Cc: joro@8bytes.org, robin.murphy@arm.com, cohuck@redhat.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com,
	intel-gvt-dev@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-s390@vger.kernel.org,
	Hao, Xudong, Zhao, Yan Y, Xu, Terrence, Jiang, Yanting,
	Duan, Zhenzhong

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Wednesday, April 26, 2023 10:54 PM
> @@ -121,7 +128,8 @@ static void vfio_emulated_unmap(void *data,
> unsigned long iova,
>  {
>  	struct vfio_device *vdev = data;
> 
> -	if (vdev->ops->dma_unmap)
> +	/* noiommu devices cannot do map/unmap */
> +	if (vdev->noiommu && vdev->ops->dma_unmap)
>  		vdev->ops->dma_unmap(vdev, iova, length);

Is it necessary? All mdev devices implementing @dma_unmap won't
set noiommu flag.

Instead in the future if we allow noiommu userspace to pin pages
we'd need similar logic too.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v4 4/9] vfio/pci: Move the existing hot reset logic to be a helper
  2023-04-26 14:54 ` [PATCH v4 4/9] vfio/pci: Move the existing hot reset logic to be a helper Yi Liu
@ 2023-04-27  6:39   ` Tian, Kevin
  0 siblings, 0 replies; 50+ messages in thread
From: Tian, Kevin @ 2023-04-27  6:39 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson@redhat.com, jgg@nvidia.com
  Cc: joro@8bytes.org, robin.murphy@arm.com, cohuck@redhat.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com,
	intel-gvt-dev@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-s390@vger.kernel.org,
	Hao, Xudong, Zhao, Yan Y, Xu, Terrence, Jiang, Yanting,
	Duan, Zhenzhong

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Wednesday, April 26, 2023 10:54 PM
> 
> This prepares to add another method for hot reset. The major hot reset logic
> are moved to vfio_pci_ioctl_pci_hot_reset_groups().
> 
> No functional change is intended.
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> Tested-by: Yanting Jiang <yanting.jiang@intel.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v4 5/9] vfio: Mark cdev usage in vfio_device
  2023-04-26 14:54 ` [PATCH v4 5/9] vfio: Mark cdev usage in vfio_device Yi Liu
@ 2023-04-27  6:40   ` Tian, Kevin
  2023-04-27 18:43   ` Alex Williamson
  1 sibling, 0 replies; 50+ messages in thread
From: Tian, Kevin @ 2023-04-27  6:40 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson@redhat.com, jgg@nvidia.com
  Cc: joro@8bytes.org, robin.murphy@arm.com, cohuck@redhat.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com,
	intel-gvt-dev@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-s390@vger.kernel.org,
	Hao, Xudong, Zhao, Yan Y, Xu, Terrence, Jiang, Yanting,
	Duan, Zhenzhong

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Wednesday, April 26, 2023 10:54 PM
> 
> Use it to differentiate whether to report group_id or dev_id in revised
> VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl. Though it is not set at this
> moment introducing it now allows us to get hot reset ready for cdev.
> 
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v4 6/9] iommufd: Reserved -1 in the iommufd xarray
  2023-04-26 14:54 ` [PATCH v4 6/9] iommufd: Reserved -1 in the iommufd xarray Yi Liu
@ 2023-04-27  6:41   ` Tian, Kevin
  2023-04-27  7:09     ` Liu, Yi L
  0 siblings, 1 reply; 50+ messages in thread
From: Tian, Kevin @ 2023-04-27  6:41 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson@redhat.com, jgg@nvidia.com
  Cc: joro@8bytes.org, robin.murphy@arm.com, cohuck@redhat.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com,
	intel-gvt-dev@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-s390@vger.kernel.org,
	Hao, Xudong, Zhao, Yan Y, Xu, Terrence, Jiang, Yanting,
	Duan, Zhenzhong

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Wednesday, April 26, 2023 10:54 PM
> 
> VFIO needs two reserved values. 0 is already reserved by initializing
> xarray with XA_FLAGS_ALLOC1. This reserves -1 by limiting the xa alloc
> range.
> 
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  drivers/iommu/iommufd/main.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/iommufd/main.c
> b/drivers/iommu/iommufd/main.c
> index 3fbe636c3d8a..51b27c96c52f 100644
> --- a/drivers/iommu/iommufd/main.c
> +++ b/drivers/iommu/iommufd/main.c
> @@ -28,6 +28,9 @@ struct iommufd_object_ops {
>  static const struct iommufd_object_ops iommufd_object_ops[];
>  static struct miscdevice vfio_misc_dev;
> 
> +/* -1 is reserved */
> +#define iommufd_xa_limit_32b XA_LIMIT(0, (-2U))
> +
>  struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
>  					     size_t size,
>  					     enum iommufd_object_type type)
> @@ -50,7 +53,7 @@ struct iommufd_object *_iommufd_object_alloc(struct
> iommufd_ctx *ictx,
>  	 * before calling iommufd_object_finalize().
>  	 */
>  	rc = xa_alloc(&ictx->objects, &obj->id, XA_ZERO_ENTRY,
> -		      xa_limit_32b, GFP_KERNEL_ACCOUNT);
> +		      iommufd_xa_limit_32b, GFP_KERNEL_ACCOUNT);

Just direct use XA_LIMIT() here.

btw do we need a contract so vfio can learn 0 and -1 are reserved or
fine to have a fixed assumption in later patches?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v4 7/9] vfio-iommufd: Add helper to retrieve iommufd_ctx and devid for vfio_device
  2023-04-26 14:54 ` [PATCH v4 7/9] vfio-iommufd: Add helper to retrieve iommufd_ctx and devid for vfio_device Yi Liu
@ 2023-04-27  6:45   ` Tian, Kevin
  2023-04-27  7:15     ` Liu, Yi L
  0 siblings, 1 reply; 50+ messages in thread
From: Tian, Kevin @ 2023-04-27  6:45 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson@redhat.com, jgg@nvidia.com
  Cc: mjrosato@linux.ibm.com, jasowang@redhat.com, Hao, Xudong,
	Duan, Zhenzhong, peterx@redhat.com, Xu, Terrence,
	chao.p.peng@linux.intel.com, linux-s390@vger.kernel.org,
	Liu, Yi L, kvm@vger.kernel.org, lulu@redhat.com, Jiang, Yanting,
	joro@8bytes.org, nicolinc@nvidia.com, Zhao, Yan Y,
	intel-gfx@lists.freedesktop.org, eric.auger@redhat.com,
	intel-gvt-dev@lists.freedesktop.org, yi.y.sun@linux.intel.com,
	cohuck@redhat.com, shameerali.kolothum.thodi@huawei.com,
	suravee.suthikulpanit@amd.com, robin.murphy@arm.com

> From: Yi Liu
> Sent: Wednesday, April 26, 2023 10:54 PM
> +
> +/*
> + * Return devid for devices that have been bound with iommufd,
> + * returns 0 if not bound yet.
> + */
> +u32 vfio_iommufd_physical_devid(struct vfio_device *vdev)
> +{
> +     if (WARN_ON(!vdev->iommufd_device && !vdev->iommufd_access))
> +             return 0;

is WARN_ON too restrictive?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v4 8/9] vfio/pci: Extend VFIO_DEVICE_GET_PCI_HOT_RESET_INFO for vfio device cdev
  2023-04-26 14:54 ` [PATCH v4 8/9] vfio/pci: Extend VFIO_DEVICE_GET_PCI_HOT_RESET_INFO for vfio device cdev Yi Liu
@ 2023-04-27  6:51   ` Tian, Kevin
  2023-04-27 20:04   ` Alex Williamson
  1 sibling, 0 replies; 50+ messages in thread
From: Tian, Kevin @ 2023-04-27  6:51 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson@redhat.com, jgg@nvidia.com
  Cc: joro@8bytes.org, robin.murphy@arm.com, cohuck@redhat.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com,
	intel-gvt-dev@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-s390@vger.kernel.org,
	Hao, Xudong, Zhao, Yan Y, Xu, Terrence, Jiang, Yanting,
	Duan, Zhenzhong

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Wednesday, April 26, 2023 10:54 PM
> +
> +/*
> + * Check if a given iommu_group has been bound to an iommufd within a
> + * devset.  Returns true if there is device in the devset which is in
> + * the input iommu_group and meanwhile bound to the input iommufd.
> + * Otherwise, returns false.
> + */
> +static bool
> +vfio_devset_iommufd_has_group(struct vfio_device_set *dev_set,
> +			      struct iommufd_ctx *iommufd,
> +			      struct iommu_group *iommu_group)

this function name reads confusing. Probably just use
vfio_dev_in_iommufd_ctx() from next patch which sounds clearer
and open-code this into that helper.

> 
> -	fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
> +	if (fill->devid) {
> +		struct vfio_device *vdev;
> +
> +		/*
> +		 * Report devid for the affected devices:
> +		 * - valid devid > 0 for the devices that are bound with
> +		 *   the iommufd of the calling device.

">0" is inaccurate. All values except 0 and -1 are valid.

> +		 * - devid == 0 for the devices that have not been opened
> +		 *   but have same group with one of the devices bound to
> +		 *   the iommufd of the calling device.
> +		 * - devid == -1 for others, and clear resettable flag.
> +		 */
> +		vdev = vfio_pci_find_device_in_devset(dev_set, pdev);
> +		if (vdev && iommufd == vfio_iommufd_physical_ictx(vdev)) {
> +			fill->devices[fill->cur].dev_id =
> +
> 	vfio_iommufd_physical_devid(vdev);
> +			if (unlikely(!fill->devices[fill->cur].dev_id))
> +				return -EINVAL;
> +		} else if (vfio_devset_iommufd_has_group(dev_set, iommufd,
> +							 iommu_group)) {
> +			fill->devices[fill->cur].dev_id =
> VFIO_PCI_DEVID_NONBLOCKING;
> +		} else {
> +			fill->devices[fill->cur].dev_id =
> VFIO_PCI_DEVID_BLOCKING;
> +			fill->resettable = false;

NONBLOCKING/BLOCKING is unclear in the context of devid.

since 0 and -1 are talked in all other places and above comment probably
it's clear enough to use plain values?

> +		}
> +	} else {
> +		fill->devices[fill->cur].group_id =
> iommu_group_id(iommu_group);
> +	}

move iommu_group_get/put() into 'else' branch.

when calling vfio_dev_in_iommufd_ctx() in the devid branch the group
will be only used in 'else' in this function.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v4 9/9] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-04-26 14:54 ` [PATCH v4 9/9] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET Yi Liu
@ 2023-04-27  6:54   ` Tian, Kevin
  2023-04-27  7:02     ` Liu, Yi L
  2023-04-27 21:55   ` Alex Williamson
  1 sibling, 1 reply; 50+ messages in thread
From: Tian, Kevin @ 2023-04-27  6:54 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson@redhat.com, jgg@nvidia.com
  Cc: joro@8bytes.org, robin.murphy@arm.com, cohuck@redhat.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com,
	intel-gvt-dev@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-s390@vger.kernel.org,
	Hao, Xudong, Zhao, Yan Y, Xu, Terrence, Jiang, Yanting,
	Duan, Zhenzhong

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Wednesday, April 26, 2023 10:54 PM
>
> +static bool vfio_dev_in_iommufd_ctx(struct vfio_pci_core_device *vdev,
> +				    struct iommufd_ctx *iommufd_ctx)
> +{
> +	struct iommufd_ctx *iommufd = vfio_iommufd_physical_ictx(&vdev-
> >vdev);
> +	struct iommu_group *iommu_group;
> +
> +	if (!iommufd_ctx)
> +		return false;
> +
> +	if (iommufd == iommufd_ctx)
> +		return true;
> +
> +	iommu_group = iommu_group_get(vdev->vdev.dev);
> +	if (!iommu_group)
> +		return false;
> +
> +	/*
> +	 * Try to check if any device within iommu_group is bound with
> +	 * the input iommufd_ctx.
> +	 */
> +	return vfio_devset_iommufd_has_group(vdev->vdev.dev_set,
> +					     iommufd_ctx, iommu_group);

iommu_group is not put.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v4 2/9] vfio-iommufd: Create iommufd_access for noiommu devices
  2023-04-27  6:39   ` Tian, Kevin
@ 2023-04-27  6:59     ` Liu, Yi L
  2023-04-27 18:32       ` Alex Williamson
  0 siblings, 1 reply; 50+ messages in thread
From: Liu, Yi L @ 2023-04-27  6:59 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson@redhat.com, jgg@nvidia.com
  Cc: joro@8bytes.org, robin.murphy@arm.com, cohuck@redhat.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com,
	intel-gvt-dev@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-s390@vger.kernel.org,
	Hao, Xudong, Zhao, Yan Y, Xu, Terrence, Jiang, Yanting,
	Duan, Zhenzhong

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Thursday, April 27, 2023 2:39 PM
> 
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Wednesday, April 26, 2023 10:54 PM
> > @@ -121,7 +128,8 @@ static void vfio_emulated_unmap(void *data,
> > unsigned long iova,
> >  {
> >  	struct vfio_device *vdev = data;
> >
> > -	if (vdev->ops->dma_unmap)
> > +	/* noiommu devices cannot do map/unmap */
> > +	if (vdev->noiommu && vdev->ops->dma_unmap)
> >  		vdev->ops->dma_unmap(vdev, iova, length);
> 
> Is it necessary? All mdev devices implementing @dma_unmap won't
> set noiommu flag.

Hmmm. Yes, and all the devices set noiommu is not implementing @dma_unmap
as far as I see. Maybe this noiommu check can be removed.

> 
> Instead in the future if we allow noiommu userspace to pin pages
> we'd need similar logic too.

I'm not quite sure about it so far. For mdev devices, the device driver
may use vfio_pin_pages/vfio_dma_rw () to pin page. Hence such drivers
need to listen to dma_unmap() event. But for noiommu users, does the
device driver also participate in the page pin? At least for vfio-pci driver,
it does not, or maybe it will in the future when enabling noiommu
userspace to pin pages. It looks to me such userspace should order
the DMA before calling ioctl to unpin page instead of letting device
driver listen to unmap.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v4 9/9] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-04-27  6:54   ` Tian, Kevin
@ 2023-04-27  7:02     ` Liu, Yi L
  0 siblings, 0 replies; 50+ messages in thread
From: Liu, Yi L @ 2023-04-27  7:02 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson@redhat.com, jgg@nvidia.com
  Cc: joro@8bytes.org, robin.murphy@arm.com, cohuck@redhat.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com,
	intel-gvt-dev@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-s390@vger.kernel.org,
	Hao, Xudong, Zhao, Yan Y, Xu, Terrence, Jiang, Yanting,
	Duan, Zhenzhong

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Thursday, April 27, 2023 2:54 PM
> 
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Wednesday, April 26, 2023 10:54 PM
> >
> > +static bool vfio_dev_in_iommufd_ctx(struct vfio_pci_core_device *vdev,
> > +				    struct iommufd_ctx *iommufd_ctx)
> > +{
> > +	struct iommufd_ctx *iommufd = vfio_iommufd_physical_ictx(&vdev-
> > >vdev);
> > +	struct iommu_group *iommu_group;
> > +
> > +	if (!iommufd_ctx)
> > +		return false;
> > +
> > +	if (iommufd == iommufd_ctx)
> > +		return true;
> > +
> > +	iommu_group = iommu_group_get(vdev->vdev.dev);
> > +	if (!iommu_group)
> > +		return false;
> > +
> > +	/*
> > +	 * Try to check if any device within iommu_group is bound with
> > +	 * the input iommufd_ctx.
> > +	 */
> > +	return vfio_devset_iommufd_has_group(vdev->vdev.dev_set,
> > +					     iommufd_ctx, iommu_group);
> 
> iommu_group is not put.

Oops. Yes.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v4 1/9] vfio: Determine noiommu in vfio_device registration
  2023-04-27  6:36   ` Tian, Kevin
@ 2023-04-27  7:05     ` Liu, Yi L
  2023-04-27 18:35       ` Alex Williamson
  0 siblings, 1 reply; 50+ messages in thread
From: Liu, Yi L @ 2023-04-27  7:05 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson@redhat.com, jgg@nvidia.com
  Cc: joro@8bytes.org, robin.murphy@arm.com, cohuck@redhat.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com,
	intel-gvt-dev@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-s390@vger.kernel.org,
	Hao, Xudong, Zhao, Yan Y, Xu, Terrence, Jiang, Yanting,
	Duan, Zhenzhong

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Thursday, April 27, 2023 2:36 PM
> 
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Wednesday, April 26, 2023 10:54 PM
> >
> > -static inline bool vfio_device_is_noiommu(struct vfio_device *vdev)
> > +static inline int vfio_device_set_noiommu(struct vfio_device *device)
> >  {
> > -	return IS_ENABLED(CONFIG_VFIO_NOIOMMU) &&
> > -	       vdev->group->type == VFIO_NO_IOMMU;
> > +	device->noiommu = IS_ENABLED(CONFIG_VFIO_NOIOMMU) &&
> > +			  device->group->type == VFIO_NO_IOMMU;
> > +	return 0;
> 
> Just void. this can't fail.

Hmmm. Yes, before below commit, it cannot fail. Maybe this can be
converted to int later.

https://lore.kernel.org/kvm/20230426150321.454465-22-yi.l.liu@intel.com/T/#u

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v4 6/9] iommufd: Reserved -1 in the iommufd xarray
  2023-04-27  6:41   ` Tian, Kevin
@ 2023-04-27  7:09     ` Liu, Yi L
  2023-04-27 11:55       ` Jason Gunthorpe
  0 siblings, 1 reply; 50+ messages in thread
From: Liu, Yi L @ 2023-04-27  7:09 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson@redhat.com, jgg@nvidia.com
  Cc: joro@8bytes.org, robin.murphy@arm.com, cohuck@redhat.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com,
	intel-gvt-dev@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-s390@vger.kernel.org,
	Hao, Xudong, Zhao, Yan Y, Xu, Terrence, Jiang, Yanting,
	Duan, Zhenzhong

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Thursday, April 27, 2023 2:42 PM
> 
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Wednesday, April 26, 2023 10:54 PM
> >
> > VFIO needs two reserved values. 0 is already reserved by initializing
> > xarray with XA_FLAGS_ALLOC1. This reserves -1 by limiting the xa alloc
> > range.
> >
> > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> > ---
> >  drivers/iommu/iommufd/main.c | 5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/iommu/iommufd/main.c
> > b/drivers/iommu/iommufd/main.c
> > index 3fbe636c3d8a..51b27c96c52f 100644
> > --- a/drivers/iommu/iommufd/main.c
> > +++ b/drivers/iommu/iommufd/main.c
> > @@ -28,6 +28,9 @@ struct iommufd_object_ops {
> >  static const struct iommufd_object_ops iommufd_object_ops[];
> >  static struct miscdevice vfio_misc_dev;
> >
> > +/* -1 is reserved */
> > +#define iommufd_xa_limit_32b XA_LIMIT(0, (-2U))
> > +
> >  struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
> >  					     size_t size,
> >  					     enum iommufd_object_type type)
> > @@ -50,7 +53,7 @@ struct iommufd_object *_iommufd_object_alloc(struct
> > iommufd_ctx *ictx,
> >  	 * before calling iommufd_object_finalize().
> >  	 */
> >  	rc = xa_alloc(&ictx->objects, &obj->id, XA_ZERO_ENTRY,
> > -		      xa_limit_32b, GFP_KERNEL_ACCOUNT);
> > +		      iommufd_xa_limit_32b, GFP_KERNEL_ACCOUNT);
> 
> Just direct use XA_LIMIT() here.

Ok.

> btw do we need a contract so vfio can learn 0 and -1 are reserved or
> fine to have a fixed assumption in later patches?

I doubt how to do it. ☹ @Jason? What about your opinion?

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v4 7/9] vfio-iommufd: Add helper to retrieve iommufd_ctx and devid for vfio_device
  2023-04-27  6:45   ` Tian, Kevin
@ 2023-04-27  7:15     ` Liu, Yi L
  0 siblings, 0 replies; 50+ messages in thread
From: Liu, Yi L @ 2023-04-27  7:15 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson@redhat.com, jgg@nvidia.com
  Cc: mjrosato@linux.ibm.com, jasowang@redhat.com, Hao, Xudong,
	Duan, Zhenzhong, peterx@redhat.com, Xu, Terrence,
	chao.p.peng@linux.intel.com, linux-s390@vger.kernel.org,
	kvm@vger.kernel.org, lulu@redhat.com, Jiang, Yanting,
	joro@8bytes.org, nicolinc@nvidia.com, Zhao, Yan Y,
	intel-gfx@lists.freedesktop.org, eric.auger@redhat.com,
	intel-gvt-dev@lists.freedesktop.org, yi.y.sun@linux.intel.com,
	cohuck@redhat.com, shameerali.kolothum.thodi@huawei.com,
	suravee.suthikulpanit@amd.com, robin.murphy@arm.com

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Thursday, April 27, 2023 2:46 PM
> 
> > From: Yi Liu
> > Sent: Wednesday, April 26, 2023 10:54 PM
> > +
> > +/*
> > + * Return devid for devices that have been bound with iommufd,
> > + * returns 0 if not bound yet.
> > + */
> > +u32 vfio_iommufd_physical_devid(struct vfio_device *vdev)
> > +{
> > +	if (WARN_ON(!vdev->iommufd_device && !vdev->iommufd_access))
> > +		return 0;
> 
> is WARN_ON too restrictive?

This originated from a comment from Eric[1]. At that time, this helper is
void type, hence there is no message when there is no devid. Now, this returns
0 if the device is not bound. Maybe checking it in the caller and warn on
there?

[1] https://lore.kernel.org/kvm/702c2883-1d51-b609-1e99-337295e6e307@redhat.com/

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v4 6/9] iommufd: Reserved -1 in the iommufd xarray
  2023-04-27  7:09     ` Liu, Yi L
@ 2023-04-27 11:55       ` Jason Gunthorpe
  0 siblings, 0 replies; 50+ messages in thread
From: Jason Gunthorpe @ 2023-04-27 11:55 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Tian, Kevin, alex.williamson@redhat.com, joro@8bytes.org,
	robin.murphy@arm.com, cohuck@redhat.com, eric.auger@redhat.com,
	nicolinc@nvidia.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com,
	chao.p.peng@linux.intel.com, yi.y.sun@linux.intel.com,
	peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com,
	intel-gvt-dev@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-s390@vger.kernel.org,
	Hao, Xudong, Zhao, Yan Y, Xu, Terrence, Jiang, Yanting,
	Duan, Zhenzhong

On Thu, Apr 27, 2023 at 07:09:38AM +0000, Liu, Yi L wrote:
> > > @@ -50,7 +53,7 @@ struct iommufd_object *_iommufd_object_alloc(struct
> > > iommufd_ctx *ictx,
> > >  	 * before calling iommufd_object_finalize().
> > >  	 */
> > >  	rc = xa_alloc(&ictx->objects, &obj->id, XA_ZERO_ENTRY,
> > > -		      xa_limit_32b, GFP_KERNEL_ACCOUNT);
> > > +		      iommufd_xa_limit_32b, GFP_KERNEL_ACCOUNT);
> > 
> > Just direct use XA_LIMIT() here.
> 
> Ok.
> 
> > btw do we need a contract so vfio can learn 0 and -1 are reserved or
> > fine to have a fixed assumption in later patches?
> 
> I doubt how to do it. ☹ @Jason? What about your opinion?

It is probably fine to use xa_limit_31b and just say that -ve values
are reserved in a comment near the define for 0

Jason

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v4 2/9] vfio-iommufd: Create iommufd_access for noiommu devices
  2023-04-27  6:59     ` Liu, Yi L
@ 2023-04-27 18:32       ` Alex Williamson
  2023-04-28  6:21         ` Yi Liu
  2023-04-28 16:13         ` Yi Liu
  0 siblings, 2 replies; 50+ messages in thread
From: Alex Williamson @ 2023-04-27 18:32 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Tian, Kevin, jgg@nvidia.com, joro@8bytes.org,
	robin.murphy@arm.com, cohuck@redhat.com, eric.auger@redhat.com,
	nicolinc@nvidia.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com,
	chao.p.peng@linux.intel.com, yi.y.sun@linux.intel.com,
	peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com,
	intel-gvt-dev@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-s390@vger.kernel.org,
	Hao, Xudong, Zhao, Yan Y, Xu, Terrence, Jiang, Yanting,
	Duan, Zhenzhong

On Thu, 27 Apr 2023 06:59:17 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Tian, Kevin <kevin.tian@intel.com>
> > Sent: Thursday, April 27, 2023 2:39 PM
> >   
> > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > Sent: Wednesday, April 26, 2023 10:54 PM
> > > @@ -121,7 +128,8 @@ static void vfio_emulated_unmap(void *data,
> > > unsigned long iova,
> > >  {
> > >  	struct vfio_device *vdev = data;
> > >
> > > -	if (vdev->ops->dma_unmap)
> > > +	/* noiommu devices cannot do map/unmap */
> > > +	if (vdev->noiommu && vdev->ops->dma_unmap)
> > >  		vdev->ops->dma_unmap(vdev, iova, length);  
> > 
> > Is it necessary? All mdev devices implementing @dma_unmap won't
> > set noiommu flag.  
> 
> Hmmm. Yes, and all the devices set noiommu is not implementing @dma_unmap
> as far as I see. Maybe this noiommu check can be removed.

Not to mention that the polarity of the noiommu test is backwards here!
This also seems to be the only performance path where noiommu is tested
and therefore I believe the only actual justification of the previous
patch.
 
> > Instead in the future if we allow noiommu userspace to pin pages
> > we'd need similar logic too.  
> 
> I'm not quite sure about it so far. For mdev devices, the device driver
> may use vfio_pin_pages/vfio_dma_rw () to pin page. Hence such drivers
> need to listen to dma_unmap() event. But for noiommu users, does the
> device driver also participate in the page pin? At least for vfio-pci driver,
> it does not, or maybe it will in the future when enabling noiommu
> userspace to pin pages. It looks to me such userspace should order
> the DMA before calling ioctl to unpin page instead of letting device
> driver listen to unmap.

Whoa, noiommu is inherently unsafe an only meant to expose the vfio
device interface for userspace drivers that are going to do unsafe
things regardless.  Enabling noiommu to work with mdev, pin pages, or
anything else should not be on our agenda.  Userspaces relying on niommu
get the minimum viable interface and must impose a minuscule
incremental maintenance burden.  The only reason we're spending so much
effort on it here is to make iommufd noiommu support equivalent to
group/container noiommu support.  We should stop at that.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v4 1/9] vfio: Determine noiommu in vfio_device registration
  2023-04-27  7:05     ` Liu, Yi L
@ 2023-04-27 18:35       ` Alex Williamson
  0 siblings, 0 replies; 50+ messages in thread
From: Alex Williamson @ 2023-04-27 18:35 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Tian, Kevin, jgg@nvidia.com, joro@8bytes.org,
	robin.murphy@arm.com, cohuck@redhat.com, eric.auger@redhat.com,
	nicolinc@nvidia.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com,
	chao.p.peng@linux.intel.com, yi.y.sun@linux.intel.com,
	peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com,
	intel-gvt-dev@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-s390@vger.kernel.org,
	Hao, Xudong, Zhao, Yan Y, Xu, Terrence, Jiang, Yanting,
	Duan, Zhenzhong

On Thu, 27 Apr 2023 07:05:37 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Tian, Kevin <kevin.tian@intel.com>
> > Sent: Thursday, April 27, 2023 2:36 PM
> >   
> > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > Sent: Wednesday, April 26, 2023 10:54 PM
> > >
> > > -static inline bool vfio_device_is_noiommu(struct vfio_device *vdev)
> > > +static inline int vfio_device_set_noiommu(struct vfio_device *device)
> > >  {
> > > -	return IS_ENABLED(CONFIG_VFIO_NOIOMMU) &&
> > > -	       vdev->group->type == VFIO_NO_IOMMU;
> > > +	device->noiommu = IS_ENABLED(CONFIG_VFIO_NOIOMMU) &&
> > > +			  device->group->type == VFIO_NO_IOMMU;
> > > +	return 0;  
> > 
> > Just void. this can't fail.  
> 
> Hmmm. Yes, before below commit, it cannot fail. Maybe this can be
> converted to int later.
> 
> https://lore.kernel.org/kvm/20230426150321.454465-22-yi.l.liu@intel.com/T/#u

AFAICT with the comments on the next patch, this change is not at all
justified within this series and should be dropped.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v4 5/9] vfio: Mark cdev usage in vfio_device
  2023-04-26 14:54 ` [PATCH v4 5/9] vfio: Mark cdev usage in vfio_device Yi Liu
  2023-04-27  6:40   ` Tian, Kevin
@ 2023-04-27 18:43   ` Alex Williamson
  2023-04-28  6:42     ` Yi Liu
  1 sibling, 1 reply; 50+ messages in thread
From: Alex Williamson @ 2023-04-27 18:43 UTC (permalink / raw)
  To: Yi Liu
  Cc: jgg, kevin.tian, joro, robin.murphy, cohuck, eric.auger, nicolinc,
	kvm, mjrosato, chao.p.peng, yi.y.sun, peterx, jasowang,
	shameerali.kolothum.thodi, lulu, suravee.suthikulpanit,
	intel-gvt-dev, intel-gfx, linux-s390, xudong.hao, yan.y.zhao,
	terrence.xu, yanting.jiang, zhenzhong.duan

On Wed, 26 Apr 2023 07:54:15 -0700
Yi Liu <yi.l.liu@intel.com> wrote:

> Use it to differentiate whether to report group_id or dev_id in revised
> VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl. Though it is not set at this
> moment introducing it now allows us to get hot reset ready for cdev.
> 
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  include/linux/vfio.h | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 4ee613924435..298f4ef16be7 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -63,6 +63,7 @@ struct vfio_device {
>  	bool iommufd_attached;
>  #endif
>  	bool noiommu;
> +	bool cdev_opened;
>  };
>  
>  /**
> @@ -140,6 +141,12 @@ int vfio_iommufd_emulated_attach_ioas(struct vfio_device *vdev, u32 *pt_id);
>  	((int (*)(struct vfio_device *vdev, u32 *pt_id)) NULL)
>  #endif
>  
> +static inline bool vfio_device_cdev_opened(struct vfio_device *device)
> +{
> +	lockdep_assert_held(&device->dev_set->lock);
> +	return device->cdev_opened;
> +}

The lockdep test doesn't make much sense here, the method of opening
the device can't change from an ioctl called on the device, which is
the only path using this.  If there needs to be a placeholder for
future code, it should probably statically return false here and we can
save adding the structure field and locking checks based on the
semantics of how the field is actually used later.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v4 8/9] vfio/pci: Extend VFIO_DEVICE_GET_PCI_HOT_RESET_INFO for vfio device cdev
  2023-04-26 14:54 ` [PATCH v4 8/9] vfio/pci: Extend VFIO_DEVICE_GET_PCI_HOT_RESET_INFO for vfio device cdev Yi Liu
  2023-04-27  6:51   ` Tian, Kevin
@ 2023-04-27 20:04   ` Alex Williamson
  2023-04-27 20:15     ` Alex Williamson
  1 sibling, 1 reply; 50+ messages in thread
From: Alex Williamson @ 2023-04-27 20:04 UTC (permalink / raw)
  To: Yi Liu
  Cc: jgg, kevin.tian, joro, robin.murphy, cohuck, eric.auger, nicolinc,
	kvm, mjrosato, chao.p.peng, yi.y.sun, peterx, jasowang,
	shameerali.kolothum.thodi, lulu, suravee.suthikulpanit,
	intel-gvt-dev, intel-gfx, linux-s390, xudong.hao, yan.y.zhao,
	terrence.xu, yanting.jiang, zhenzhong.duan

On Wed, 26 Apr 2023 07:54:18 -0700
Yi Liu <yi.l.liu@intel.com> wrote:

> This makes VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl to use the bound
> iommufd of the cdev device to check the ownership of the other affected
> devices and set a flag to tell user if the cdev device is resettable
> with a zero-length fd array.
> 
> For each of the affected devices, if it is bound to the iommufd of the
> cdev device, _INFO reports a valid dev_id > 0; if it is not opened by
> the calling user, but it is in the iommu_group of a device that is bound
> to the iommufd of the cdev device, reports dev_id == 0; If the device is
> un-owned device, configured within a different iommufd, or opened outside
> of the vfio device cdev API, the _INFO ioctl shall report dev_id==-1 for
> such affected devices. dev_id >=0 doesn't block hot-reset, while
> dev_id == -1 will block hot-reset.
> 
> This adds flag VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID to tell the user
> dev_id is returned and adds flag VFIO_PCI_HOT_RESET_FLAG_RESETTABLE to
> tell user if the cdev device is resettable or not.
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 101 ++++++++++++++++++++++++++++---
>  include/uapi/linux/vfio.h        |  39 +++++++++++-
>  2 files changed, 132 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 39e7823088e7..43858d471447 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -766,6 +766,51 @@ static int vfio_pci_get_irq_count(struct vfio_pci_core_device *vdev, int irq_typ
>  	return 0;
>  }
>  
> +static struct vfio_device *
> +vfio_pci_find_device_in_devset(struct vfio_device_set *dev_set,
> +			       struct pci_dev *pdev)
> +{
> +	struct vfio_device *cur;
> +
> +	lockdep_assert_held(&dev_set->lock);
> +
> +	list_for_each_entry(cur, &dev_set->device_list, dev_set_list)
> +		if (cur->dev == &pdev->dev)
> +			return cur;
> +	return NULL;
> +}

Couldn't this just as easily take a struct device arg and live in
vfio/vfio_main.?

> +
> +/*
> + * Check if a given iommu_group has been bound to an iommufd within a
> + * devset.  Returns true if there is device in the devset which is in
> + * the input iommu_group and meanwhile bound to the input iommufd.
> + * Otherwise, returns false.
> + */
> +static bool
> +vfio_devset_iommufd_has_group(struct vfio_device_set *dev_set,
> +			      struct iommufd_ctx *iommufd,
> +			      struct iommu_group *iommu_group)
> +{
> +	struct vfio_device *cur;
> +	struct iommu_group *grp;
> +	bool found = false;
> +
> +	lockdep_assert_held(&dev_set->lock);
> +
> +	list_for_each_entry(cur, &dev_set->device_list, dev_set_list) {
> +		grp = iommu_group_get(cur->dev);
> +		if (!grp)
> +			continue;
> +		iommu_group_put(grp);
> +		if (iommu_group == grp &&
> +		    iommufd == vfio_iommufd_physical_ictx(cur)) {
> +			found = true;
> +			break;
> +		}
> +	}
> +	return found;
> +}

And should this live in vfio/iommufd.c?  I'd change the variables to
vdev and group for consistency elsewhere (yeah, I see cur from removed
code below).  We also don't need the found variable, we can simply
return true from within the loop and false outside of the loop.  The
group variable could also be scoped within the loop.

> +
>  static int vfio_pci_count_devs(struct pci_dev *pdev, void *data)
>  {
>  	(*(int *)data)++;
> @@ -776,13 +821,20 @@ struct vfio_pci_fill_info {
>  	int max;
>  	int cur;
>  	struct vfio_pci_dependent_device *devices;
> +	struct vfio_device *vdev;
> +	bool devid;
> +	bool resettable;

See other current threads on list about using bitfields.

>  };
>  
>  static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
>  {
>  	struct vfio_pci_fill_info *fill = data;
> +	struct iommufd_ctx *iommufd = vfio_iommufd_physical_ictx(fill->vdev);
> +	struct vfio_device_set *dev_set = fill->vdev->dev_set;

Curious that we didn't added iommufd and dev_set fields to
vfio_pci_fill_info instead.  Both vars can be scoped within the devid
branch below.

>  	struct iommu_group *iommu_group;
>  
> +	lockdep_assert_held(&dev_set->lock);
> +
>  	if (fill->cur == fill->max)
>  		return -EAGAIN; /* Something changed, try again */
>  
> @@ -790,7 +842,34 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
>  	if (!iommu_group)
>  		return -EPERM; /* Cannot reset non-isolated devices */
>  
> -	fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
> +	if (fill->devid) {
> +		struct vfio_device *vdev;
> +
> +		/*
> +		 * Report devid for the affected devices:
> +		 * - valid devid > 0 for the devices that are bound with
> +		 *   the iommufd of the calling device.
> +		 * - devid == 0 for the devices that have not been opened
> +		 *   but have same group with one of the devices bound to
> +		 *   the iommufd of the calling device.
> +		 * - devid == -1 for others, and clear resettable flag.
> +		 */
> +		vdev = vfio_pci_find_device_in_devset(dev_set, pdev);
> +		if (vdev && iommufd == vfio_iommufd_physical_ictx(vdev)) {
> +			fill->devices[fill->cur].dev_id =
> +						vfio_iommufd_physical_devid(vdev);
> +			if (unlikely(!fill->devices[fill->cur].dev_id))
> +				return -EINVAL;

This looks more like a WARN_ON, it requires an inconsistent kernel
state, right?

> +		} else if (vfio_devset_iommufd_has_group(dev_set, iommufd,
> +							 iommu_group)) {
> +			fill->devices[fill->cur].dev_id = VFIO_PCI_DEVID_NONBLOCKING;
> +		} else {
> +			fill->devices[fill->cur].dev_id = VFIO_PCI_DEVID_BLOCKING;
> +			fill->resettable = false;
> +		}
> +	} else {
> +		fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
> +	}
>  	fill->devices[fill->cur].segment = pci_domain_nr(pdev->bus);
>  	fill->devices[fill->cur].bus = pdev->bus->number;
>  	fill->devices[fill->cur].devfn = pdev->devfn;
> @@ -1229,17 +1308,27 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info(
>  		return -ENOMEM;
>  
>  	fill.devices = devices;
> +	fill.vdev = &vdev->vdev;
>  
> +	mutex_lock(&vdev->vdev.dev_set->lock);
> +	fill.devid = fill.resettable = vfio_device_cdev_opened(&vdev->vdev);
>  	ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_fill_devs,
>  					    &fill, slot);
> +	mutex_unlock(&vdev->vdev.dev_set->lock);
>  
>  	/*
>  	 * If a device was removed between counting and filling, we may come up
>  	 * short of fill.max.  If a device was added, we'll have a return of
>  	 * -EAGAIN above.
>  	 */
> -	if (!ret)
> +	if (!ret) {
>  		hdr.count = fill.cur;
> +		if (fill.devid) {
> +			hdr.flags = VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID;

hdr.flags is cleared early in the function, this should also mask in
DEV_ID for future proofing.

Note this implementation doesn't allow flags to be returned w/o a fully
sized return structure, as suggested might be a reason to maintain the
redundancy between the below flag and the devid semantics.

> +			if (fill.resettable)
> +				hdr.flags |= VFIO_PCI_HOT_RESET_FLAG_RESETTABLE;
> +		}
> +	}
>  
>  reset_info_exit:
>  	if (copy_to_user(arg, &hdr, minsz))
> @@ -2335,12 +2424,10 @@ static bool vfio_dev_in_groups(struct vfio_pci_core_device *vdev,
>  static int vfio_pci_is_device_in_set(struct pci_dev *pdev, void *data)
>  {
>  	struct vfio_device_set *dev_set = data;
> -	struct vfio_device *cur;
>  
> -	list_for_each_entry(cur, &dev_set->device_list, dev_set_list)
> -		if (cur->dev == &pdev->dev)
> -			return 0;
> -	return -EBUSY;
> +	lockdep_assert_held(&dev_set->lock);
> +
> +	return vfio_pci_find_device_in_devset(dev_set, pdev) ? 0 : -EBUSY;
>  }
>  
>  /*
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 0552e8dcf0cb..4b4e2c28984b 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -650,11 +650,46 @@ enum {
>   * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 12,
>   *					      struct vfio_pci_hot_reset_info)
>   *
> + * This command is used to query the affected devices in the hot reset for
> + * a given device.
> + *
> + * This command always reports the segment, bus, and devfn information for
> + * each affected device, and selectively reports the group_id or dev_id per
> + * the way how the calling device is opened.
> + *
> + *	- If the calling device is opened via the traditional group/container
> + *	  API, group_id is reported.  User should check if it has owned all
> + *	  the affected devices and provides a set of group fds to prove the
> + *	  ownership in VFIO_DEVICE_PCI_HOT_RESET ioctl.
> + *
> + *	- If the calling device is opened as a cdev, dev_id is reported.
> + *	  Flag VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID would be set.  Flag

s/would be set/is set to indicate this data type/

> + *	  VFIO_PCI_HOT_RESET_FLAG_RESETTABLE would be set per the ownership

I think we need to work on this flag name, see below.

> + *	  of the other affected devices.  If it is set, the user could invoke
> + *	  VFIO_DEVICE_PCI_HOT_RESET with a zero-length fd array.  Kernel

We don't have that support yet.

> + *	  set this flag when all the affected devices are owned by the user.
> + *	  This flag is available only VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID
> + *	  is set, otherwise ignored.  For a given affected device, it is owned

s/ignored/reserved/

> + *	  if it suits one of the below cases:

"...it is considered owned by this interface if it meets the following
conditions:"

> + *		1) bound to the same iommufd_ctx with the calling device

"1) Has a valid devid within the iommufd_ctx of the calling device.
Ownership cannot be determined across separate iommufd_ctx and the cdev
calling conventions do not support a proof-of-ownership model as
provided in the legacy group interface.  In this case a valid devid
with value greater than zero is provided in the return structure."


> + *		2) has not been bound to iommufd_ctx, but it is within the
> + *		   iommu_group of an owned device.

"2) Does not have a valid devid within iommufd_ctx of the calling
device, but belongs to the same IOMMU group as the calling device.
This provides implicit ownership for devices within the same DMA
isolation context.  In this case the invalid devid value of zero is
provided in the return structure."

> + *	  For 1), the dev_id > 0, for 2) dev_id == 0. Otherwise, dev_id == -1.

"A devid value of -1 is provided in the return structure for devices
where ownership is not available.  Such devices prevent use of
VFIO_DEVICE_PCI_HOT_RESET outside of proof-of-ownership calling
conventions (ie. via legacy group accessed devices)."

> + *
> + * If the affected devices of a calling device span into multiple iommufds
> + * or opened by different APIs (group/container or cdev), hot-reset on
> + * this device would be rejected.

I believe this is already covered in the wording suggestions above.

> + *
>   * Return: 0 on success, -errno on failure:
>   *	-enospc = insufficient buffer, -enodev = unsupported for device.
>   */
>  struct vfio_pci_dependent_device {
> -	__u32	group_id;
> +	union {
> +		__u32   group_id;
> +		__u32	dev_id;
> +#define VFIO_PCI_DEVID_NONBLOCKING	0
> +#define VFIO_PCI_DEVID_BLOCKING	-1

The above description seems like it's leaning towards OWNED rather than
BLOCKING.

> +	};
>  	__u16	segment;
>  	__u8	bus;
>  	__u8	devfn; /* Use PCI_SLOT/PCI_FUNC */
> @@ -663,6 +698,8 @@ struct vfio_pci_dependent_device {
>  struct vfio_pci_hot_reset_info {
>  	__u32	argsz;
>  	__u32	flags;
> +#define VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID	(1 << 0)
> +#define VFIO_PCI_HOT_RESET_FLAG_RESETTABLE	(1 << 1)

Maybe:

VFIO_PCI_HOT_RESET_FLAG_DEV_ID

and
 
VFIO_PCI_HOT_RESET_FLAG_DEV_ID_OWNED

I think we want to make the naming of the flag clearly specific to
DEV_ID and perhaps avoid "INFO said this was resettable, but HOT_RESET
failed" sorts of expectations.  Thanks,

Alex

>  	__u32	count;
>  	struct vfio_pci_dependent_device	devices[];
>  };


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v4 8/9] vfio/pci: Extend VFIO_DEVICE_GET_PCI_HOT_RESET_INFO for vfio device cdev
  2023-04-27 20:04   ` Alex Williamson
@ 2023-04-27 20:15     ` Alex Williamson
  2023-05-08 15:32       ` Liu, Yi L
  0 siblings, 1 reply; 50+ messages in thread
From: Alex Williamson @ 2023-04-27 20:15 UTC (permalink / raw)
  To: Yi Liu
  Cc: jgg, kevin.tian, joro, robin.murphy, cohuck, eric.auger, nicolinc,
	kvm, mjrosato, chao.p.peng, yi.y.sun, peterx, jasowang,
	shameerali.kolothum.thodi, lulu, suravee.suthikulpanit,
	intel-gvt-dev, intel-gfx, linux-s390, xudong.hao, yan.y.zhao,
	terrence.xu, yanting.jiang, zhenzhong.duan

On Thu, 27 Apr 2023 14:04:05 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Wed, 26 Apr 2023 07:54:18 -0700
> Yi Liu <yi.l.liu@intel.com> wrote:
> 
> > This makes VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl to use the bound
> > iommufd of the cdev device to check the ownership of the other affected
> > devices and set a flag to tell user if the cdev device is resettable
> > with a zero-length fd array.
> > 
> > For each of the affected devices, if it is bound to the iommufd of the
> > cdev device, _INFO reports a valid dev_id > 0; if it is not opened by
> > the calling user, but it is in the iommu_group of a device that is bound
> > to the iommufd of the cdev device, reports dev_id == 0; If the device is
> > un-owned device, configured within a different iommufd, or opened outside
> > of the vfio device cdev API, the _INFO ioctl shall report dev_id==-1 for
> > such affected devices. dev_id >=0 doesn't block hot-reset, while
> > dev_id == -1 will block hot-reset.
> > 
> > This adds flag VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID to tell the user
> > dev_id is returned and adds flag VFIO_PCI_HOT_RESET_FLAG_RESETTABLE to
> > tell user if the cdev device is resettable or not.
> > 
> > Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> > Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/pci/vfio_pci_core.c | 101 ++++++++++++++++++++++++++++---
> >  include/uapi/linux/vfio.h        |  39 +++++++++++-
> >  2 files changed, 132 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> > index 39e7823088e7..43858d471447 100644
> > --- a/drivers/vfio/pci/vfio_pci_core.c
> > +++ b/drivers/vfio/pci/vfio_pci_core.c
> > @@ -766,6 +766,51 @@ static int vfio_pci_get_irq_count(struct vfio_pci_core_device *vdev, int irq_typ
> >  	return 0;
> >  }
> >  
> > +static struct vfio_device *
> > +vfio_pci_find_device_in_devset(struct vfio_device_set *dev_set,
> > +			       struct pci_dev *pdev)
> > +{
> > +	struct vfio_device *cur;
> > +
> > +	lockdep_assert_held(&dev_set->lock);
> > +
> > +	list_for_each_entry(cur, &dev_set->device_list, dev_set_list)
> > +		if (cur->dev == &pdev->dev)
> > +			return cur;
> > +	return NULL;
> > +}  
> 
> Couldn't this just as easily take a struct device arg and live in
> vfio/vfio_main.?
> 
> > +
> > +/*
> > + * Check if a given iommu_group has been bound to an iommufd within a
> > + * devset.  Returns true if there is device in the devset which is in
> > + * the input iommu_group and meanwhile bound to the input iommufd.
> > + * Otherwise, returns false.
> > + */
> > +static bool
> > +vfio_devset_iommufd_has_group(struct vfio_device_set *dev_set,
> > +			      struct iommufd_ctx *iommufd,
> > +			      struct iommu_group *iommu_group)
> > +{
> > +	struct vfio_device *cur;
> > +	struct iommu_group *grp;
> > +	bool found = false;
> > +
> > +	lockdep_assert_held(&dev_set->lock);
> > +
> > +	list_for_each_entry(cur, &dev_set->device_list, dev_set_list) {
> > +		grp = iommu_group_get(cur->dev);
> > +		if (!grp)
> > +			continue;
> > +		iommu_group_put(grp);
> > +		if (iommu_group == grp &&
> > +		    iommufd == vfio_iommufd_physical_ictx(cur)) {
> > +			found = true;
> > +			break;
> > +		}
> > +	}
> > +	return found;
> > +}  
> 
> And should this live in vfio/iommufd.c?  I'd change the variables to
> vdev and group for consistency elsewhere (yeah, I see cur from removed
> code below).  We also don't need the found variable, we can simply
> return true from within the loop and false outside of the loop.  The
> group variable could also be scoped within the loop.
> 
> > +
> >  static int vfio_pci_count_devs(struct pci_dev *pdev, void *data)
> >  {
> >  	(*(int *)data)++;
> > @@ -776,13 +821,20 @@ struct vfio_pci_fill_info {
> >  	int max;
> >  	int cur;
> >  	struct vfio_pci_dependent_device *devices;
> > +	struct vfio_device *vdev;
> > +	bool devid;
> > +	bool resettable;  
> 
> See other current threads on list about using bitfields.
> 
> >  };
> >  
> >  static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
> >  {
> >  	struct vfio_pci_fill_info *fill = data;
> > +	struct iommufd_ctx *iommufd = vfio_iommufd_physical_ictx(fill->vdev);
> > +	struct vfio_device_set *dev_set = fill->vdev->dev_set;  
> 
> Curious that we didn't added iommufd and dev_set fields to
> vfio_pci_fill_info instead.  Both vars can be scoped within the devid
> branch below.
> 
> >  	struct iommu_group *iommu_group;
> >  
> > +	lockdep_assert_held(&dev_set->lock);
> > +
> >  	if (fill->cur == fill->max)
> >  		return -EAGAIN; /* Something changed, try again */
> >  
> > @@ -790,7 +842,34 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
> >  	if (!iommu_group)
> >  		return -EPERM; /* Cannot reset non-isolated devices */
> >  
> > -	fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
> > +	if (fill->devid) {
> > +		struct vfio_device *vdev;
> > +
> > +		/*
> > +		 * Report devid for the affected devices:
> > +		 * - valid devid > 0 for the devices that are bound with
> > +		 *   the iommufd of the calling device.
> > +		 * - devid == 0 for the devices that have not been opened
> > +		 *   but have same group with one of the devices bound to
> > +		 *   the iommufd of the calling device.
> > +		 * - devid == -1 for others, and clear resettable flag.
> > +		 */
> > +		vdev = vfio_pci_find_device_in_devset(dev_set, pdev);
> > +		if (vdev && iommufd == vfio_iommufd_physical_ictx(vdev)) {
> > +			fill->devices[fill->cur].dev_id =
> > +						vfio_iommufd_physical_devid(vdev);
> > +			if (unlikely(!fill->devices[fill->cur].dev_id))
> > +				return -EINVAL;  
> 
> This looks more like a WARN_ON, it requires an inconsistent kernel
> state, right?
> 
> > +		} else if (vfio_devset_iommufd_has_group(dev_set, iommufd,
> > +							 iommu_group)) {
> > +			fill->devices[fill->cur].dev_id = VFIO_PCI_DEVID_NONBLOCKING;
> > +		} else {
> > +			fill->devices[fill->cur].dev_id = VFIO_PCI_DEVID_BLOCKING;
> > +			fill->resettable = false;
> > +		}
> > +	} else {
> > +		fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
> > +	}
> >  	fill->devices[fill->cur].segment = pci_domain_nr(pdev->bus);
> >  	fill->devices[fill->cur].bus = pdev->bus->number;
> >  	fill->devices[fill->cur].devfn = pdev->devfn;
> > @@ -1229,17 +1308,27 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info(
> >  		return -ENOMEM;
> >  
> >  	fill.devices = devices;
> > +	fill.vdev = &vdev->vdev;
> >  
> > +	mutex_lock(&vdev->vdev.dev_set->lock);
> > +	fill.devid = fill.resettable = vfio_device_cdev_opened(&vdev->vdev);
> >  	ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_fill_devs,
> >  					    &fill, slot);
> > +	mutex_unlock(&vdev->vdev.dev_set->lock);
> >  
> >  	/*
> >  	 * If a device was removed between counting and filling, we may come up
> >  	 * short of fill.max.  If a device was added, we'll have a return of
> >  	 * -EAGAIN above.
> >  	 */
> > -	if (!ret)
> > +	if (!ret) {
> >  		hdr.count = fill.cur;
> > +		if (fill.devid) {
> > +			hdr.flags = VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID;  
> 
> hdr.flags is cleared early in the function, this should also mask in
> DEV_ID for future proofing.
> 
> Note this implementation doesn't allow flags to be returned w/o a fully
> sized return structure, as suggested might be a reason to maintain the
> redundancy between the below flag and the devid semantics.
> 
> > +			if (fill.resettable)
> > +				hdr.flags |= VFIO_PCI_HOT_RESET_FLAG_RESETTABLE;
> > +		}
> > +	}
> >  
> >  reset_info_exit:
> >  	if (copy_to_user(arg, &hdr, minsz))
> > @@ -2335,12 +2424,10 @@ static bool vfio_dev_in_groups(struct vfio_pci_core_device *vdev,
> >  static int vfio_pci_is_device_in_set(struct pci_dev *pdev, void *data)
> >  {
> >  	struct vfio_device_set *dev_set = data;
> > -	struct vfio_device *cur;
> >  
> > -	list_for_each_entry(cur, &dev_set->device_list, dev_set_list)
> > -		if (cur->dev == &pdev->dev)
> > -			return 0;
> > -	return -EBUSY;
> > +	lockdep_assert_held(&dev_set->lock);
> > +
> > +	return vfio_pci_find_device_in_devset(dev_set, pdev) ? 0 : -EBUSY;
> >  }
> >  
> >  /*
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 0552e8dcf0cb..4b4e2c28984b 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -650,11 +650,46 @@ enum {
> >   * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 12,
> >   *					      struct vfio_pci_hot_reset_info)
> >   *
> > + * This command is used to query the affected devices in the hot reset for
> > + * a given device.
> > + *
> > + * This command always reports the segment, bus, and devfn information for
> > + * each affected device, and selectively reports the group_id or dev_id per
> > + * the way how the calling device is opened.
> > + *
> > + *	- If the calling device is opened via the traditional group/container
> > + *	  API, group_id is reported.  User should check if it has owned all
> > + *	  the affected devices and provides a set of group fds to prove the
> > + *	  ownership in VFIO_DEVICE_PCI_HOT_RESET ioctl.
> > + *
> > + *	- If the calling device is opened as a cdev, dev_id is reported.
> > + *	  Flag VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID would be set.  Flag  
> 
> s/would be set/is set to indicate this data type/
> 
> > + *	  VFIO_PCI_HOT_RESET_FLAG_RESETTABLE would be set per the ownership  
> 
> I think we need to work on this flag name, see below.
> 
> > + *	  of the other affected devices.  If it is set, the user could invoke
> > + *	  VFIO_DEVICE_PCI_HOT_RESET with a zero-length fd array.  Kernel  
> 
> We don't have that support yet.
> 
> > + *	  set this flag when all the affected devices are owned by the user.
> > + *	  This flag is available only VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID
> > + *	  is set, otherwise ignored.  For a given affected device, it is owned  
> 
> s/ignored/reserved/
> 
> > + *	  if it suits one of the below cases:  
> 
> "...it is considered owned by this interface if it meets the following
> conditions:"
> 
> > + *		1) bound to the same iommufd_ctx with the calling device  
> 
> "1) Has a valid devid within the iommufd_ctx of the calling device.
> Ownership cannot be determined across separate iommufd_ctx and the cdev
> calling conventions do not support a proof-of-ownership model as
> provided in the legacy group interface.  In this case a valid devid
> with value greater than zero is provided in the return structure."
> 
> 
> > + *		2) has not been bound to iommufd_ctx, but it is within the
> > + *		   iommu_group of an owned device.  
> 
> "2) Does not have a valid devid within iommufd_ctx of the calling
> device, but belongs to the same IOMMU group as the calling device.
> This provides implicit ownership for devices within the same DMA
> isolation context.  In this case the invalid devid value of zero is
> provided in the return structure."
> 
> > + *	  For 1), the dev_id > 0, for 2) dev_id == 0. Otherwise, dev_id == -1.  
> 
> "A devid value of -1 is provided in the return structure for devices
> where ownership is not available.  Such devices prevent use of
> VFIO_DEVICE_PCI_HOT_RESET outside of proof-of-ownership calling
> conventions (ie. via legacy group accessed devices)."
> 
> > + *
> > + * If the affected devices of a calling device span into multiple iommufds
> > + * or opened by different APIs (group/container or cdev), hot-reset on
> > + * this device would be rejected.  
> 
> I believe this is already covered in the wording suggestions above.
> 
> > + *
> >   * Return: 0 on success, -errno on failure:
> >   *	-enospc = insufficient buffer, -enodev = unsupported for device.
> >   */
> >  struct vfio_pci_dependent_device {
> > -	__u32	group_id;
> > +	union {
> > +		__u32   group_id;
> > +		__u32	dev_id;
> > +#define VFIO_PCI_DEVID_NONBLOCKING	0
> > +#define VFIO_PCI_DEVID_BLOCKING	-1  
> 
> The above description seems like it's leaning towards OWNED rather than
> BLOCKING.

Also these should be defined relative to something defined in IOMMUFD
rather than inventing values here.  We can't have the valid devid
number space owned by IOMMUFD conflict with these definitions.  Thanks,

Alex

> 
> > +	};
> >  	__u16	segment;
> >  	__u8	bus;
> >  	__u8	devfn; /* Use PCI_SLOT/PCI_FUNC */
> > @@ -663,6 +698,8 @@ struct vfio_pci_dependent_device {
> >  struct vfio_pci_hot_reset_info {
> >  	__u32	argsz;
> >  	__u32	flags;
> > +#define VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID	(1 << 0)
> > +#define VFIO_PCI_HOT_RESET_FLAG_RESETTABLE	(1 << 1)  
> 
> Maybe:
> 
> VFIO_PCI_HOT_RESET_FLAG_DEV_ID
> 
> and
>  
> VFIO_PCI_HOT_RESET_FLAG_DEV_ID_OWNED
> 
> I think we want to make the naming of the flag clearly specific to
> DEV_ID and perhaps avoid "INFO said this was resettable, but HOT_RESET
> failed" sorts of expectations.  Thanks,
> 
> Alex
> 
> >  	__u32	count;
> >  	struct vfio_pci_dependent_device	devices[];
> >  };  
> 


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v4 9/9] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-04-26 14:54 ` [PATCH v4 9/9] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET Yi Liu
  2023-04-27  6:54   ` Tian, Kevin
@ 2023-04-27 21:55   ` Alex Williamson
  2023-05-02 12:55     ` Liu, Yi L
  1 sibling, 1 reply; 50+ messages in thread
From: Alex Williamson @ 2023-04-27 21:55 UTC (permalink / raw)
  To: Yi Liu
  Cc: jgg, kevin.tian, joro, robin.murphy, cohuck, eric.auger, nicolinc,
	kvm, mjrosato, chao.p.peng, yi.y.sun, peterx, jasowang,
	shameerali.kolothum.thodi, lulu, suravee.suthikulpanit,
	intel-gvt-dev, intel-gfx, linux-s390, xudong.hao, yan.y.zhao,
	terrence.xu, yanting.jiang, zhenzhong.duan

On Wed, 26 Apr 2023 07:54:19 -0700
Yi Liu <yi.l.liu@intel.com> wrote:

> This is the way user to invoke hot-reset for the devices opened by cdev
> interface. User should check the flag VFIO_PCI_HOT_RESET_FLAG_RESETTABLE
> in the output of VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl before doing
> hot-reset for cdev devices.
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Tested-by: Yanting Jiang <yanting.jiang@intel.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 66 +++++++++++++++++++++++++++-----
>  include/uapi/linux/vfio.h        | 22 +++++++++++
>  2 files changed, 79 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 43858d471447..f70e3b948b16 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -180,7 +180,8 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
>  struct vfio_pci_group_info;
>  static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
>  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
> -				      struct vfio_pci_group_info *groups);
> +				      struct vfio_pci_group_info *groups,
> +				      struct iommufd_ctx *iommufd_ctx);
>  
>  /*
>   * INTx masking requires the ability to disable INTx signaling via PCI_COMMAND
> @@ -1364,8 +1365,7 @@ vfio_pci_ioctl_pci_hot_reset_groups(struct vfio_pci_core_device *vdev,
>  	if (ret)
>  		return ret;
>  
> -	/* Somewhere between 1 and count is OK */
> -	if (!array_count || array_count > count)
> +	if (array_count > count)
>  		return -EINVAL;

Doesn't this need a || vfio_device_cdev_opened(vdev) test as well?
It's invalid to pass fds for a cdev device.  Presumably it would fail
later collecting group fds as well, but might as well enforce the
semantics early.

>  
>  	group_fds = kcalloc(array_count, sizeof(*group_fds), GFP_KERNEL);
> @@ -1414,7 +1414,7 @@ vfio_pci_ioctl_pci_hot_reset_groups(struct vfio_pci_core_device *vdev,
>  	info.count = array_count;
>  	info.files = files;
>  
> -	ret = vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, &info);
> +	ret = vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, &info, NULL);
>  
>  hot_reset_release:
>  	for (file_idx--; file_idx >= 0; file_idx--)
> @@ -1429,6 +1429,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
>  {
>  	unsigned long minsz = offsetofend(struct vfio_pci_hot_reset, count);
>  	struct vfio_pci_hot_reset hdr;
> +	struct iommufd_ctx *iommufd;
>  	bool slot = false;
>  
>  	if (copy_from_user(&hdr, arg, minsz))
> @@ -1443,7 +1444,12 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
>  	else if (pci_probe_reset_bus(vdev->pdev->bus))
>  		return -ENODEV;
>  
> -	return vfio_pci_ioctl_pci_hot_reset_groups(vdev, hdr.count, slot, arg);
> +	if (hdr.count)
> +		return vfio_pci_ioctl_pci_hot_reset_groups(vdev, hdr.count, slot, arg);
> +
> +	iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
> +
> +	return vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, NULL, iommufd);

Why did we need to store iommufd in a variable?

>  }
>  
>  static int vfio_pci_ioctl_ioeventfd(struct vfio_pci_core_device *vdev,
> @@ -2415,6 +2421,9 @@ static bool vfio_dev_in_groups(struct vfio_pci_core_device *vdev,
>  {
>  	unsigned int i;
>  
> +	if (!groups)
> +		return false;
> +
>  	for (i = 0; i < groups->count; i++)
>  		if (vfio_file_has_dev(groups->files[i], &vdev->vdev))
>  			return true;
> @@ -2488,13 +2497,38 @@ static int vfio_pci_dev_set_pm_runtime_get(struct vfio_device_set *dev_set)
>  	return ret;
>  }
>  
> +static bool vfio_dev_in_iommufd_ctx(struct vfio_pci_core_device *vdev,
> +				    struct iommufd_ctx *iommufd_ctx)
> +{
> +	struct iommufd_ctx *iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
> +	struct iommu_group *iommu_group;
> +
> +	if (!iommufd_ctx)
> +		return false;
> +
> +	if (iommufd == iommufd_ctx)
> +		return true;
> +
> +	iommu_group = iommu_group_get(vdev->vdev.dev);
> +	if (!iommu_group)
> +		return false;
> +
> +	/*
> +	 * Try to check if any device within iommu_group is bound with
> +	 * the input iommufd_ctx.
> +	 */
> +	return vfio_devset_iommufd_has_group(vdev->vdev.dev_set,
> +					     iommufd_ctx, iommu_group);
> +}

This last test makes this not do what the function name suggests it
does.  If it were true, the device is not in the iommufd_ctx, it simply
cannot be within another iommu ctx.

> +
>  /*
>   * We need to get memory_lock for each device, but devices can share mmap_lock,
>   * therefore we need to zap and hold the vma_lock for each device, and only then
>   * get each memory_lock.
>   */
>  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
> -				      struct vfio_pci_group_info *groups)
> +				      struct vfio_pci_group_info *groups,
> +				      struct iommufd_ctx *iommufd_ctx)
>  {
>  	struct vfio_pci_core_device *cur_mem;
>  	struct vfio_pci_core_device *cur_vma;
> @@ -2525,10 +2559,24 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>  
>  	list_for_each_entry(cur_vma, &dev_set->device_list, vdev.dev_set_list) {
>  		/*
> -		 * Test whether all the affected devices are contained by the
> -		 * set of groups provided by the user.
> +		 * Test whether all the affected devices can be reset by the
> +		 * user.
> +		 *
> +		 * If user provides a set of groups, all the opened devices
> +		 * in the dev_set should be contained by the set of groups
> +		 * provided by the user.
> +		 *
> +		 * If user provides a zero-length group fd array, then all
> +		 * the affected devices must be bound to same iommufd_ctx as
> +		 * the input iommufd_ctx.  If there is device that has not
> +		 * been bound to iommufd_ctx yet, shall check if there is any
> +		 * device within its iommu_group that has been bound to the
> +		 * input iommufd_ctx.
> +		 *
> +		 * Otherwise, reset is not allowed.
>  		 */
> -		if (!vfio_dev_in_groups(cur_vma, groups)) {
> +		if (!vfio_dev_in_groups(cur_vma, groups) &&
> +		    !vfio_dev_in_iommufd_ctx(cur_vma, iommufd_ctx)) {


Rather than mangling vfio_dev_in_groups() and inventing
vfio_dev_in_iommufd_ctx() that doesn't do what it implies, how about:

bool vfio_device_owned(struct vfio_device *vdev,
		       struct vfio_pci_group_info *groups,
		       struct iommufd_ctx *iommufd_ctx)
{
	struct iommu_group *group;

	WARN_ON(!!groups == !!iommufd_ctx);

	if (groups)
		return vfio_dev_in_groups(vdev, groups));

	if (vfio_iommufd_physical_ictx(vdev) == iommufd_ctx)
		return true;

	group = iommu_group_get(vdev->dev);
	if (group)
		return vfio_devset_iommufd_has_group(vdev->vdev.dev_set,
						     iommufd_ctx, group);
	return false;
}

Seems like such a function would live in vfio_main.c

>  			ret = -EINVAL;
>  			goto err_undo;
>  		}
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 4b4e2c28984b..1241d02d8701 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -710,6 +710,28 @@ struct vfio_pci_hot_reset_info {
>   * VFIO_DEVICE_PCI_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + 13,
>   *				    struct vfio_pci_hot_reset)
>   *
> + * Userspace requests hot reset for the devices it operates.  Due to the
> + * underlying topology, multiple devices can be affected in the reset
> + * while some might be opened by another user.  To avoid interference
> + * the calling user must ensure all affected devices are owned by itself.
> + * The ownership proof needs to refer the output of
> + * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO.  Ownership can be proved as:
> + *
> + *   1) An array of group fds - This is used for the devices opened via
> + *				the group/container interface.
> + *   2) A zero-length array - This is used for the devices opened via
> + *			      the cdev interface.  User should check the
> + *			      flag VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID
> + *			      and flag VFIO_PCI_HOT_RESET_FLAG_RESETTABLE
> + *			      before using this method.
> + *
> + * In case a non void group fd array is passed, the devices affected by
> + * the reset must belong to those opened VFIO groups.  In case a zero
> + * length array is passed, the other devices affected by the reset, if
> + * any, must be either bound to the same iommufd as this VFIO device or
> + * in the same iommu_group with a device that does.  Either of the two
> + * methods is applied to check the feasibility of the hot reset.

This should probably just refer to the concept of ownership described
in the INFO ioctl and clarify that cdev opened device must exclusively
provide an empty array and group opened devices must exclusively use an
array of group fds for proof of ownership.  Mixed access to devices
between cdev and legacy groups are not supported by this interface.
Thanks,

Alex

> + *
>   * Return: 0 on success, -errno on failure.
>   */
>  struct vfio_pci_hot_reset {


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v4 2/9] vfio-iommufd: Create iommufd_access for noiommu devices
  2023-04-27 18:32       ` Alex Williamson
@ 2023-04-28  6:21         ` Yi Liu
  2023-04-28  7:00           ` Tian, Kevin
  2023-04-28 12:07           ` Jason Gunthorpe
  2023-04-28 16:13         ` Yi Liu
  1 sibling, 2 replies; 50+ messages in thread
From: Yi Liu @ 2023-04-28  6:21 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, jgg@nvidia.com, joro@8bytes.org,
	robin.murphy@arm.com, cohuck@redhat.com, eric.auger@redhat.com,
	nicolinc@nvidia.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com,
	chao.p.peng@linux.intel.com, yi.y.sun@linux.intel.com,
	peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com,
	intel-gvt-dev@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-s390@vger.kernel.org,
	Hao, Xudong, Zhao, Yan Y, Xu, Terrence, Jiang, Yanting,
	Duan, Zhenzhong

On 2023/4/28 02:32, Alex Williamson wrote:
> On Thu, 27 Apr 2023 06:59:17 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
>>> From: Tian, Kevin <kevin.tian@intel.com>
>>> Sent: Thursday, April 27, 2023 2:39 PM
>>>    
>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>> Sent: Wednesday, April 26, 2023 10:54 PM
>>>> @@ -121,7 +128,8 @@ static void vfio_emulated_unmap(void *data,
>>>> unsigned long iova,
>>>>   {
>>>>   	struct vfio_device *vdev = data;
>>>>
>>>> -	if (vdev->ops->dma_unmap)
>>>> +	/* noiommu devices cannot do map/unmap */
>>>> +	if (vdev->noiommu && vdev->ops->dma_unmap)
>>>>   		vdev->ops->dma_unmap(vdev, iova, length);
>>>
>>> Is it necessary? All mdev devices implementing @dma_unmap won't
>>> set noiommu flag.
>>
>> Hmmm. Yes, and all the devices set noiommu is not implementing @dma_unmap
>> as far as I see. Maybe this noiommu check can be removed.
> 
> Not to mention that the polarity of the noiommu test is backwards here!
> This also seems to be the only performance path where noiommu is tested
> and therefore I believe the only actual justification of the previous
> patch.

but this patch needs to use vfio_iommufd_emulated_bind() and
vfio_iommufd_emulated_unbind() for the noiommu devices when binding
to iommufd. So needs to check noiommu in the vfio_iommufd_bind()
and vfio_iommu_unbind() as well.

>>> Instead in the future if we allow noiommu userspace to pin pages
>>> we'd need similar logic too.
>>
>> I'm not quite sure about it so far. For mdev devices, the device driver
>> may use vfio_pin_pages/vfio_dma_rw () to pin page. Hence such drivers
>> need to listen to dma_unmap() event. But for noiommu users, does the
>> device driver also participate in the page pin? At least for vfio-pci driver,
>> it does not, or maybe it will in the future when enabling noiommu
>> userspace to pin pages. It looks to me such userspace should order
>> the DMA before calling ioctl to unpin page instead of letting device
>> driver listen to unmap.
> 
> Whoa, noiommu is inherently unsafe an only meant to expose the vfio
> device interface for userspace drivers that are going to do unsafe
> things regardless.  Enabling noiommu to work with mdev, pin pages, or
> anything else should not be on our agenda.

One clarification. I think the idea from Jason is to make noiommu
userspace to be able to pin page. [1]. But this is just a potential
benefit of creating iommufd_access for noiommu devices. There is no
intention to make noiommu devices to work with mdev.

[1] https://lore.kernel.org/kvm/ZD1MCc6fD+oisjki@nvidia.com/#t

> Userspaces relying on niommu
> get the minimum viable interface and must impose a minuscule
> incremental maintenance burden.  The only reason we're spending so much
> effort on it here is to make iommufd noiommu support equivalent to
> group/container noiommu support.  We should stop at that.  Thanks,

yes. This is why this patch is to bind noiommu devices to iommufd
and create iommufd_access. Otherwise, noiommu devices would have
trouble in the hot-reset path when iommufd-based ownership check
model is applied, and this is the only model for cdev. So binding
noiommu devices to iommufd is necessary for support noiommu in the
cdev interface.

If this above makes sense. Then back to the question of noiommu
check in vfio_emulated_unmap(). At first, I intend to have such
a check to avoid calling dma_unmap callback for noiommu devices.
But per Kevin's comment and your above statement on mdev and noiommu,
so in reality, noiommu device drivers won't implement dma_unmap
callback. So it is probably fine to remove the noiommu check in
vfio_emulated_unmap().

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v4 5/9] vfio: Mark cdev usage in vfio_device
  2023-04-27 18:43   ` Alex Williamson
@ 2023-04-28  6:42     ` Yi Liu
  0 siblings, 0 replies; 50+ messages in thread
From: Yi Liu @ 2023-04-28  6:42 UTC (permalink / raw)
  To: Alex Williamson
  Cc: jgg, kevin.tian, joro, robin.murphy, cohuck, eric.auger, nicolinc,
	kvm, mjrosato, chao.p.peng, yi.y.sun, peterx, jasowang,
	shameerali.kolothum.thodi, lulu, suravee.suthikulpanit,
	intel-gvt-dev, intel-gfx, linux-s390, xudong.hao, yan.y.zhao,
	terrence.xu, yanting.jiang, zhenzhong.duan

On 2023/4/28 02:43, Alex Williamson wrote:
> On Wed, 26 Apr 2023 07:54:15 -0700
> Yi Liu <yi.l.liu@intel.com> wrote:
> 
>> Use it to differentiate whether to report group_id or dev_id in revised
>> VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl. Though it is not set at this
>> moment introducing it now allows us to get hot reset ready for cdev.
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> ---
>>   include/linux/vfio.h | 7 +++++++
>>   1 file changed, 7 insertions(+)
>>
>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>> index 4ee613924435..298f4ef16be7 100644
>> --- a/include/linux/vfio.h
>> +++ b/include/linux/vfio.h
>> @@ -63,6 +63,7 @@ struct vfio_device {
>>   	bool iommufd_attached;
>>   #endif
>>   	bool noiommu;
>> +	bool cdev_opened;
>>   };
>>   
>>   /**
>> @@ -140,6 +141,12 @@ int vfio_iommufd_emulated_attach_ioas(struct vfio_device *vdev, u32 *pt_id);
>>   	((int (*)(struct vfio_device *vdev, u32 *pt_id)) NULL)
>>   #endif
>>   
>> +static inline bool vfio_device_cdev_opened(struct vfio_device *device)
>> +{
>> +	lockdep_assert_held(&device->dev_set->lock);
>> +	return device->cdev_opened;
>> +}
> 
> The lockdep test doesn't make much sense here, the method of opening
> the device can't change from an ioctl called on the device, which is
> the only path using this.  

yes, it is only used by the ioctl paths so far. And this flag should be
static in other user-space trigger-able paths with device fd, like the
read/write/mmap. Do you think a comment would help to avoid abuse in other
paths or the name of it implies it should be used after a device is opened?

> If there needs to be a placeholder for
> future code, it should probably statically return false here and we can
> save adding the structure field and locking checks based on the
> semantics of how the field is actually used later.  Thanks,

Thanks for this suggestion. I was also wondering how to handle it when
Eric questioned better to add cdev_opened when it is really used.[1]

[1] 
https://lore.kernel.org/kvm/DS0PR11MB752914607E1DC9CFBFC2EF7AC3609@DS0PR11MB7529.namprd11.prod.outlook.com/

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v4 2/9] vfio-iommufd: Create iommufd_access for noiommu devices
  2023-04-28  6:21         ` Yi Liu
@ 2023-04-28  7:00           ` Tian, Kevin
  2023-04-28  7:04             ` Yi Liu
  2023-04-28 12:07           ` Jason Gunthorpe
  1 sibling, 1 reply; 50+ messages in thread
From: Tian, Kevin @ 2023-04-28  7:00 UTC (permalink / raw)
  To: Liu, Yi L, Alex Williamson
  Cc: jgg@nvidia.com, joro@8bytes.org, robin.murphy@arm.com,
	cohuck@redhat.com, eric.auger@redhat.com, nicolinc@nvidia.com,
	kvm@vger.kernel.org, mjrosato@linux.ibm.com,
	chao.p.peng@linux.intel.com, yi.y.sun@linux.intel.com,
	peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com,
	intel-gvt-dev@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-s390@vger.kernel.org,
	Hao, Xudong, Zhao, Yan Y, Xu, Terrence, Jiang, Yanting,
	Duan, Zhenzhong

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Friday, April 28, 2023 2:21 PM
> 
> On 2023/4/28 02:32, Alex Williamson wrote:
> > On Thu, 27 Apr 2023 06:59:17 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >
> >>> From: Tian, Kevin <kevin.tian@intel.com>
> >>> Sent: Thursday, April 27, 2023 2:39 PM
> >>>
> >>>> From: Liu, Yi L <yi.l.liu@intel.com>
> >>>> Sent: Wednesday, April 26, 2023 10:54 PM
> >>>> @@ -121,7 +128,8 @@ static void vfio_emulated_unmap(void *data,
> >>>> unsigned long iova,
> >>>>   {
> >>>>   	struct vfio_device *vdev = data;
> >>>>
> >>>> -	if (vdev->ops->dma_unmap)
> >>>> +	/* noiommu devices cannot do map/unmap */
> >>>> +	if (vdev->noiommu && vdev->ops->dma_unmap)
> >>>>   		vdev->ops->dma_unmap(vdev, iova, length);
> >>>
> >>> Is it necessary? All mdev devices implementing @dma_unmap won't
> >>> set noiommu flag.
> >>
> >> Hmmm. Yes, and all the devices set noiommu is not implementing
> @dma_unmap
> >> as far as I see. Maybe this noiommu check can be removed.
> >
> > Not to mention that the polarity of the noiommu test is backwards here!
> > This also seems to be the only performance path where noiommu is tested
> > and therefore I believe the only actual justification of the previous
> > patch.
> 
> but this patch needs to use vfio_iommufd_emulated_bind() and
> vfio_iommufd_emulated_unbind() for the noiommu devices when binding
> to iommufd. So needs to check noiommu in the vfio_iommufd_bind()
> and vfio_iommu_unbind() as well.
> 

You can continue to use vfio_device_is_noiommu() in this patch. It's not
big deal to drop it from this series and then add back in cdev series.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v4 2/9] vfio-iommufd: Create iommufd_access for noiommu devices
  2023-04-28  7:00           ` Tian, Kevin
@ 2023-04-28  7:04             ` Yi Liu
  0 siblings, 0 replies; 50+ messages in thread
From: Yi Liu @ 2023-04-28  7:04 UTC (permalink / raw)
  To: Tian, Kevin, Alex Williamson
  Cc: jgg@nvidia.com, joro@8bytes.org, robin.murphy@arm.com,
	cohuck@redhat.com, eric.auger@redhat.com, nicolinc@nvidia.com,
	kvm@vger.kernel.org, mjrosato@linux.ibm.com,
	chao.p.peng@linux.intel.com, yi.y.sun@linux.intel.com,
	peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com,
	intel-gvt-dev@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-s390@vger.kernel.org,
	Hao, Xudong, Zhao, Yan Y, Xu, Terrence, Jiang, Yanting,
	Duan, Zhenzhong

On 2023/4/28 15:00, Tian, Kevin wrote:
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Sent: Friday, April 28, 2023 2:21 PM
>>
>> On 2023/4/28 02:32, Alex Williamson wrote:
>>> On Thu, 27 Apr 2023 06:59:17 +0000
>>> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
>>>
>>>>> From: Tian, Kevin <kevin.tian@intel.com>
>>>>> Sent: Thursday, April 27, 2023 2:39 PM
>>>>>
>>>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>>>> Sent: Wednesday, April 26, 2023 10:54 PM
>>>>>> @@ -121,7 +128,8 @@ static void vfio_emulated_unmap(void *data,
>>>>>> unsigned long iova,
>>>>>>    {
>>>>>>    	struct vfio_device *vdev = data;
>>>>>>
>>>>>> -	if (vdev->ops->dma_unmap)
>>>>>> +	/* noiommu devices cannot do map/unmap */
>>>>>> +	if (vdev->noiommu && vdev->ops->dma_unmap)
>>>>>>    		vdev->ops->dma_unmap(vdev, iova, length);
>>>>>
>>>>> Is it necessary? All mdev devices implementing @dma_unmap won't
>>>>> set noiommu flag.
>>>>
>>>> Hmmm. Yes, and all the devices set noiommu is not implementing
>> @dma_unmap
>>>> as far as I see. Maybe this noiommu check can be removed.
>>>
>>> Not to mention that the polarity of the noiommu test is backwards here!
>>> This also seems to be the only performance path where noiommu is tested
>>> and therefore I believe the only actual justification of the previous
>>> patch.
>>
>> but this patch needs to use vfio_iommufd_emulated_bind() and
>> vfio_iommufd_emulated_unbind() for the noiommu devices when binding
>> to iommufd. So needs to check noiommu in the vfio_iommufd_bind()
>> and vfio_iommu_unbind() as well.
>>
> 
> You can continue to use vfio_device_is_noiommu() in this patch. It's not
> big deal to drop it from this series and then add back in cdev series.

yes.:-) patch 01 of this series was added more per the cdev series reviews.

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v4 0/9] Enhance vfio PCI hot reset for vfio cdev device
  2023-04-26 14:54 [PATCH v4 0/9] Enhance vfio PCI hot reset for vfio cdev device Yi Liu
                   ` (9 preceding siblings ...)
  2023-04-26 15:07 ` [PATCH v4 0/9] Enhance vfio PCI hot reset for vfio cdev device Liu, Yi L
@ 2023-04-28  9:28 ` Jiang, Yanting
  10 siblings, 0 replies; 50+ messages in thread
From: Jiang, Yanting @ 2023-04-28  9:28 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson@redhat.com, jgg@nvidia.com,
	Tian, Kevin
  Cc: joro@8bytes.org, robin.murphy@arm.com, cohuck@redhat.com,
	eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org,
	mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com,
	yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com,
	intel-gvt-dev@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-s390@vger.kernel.org,
	Hao, Xudong, Zhao, Yan Y, Xu, Terrence, Duan, Zhenzhong

> VFIO_DEVICE_PCI_HOT_RESET requires user to pass an array of group fds to
> prove that it owns all devices affected by resetting the calling device. While for
> cdev devices, user can use an iommufd-based ownership checking model and
> invoke VFIO_DEVICE_PCI_HOT_RESET with a zero-length fd array.
> 
> This series first creates iommufd_access for noiommu devices to fill the gap for
> adding iommufd-based ownership checking model, then extends
> VFIO_DEVICE_GET_PCI_HOT_RESET_INFO to check ownership and return the
> check result and the devid of affected devices to user. In the end, extends the
> VFIO_DEVICE_PCI_HOT_RESET to accept zero-length fd array for hot-reset with
> cdev devices.
> 
> The new hot reset method and updated _INFO ioctl are tested with the below
> qemu:
> 
> https://github.com/yiliu1765/qemu/tree/iommufd_rfcv4.mig.reset.v4_var3
> (requires to test with the cdev kernel)
> 

Tested NIC passthrough on Intel platform.
Result looks good hence,
Tested-by: Yanting Jiang <yanting.jiang@intel.com>

Thanks,
Yanting


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v4 2/9] vfio-iommufd: Create iommufd_access for noiommu devices
  2023-04-28  6:21         ` Yi Liu
  2023-04-28  7:00           ` Tian, Kevin
@ 2023-04-28 12:07           ` Jason Gunthorpe
  2023-04-28 16:07             ` Yi Liu
  1 sibling, 1 reply; 50+ messages in thread
From: Jason Gunthorpe @ 2023-04-28 12:07 UTC (permalink / raw)
  To: Yi Liu
  Cc: Alex Williamson, Tian, Kevin, joro@8bytes.org,
	robin.murphy@arm.com, cohuck@redhat.com, eric.auger@redhat.com,
	nicolinc@nvidia.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com,
	chao.p.peng@linux.intel.com, yi.y.sun@linux.intel.com,
	peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com,
	intel-gvt-dev@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-s390@vger.kernel.org,
	Hao, Xudong, Zhao, Yan Y, Xu, Terrence, Jiang, Yanting,
	Duan, Zhenzhong

On Fri, Apr 28, 2023 at 02:21:26PM +0800, Yi Liu wrote:

> but this patch needs to use vfio_iommufd_emulated_bind() and
> vfio_iommufd_emulated_unbind() for the noiommu devices when binding
> to iommufd. So needs to check noiommu in the vfio_iommufd_bind()
> and vfio_iommu_unbind() as well.

I'm not sure this is strictly necessary, it just needs an access

The emulated stuff is for mdev only, it should not be confused with
no-iommu

Eg if you had a no_iommu_access value to store the access it would be
fine and could serve as the 'this is no_iommu' flag

The only purpose of the access at this moment is to get an iommufdctx
id to return to userspace.

Jason

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v4 2/9] vfio-iommufd: Create iommufd_access for noiommu devices
  2023-04-28 12:07           ` Jason Gunthorpe
@ 2023-04-28 16:07             ` Yi Liu
  2023-05-02 18:12               ` Jason Gunthorpe
  0 siblings, 1 reply; 50+ messages in thread
From: Yi Liu @ 2023-04-28 16:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Tian, Kevin, joro@8bytes.org,
	robin.murphy@arm.com, cohuck@redhat.com, eric.auger@redhat.com,
	nicolinc@nvidia.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com,
	chao.p.peng@linux.intel.com, yi.y.sun@linux.intel.com,
	peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com,
	intel-gvt-dev@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-s390@vger.kernel.org,
	Hao, Xudong, Zhao, Yan Y, Xu, Terrence, Jiang, Yanting,
	Duan, Zhenzhong

On 2023/4/28 20:07, Jason Gunthorpe wrote:
> On Fri, Apr 28, 2023 at 02:21:26PM +0800, Yi Liu wrote:
> 
>> but this patch needs to use vfio_iommufd_emulated_bind() and
>> vfio_iommufd_emulated_unbind() for the noiommu devices when binding
>> to iommufd. So needs to check noiommu in the vfio_iommufd_bind()
>> and vfio_iommu_unbind() as well.
> 
> I'm not sure this is strictly necessary, it just needs an access
> 
> The emulated stuff is for mdev only, it should not be confused with
> no-iommu

hmmm. I guess the confusion is due to the reuse of 
vfio_iommufd_emulated_bind().

> 
> Eg if you had a no_iommu_access value to store the access it would be
> fine and could serve as the 'this is no_iommu' flag

So this no_iommu_access shall be created per iommufd bind, and call the
iommufd_access_create() with iommufd_access_ops. is it? If so, this is
not 100% the same with no_iommu flag as this flag is static after device
registration.

> 
> The only purpose of the access at this moment is to get an iommufdctx
> id to return to userspace.

yes. and it is the iommufd_access ID to be returned to userspace.

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v4 2/9] vfio-iommufd: Create iommufd_access for noiommu devices
  2023-04-27 18:32       ` Alex Williamson
  2023-04-28  6:21         ` Yi Liu
@ 2023-04-28 16:13         ` Yi Liu
  2023-05-02 18:22           ` Jason Gunthorpe
  1 sibling, 1 reply; 50+ messages in thread
From: Yi Liu @ 2023-04-28 16:13 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, jgg@nvidia.com, joro@8bytes.org,
	robin.murphy@arm.com, cohuck@redhat.com, eric.auger@redhat.com,
	nicolinc@nvidia.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com,
	chao.p.peng@linux.intel.com, yi.y.sun@linux.intel.com,
	peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com,
	intel-gvt-dev@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-s390@vger.kernel.org,
	Hao, Xudong, Zhao, Yan Y, Xu, Terrence, Jiang, Yanting,
	Duan, Zhenzhong

On 2023/4/28 02:32, Alex Williamson wrote:
> On Thu, 27 Apr 2023 06:59:17 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
[...]
>>
>> I'm not quite sure about it so far. For mdev devices, the device driver
>> may use vfio_pin_pages/vfio_dma_rw () to pin page. Hence such drivers
>> need to listen to dma_unmap() event. But for noiommu users, does the
>> device driver also participate in the page pin? At least for vfio-pci driver,
>> it does not, or maybe it will in the future when enabling noiommu
>> userspace to pin pages. It looks to me such userspace should order
>> the DMA before calling ioctl to unpin page instead of letting device
>> driver listen to unmap.
> 
> Whoa, noiommu is inherently unsafe an only meant to expose the vfio
> device interface for userspace drivers that are going to do unsafe
> things regardless.  Enabling noiommu to work with mdev, pin pages, or
> anything else should not be on our agenda.  Userspaces relying on niommu
> get the minimum viable interface and must impose a minuscule
> incremental maintenance burden.  The only reason we're spending so much
> effort on it here is to make iommufd noiommu support equivalent to
> group/container noiommu support.  We should stop at that.  Thanks,

btw. I asked a question in [1] to check if we should allow attach/detach
on noiommu devices. Jason has replied it. If in future noiommu userspace
can pin page, then such userspace will need to attach/detach ioas. So I
made cdev series[2] to allow attach ioas on noiommu devices. Supporting
it from cdev day-1 may avoid probing if attach/detach is supported or
not for specific devices when adding pin page for noiommu userspace.

But now, I think such a support will not in plan, is it? If so, will it
be better to disallow attach/detach on noiommu devices in patch [2]?

[1] https://lore.kernel.org/kvm/ZEa+khH0tUFStRMW@nvidia.com/
[2] https://lore.kernel.org/kvm/20230426150321.454465-21-yi.l.liu@intel.com/

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v4 9/9] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-04-27 21:55   ` Alex Williamson
@ 2023-05-02 12:55     ` Liu, Yi L
  0 siblings, 0 replies; 50+ messages in thread
From: Liu, Yi L @ 2023-05-02 12:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: jgg@nvidia.com, Tian, Kevin, joro@8bytes.org,
	robin.murphy@arm.com, cohuck@redhat.com, eric.auger@redhat.com,
	nicolinc@nvidia.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com,
	chao.p.peng@linux.intel.com, yi.y.sun@linux.intel.com,
	peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com,
	intel-gvt-dev@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-s390@vger.kernel.org,
	Hao, Xudong, Zhao, Yan Y, Xu, Terrence, Jiang, Yanting,
	Duan, Zhenzhong

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, April 28, 2023 5:55 AM
> 
> On Wed, 26 Apr 2023 07:54:19 -0700
> Yi Liu <yi.l.liu@intel.com> wrote:
> 
> > This is the way user to invoke hot-reset for the devices opened by cdev
> > interface. User should check the flag VFIO_PCI_HOT_RESET_FLAG_RESETTABLE
> > in the output of VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl before doing
> > hot-reset for cdev devices.
> >
> > Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > Tested-by: Yanting Jiang <yanting.jiang@intel.com>
> > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/pci/vfio_pci_core.c | 66 +++++++++++++++++++++++++++-----
> >  include/uapi/linux/vfio.h        | 22 +++++++++++
> >  2 files changed, 79 insertions(+), 9 deletions(-)
> >
> > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> > index 43858d471447..f70e3b948b16 100644
> > --- a/drivers/vfio/pci/vfio_pci_core.c
> > +++ b/drivers/vfio/pci/vfio_pci_core.c
> > @@ -180,7 +180,8 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device
> *vdev)
> >  struct vfio_pci_group_info;
> >  static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
> >  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
> > -				      struct vfio_pci_group_info *groups);
> > +				      struct vfio_pci_group_info *groups,
> > +				      struct iommufd_ctx *iommufd_ctx);
> >
> >  /*
> >   * INTx masking requires the ability to disable INTx signaling via PCI_COMMAND
> > @@ -1364,8 +1365,7 @@ vfio_pci_ioctl_pci_hot_reset_groups(struct
> vfio_pci_core_device *vdev,
> >  	if (ret)
> >  		return ret;
> >
> > -	/* Somewhere between 1 and count is OK */
> > -	if (!array_count || array_count > count)
> > +	if (array_count > count)
> >  		return -EINVAL;
> 
> Doesn't this need a || vfio_device_cdev_opened(vdev) test as well?
> It's invalid to pass fds for a cdev device.  Presumably it would fail
> later collecting group fds as well, but might as well enforce the
> semantics early.

Yes, it is.

> 
> >
> >  	group_fds = kcalloc(array_count, sizeof(*group_fds), GFP_KERNEL);
> > @@ -1414,7 +1414,7 @@ vfio_pci_ioctl_pci_hot_reset_groups(struct
> vfio_pci_core_device *vdev,
> >  	info.count = array_count;
> >  	info.files = files;
> >
> > -	ret = vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, &info);
> > +	ret = vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, &info, NULL);
> >
> >  hot_reset_release:
> >  	for (file_idx--; file_idx >= 0; file_idx--)
> > @@ -1429,6 +1429,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct
> vfio_pci_core_device *vdev,
> >  {
> >  	unsigned long minsz = offsetofend(struct vfio_pci_hot_reset, count);
> >  	struct vfio_pci_hot_reset hdr;
> > +	struct iommufd_ctx *iommufd;
> >  	bool slot = false;
> >
> >  	if (copy_from_user(&hdr, arg, minsz))
> > @@ -1443,7 +1444,12 @@ static int vfio_pci_ioctl_pci_hot_reset(struct
> vfio_pci_core_device *vdev,
> >  	else if (pci_probe_reset_bus(vdev->pdev->bus))
> >  		return -ENODEV;
> >
> > -	return vfio_pci_ioctl_pci_hot_reset_groups(vdev, hdr.count, slot, arg);
> > +	if (hdr.count)
> > +		return vfio_pci_ioctl_pci_hot_reset_groups(vdev, hdr.count, slot, arg);
> > +
> > +	iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
> > +
> > +	return vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, NULL, iommufd);
> 
> Why did we need to store iommufd in a variable?

will remove it.

> >  }
> >
> >  static int vfio_pci_ioctl_ioeventfd(struct vfio_pci_core_device *vdev,
> > @@ -2415,6 +2421,9 @@ static bool vfio_dev_in_groups(struct vfio_pci_core_device
> *vdev,
> >  {
> >  	unsigned int i;
> >
> > +	if (!groups)
> > +		return false;
> > +
> >  	for (i = 0; i < groups->count; i++)
> >  		if (vfio_file_has_dev(groups->files[i], &vdev->vdev))
> >  			return true;
> > @@ -2488,13 +2497,38 @@ static int vfio_pci_dev_set_pm_runtime_get(struct
> vfio_device_set *dev_set)
> >  	return ret;
> >  }
> >
> > +static bool vfio_dev_in_iommufd_ctx(struct vfio_pci_core_device *vdev,
> > +				    struct iommufd_ctx *iommufd_ctx)
> > +{
> > +	struct iommufd_ctx *iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
> > +	struct iommu_group *iommu_group;
> > +
> > +	if (!iommufd_ctx)
> > +		return false;
> > +
> > +	if (iommufd == iommufd_ctx)
> > +		return true;
> > +
> > +	iommu_group = iommu_group_get(vdev->vdev.dev);
> > +	if (!iommu_group)
> > +		return false;
> > +
> > +	/*
> > +	 * Try to check if any device within iommu_group is bound with
> > +	 * the input iommufd_ctx.
> > +	 */
> > +	return vfio_devset_iommufd_has_group(vdev->vdev.dev_set,
> > +					     iommufd_ctx, iommu_group);
> > +}
> 
> This last test makes this not do what the function name suggests it
> does.  If it were true, the device is not in the iommufd_ctx, it simply
> cannot be within another iommu ctx.

Yes. it actually means not possible to be in another iommufd_ctx.

> 
> > +
> >  /*
> >   * We need to get memory_lock for each device, but devices can share mmap_lock,
> >   * therefore we need to zap and hold the vma_lock for each device, and only then
> >   * get each memory_lock.
> >   */
> >  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
> > -				      struct vfio_pci_group_info *groups)
> > +				      struct vfio_pci_group_info *groups,
> > +				      struct iommufd_ctx *iommufd_ctx)
> >  {
> >  	struct vfio_pci_core_device *cur_mem;
> >  	struct vfio_pci_core_device *cur_vma;
> > @@ -2525,10 +2559,24 @@ static int vfio_pci_dev_set_hot_reset(struct
> vfio_device_set *dev_set,
> >
> >  	list_for_each_entry(cur_vma, &dev_set->device_list, vdev.dev_set_list) {
> >  		/*
> > -		 * Test whether all the affected devices are contained by the
> > -		 * set of groups provided by the user.
> > +		 * Test whether all the affected devices can be reset by the
> > +		 * user.
> > +		 *
> > +		 * If user provides a set of groups, all the opened devices
> > +		 * in the dev_set should be contained by the set of groups
> > +		 * provided by the user.
> > +		 *
> > +		 * If user provides a zero-length group fd array, then all
> > +		 * the affected devices must be bound to same iommufd_ctx as
> > +		 * the input iommufd_ctx.  If there is device that has not
> > +		 * been bound to iommufd_ctx yet, shall check if there is any
> > +		 * device within its iommu_group that has been bound to the
> > +		 * input iommufd_ctx.
> > +		 *
> > +		 * Otherwise, reset is not allowed.
> >  		 */
> > -		if (!vfio_dev_in_groups(cur_vma, groups)) {
> > +		if (!vfio_dev_in_groups(cur_vma, groups) &&
> > +		    !vfio_dev_in_iommufd_ctx(cur_vma, iommufd_ctx)) {
> 
> 
> Rather than mangling vfio_dev_in_groups() and inventing
> vfio_dev_in_iommufd_ctx() that doesn't do what it implies, how about:
> 
> bool vfio_device_owned(struct vfio_device *vdev,
> 		       struct vfio_pci_group_info *groups,
> 		       struct iommufd_ctx *iommufd_ctx)
> {
> 	struct iommu_group *group;
> 
> 	WARN_ON(!!groups == !!iommufd_ctx);
> 
> 	if (groups)
> 		return vfio_dev_in_groups(vdev, groups));
> 
> 	if (vfio_iommufd_physical_ictx(vdev) == iommufd_ctx)
> 		return true;
> 
> 	group = iommu_group_get(vdev->dev);
> 	if (group)
> 		return vfio_devset_iommufd_has_group(vdev->vdev.dev_set,
> 						     iommufd_ctx, group);
> 	return false;
> }

Will follow above suggestion.

> Seems like such a function would live in vfio_main.c

It may require to make the struct vfio_pci_group_info visible outside
of vfio-pci. This seems to be strange to make vfio_main.c to refer pci
specific structure.

> 
> >  			ret = -EINVAL;
> >  			goto err_undo;
> >  		}
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 4b4e2c28984b..1241d02d8701 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -710,6 +710,28 @@ struct vfio_pci_hot_reset_info {
> >   * VFIO_DEVICE_PCI_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + 13,
> >   *				    struct vfio_pci_hot_reset)
> >   *
> > + * Userspace requests hot reset for the devices it operates.  Due to the
> > + * underlying topology, multiple devices can be affected in the reset
> > + * while some might be opened by another user.  To avoid interference
> > + * the calling user must ensure all affected devices are owned by itself.
> > + * The ownership proof needs to refer the output of
> > + * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO.  Ownership can be proved as:
> > + *
> > + *   1) An array of group fds - This is used for the devices opened via
> > + *				the group/container interface.
> > + *   2) A zero-length array - This is used for the devices opened via
> > + *			      the cdev interface.  User should check the
> > + *			      flag VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID
> > + *			      and flag VFIO_PCI_HOT_RESET_FLAG_RESETTABLE
> > + *			      before using this method.
> > + *
> > + * In case a non void group fd array is passed, the devices affected by
> > + * the reset must belong to those opened VFIO groups.  In case a zero
> > + * length array is passed, the other devices affected by the reset, if
> > + * any, must be either bound to the same iommufd as this VFIO device or
> > + * in the same iommu_group with a device that does.  Either of the two
> > + * methods is applied to check the feasibility of the hot reset.
> 
> This should probably just refer to the concept of ownership described
> in the INFO ioctl and clarify that cdev opened device must exclusively
> provide an empty array and group opened devices must exclusively use an
> array of group fds for proof of ownership.  Mixed access to devices
> between cdev and legacy groups are not supported by this interface.
> Thanks,

Sure. Will make it in next version.

Regards,
Yi Liu

> 
> Alex
> 
> > + *
> >   * Return: 0 on success, -errno on failure.
> >   */
> >  struct vfio_pci_hot_reset {


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v4 2/9] vfio-iommufd: Create iommufd_access for noiommu devices
  2023-04-28 16:07             ` Yi Liu
@ 2023-05-02 18:12               ` Jason Gunthorpe
  2023-05-03  9:48                 ` Liu, Yi L
  0 siblings, 1 reply; 50+ messages in thread
From: Jason Gunthorpe @ 2023-05-02 18:12 UTC (permalink / raw)
  To: Yi Liu
  Cc: Alex Williamson, Tian, Kevin, joro@8bytes.org,
	robin.murphy@arm.com, cohuck@redhat.com, eric.auger@redhat.com,
	nicolinc@nvidia.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com,
	chao.p.peng@linux.intel.com, yi.y.sun@linux.intel.com,
	peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com,
	intel-gvt-dev@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-s390@vger.kernel.org,
	Hao, Xudong, Zhao, Yan Y, Xu, Terrence, Jiang, Yanting,
	Duan, Zhenzhong

On Sat, Apr 29, 2023 at 12:07:24AM +0800, Yi Liu wrote:
> > The emulated stuff is for mdev only, it should not be confused with
> > no-iommu
> 
> hmmm. I guess the confusion is due to the reuse of
> vfio_iommufd_emulated_bind().

This is probabl y not a good direction

> > Eg if you had a no_iommu_access value to store the access it would be
> > fine and could serve as the 'this is no_iommu' flag
> 
> So this no_iommu_access shall be created per iommufd bind, and call the
> iommufd_access_create() with iommufd_access_ops. is it? If so, this is
> not 100% the same with no_iommu flag as this flag is static after device
> registration.

Something like that, yes

I don't think it is any real difference with the current flag, both
are determined at the first ioctl when the iommufd is presented and
both would state permanently until the fd close

Jason

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v4 2/9] vfio-iommufd: Create iommufd_access for noiommu devices
  2023-04-28 16:13         ` Yi Liu
@ 2023-05-02 18:22           ` Jason Gunthorpe
  2023-05-03  9:57             ` Liu, Yi L
  0 siblings, 1 reply; 50+ messages in thread
From: Jason Gunthorpe @ 2023-05-02 18:22 UTC (permalink / raw)
  To: Yi Liu
  Cc: Alex Williamson, Tian, Kevin, joro@8bytes.org,
	robin.murphy@arm.com, cohuck@redhat.com, eric.auger@redhat.com,
	nicolinc@nvidia.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com,
	chao.p.peng@linux.intel.com, yi.y.sun@linux.intel.com,
	peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com,
	intel-gvt-dev@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-s390@vger.kernel.org,
	Hao, Xudong, Zhao, Yan Y, Xu, Terrence, Jiang, Yanting,
	Duan, Zhenzhong

On Sat, Apr 29, 2023 at 12:13:39AM +0800, Yi Liu wrote:

> > Whoa, noiommu is inherently unsafe an only meant to expose the vfio
> > device interface for userspace drivers that are going to do unsafe
> > things regardless.  Enabling noiommu to work with mdev, pin pages, or
> > anything else should not be on our agenda.  Userspaces relying on niommu
> > get the minimum viable interface and must impose a minuscule
> > incremental maintenance burden.  The only reason we're spending so much
> > effort on it here is to make iommufd noiommu support equivalent to
> > group/container noiommu support.  We should stop at that.  Thanks,
> 
> btw. I asked a question in [1] to check if we should allow attach/detach
> on noiommu devices. Jason has replied it. If in future noiommu userspace
> can pin page, then such userspace will need to attach/detach ioas. So I
> made cdev series[2] to allow attach ioas on noiommu devices. Supporting
> it from cdev day-1 may avoid probing if attach/detach is supported or
> not for specific devices when adding pin page for noiommu userspace.
> 
> But now, I think such a support will not in plan, is it? If so, will it
> be better to disallow attach/detach on noiommu devices in patch [2]?
> 
> [1] https://lore.kernel.org/kvm/ZEa+khH0tUFStRMW@nvidia.com/
> [2] https://lore.kernel.org/kvm/20230426150321.454465-21-yi.l.liu@intel.com/

If we block it then userspace has to act quite differently, I think we
should keep it.

My general idea to complete the no-iommu feature is to add a new IOCTL
to VFIO that is 'pin iova and return dma addr' that no-iommu userspace
would call instead of trying to abuse mlock and /proc/ to do it. That
ioctl would use the IOAS attached to the access just like a mdev would
do, so it has a real IOVA, but it is not a mdev.

unmap callback just does nothing, as Alex says it is all still totally
unsafe.

This just allows it use the mm a little more properly and safely (eg
mlock() doesn't set things like page_maybe_dma_pinned(), proc doesn't
reject things like DAX and it currently doesn't make an adjustment for
the PCI offset stuff..) So it would make DPDK a little more robust,
portable and make the whole VFIO no-iommu feature much easier to use.

To do that we need an iommufd access, an access ID and we need to link
the current IOAS to the special access, like mdev, but in any mdev
code paths.

That creating the access ID solves the reset problem as well is a nice
side effect and is the only part of this you should focus on for now..

Jason

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v4 2/9] vfio-iommufd: Create iommufd_access for noiommu devices
  2023-05-02 18:12               ` Jason Gunthorpe
@ 2023-05-03  9:48                 ` Liu, Yi L
  2023-05-03 19:42                   ` Jason Gunthorpe
  2023-05-08 15:46                   ` Liu, Yi L
  0 siblings, 2 replies; 50+ messages in thread
From: Liu, Yi L @ 2023-05-03  9:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Tian, Kevin, joro@8bytes.org,
	robin.murphy@arm.com, cohuck@redhat.com, eric.auger@redhat.com,
	nicolinc@nvidia.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com,
	chao.p.peng@linux.intel.com, yi.y.sun@linux.intel.com,
	peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com,
	intel-gvt-dev@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-s390@vger.kernel.org,
	Hao, Xudong, Zhao, Yan Y, Xu, Terrence, Jiang, Yanting,
	Duan, Zhenzhong

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, May 3, 2023 2:12 AM
> 
> On Sat, Apr 29, 2023 at 12:07:24AM +0800, Yi Liu wrote:
> > > The emulated stuff is for mdev only, it should not be confused with
> > > no-iommu
> >
> > hmmm. I guess the confusion is due to the reuse of
> > vfio_iommufd_emulated_bind().
> 
> This is probabl y not a good direction

I see. But if not reusing, then there may be a few code duplications.
I'm fine to add separate _bind/unbind() functions for noiommu devices
if Alex and you prefer it.

> > > Eg if you had a no_iommu_access value to store the access it would be
> > > fine and could serve as the 'this is no_iommu' flag
> >
> > So this no_iommu_access shall be created per iommufd bind, and call the
> > iommufd_access_create() with iommufd_access_ops. is it? If so, this is
> > not 100% the same with no_iommu flag as this flag is static after device
> > registration.
> 
> Something like that, yes
> 
> I don't think it is any real difference with the current flag, both
> are determined at the first ioctl when the iommufd is presented and
> both would state permanently until the fd close

Well, noiommu flag would be static from registration till unregistration.:-)
While no_iommu_access's life circle is between the bind and fd close. But
given that the major usage of it is during the duration between fd is bound
to iommufd and closed, so it's still possible to let no_iommu_access serve
as noiommu flag. 😊

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v4 2/9] vfio-iommufd: Create iommufd_access for noiommu devices
  2023-05-02 18:22           ` Jason Gunthorpe
@ 2023-05-03  9:57             ` Liu, Yi L
  2023-05-03 19:41               ` Jason Gunthorpe
  0 siblings, 1 reply; 50+ messages in thread
From: Liu, Yi L @ 2023-05-03  9:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Tian, Kevin, joro@8bytes.org,
	robin.murphy@arm.com, cohuck@redhat.com, eric.auger@redhat.com,
	nicolinc@nvidia.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com,
	chao.p.peng@linux.intel.com, yi.y.sun@linux.intel.com,
	peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com,
	intel-gvt-dev@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-s390@vger.kernel.org,
	Hao, Xudong, Zhao, Yan Y, Xu, Terrence, Jiang, Yanting,
	Duan, Zhenzhong

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, May 3, 2023 2:22 AM
> 
> On Sat, Apr 29, 2023 at 12:13:39AM +0800, Yi Liu wrote:
> 
> > > Whoa, noiommu is inherently unsafe an only meant to expose the vfio
> > > device interface for userspace drivers that are going to do unsafe
> > > things regardless.  Enabling noiommu to work with mdev, pin pages, or
> > > anything else should not be on our agenda.  Userspaces relying on niommu
> > > get the minimum viable interface and must impose a minuscule
> > > incremental maintenance burden.  The only reason we're spending so much
> > > effort on it here is to make iommufd noiommu support equivalent to
> > > group/container noiommu support.  We should stop at that.  Thanks,
> >
> > btw. I asked a question in [1] to check if we should allow attach/detach
> > on noiommu devices. Jason has replied it. If in future noiommu userspace
> > can pin page, then such userspace will need to attach/detach ioas. So I
> > made cdev series[2] to allow attach ioas on noiommu devices. Supporting
> > it from cdev day-1 may avoid probing if attach/detach is supported or
> > not for specific devices when adding pin page for noiommu userspace.
> >
> > But now, I think such a support will not in plan, is it? If so, will it
> > be better to disallow attach/detach on noiommu devices in patch [2]?
> >
> > [1] https://lore.kernel.org/kvm/ZEa+khH0tUFStRMW@nvidia.com/
> > [2] https://lore.kernel.org/kvm/20230426150321.454465-21-yi.l.liu@intel.com/
>
> If we block it then userspace has to act quite differently, I think we
> should keep it.

Maybe kernel can simply fail the attach/detach if it happens on noiommu
devices, and noiommu userspace should just know it would fail. @Alex,
how about your opinion?

> My general idea to complete the no-iommu feature is to add a new IOCTL
> to VFIO that is 'pin iova and return dma addr' that no-iommu userspace
> would call instead of trying to abuse mlock and /proc/ to do it. That
> ioctl would use the IOAS attached to the access just like a mdev would
> do, so it has a real IOVA, but it is not a mdev.

This new ioctl may be IOMMUFD ioctl since its input is the IOAS and
addr, nothing related to the device. Is it?

> unmap callback just does nothing, as Alex says it is all still totally
> unsafe.

Sure. That's also why I added a noiommu test to avoid calling
unmap callback although it seems not possible to have unmap
callback as mdev drivers would implement it.

> 
> This just allows it use the mm a little more properly and safely (eg
> mlock() doesn't set things like page_maybe_dma_pinned(), proc doesn't
> reject things like DAX and it currently doesn't make an adjustment for
> the PCI offset stuff..) So it would make DPDK a little more robust,
> portable and make the whole VFIO no-iommu feature much easier to use.

Thanks for the explanation.

> To do that we need an iommufd access, an access ID and we need to link
> the current IOAS to the special access, like mdev, but in any mdev
> code paths.
> 
> That creating the access ID solves the reset problem as well is a nice
> side effect and is the only part of this you should focus on for now..

Yes. I get this part. We only need access ID so far to fix the noiommu
gap in hot-reset.

Regards,
Yi Liu
 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v4 2/9] vfio-iommufd: Create iommufd_access for noiommu devices
  2023-05-03  9:57             ` Liu, Yi L
@ 2023-05-03 19:41               ` Jason Gunthorpe
  2023-05-03 22:49                 ` Alex Williamson
  0 siblings, 1 reply; 50+ messages in thread
From: Jason Gunthorpe @ 2023-05-03 19:41 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Alex Williamson, Tian, Kevin, joro@8bytes.org,
	robin.murphy@arm.com, cohuck@redhat.com, eric.auger@redhat.com,
	nicolinc@nvidia.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com,
	chao.p.peng@linux.intel.com, yi.y.sun@linux.intel.com,
	peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com,
	intel-gvt-dev@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-s390@vger.kernel.org,
	Hao, Xudong, Zhao, Yan Y, Xu, Terrence, Jiang, Yanting,
	Duan, Zhenzhong

> > My general idea to complete the no-iommu feature is to add a new IOCTL
> > to VFIO that is 'pin iova and return dma addr' that no-iommu userspace
> > would call instead of trying to abuse mlock and /proc/ to do it. That
> > ioctl would use the IOAS attached to the access just like a mdev would
> > do, so it has a real IOVA, but it is not a mdev.
> 
> This new ioctl may be IOMMUFD ioctl since its input is the IOAS and
> addr, nothing related to the device. Is it?

No, definately a VFIO special ioctl for VFIO no-iommu mode.

> Sure. That's also why I added a noiommu test to avoid calling
> unmap callback although it seems not possible to have unmap
> callback as mdev drivers would implement it.

Just have a special noiommu ops and use an empty unmap function and
pass that to the special noiommu access creation.

Jason

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v4 2/9] vfio-iommufd: Create iommufd_access for noiommu devices
  2023-05-03  9:48                 ` Liu, Yi L
@ 2023-05-03 19:42                   ` Jason Gunthorpe
  2023-05-08 15:46                   ` Liu, Yi L
  1 sibling, 0 replies; 50+ messages in thread
From: Jason Gunthorpe @ 2023-05-03 19:42 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Alex Williamson, Tian, Kevin, joro@8bytes.org,
	robin.murphy@arm.com, cohuck@redhat.com, eric.auger@redhat.com,
	nicolinc@nvidia.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com,
	chao.p.peng@linux.intel.com, yi.y.sun@linux.intel.com,
	peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com,
	intel-gvt-dev@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-s390@vger.kernel.org,
	Hao, Xudong, Zhao, Yan Y, Xu, Terrence, Jiang, Yanting,
	Duan, Zhenzhong

On Wed, May 03, 2023 at 09:48:36AM +0000, Liu, Yi L wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, May 3, 2023 2:12 AM
> > 
> > On Sat, Apr 29, 2023 at 12:07:24AM +0800, Yi Liu wrote:
> > > > The emulated stuff is for mdev only, it should not be confused with
> > > > no-iommu
> > >
> > > hmmm. I guess the confusion is due to the reuse of
> > > vfio_iommufd_emulated_bind().
> > 
> > This is probabl y not a good direction
> 
> I see. But if not reusing, then there may be a few code duplications.
> I'm fine to add separate _bind/unbind() functions for noiommu devices
> if Alex and you prefer it.

I think you will find there is minimal duplication

Jason

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v4 2/9] vfio-iommufd: Create iommufd_access for noiommu devices
  2023-05-03 19:41               ` Jason Gunthorpe
@ 2023-05-03 22:49                 ` Alex Williamson
  0 siblings, 0 replies; 50+ messages in thread
From: Alex Williamson @ 2023-05-03 22:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, Tian, Kevin, joro@8bytes.org, robin.murphy@arm.com,
	cohuck@redhat.com, eric.auger@redhat.com, nicolinc@nvidia.com,
	kvm@vger.kernel.org, mjrosato@linux.ibm.com,
	chao.p.peng@linux.intel.com, yi.y.sun@linux.intel.com,
	peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com,
	intel-gvt-dev@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-s390@vger.kernel.org,
	Hao, Xudong, Zhao, Yan Y, Xu, Terrence, Jiang, Yanting,
	Duan, Zhenzhong

On Wed, 3 May 2023 16:41:52 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> > > My general idea to complete the no-iommu feature is to add a new IOCTL
> > > to VFIO that is 'pin iova and return dma addr' that no-iommu userspace
> > > would call instead of trying to abuse mlock and /proc/ to do it. That
> > > ioctl would use the IOAS attached to the access just like a mdev would
> > > do, so it has a real IOVA, but it is not a mdev.  
> > 
> > This new ioctl may be IOMMUFD ioctl since its input is the IOAS and
> > addr, nothing related to the device. Is it?  
> 
> No, definately a VFIO special ioctl for VFIO no-iommu mode.

This seems like brushing off the dirty work to vfio.  Userspace drivers
relying on no-iommu are in pretty questionable territory already, do we
care if they don't have a good means to pin pages and derive the DMA
address of those pinned pages?  As I noted earlier, I'm really not
interested in expanding no-iommu, it's less important now than when it
was added (we have vIOMMUs in VMs now and more platforms with IOMMUs)
and we shouldn't be encouraging its use by further developing the
interface.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v4 8/9] vfio/pci: Extend VFIO_DEVICE_GET_PCI_HOT_RESET_INFO for vfio device cdev
  2023-04-27 20:15     ` Alex Williamson
@ 2023-05-08 15:32       ` Liu, Yi L
  2023-05-08 20:29         ` Alex Williamson
  0 siblings, 1 reply; 50+ messages in thread
From: Liu, Yi L @ 2023-05-08 15:32 UTC (permalink / raw)
  To: Alex Williamson
  Cc: jgg@nvidia.com, Tian, Kevin, joro@8bytes.org,
	robin.murphy@arm.com, cohuck@redhat.com, eric.auger@redhat.com,
	nicolinc@nvidia.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com,
	chao.p.peng@linux.intel.com, yi.y.sun@linux.intel.com,
	peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com,
	intel-gvt-dev@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-s390@vger.kernel.org,
	Hao, Xudong, Zhao, Yan Y, Xu, Terrence, Jiang, Yanting,
	Duan, Zhenzhong

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, April 28, 2023 4:16 AM
>
> > > + *
> > >   * Return: 0 on success, -errno on failure:
> > >   *	-enospc = insufficient buffer, -enodev = unsupported for device.
> > >   */
> > >  struct vfio_pci_dependent_device {
> > > -	__u32	group_id;
> > > +	union {
> > > +		__u32   group_id;
> > > +		__u32	dev_id;
> > > +#define VFIO_PCI_DEVID_NONBLOCKING	0
> > > +#define VFIO_PCI_DEVID_BLOCKING	-1
> >
> > The above description seems like it's leaning towards OWNED rather than
> > BLOCKING.
> 
> Also these should be defined relative to something defined in IOMMUFD
> rather than inventing values here.  We can't have the valid devid
> number space owned by IOMMUFD conflict with these definitions.  Thanks,

Jason has proposed to reserve all negative IDs and 0 in iommufd. In that case,
can vfio define the numbers now?

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v4 2/9] vfio-iommufd: Create iommufd_access for noiommu devices
  2023-05-03  9:48                 ` Liu, Yi L
  2023-05-03 19:42                   ` Jason Gunthorpe
@ 2023-05-08 15:46                   ` Liu, Yi L
  1 sibling, 0 replies; 50+ messages in thread
From: Liu, Yi L @ 2023-05-08 15:46 UTC (permalink / raw)
  To: Liu, Yi L, Jason Gunthorpe
  Cc: Alex Williamson, Tian, Kevin, joro@8bytes.org,
	robin.murphy@arm.com, cohuck@redhat.com, eric.auger@redhat.com,
	nicolinc@nvidia.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com,
	chao.p.peng@linux.intel.com, yi.y.sun@linux.intel.com,
	peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com,
	intel-gvt-dev@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-s390@vger.kernel.org,
	Hao, Xudong, Zhao, Yan Y, Xu, Terrence, Jiang, Yanting,
	Duan, Zhenzhong

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Wednesday, May 3, 2023 5:49 PM
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, May 3, 2023 2:12 AM
> >
> > On Sat, Apr 29, 2023 at 12:07:24AM +0800, Yi Liu wrote:
> > > > The emulated stuff is for mdev only, it should not be confused with
> > > > no-iommu
> > >
> > > hmmm. I guess the confusion is due to the reuse of
> > > vfio_iommufd_emulated_bind().
> >
> > This is probabl y not a good direction
> 
> I see. But if not reusing, then there may be a few code duplications.
> I'm fine to add separate _bind/unbind() functions for noiommu devices
> if Alex and you prefer it.
> 
> > > > Eg if you had a no_iommu_access value to store the access it would be
> > > > fine and could serve as the 'this is no_iommu' flag
> > >
> > > So this no_iommu_access shall be created per iommufd bind, and call the
> > > iommufd_access_create() with iommufd_access_ops. is it? If so, this is
> > > not 100% the same with no_iommu flag as this flag is static after device
> > > registration.
> >
> > Something like that, yes
> >
> > I don't think it is any real difference with the current flag, both
> > are determined at the first ioctl when the iommufd is presented and
> > both would state permanently until the fd close
> 
> Well, noiommu flag would be static from registration till unregistration.:-)
> While no_iommu_access's life circle is between the bind and fd close. But
> given that the major usage of it is during the duration between fd is bound
> to iommufd and closed, so it's still possible to let no_iommu_access serve
> as noiommu flag. 😊

Hi Jason,

I found another reason to use noiommu flag here.

Existing vfio will fail the vfio_device registration if there is no iommu_group
and neither CONFIG_VFIO_NOIOMMU and vfio_noiommu is set. But such
logic is compiled out when !CONFIG_VFIO_GROUP.

So cdev path needs to check noiommu explicitly. Just like below code.
It is called by vfio_device registration. If iommu_group is null, and
noiommu is not enabled, then it failed, hence vfio_device registration
failed. As we have such a check for noiommu at registration, so it seems
more reasonable to record this result in a flag instead of using
no_iommu_access. Is it?

+static inline int vfio_device_set_noiommu(struct vfio_device *device)
+{
+	struct iommu_group *iommu_group;
+
+	device->noiommu = false;
+
+	iommu_group = iommu_group_get(device->dev);
+	if (!iommu_group) {
+		if (!IS_ENABLED(CONFIG_VFIO_NOIOMMU) || !vfio_noiommu)
+			return -EINVAL;
+		device->noiommu = true;
+	} else {
+		iommu_group_put(iommu_group);
+	}
+
+	return 0;
+}

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v4 8/9] vfio/pci: Extend VFIO_DEVICE_GET_PCI_HOT_RESET_INFO for vfio device cdev
  2023-05-08 15:32       ` Liu, Yi L
@ 2023-05-08 20:29         ` Alex Williamson
  0 siblings, 0 replies; 50+ messages in thread
From: Alex Williamson @ 2023-05-08 20:29 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: jgg@nvidia.com, Tian, Kevin, joro@8bytes.org,
	robin.murphy@arm.com, cohuck@redhat.com, eric.auger@redhat.com,
	nicolinc@nvidia.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com,
	chao.p.peng@linux.intel.com, yi.y.sun@linux.intel.com,
	peterx@redhat.com, jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com, lulu@redhat.com,
	suravee.suthikulpanit@amd.com,
	intel-gvt-dev@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-s390@vger.kernel.org,
	Hao, Xudong, Zhao, Yan Y, Xu, Terrence, Jiang, Yanting,
	Duan, Zhenzhong

On Mon, 8 May 2023 15:32:44 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Friday, April 28, 2023 4:16 AM
> >  
> > > > + *
> > > >   * Return: 0 on success, -errno on failure:
> > > >   *	-enospc = insufficient buffer, -enodev = unsupported for device.
> > > >   */
> > > >  struct vfio_pci_dependent_device {
> > > > -	__u32	group_id;
> > > > +	union {
> > > > +		__u32   group_id;
> > > > +		__u32	dev_id;
> > > > +#define VFIO_PCI_DEVID_NONBLOCKING	0
> > > > +#define VFIO_PCI_DEVID_BLOCKING	-1  
> > >
> > > The above description seems like it's leaning towards OWNED rather than
> > > BLOCKING.  
> > 
> > Also these should be defined relative to something defined in IOMMUFD
> > rather than inventing values here.  We can't have the valid devid
> > number space owned by IOMMUFD conflict with these definitions.  Thanks,  
> 
> Jason has proposed to reserve all negative IDs and 0 in iommufd. In that case,
> can vfio define the numbers now?

Ok, as long as it's guaranteed that we're overlapping invalid dev-ids,
as specified by IOMMUFD, then the mapping of specific invalid dev-ids
to error values here is interface specific and can be defined here.
Thanks,

Alex


^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2023-05-08 20:30 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-26 14:54 [PATCH v4 0/9] Enhance vfio PCI hot reset for vfio cdev device Yi Liu
2023-04-26 14:54 ` [PATCH v4 1/9] vfio: Determine noiommu in vfio_device registration Yi Liu
2023-04-27  6:36   ` Tian, Kevin
2023-04-27  7:05     ` Liu, Yi L
2023-04-27 18:35       ` Alex Williamson
2023-04-26 14:54 ` [PATCH v4 2/9] vfio-iommufd: Create iommufd_access for noiommu devices Yi Liu
2023-04-27  6:39   ` Tian, Kevin
2023-04-27  6:59     ` Liu, Yi L
2023-04-27 18:32       ` Alex Williamson
2023-04-28  6:21         ` Yi Liu
2023-04-28  7:00           ` Tian, Kevin
2023-04-28  7:04             ` Yi Liu
2023-04-28 12:07           ` Jason Gunthorpe
2023-04-28 16:07             ` Yi Liu
2023-05-02 18:12               ` Jason Gunthorpe
2023-05-03  9:48                 ` Liu, Yi L
2023-05-03 19:42                   ` Jason Gunthorpe
2023-05-08 15:46                   ` Liu, Yi L
2023-04-28 16:13         ` Yi Liu
2023-05-02 18:22           ` Jason Gunthorpe
2023-05-03  9:57             ` Liu, Yi L
2023-05-03 19:41               ` Jason Gunthorpe
2023-05-03 22:49                 ` Alex Williamson
2023-04-26 14:54 ` [PATCH v4 3/9] vfio/pci: Update comment around group_fd get in vfio_pci_ioctl_pci_hot_reset() Yi Liu
2023-04-26 14:54 ` [PATCH v4 4/9] vfio/pci: Move the existing hot reset logic to be a helper Yi Liu
2023-04-27  6:39   ` Tian, Kevin
2023-04-26 14:54 ` [PATCH v4 5/9] vfio: Mark cdev usage in vfio_device Yi Liu
2023-04-27  6:40   ` Tian, Kevin
2023-04-27 18:43   ` Alex Williamson
2023-04-28  6:42     ` Yi Liu
2023-04-26 14:54 ` [PATCH v4 6/9] iommufd: Reserved -1 in the iommufd xarray Yi Liu
2023-04-27  6:41   ` Tian, Kevin
2023-04-27  7:09     ` Liu, Yi L
2023-04-27 11:55       ` Jason Gunthorpe
2023-04-26 14:54 ` [PATCH v4 7/9] vfio-iommufd: Add helper to retrieve iommufd_ctx and devid for vfio_device Yi Liu
2023-04-27  6:45   ` Tian, Kevin
2023-04-27  7:15     ` Liu, Yi L
2023-04-26 14:54 ` [PATCH v4 8/9] vfio/pci: Extend VFIO_DEVICE_GET_PCI_HOT_RESET_INFO for vfio device cdev Yi Liu
2023-04-27  6:51   ` Tian, Kevin
2023-04-27 20:04   ` Alex Williamson
2023-04-27 20:15     ` Alex Williamson
2023-05-08 15:32       ` Liu, Yi L
2023-05-08 20:29         ` Alex Williamson
2023-04-26 14:54 ` [PATCH v4 9/9] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET Yi Liu
2023-04-27  6:54   ` Tian, Kevin
2023-04-27  7:02     ` Liu, Yi L
2023-04-27 21:55   ` Alex Williamson
2023-05-02 12:55     ` Liu, Yi L
2023-04-26 15:07 ` [PATCH v4 0/9] Enhance vfio PCI hot reset for vfio cdev device Liu, Yi L
2023-04-28  9:28 ` Jiang, Yanting

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox