[RFC 0/6] Add multiple address spaces support to VDUSE

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC 0/6] Add multiple address spaces support to VDUSE
@ 2025-06-06 11:50 Eugenio Pérez
  2025-06-06 11:50 ` [RFC 1/6] vduse: add v1 API definition Eugenio Pérez
                   ` (6 more replies)
  0 siblings, 7 replies; 30+ messages in thread
From: Eugenio Pérez @ 2025-06-06 11:50 UTC (permalink / raw)
  To: jasowang
  Cc: Yongji Xie, Cindy Lu, linux-kernel, Eugenio Pérez,
	Stefano Garzarella, Stefan Hajnoczi, Maxime Coquelin,
	Michael S. Tsirkin, virtualization, Xuan Zhuo, Laurent Vivier

When used by vhost-vDPA bus driver for VM, the control virtqueue
should be shadowed via userspace VMM (QEMU) instead of being assigned
directly to Guest. This is because QEMU needs to know the device state
in order to start and stop device correctly (e.g for Live Migration).

This requies to isolate the memory mapping for control virtqueue
presented by vhost-vDPA to prevent guest from accessing it directly.

This patches introduces the multiple address spaces support for VDUSE
device. This idea is to identify a specific address space via an
dedicated identifier - ASID.

The VDUSE device needs to report the number of virtqueue groups and the
association with each virtqueue, and also the number of address space
supported.  Then, the vDPA driver can modify the ASID assigned to each
VQ group to isolate the memory AS.  This is already done for vdpa_sim and
nvidia mlx5 devices.

This helps to isolate the environments for the virtqueues that will not
be assigned directly. E.g in the case of virtio-net, the control
virtqueue will not be assigned directly to guest.

This is an RFC has only been tested with vhost_vdpa, so it deserves more
testing  with virtio_vdpa. And it still has some TODOs pending.  Sending
to gather early feedback.

Also, to be able to test this patch, the user needs to manually revert
56e71885b034 ("vduse: Temporarily fail if control queue feature requested").

Eugenio Pérez (6):
  vduse: add v1 API definition
  vduse: add vq group support
  vduse: add vq group asid support
  vduse: send update_iotlb_v2 message
  vduse: reset group asid in reset
  vduse: bump version number

 drivers/vdpa/vdpa_user/vduse_dev.c | 307 ++++++++++++++++++++++-------
 include/uapi/linux/vduse.h         |  64 +++++-
 2 files changed, 298 insertions(+), 73 deletions(-)

-- 
2.49.0

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC 1/6] vduse: add v1 API definition
  2025-06-06 11:50 [RFC 0/6] Add multiple address spaces support to VDUSE Eugenio Pérez
@ 2025-06-06 11:50 ` Eugenio Pérez
  2025-06-09  1:41   ` Jason Wang
  2025-06-06 11:50 ` [RFC 2/6] vduse: add vq group support Eugenio Pérez
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 30+ messages in thread
From: Eugenio Pérez @ 2025-06-06 11:50 UTC (permalink / raw)
  To: jasowang
  Cc: Yongji Xie, Cindy Lu, linux-kernel, Eugenio Pérez,
	Stefano Garzarella, Stefan Hajnoczi, Maxime Coquelin,
	Michael S. Tsirkin, virtualization, Xuan Zhuo, Laurent Vivier

This allows to define all functions checking the API version set by the
userland device.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/uapi/linux/vduse.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
index 68a627d04afa..9a56d0416bfe 100644
--- a/include/uapi/linux/vduse.h
+++ b/include/uapi/linux/vduse.h
@@ -10,6 +10,10 @@
 
 #define VDUSE_API_VERSION	0
 
+/* VQ groups and ASID support */
+
+#define VDUSE_API_VERSION_1	1
+
 /*
  * Get the version of VDUSE API that kernel supported (VDUSE_API_VERSION).
  * This is used for future extension.
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC 2/6] vduse: add vq group support
  2025-06-06 11:50 [RFC 0/6] Add multiple address spaces support to VDUSE Eugenio Pérez
  2025-06-06 11:50 ` [RFC 1/6] vduse: add v1 API definition Eugenio Pérez
@ 2025-06-06 11:50 ` Eugenio Pérez
  2025-06-09  1:55   ` Jason Wang
  2025-06-06 11:50 ` [RFC 3/6] vduse: add vq group asid support Eugenio Pérez
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 30+ messages in thread
From: Eugenio Pérez @ 2025-06-06 11:50 UTC (permalink / raw)
  To: jasowang
  Cc: Yongji Xie, Cindy Lu, linux-kernel, Eugenio Pérez,
	Stefano Garzarella, Stefan Hajnoczi, Maxime Coquelin,
	Michael S. Tsirkin, virtualization, Xuan Zhuo, Laurent Vivier

The virtqueue group is the minimal set of virtqueues that must share an
address space.  And the address space identifier could only be attached
to a specific virtqueue group.  The virtqueue is attached to a
virtqueue group for all the life of the device.

During vDPA device allocation, the VDUSE device needs to report the
number of virtqueue groups supported. At this moment only vhost_vdpa is
able to do it.

This helps to isolate the environments for the virtqueue that will not
be assigned directly. E.g in the case of virtio-net, the control
virtqueue will not be assigned directly to guest.

As we need to back the vq groups with a struct device for the file
operations, let's keep this number as low as possible at the moment: 2.
We can back one VQ group with the vduse device and the other one with
the vdpa device.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 44 +++++++++++++++++++++++++++++-
 include/uapi/linux/vduse.h         | 17 +++++++++++-
 2 files changed, 59 insertions(+), 2 deletions(-)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index 6a9a37351310..6fa687bc4912 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -46,6 +46,11 @@
 #define VDUSE_IOVA_SIZE (VDUSE_MAX_BOUNCE_SIZE + 128 * 1024 * 1024)
 #define VDUSE_MSG_DEFAULT_TIMEOUT 30
 
+/*
+ * Let's make it 2 for simplicity.
+ */
+#define VDUSE_MAX_VQ_GROUPS 2
+
 #define IRQ_UNBOUND -1
 
 struct vduse_virtqueue {
@@ -114,6 +119,7 @@ struct vduse_dev {
 	u8 status;
 	u32 vq_num;
 	u32 vq_align;
+	u32 ngroups;
 	struct vduse_umem *umem;
 	struct mutex mem_lock;
 	unsigned int bounce_size;
@@ -592,6 +598,25 @@ static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
 	return 0;
 }
 
+static u32 vduse_get_vq_group(struct vdpa_device *vdpa, u16 idx)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_dev_msg msg = { 0 };
+	int ret;
+
+	if (dev->api_version < VDUSE_API_VERSION_1)
+		return 0;
+
+	msg.req.type = VDUSE_GET_VQ_GROUP;
+	msg.req.vq_group.index = idx;
+
+	ret = vduse_dev_msg_sync(dev, &msg);
+	if (ret)
+		return ret;
+
+	return msg.resp.vq_group.num;
+}
+
 static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
 				struct vdpa_vq_state *state)
 {
@@ -789,6 +814,7 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = {
 	.set_vq_cb		= vduse_vdpa_set_vq_cb,
 	.set_vq_num             = vduse_vdpa_set_vq_num,
 	.get_vq_size		= vduse_vdpa_get_vq_size,
+	.get_vq_group		= vduse_get_vq_group,
 	.set_vq_ready		= vduse_vdpa_set_vq_ready,
 	.get_vq_ready		= vduse_vdpa_get_vq_ready,
 	.set_vq_state		= vduse_vdpa_set_vq_state,
@@ -1850,6 +1876,16 @@ static int vduse_create_dev(struct vduse_dev_config *config,
 	dev->device_features = config->features;
 	dev->device_id = config->device_id;
 	dev->vendor_id = config->vendor_id;
+	if (dev->api_version >= 1) {
+		if (config->ngroups > VDUSE_MAX_VQ_GROUPS) {
+			pr_err("Not creating a VDUSE device with %u vq groups. Max: %u",
+				config->ngroups, VDUSE_MAX_VQ_GROUPS);
+			goto err_ngroups;
+		}
+		dev->ngroups = config->ngroups ?: 1;
+	} else {
+		dev->ngroups = 1;
+	}
 	dev->name = kstrdup(config->name, GFP_KERNEL);
 	if (!dev->name)
 		goto err_str;
@@ -1885,6 +1921,7 @@ static int vduse_create_dev(struct vduse_dev_config *config,
 	idr_remove(&vduse_idr, dev->minor);
 err_idr:
 	kfree(dev->name);
+err_ngroups:
 err_str:
 	vduse_dev_destroy(dev);
 err:
@@ -2003,13 +2040,18 @@ static struct vduse_mgmt_dev *vduse_mgmt;
 static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
 {
 	struct vduse_vdpa *vdev;
+	__u32 ngroups = 1;
 	int ret;
 
 	if (dev->vdev)
 		return -EEXIST;
 
+	if (vdev->dev->api_version >= VDUSE_API_VERSION_1)
+		ngroups = vdev->dev->ngroups;
+
 	vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev,
-				 &vduse_vdpa_config_ops, 1, 1, name, true);
+				 &vduse_vdpa_config_ops, ngroups, 1, name,
+				 true);
 	if (IS_ERR(vdev))
 		return PTR_ERR(vdev);
 
diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
index 9a56d0416bfe..a779bcddac58 100644
--- a/include/uapi/linux/vduse.h
+++ b/include/uapi/linux/vduse.h
@@ -45,7 +45,8 @@ struct vduse_dev_config {
 	__u64 features;
 	__u32 vq_num;
 	__u32 vq_align;
-	__u32 reserved[13];
+	__u32 ngroups; /* if VDUSE_API_VERSION >= 1 */
+	__u32 reserved[12];
 	__u32 config_size;
 	__u8 config[];
 };
@@ -160,6 +161,16 @@ struct vduse_vq_state_packed {
 	__u16 last_used_idx;
 };
 
+/**
+ * struct vduse_vq_group - virtqueue group
+ * @num: Index of the virtqueue group
+ * @num: Group
+ */
+struct vduse_vq_group {
+	__u32 index;
+	__u32 num;
+};
+
 /**
  * struct vduse_vq_info - information of a virtqueue
  * @index: virtqueue index
@@ -182,6 +193,7 @@ struct vduse_vq_info {
 	union {
 		struct vduse_vq_state_split split;
 		struct vduse_vq_state_packed packed;
+		struct vduse_vq_group group;
 	};
 	__u8 ready;
 };
@@ -274,6 +286,7 @@ enum vduse_req_type {
 	VDUSE_GET_VQ_STATE,
 	VDUSE_SET_STATUS,
 	VDUSE_UPDATE_IOTLB,
+	VDUSE_GET_VQ_GROUP,
 };
 
 /**
@@ -328,6 +341,7 @@ struct vduse_dev_request {
 		struct vduse_vq_state vq_state;
 		struct vduse_dev_status s;
 		struct vduse_iova_range iova;
+		struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
 		__u32 padding[32];
 	};
 };
@@ -350,6 +364,7 @@ struct vduse_dev_response {
 	__u32 reserved[4];
 	union {
 		struct vduse_vq_state vq_state;
+		struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
 		__u32 padding[32];
 	};
 };
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC 3/6] vduse: add vq group asid support
  2025-06-06 11:50 [RFC 0/6] Add multiple address spaces support to VDUSE Eugenio Pérez
  2025-06-06 11:50 ` [RFC 1/6] vduse: add v1 API definition Eugenio Pérez
  2025-06-06 11:50 ` [RFC 2/6] vduse: add vq group support Eugenio Pérez
@ 2025-06-06 11:50 ` Eugenio Pérez
  2025-06-12  0:30   ` Jason Wang
  2025-06-06 11:50 ` [RFC 4/6] vduse: send update_iotlb_v2 message Eugenio Pérez
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 30+ messages in thread
From: Eugenio Pérez @ 2025-06-06 11:50 UTC (permalink / raw)
  To: jasowang
  Cc: Yongji Xie, Cindy Lu, linux-kernel, Eugenio Pérez,
	Stefano Garzarella, Stefan Hajnoczi, Maxime Coquelin,
	Michael S. Tsirkin, virtualization, Xuan Zhuo, Laurent Vivier

Make one IOTLB domain per address space, and allow the driver to assign
each ASID to a vq group.  Each address space via an dedicated identifier
(ASID).

During vDPA device allocation, the VDUSE device needs to report the
number of address spaces supported.  Then the vdpa driver is able to
configure them.  At this moment only vhost_vdpa is able to do it.

This helps to isolate the environments for the virtqueue that will not
be assigned directly.  E.g in the case of virtio-net, the control
virtqueue will not be assigned directly to guest.

TODO: Ideally, umem should not be duplicated.  But it is hard or
impossible to refactor everything around one single umem.  So should we
continue with device specifying umem per vq group?

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 250 +++++++++++++++++++++--------
 include/uapi/linux/vduse.h         |  38 ++++-
 2 files changed, 216 insertions(+), 72 deletions(-)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index 6fa687bc4912..d51e4f26fe72 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -51,6 +51,11 @@
  */
 #define VDUSE_MAX_VQ_GROUPS 2
 
+/*
+ * Let's make it 2 for simplicity.
+ */
+#define VDUSE_MAX_ASID 2
+
 #define IRQ_UNBOUND -1
 
 struct vduse_virtqueue {
@@ -92,7 +97,7 @@ struct vduse_dev {
 	struct vduse_vdpa *vdev;
 	struct device *dev;
 	struct vduse_virtqueue **vqs;
-	struct vduse_iova_domain *domain;
+	struct vduse_iova_domain *domain[VDUSE_MAX_ASID];
 	char *name;
 	struct mutex lock;
 	spinlock_t msg_lock;
@@ -120,7 +125,8 @@ struct vduse_dev {
 	u32 vq_num;
 	u32 vq_align;
 	u32 ngroups;
-	struct vduse_umem *umem;
+	u32 nas;
+	struct vduse_umem *umem[VDUSE_MAX_ASID];
 	struct mutex mem_lock;
 	unsigned int bounce_size;
 	struct mutex domain_lock;
@@ -436,11 +442,14 @@ static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
 static void vduse_dev_reset(struct vduse_dev *dev)
 {
 	int i;
-	struct vduse_iova_domain *domain = dev->domain;
 
 	/* The coherent mappings are handled in vduse_dev_free_coherent() */
-	if (domain && domain->bounce_map)
-		vduse_domain_reset_bounce_map(domain);
+	for (i = 0; i < dev->nas; i++) {
+		struct vduse_iova_domain *domain = dev->domain[i];
+
+		if (domain && domain->bounce_map)
+			vduse_domain_reset_bounce_map(domain);
+	}
 
 	down_write(&dev->rwsem);
 
@@ -617,6 +626,23 @@ static u32 vduse_get_vq_group(struct vdpa_device *vdpa, u16 idx)
 	return msg.resp.vq_group.num;
 }
 
+static int vduse_set_group_asid(struct vdpa_device *vdpa, unsigned int group,
+				unsigned int asid)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_dev_msg msg = { 0 };
+
+	if (dev->api_version < VDUSE_API_VERSION_1 ||
+	    group >= dev->ngroups || asid >= dev->nas)
+		return -EINVAL;
+
+	msg.req.type = VDUSE_SET_VQ_GROUP_ASID;
+	msg.req.vq_group_asid.group = group;
+	msg.req.vq_group_asid.asid = asid;
+
+	return vduse_dev_msg_sync(dev, &msg);
+}
+
 static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
 				struct vdpa_vq_state *state)
 {
@@ -788,13 +814,13 @@ static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
 	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
 	int ret;
 
-	ret = vduse_domain_set_map(dev->domain, iotlb);
+	ret = vduse_domain_set_map(dev->domain[asid], iotlb);
 	if (ret)
 		return ret;
 
 	ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
 	if (ret) {
-		vduse_domain_clear_map(dev->domain, iotlb);
+		vduse_domain_clear_map(dev->domain[asid], iotlb);
 		return ret;
 	}
 
@@ -837,6 +863,7 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = {
 	.get_vq_affinity	= vduse_vdpa_get_vq_affinity,
 	.reset			= vduse_vdpa_reset,
 	.set_map		= vduse_vdpa_set_map,
+	.set_group_asid		= vduse_set_group_asid,
 	.free			= vduse_vdpa_free,
 };
 
@@ -845,9 +872,12 @@ static void vduse_dev_sync_single_for_device(struct device *dev,
 					     enum dma_data_direction dir)
 {
 	struct vduse_dev *vdev = dev_to_vduse(dev);
-	struct vduse_iova_domain *domain = vdev->domain;
 
-	vduse_domain_sync_single_for_device(domain, dma_addr, size, dir);
+	for (int i = 0; i < vdev->nas; i++) {
+		struct vduse_iova_domain *domain = vdev->domain[i];
+
+		vduse_domain_sync_single_for_device(domain, dma_addr, size, dir);
+	}
 }
 
 static void vduse_dev_sync_single_for_cpu(struct device *dev,
@@ -855,9 +885,12 @@ static void vduse_dev_sync_single_for_cpu(struct device *dev,
 					     enum dma_data_direction dir)
 {
 	struct vduse_dev *vdev = dev_to_vduse(dev);
-	struct vduse_iova_domain *domain = vdev->domain;
 
-	vduse_domain_sync_single_for_cpu(domain, dma_addr, size, dir);
+	for (int i = 0; i < vdev->nas; i++) {
+		struct vduse_iova_domain *domain = vdev->domain[i];
+
+		vduse_domain_sync_single_for_cpu(domain, dma_addr, size, dir);
+	}
 }
 
 static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
@@ -866,7 +899,7 @@ static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
 				     unsigned long attrs)
 {
 	struct vduse_dev *vdev = dev_to_vduse(dev);
-	struct vduse_iova_domain *domain = vdev->domain;
+	struct vduse_iova_domain *domain = vdev->domain[0];
 
 	return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
 }
@@ -876,7 +909,7 @@ static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
 				unsigned long attrs)
 {
 	struct vduse_dev *vdev = dev_to_vduse(dev);
-	struct vduse_iova_domain *domain = vdev->domain;
+	struct vduse_iova_domain *domain = vdev->domain[0];
 
 	return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
 }
@@ -886,7 +919,7 @@ static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
 					unsigned long attrs)
 {
 	struct vduse_dev *vdev = dev_to_vduse(dev);
-	struct vduse_iova_domain *domain = vdev->domain;
+	struct vduse_iova_domain *domain = vdev->domain[0];
 	unsigned long iova;
 	void *addr;
 
@@ -906,17 +939,25 @@ static void vduse_dev_free_coherent(struct device *dev, size_t size,
 					unsigned long attrs)
 {
 	struct vduse_dev *vdev = dev_to_vduse(dev);
-	struct vduse_iova_domain *domain = vdev->domain;
+	struct vduse_iova_domain *domain = vdev->domain[0];
 
 	vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
 }
 
+/* TODO check if this is correct */
 static size_t vduse_dev_max_mapping_size(struct device *dev)
 {
 	struct vduse_dev *vdev = dev_to_vduse(dev);
-	struct vduse_iova_domain *domain = vdev->domain;
+	size_t max_mapping_size = 0;
+
+	for (int i = 0; i < vdev->nas; i++) {
+		struct vduse_iova_domain *domain = vdev->domain[i];
 
-	return domain->bounce_size;
+		if (domain->bounce_size > max_mapping_size)
+			max_mapping_size = domain->bounce_size;
+	}
+
+	return max_mapping_size;
 }
 
 static const struct dma_map_ops vduse_dev_dma_ops = {
@@ -1054,31 +1095,32 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
 	return ret;
 }
 
-static int vduse_dev_dereg_umem(struct vduse_dev *dev,
+static int vduse_dev_dereg_umem(struct vduse_dev *dev, u32 asid,
 				u64 iova, u64 size)
 {
 	int ret;
 
 	mutex_lock(&dev->mem_lock);
 	ret = -ENOENT;
-	if (!dev->umem)
+	if (!dev->umem[asid])
 		goto unlock;
 
 	ret = -EINVAL;
-	if (!dev->domain)
+	if (!dev->domain[asid])
 		goto unlock;
 
-	if (dev->umem->iova != iova || size != dev->domain->bounce_size)
+	if (dev->umem[asid]->iova != iova ||
+	    size != dev->domain[asid]->bounce_size)
 		goto unlock;
 
-	vduse_domain_remove_user_bounce_pages(dev->domain);
-	unpin_user_pages_dirty_lock(dev->umem->pages,
-				    dev->umem->npages, true);
-	atomic64_sub(dev->umem->npages, &dev->umem->mm->pinned_vm);
-	mmdrop(dev->umem->mm);
-	vfree(dev->umem->pages);
-	kfree(dev->umem);
-	dev->umem = NULL;
+	vduse_domain_remove_user_bounce_pages(dev->domain[asid]);
+	unpin_user_pages_dirty_lock(dev->umem[asid]->pages,
+				    dev->umem[asid]->npages, true);
+	atomic64_sub(dev->umem[asid]->npages, &dev->umem[asid]->mm->pinned_vm);
+	mmdrop(dev->umem[asid]->mm);
+	vfree(dev->umem[asid]->pages);
+	kfree(dev->umem[asid]);
+	dev->umem[asid] = NULL;
 	ret = 0;
 unlock:
 	mutex_unlock(&dev->mem_lock);
@@ -1086,7 +1128,7 @@ static int vduse_dev_dereg_umem(struct vduse_dev *dev,
 }
 
 static int vduse_dev_reg_umem(struct vduse_dev *dev,
-			      u64 iova, u64 uaddr, u64 size)
+			      u32 asid, u64 iova, u64 uaddr, u64 size)
 {
 	struct page **page_list = NULL;
 	struct vduse_umem *umem = NULL;
@@ -1094,14 +1136,14 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
 	unsigned long npages, lock_limit;
 	int ret;
 
-	if (!dev->domain || !dev->domain->bounce_map ||
-	    size != dev->domain->bounce_size ||
+	if (!dev->domain[asid] || !dev->domain[asid]->bounce_map ||
+	    size != dev->domain[asid]->bounce_size ||
 	    iova != 0 || uaddr & ~PAGE_MASK)
 		return -EINVAL;
 
 	mutex_lock(&dev->mem_lock);
 	ret = -EEXIST;
-	if (dev->umem)
+	if (dev->umem[asid])
 		goto unlock;
 
 	ret = -ENOMEM;
@@ -1125,7 +1167,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
 		goto out;
 	}
 
-	ret = vduse_domain_add_user_bounce_pages(dev->domain,
+	ret = vduse_domain_add_user_bounce_pages(dev->domain[asid],
 						 page_list, pinned);
 	if (ret)
 		goto out;
@@ -1138,7 +1180,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
 	umem->mm = current->mm;
 	mmgrab(current->mm);
 
-	dev->umem = umem;
+	dev->umem[asid] = umem;
 out:
 	if (ret && pinned > 0)
 		unpin_user_pages(page_list, pinned);
@@ -1181,26 +1223,42 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
 
 	switch (cmd) {
 	case VDUSE_IOTLB_GET_FD: {
-		struct vduse_iotlb_entry entry;
+		struct vduse_iotlb_entry_v2 entry = {};
+		struct vduse_iotlb_entry entry_old;
 		struct vhost_iotlb_map *map;
 		struct vdpa_map_file *map_file;
 		struct file *f = NULL;
 
 		ret = -EFAULT;
-		if (copy_from_user(&entry, argp, sizeof(entry)))
-			break;
+		if (dev->api_version >= VDUSE_API_VERSION_1) {
+			if (copy_from_user(&entry, argp, sizeof(entry)))
+				break;
+		} else {
+			if (copy_from_user(&entry_old, argp,
+					   sizeof(entry_old)))
+				break;
+
+			entry.offset = entry_old.offset;
+			entry.start = entry_old.start;
+			entry.last = entry_old.last;
+			entry.perm = entry_old.perm;
+		}
 
 		ret = -EINVAL;
 		if (entry.start > entry.last)
 			break;
 
+		if (entry.asid >= dev->nas)
+			break;
+
 		mutex_lock(&dev->domain_lock);
-		if (!dev->domain) {
+		/* TODO accessing an array with idx from userspace, mitigations? */
+		if (!dev->domain[entry.asid]) {
 			mutex_unlock(&dev->domain_lock);
 			break;
 		}
-		spin_lock(&dev->domain->iotlb_lock);
-		map = vhost_iotlb_itree_first(dev->domain->iotlb,
+		spin_lock(&dev->domain[entry.asid]->iotlb_lock);
+		map = vhost_iotlb_itree_first(dev->domain[entry.asid]->iotlb,
 					      entry.start, entry.last);
 		if (map) {
 			map_file = (struct vdpa_map_file *)map->opaque;
@@ -1210,7 +1268,7 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
 			entry.last = map->last;
 			entry.perm = map->perm;
 		}
-		spin_unlock(&dev->domain->iotlb_lock);
+		spin_unlock(&dev->domain[entry.asid]->iotlb_lock);
 		mutex_unlock(&dev->domain_lock);
 		ret = -EINVAL;
 		if (!f)
@@ -1360,12 +1418,18 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
 			break;
 
 		ret = -EINVAL;
+		/* TODO: Using asid from userspace, need to mitigate? */
 		if (!is_mem_zero((const char *)umem.reserved,
-				 sizeof(umem.reserved)))
+				 sizeof(umem.reserved)) ||
+		    !is_mem_zero((const char *)umem.reserved2,
+				 sizeof(umem.reserved2)) ||
+		    (dev->api_version < VDUSE_API_VERSION_1 &&
+		     umem.asid != 0) ||
+		     umem.asid >= dev->nas)
 			break;
 
 		mutex_lock(&dev->domain_lock);
-		ret = vduse_dev_reg_umem(dev, umem.iova,
+		ret = vduse_dev_reg_umem(dev, umem.asid, umem.iova,
 					 umem.uaddr, umem.size);
 		mutex_unlock(&dev->domain_lock);
 		break;
@@ -1378,15 +1442,23 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
 			break;
 
 		ret = -EINVAL;
+		/* TODO: Using asid from userspace, need to mitigate? */
 		if (!is_mem_zero((const char *)umem.reserved,
-				 sizeof(umem.reserved)))
+				 sizeof(umem.reserved)) ||
+		    !is_mem_zero((const char *)umem.reserved2,
+				 sizeof(umem.reserved2)) ||
+		    (dev->api_version < VDUSE_API_VERSION_1 &&
+		     umem.asid != 0) ||
+		     umem.asid >= dev->nas)
 			break;
+
 		mutex_lock(&dev->domain_lock);
-		ret = vduse_dev_dereg_umem(dev, umem.iova,
+		ret = vduse_dev_dereg_umem(dev, umem.asid, umem.iova,
 					   umem.size);
 		mutex_unlock(&dev->domain_lock);
 		break;
 	}
+	/* TODO can we merge this with GET_FD? */
 	case VDUSE_IOTLB_GET_INFO: {
 		struct vduse_iova_info info;
 		struct vhost_iotlb_map *map;
@@ -1399,27 +1471,32 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
 		if (info.start > info.last)
 			break;
 
+		if (info.asid >= dev->nas)
+			break;
+
 		if (!is_mem_zero((const char *)info.reserved,
 				 sizeof(info.reserved)))
 			break;
 
 		mutex_lock(&dev->domain_lock);
-		if (!dev->domain) {
+		/* TODO asid comes from userspace. mitigations? */
+		if (!dev->domain[info.asid]) {
 			mutex_unlock(&dev->domain_lock);
 			break;
 		}
-		spin_lock(&dev->domain->iotlb_lock);
-		map = vhost_iotlb_itree_first(dev->domain->iotlb,
+		spin_lock(&dev->domain[info.asid]->iotlb_lock);
+		map = vhost_iotlb_itree_first(dev->domain[info.asid]->iotlb,
 					      info.start, info.last);
 		if (map) {
 			info.start = map->start;
 			info.last = map->last;
 			info.capability = 0;
-			if (dev->domain->bounce_map && map->start == 0 &&
-			    map->last == dev->domain->bounce_size - 1)
+			if (dev->domain[info.asid]->bounce_map &&
+			    map->start == 0 &&
+			    map->last == dev->domain[info.asid]->bounce_size - 1)
 				info.capability |= VDUSE_IOVA_CAP_UMEM;
 		}
-		spin_unlock(&dev->domain->iotlb_lock);
+		spin_unlock(&dev->domain[info.asid]->iotlb_lock);
 		mutex_unlock(&dev->domain_lock);
 		if (!map)
 			break;
@@ -1444,8 +1521,13 @@ static int vduse_dev_release(struct inode *inode, struct file *file)
 	struct vduse_dev *dev = file->private_data;
 
 	mutex_lock(&dev->domain_lock);
-	if (dev->domain)
-		vduse_dev_dereg_umem(dev, 0, dev->domain->bounce_size);
+	for (int i = 0; i < dev->nas; i++) {
+		if (dev->domain[i]) {
+			vduse_dev_dereg_umem(dev, i, 0,
+					     dev->domain[i]->bounce_size);
+			dev->domain[i] = NULL;
+		}
+	}
 	mutex_unlock(&dev->domain_lock);
 	spin_lock(&dev->msg_lock);
 	/* Make sure the inflight messages can processed after reconncection */
@@ -1715,8 +1797,10 @@ static int vduse_destroy_dev(char *name)
 	idr_remove(&vduse_idr, dev->minor);
 	kvfree(dev->config);
 	vduse_dev_deinit_vqs(dev);
-	if (dev->domain)
-		vduse_domain_destroy(dev->domain);
+	for (int i = 0; i < dev->nas; i++) {
+		if (dev->domain[i])
+			vduse_domain_destroy(dev->domain[i]);
+	}
 	kfree(dev->name);
 	vduse_dev_destroy(dev);
 	module_put(THIS_MODULE);
@@ -1824,7 +1908,7 @@ static ssize_t bounce_size_store(struct device *device,
 
 	ret = -EPERM;
 	mutex_lock(&dev->domain_lock);
-	if (dev->domain)
+	if (dev->domain[0] && dev->domain[1])
 		goto unlock;
 
 	ret = kstrtouint(buf, 10, &bounce_size);
@@ -1882,9 +1966,18 @@ static int vduse_create_dev(struct vduse_dev_config *config,
 				config->ngroups, VDUSE_MAX_VQ_GROUPS);
 			goto err_ngroups;
 		}
+
+		if (config->nas > VDUSE_MAX_ASID) {
+			pr_err("Not creating a VDUSE device with %u asid. Max: %u",
+				config->nas, VDUSE_MAX_ASID);
+			goto err_nas;
+		}
+
 		dev->ngroups = config->ngroups ?: 1;
+		dev->nas = config->nas ?: 1;
 	} else {
 		dev->ngroups = 1;
+		dev->nas = 1;
 	}
 	dev->name = kstrdup(config->name, GFP_KERNEL);
 	if (!dev->name)
@@ -1923,6 +2016,7 @@ static int vduse_create_dev(struct vduse_dev_config *config,
 	kfree(dev->name);
 err_ngroups:
 err_str:
+err_nas:
 	vduse_dev_destroy(dev);
 err:
 	return ret;
@@ -2015,7 +2109,6 @@ static int vduse_open(struct inode *inode, struct file *file)
 	if (!control)
 		return -ENOMEM;
 
-	control->api_version = VDUSE_API_VERSION;
 	file->private_data = control;
 
 	return 0;
@@ -2040,17 +2133,15 @@ static struct vduse_mgmt_dev *vduse_mgmt;
 static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
 {
 	struct vduse_vdpa *vdev;
-	__u32 ngroups = 1;
+	__u32 ngroups = dev->ngroups;
 	int ret;
 
 	if (dev->vdev)
 		return -EEXIST;
 
-	if (vdev->dev->api_version >= VDUSE_API_VERSION_1)
-		ngroups = vdev->dev->ngroups;
-
+	/* TODO do we need to store ngroups and nas? vdpa device already store it for us */
 	vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev,
-				 &vduse_vdpa_config_ops, ngroups, 1, name,
+				 &vduse_vdpa_config_ops, ngroups, dev->nas, name,
 				 true);
 	if (IS_ERR(vdev))
 		return PTR_ERR(vdev);
@@ -2088,11 +2179,23 @@ static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name,
 		return ret;
 
 	mutex_lock(&dev->domain_lock);
-	if (!dev->domain)
-		dev->domain = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
-						  dev->bounce_size);
+	ret = 0;
+
+	/* TODO we could delay the creation of the domain */
+	for (int i = 0; i < dev->nas; ++i) {
+		if (!dev->domain[i])
+			dev->domain[i] = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
+							     dev->bounce_size);
+		if (!dev->domain[i]) {
+			ret = -ENOMEM;
+			for (int j = 0; j < i; ++j)
+				vduse_domain_destroy(dev->domain[j]);
+			goto err_domain;
+		}
+	}
+
 	mutex_unlock(&dev->domain_lock);
-	if (!dev->domain) {
+	if (ret == -ENOMEM) {
 		put_device(&dev->vdev->vdpa.dev);
 		return -ENOMEM;
 	}
@@ -2101,13 +2204,22 @@ static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name,
 	if (ret) {
 		put_device(&dev->vdev->vdpa.dev);
 		mutex_lock(&dev->domain_lock);
-		vduse_domain_destroy(dev->domain);
-		dev->domain = NULL;
+		for (int i = 0; i < dev->nas; i++) {
+			if (dev->domain[i]) {
+				vduse_domain_destroy(dev->domain[i]);
+				dev->domain[i] = NULL;
+			}
+		}
 		mutex_unlock(&dev->domain_lock);
 		return ret;
 	}
 
 	return 0;
+
+err_domain:
+	/* TODO do I need to call put_device? */
+	mutex_unlock(&dev->domain_lock);
+	return ret;
 }
 
 static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
index a779bcddac58..3a17a0b4e938 100644
--- a/include/uapi/linux/vduse.h
+++ b/include/uapi/linux/vduse.h
@@ -46,7 +46,8 @@ struct vduse_dev_config {
 	__u32 vq_num;
 	__u32 vq_align;
 	__u32 ngroups; /* if VDUSE_API_VERSION >= 1 */
-	__u32 reserved[12];
+	__u32 nas; /* if VDUSE_API_VERSION >= 1 */
+	__u32 reserved[11];
 	__u32 config_size;
 	__u8 config[];
 };
@@ -81,6 +82,17 @@ struct vduse_iotlb_entry {
 	__u8 perm;
 };
 
+struct vduse_iotlb_entry_v2 {
+	__u64 offset;
+	__u64 start;
+	__u64 last;
+	__u32 asid;
+#define VDUSE_ACCESS_RO 0x1
+#define VDUSE_ACCESS_WO 0x2
+#define VDUSE_ACCESS_RW 0x3
+	__u8 perm;
+};
+
 /*
  * Find the first IOVA region that overlaps with the range [start, last]
  * and return the corresponding file descriptor. Return -EINVAL means the
@@ -171,6 +183,16 @@ struct vduse_vq_group {
 	__u32 num;
 };
 
+/**
+ * struct vduse_vq_group - virtqueue group
+ @ @group: Index of the virtqueue group
+ * @asid: Address space ID of the group
+ */
+struct vduse_vq_group_asid {
+	__u32 group;
+	__u32 asid;
+};
+
 /**
  * struct vduse_vq_info - information of a virtqueue
  * @index: virtqueue index
@@ -231,7 +253,9 @@ struct vduse_vq_eventfd {
  * @uaddr: start address of userspace memory, it must be aligned to page size
  * @iova: start of the IOVA region
  * @size: size of the IOVA region
+ * @asid: Address space ID of the IOVA region
  * @reserved: for future use, needs to be initialized to zero
+ * @reserved2: for future use, needs to be initialized to zero
  *
  * Structure used by VDUSE_IOTLB_REG_UMEM and VDUSE_IOTLB_DEREG_UMEM
  * ioctls to register/de-register userspace memory for IOVA regions
@@ -240,7 +264,9 @@ struct vduse_iova_umem {
 	__u64 uaddr;
 	__u64 iova;
 	__u64 size;
-	__u64 reserved[3];
+	__u32 asid;
+	__u32 reserved[1];
+	__u64 reserved2[2];
 };
 
 /* Register userspace memory for IOVA regions */
@@ -264,7 +290,8 @@ struct vduse_iova_info {
 	__u64 last;
 #define VDUSE_IOVA_CAP_UMEM (1 << 0)
 	__u64 capability;
-	__u64 reserved[3];
+	__u64 asid; /* Only if device API version >= 1 */
+	__u64 reserved[2];
 };
 
 /*
@@ -287,6 +314,7 @@ enum vduse_req_type {
 	VDUSE_SET_STATUS,
 	VDUSE_UPDATE_IOTLB,
 	VDUSE_GET_VQ_GROUP,
+	VDUSE_SET_VQ_GROUP_ASID,
 };
 
 /**
@@ -342,6 +370,8 @@ struct vduse_dev_request {
 		struct vduse_dev_status s;
 		struct vduse_iova_range iova;
 		struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
+		/* Only if vduse api version >= 1 */
+		struct vduse_vq_group_asid vq_group_asid;
 		__u32 padding[32];
 	};
 };
@@ -365,6 +395,8 @@ struct vduse_dev_response {
 	union {
 		struct vduse_vq_state vq_state;
 		struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
+		/* Only if vduse api version >= 1 */
+		struct vduse_vq_group_asid vq_group_asid;
 		__u32 padding[32];
 	};
 };
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC 4/6] vduse: send update_iotlb_v2 message
  2025-06-06 11:50 [RFC 0/6] Add multiple address spaces support to VDUSE Eugenio Pérez
                   ` (2 preceding siblings ...)
  2025-06-06 11:50 ` [RFC 3/6] vduse: add vq group asid support Eugenio Pérez
@ 2025-06-06 11:50 ` Eugenio Pérez
  2025-06-06 11:50 ` [RFC 5/6] vduse: reset group asid in reset Eugenio Pérez
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 30+ messages in thread
From: Eugenio Pérez @ 2025-06-06 11:50 UTC (permalink / raw)
  To: jasowang
  Cc: Yongji Xie, Cindy Lu, linux-kernel, Eugenio Pérez,
	Stefano Garzarella, Stefan Hajnoczi, Maxime Coquelin,
	Michael S. Tsirkin, virtualization, Xuan Zhuo, Laurent Vivier

This new version reports the translation ASID.  Needed so the device can
support it.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 14 ++++++++++----
 include/uapi/linux/vduse.h         |  7 +++++++
 2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index d51e4f26fe72..151c6d133e76 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -317,7 +317,7 @@ static int vduse_dev_set_status(struct vduse_dev *dev, u8 status)
 	return vduse_dev_msg_sync(dev, &msg);
 }
 
-static int vduse_dev_update_iotlb(struct vduse_dev *dev,
+static int vduse_dev_update_iotlb(struct vduse_dev *dev, u32 asid,
 				  u64 start, u64 last)
 {
 	struct vduse_dev_msg msg = { 0 };
@@ -326,8 +326,14 @@ static int vduse_dev_update_iotlb(struct vduse_dev *dev,
 		return -EINVAL;
 
 	msg.req.type = VDUSE_UPDATE_IOTLB;
-	msg.req.iova.start = start;
-	msg.req.iova.last = last;
+	if (dev->api_version < VDUSE_API_VERSION_1) {
+		msg.req.iova.start = start;
+		msg.req.iova.last = last;
+	} else {
+		msg.req.iova_v2.start = start;
+		msg.req.iova_v2.last = last;
+		msg.req.iova_v2.asid = asid;
+	}
 
 	return vduse_dev_msg_sync(dev, &msg);
 }
@@ -818,7 +824,7 @@ static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
 	if (ret)
 		return ret;
 
-	ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
+	ret = vduse_dev_update_iotlb(dev, asid, 0ULL, ULLONG_MAX);
 	if (ret) {
 		vduse_domain_clear_map(dev->domain[asid], iotlb);
 		return ret;
diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
index 3a17a0b4e938..a7c979591b2e 100644
--- a/include/uapi/linux/vduse.h
+++ b/include/uapi/linux/vduse.h
@@ -349,6 +349,12 @@ struct vduse_iova_range {
 	__u64 last;
 };
 
+struct vduse_iova_range_v2 {
+	__u64 start;
+	__u64 last;
+	__u32 asid;
+};
+
 /**
  * struct vduse_dev_request - control request
  * @type: request type
@@ -369,6 +375,7 @@ struct vduse_dev_request {
 		struct vduse_vq_state vq_state;
 		struct vduse_dev_status s;
 		struct vduse_iova_range iova;
+		struct vduse_iova_range_v2 iova_v2;
 		struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
 		/* Only if vduse api version >= 1 */
 		struct vduse_vq_group_asid vq_group_asid;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC 5/6] vduse: reset group asid in reset
  2025-06-06 11:50 [RFC 0/6] Add multiple address spaces support to VDUSE Eugenio Pérez
                   ` (3 preceding siblings ...)
  2025-06-06 11:50 ` [RFC 4/6] vduse: send update_iotlb_v2 message Eugenio Pérez
@ 2025-06-06 11:50 ` Eugenio Pérez
  2025-06-12  0:32   ` Jason Wang
  2025-06-06 11:50 ` [RFC 6/6] vduse: bump version number Eugenio Pérez
  2025-06-09  5:51 ` [RFC 0/6] Add multiple address spaces support to VDUSE Christoph Hellwig
  6 siblings, 1 reply; 30+ messages in thread
From: Eugenio Pérez @ 2025-06-06 11:50 UTC (permalink / raw)
  To: jasowang
  Cc: Yongji Xie, Cindy Lu, linux-kernel, Eugenio Pérez,
	Stefano Garzarella, Stefan Hajnoczi, Maxime Coquelin,
	Michael S. Tsirkin, virtualization, Xuan Zhuo, Laurent Vivier

Is the expected behavior with vdpa_sim and mlx.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index 151c6d133e76..5f0032df43b8 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -445,6 +445,9 @@ static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
 	return mask;
 }
 
+static int vduse_set_group_asid(struct vdpa_device *vdpa, unsigned int group,
+				unsigned int asid);
+
 static void vduse_dev_reset(struct vduse_dev *dev)
 {
 	int i;
@@ -457,6 +460,9 @@ static void vduse_dev_reset(struct vduse_dev *dev)
 			vduse_domain_reset_bounce_map(domain);
 	}
 
+	for (i = 0; i < dev->ngroups; i++)
+		vduse_set_group_asid(&dev->vdev->vdpa, i, 0);
+
 	down_write(&dev->rwsem);
 
 	dev->status = 0;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC 6/6] vduse: bump version number
  2025-06-06 11:50 [RFC 0/6] Add multiple address spaces support to VDUSE Eugenio Pérez
                   ` (4 preceding siblings ...)
  2025-06-06 11:50 ` [RFC 5/6] vduse: reset group asid in reset Eugenio Pérez
@ 2025-06-06 11:50 ` Eugenio Pérez
  2025-06-09  5:51 ` [RFC 0/6] Add multiple address spaces support to VDUSE Christoph Hellwig
  6 siblings, 0 replies; 30+ messages in thread
From: Eugenio Pérez @ 2025-06-06 11:50 UTC (permalink / raw)
  To: jasowang
  Cc: Yongji Xie, Cindy Lu, linux-kernel, Eugenio Pérez,
	Stefano Garzarella, Stefan Hajnoczi, Maxime Coquelin,
	Michael S. Tsirkin, virtualization, Xuan Zhuo, Laurent Vivier

So VDUSE devices can use the new features

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index 5f0032df43b8..aa2d25caa933 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -2054,7 +2054,7 @@ static long vduse_ioctl(struct file *file, unsigned int cmd,
 			break;
 
 		ret = -EINVAL;
-		if (api_version > VDUSE_API_VERSION)
+		if (api_version > VDUSE_API_VERSION_1)
 			break;
 
 		ret = 0;
@@ -2121,6 +2121,7 @@ static int vduse_open(struct inode *inode, struct file *file)
 	if (!control)
 		return -ENOMEM;
 
+	control->api_version = VDUSE_API_VERSION_1;
 	file->private_data = control;
 
 	return 0;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC 1/6] vduse: add v1 API definition
  2025-06-06 11:50 ` [RFC 1/6] vduse: add v1 API definition Eugenio Pérez
@ 2025-06-09  1:41   ` Jason Wang
  2025-06-09  1:50     ` Jason Wang
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Wang @ 2025-06-09  1:41 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: Yongji Xie, Cindy Lu, linux-kernel, Stefano Garzarella,
	Stefan Hajnoczi, Maxime Coquelin, Michael S. Tsirkin,
	virtualization, Xuan Zhuo, Laurent Vivier

On Fri, Jun 6, 2025 at 7:50 PM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> This allows to define all functions checking the API version set by the
> userland device.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>

It might be worth clarifying how it works.

For example,

1) would VDUSE behave differently or if it's just some new ioctls
2) If VDUSE behave differently, do we need a ioctl to set the API
version for backward compatibility?
3) If it means a brunch of new ioctls, could userspace just probe the
new ioctls instead?

Thanks

> ---
>  include/uapi/linux/vduse.h | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> index 68a627d04afa..9a56d0416bfe 100644
> --- a/include/uapi/linux/vduse.h
> +++ b/include/uapi/linux/vduse.h
> @@ -10,6 +10,10 @@
>
>  #define VDUSE_API_VERSION      0
>
> +/* VQ groups and ASID support */
> +
> +#define VDUSE_API_VERSION_1    1
> +
>  /*
>   * Get the version of VDUSE API that kernel supported (VDUSE_API_VERSION).
>   * This is used for future extension.
> --
> 2.49.0
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 1/6] vduse: add v1 API definition
  2025-06-09  1:41   ` Jason Wang
@ 2025-06-09  1:50     ` Jason Wang
  2025-06-09  6:10       ` Eugenio Perez Martin
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Wang @ 2025-06-09  1:50 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: Yongji Xie, Cindy Lu, linux-kernel, Stefano Garzarella,
	Stefan Hajnoczi, Maxime Coquelin, Michael S. Tsirkin,
	virtualization, Xuan Zhuo, Laurent Vivier

On Mon, Jun 9, 2025 at 9:41 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Fri, Jun 6, 2025 at 7:50 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> >
> > This allows to define all functions checking the API version set by the
> > userland device.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>
> It might be worth clarifying how it works.
>
> For example,
>
> 1) would VDUSE behave differently or if it's just some new ioctls
> 2) If VDUSE behave differently, do we need a ioctl to set the API
> version for backward compatibility?

Speak too fast, there's a VDUSE_SET_API_VERSION actually.

I think we need to think if it complicates the migration compatibility or not.

> 3) If it means a brunch of new ioctls, could userspace just probe the
> new ioctls instead?
>
> Thanks

Thanks

>
> > ---
> >  include/uapi/linux/vduse.h | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> > index 68a627d04afa..9a56d0416bfe 100644
> > --- a/include/uapi/linux/vduse.h
> > +++ b/include/uapi/linux/vduse.h
> > @@ -10,6 +10,10 @@
> >
> >  #define VDUSE_API_VERSION      0
> >
> > +/* VQ groups and ASID support */
> > +
> > +#define VDUSE_API_VERSION_1    1
> > +
> >  /*
> >   * Get the version of VDUSE API that kernel supported (VDUSE_API_VERSION).
> >   * This is used for future extension.
> > --
> > 2.49.0
> >


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 2/6] vduse: add vq group support
  2025-06-06 11:50 ` [RFC 2/6] vduse: add vq group support Eugenio Pérez
@ 2025-06-09  1:55   ` Jason Wang
  2025-06-09  6:03     ` Eugenio Perez Martin
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Wang @ 2025-06-09  1:55 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: Yongji Xie, Cindy Lu, linux-kernel, Stefano Garzarella,
	Stefan Hajnoczi, Maxime Coquelin, Michael S. Tsirkin,
	virtualization, Xuan Zhuo, Laurent Vivier

On Fri, Jun 6, 2025 at 7:50 PM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> The virtqueue group is the minimal set of virtqueues that must share an
> address space.  And the address space identifier could only be attached
> to a specific virtqueue group.  The virtqueue is attached to a
> virtqueue group for all the life of the device.
>
> During vDPA device allocation, the VDUSE device needs to report the
> number of virtqueue groups supported. At this moment only vhost_vdpa is
> able to do it.
>
> This helps to isolate the environments for the virtqueue that will not
> be assigned directly. E.g in the case of virtio-net, the control
> virtqueue will not be assigned directly to guest.
>
> As we need to back the vq groups with a struct device for the file
> operations, let's keep this number as low as possible at the moment: 2.
> We can back one VQ group with the vduse device and the other one with
> the vdpa device.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  drivers/vdpa/vdpa_user/vduse_dev.c | 44 +++++++++++++++++++++++++++++-
>  include/uapi/linux/vduse.h         | 17 +++++++++++-
>  2 files changed, 59 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> index 6a9a37351310..6fa687bc4912 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -46,6 +46,11 @@
>  #define VDUSE_IOVA_SIZE (VDUSE_MAX_BOUNCE_SIZE + 128 * 1024 * 1024)
>  #define VDUSE_MSG_DEFAULT_TIMEOUT 30
>
> +/*
> + * Let's make it 2 for simplicity.
> + */
> +#define VDUSE_MAX_VQ_GROUPS 2
> +
>  #define IRQ_UNBOUND -1
>
>  struct vduse_virtqueue {
> @@ -114,6 +119,7 @@ struct vduse_dev {
>         u8 status;
>         u32 vq_num;
>         u32 vq_align;
> +       u32 ngroups;
>         struct vduse_umem *umem;
>         struct mutex mem_lock;
>         unsigned int bounce_size;
> @@ -592,6 +598,25 @@ static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
>         return 0;
>  }
>
> +static u32 vduse_get_vq_group(struct vdpa_device *vdpa, u16 idx)
> +{
> +       struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +       struct vduse_dev_msg msg = { 0 };
> +       int ret;
> +
> +       if (dev->api_version < VDUSE_API_VERSION_1)
> +               return 0;
> +
> +       msg.req.type = VDUSE_GET_VQ_GROUP;
> +       msg.req.vq_group.index = idx;

Considering there will be a set_group_asid request, could the kernel
cache the result so we don't need to bother with requests from
userspace?

> +
> +       ret = vduse_dev_msg_sync(dev, &msg);
> +       if (ret)
> +               return ret;
> +
> +       return msg.resp.vq_group.num;
> +}
> +
>  static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
>                                 struct vdpa_vq_state *state)
>  {
> @@ -789,6 +814,7 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = {
>         .set_vq_cb              = vduse_vdpa_set_vq_cb,
>         .set_vq_num             = vduse_vdpa_set_vq_num,
>         .get_vq_size            = vduse_vdpa_get_vq_size,
> +       .get_vq_group           = vduse_get_vq_group,
>         .set_vq_ready           = vduse_vdpa_set_vq_ready,
>         .get_vq_ready           = vduse_vdpa_get_vq_ready,
>         .set_vq_state           = vduse_vdpa_set_vq_state,
> @@ -1850,6 +1876,16 @@ static int vduse_create_dev(struct vduse_dev_config *config,
>         dev->device_features = config->features;
>         dev->device_id = config->device_id;
>         dev->vendor_id = config->vendor_id;
> +       if (dev->api_version >= 1) {
> +               if (config->ngroups > VDUSE_MAX_VQ_GROUPS) {
> +                       pr_err("Not creating a VDUSE device with %u vq groups. Max: %u",
> +                               config->ngroups, VDUSE_MAX_VQ_GROUPS);
> +                       goto err_ngroups;
> +               }
> +               dev->ngroups = config->ngroups ?: 1;
> +       } else {
> +               dev->ngroups = 1;
> +       }
>         dev->name = kstrdup(config->name, GFP_KERNEL);
>         if (!dev->name)
>                 goto err_str;
> @@ -1885,6 +1921,7 @@ static int vduse_create_dev(struct vduse_dev_config *config,
>         idr_remove(&vduse_idr, dev->minor);
>  err_idr:
>         kfree(dev->name);
> +err_ngroups:
>  err_str:
>         vduse_dev_destroy(dev);
>  err:
> @@ -2003,13 +2040,18 @@ static struct vduse_mgmt_dev *vduse_mgmt;
>  static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
>  {
>         struct vduse_vdpa *vdev;
> +       __u32 ngroups = 1;
>         int ret;
>
>         if (dev->vdev)
>                 return -EEXIST;
>
> +       if (vdev->dev->api_version >= VDUSE_API_VERSION_1)
> +               ngroups = vdev->dev->ngroups;
> +
>         vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev,
> -                                &vduse_vdpa_config_ops, 1, 1, name, true);
> +                                &vduse_vdpa_config_ops, ngroups, 1, name,
> +                                true);
>         if (IS_ERR(vdev))
>                 return PTR_ERR(vdev);
>
> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> index 9a56d0416bfe..a779bcddac58 100644
> --- a/include/uapi/linux/vduse.h
> +++ b/include/uapi/linux/vduse.h
> @@ -45,7 +45,8 @@ struct vduse_dev_config {
>         __u64 features;
>         __u32 vq_num;
>         __u32 vq_align;
> -       __u32 reserved[13];
> +       __u32 ngroups; /* if VDUSE_API_VERSION >= 1 */
> +       __u32 reserved[12];
>         __u32 config_size;
>         __u8 config[];
>  };
> @@ -160,6 +161,16 @@ struct vduse_vq_state_packed {
>         __u16 last_used_idx;
>  };
>
> +/**
> + * struct vduse_vq_group - virtqueue group
> + * @num: Index of the virtqueue group
> + * @num: Group
> + */
> +struct vduse_vq_group {
> +       __u32 index;
> +       __u32 num;
> +};
> +
>  /**
>   * struct vduse_vq_info - information of a virtqueue
>   * @index: virtqueue index
> @@ -182,6 +193,7 @@ struct vduse_vq_info {
>         union {
>                 struct vduse_vq_state_split split;
>                 struct vduse_vq_state_packed packed;
> +               struct vduse_vq_group group;
>         };
>         __u8 ready;
>  };
> @@ -274,6 +286,7 @@ enum vduse_req_type {
>         VDUSE_GET_VQ_STATE,
>         VDUSE_SET_STATUS,
>         VDUSE_UPDATE_IOTLB,
> +       VDUSE_GET_VQ_GROUP,
>  };
>
>  /**
> @@ -328,6 +341,7 @@ struct vduse_dev_request {
>                 struct vduse_vq_state vq_state;
>                 struct vduse_dev_status s;
>                 struct vduse_iova_range iova;
> +               struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
>                 __u32 padding[32];

This should be [31] otherwise we break the uAPI?

>         };
>  };
> @@ -350,6 +364,7 @@ struct vduse_dev_response {
>         __u32 reserved[4];
>         union {
>                 struct vduse_vq_state vq_state;
> +               struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
>                 __u32 padding[32];

And here?

Thanks

>         };
>  };
> --
> 2.49.0
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/6] Add multiple address spaces support to VDUSE
  2025-06-06 11:50 [RFC 0/6] Add multiple address spaces support to VDUSE Eugenio Pérez
                   ` (5 preceding siblings ...)
  2025-06-06 11:50 ` [RFC 6/6] vduse: bump version number Eugenio Pérez
@ 2025-06-09  5:51 ` Christoph Hellwig
  2025-06-20  6:25   ` Eugenio Perez Martin
  6 siblings, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2025-06-09  5:51 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: jasowang, Yongji Xie, Cindy Lu, linux-kernel, Stefano Garzarella,
	Stefan Hajnoczi, Maxime Coquelin, Michael S. Tsirkin,
	virtualization, Xuan Zhuo, Laurent Vivier

Before you add new features to vduse, please remove the broken abuse of
the DMA API first.  Without that no new feature work should go into
this code.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 2/6] vduse: add vq group support
  2025-06-09  1:55   ` Jason Wang
@ 2025-06-09  6:03     ` Eugenio Perez Martin
  2025-06-10  8:53       ` Jason Wang
  0 siblings, 1 reply; 30+ messages in thread
From: Eugenio Perez Martin @ 2025-06-09  6:03 UTC (permalink / raw)
  To: Jason Wang
  Cc: Yongji Xie, Cindy Lu, linux-kernel, Stefano Garzarella,
	Stefan Hajnoczi, Maxime Coquelin, Michael S. Tsirkin,
	virtualization, Xuan Zhuo, Laurent Vivier

On Mon, Jun 9, 2025 at 3:55 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Fri, Jun 6, 2025 at 7:50 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> >
> > The virtqueue group is the minimal set of virtqueues that must share an
> > address space.  And the address space identifier could only be attached
> > to a specific virtqueue group.  The virtqueue is attached to a
> > virtqueue group for all the life of the device.
> >
> > During vDPA device allocation, the VDUSE device needs to report the
> > number of virtqueue groups supported. At this moment only vhost_vdpa is
> > able to do it.
> >
> > This helps to isolate the environments for the virtqueue that will not
> > be assigned directly. E.g in the case of virtio-net, the control
> > virtqueue will not be assigned directly to guest.
> >
> > As we need to back the vq groups with a struct device for the file
> > operations, let's keep this number as low as possible at the moment: 2.
> > We can back one VQ group with the vduse device and the other one with
> > the vdpa device.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >  drivers/vdpa/vdpa_user/vduse_dev.c | 44 +++++++++++++++++++++++++++++-
> >  include/uapi/linux/vduse.h         | 17 +++++++++++-
> >  2 files changed, 59 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > index 6a9a37351310..6fa687bc4912 100644
> > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > @@ -46,6 +46,11 @@
> >  #define VDUSE_IOVA_SIZE (VDUSE_MAX_BOUNCE_SIZE + 128 * 1024 * 1024)
> >  #define VDUSE_MSG_DEFAULT_TIMEOUT 30
> >
> > +/*
> > + * Let's make it 2 for simplicity.
> > + */
> > +#define VDUSE_MAX_VQ_GROUPS 2
> > +
> >  #define IRQ_UNBOUND -1
> >
> >  struct vduse_virtqueue {
> > @@ -114,6 +119,7 @@ struct vduse_dev {
> >         u8 status;
> >         u32 vq_num;
> >         u32 vq_align;
> > +       u32 ngroups;
> >         struct vduse_umem *umem;
> >         struct mutex mem_lock;
> >         unsigned int bounce_size;
> > @@ -592,6 +598,25 @@ static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
> >         return 0;
> >  }
> >
> > +static u32 vduse_get_vq_group(struct vdpa_device *vdpa, u16 idx)
> > +{
> > +       struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +       struct vduse_dev_msg msg = { 0 };
> > +       int ret;
> > +
> > +       if (dev->api_version < VDUSE_API_VERSION_1)
> > +               return 0;
> > +
> > +       msg.req.type = VDUSE_GET_VQ_GROUP;
> > +       msg.req.vq_group.index = idx;
>
> Considering there will be a set_group_asid request, could the kernel
> cache the result so we don't need to bother with requests from
> userspace?
>

Yes we can, actually a previous version did it. But what's the use? It
is not used in the dataplane, so we reduce complexity if we don't
store it.

What's more, in the case of the net device, the vq number -> vq group
association can change in a reset as the CVQ is either the last one or
#2 if MQ is negotiated. We need to code when to reset this
association, so complexity grows even more. And the vq group are not
asked by QEMU after that point anyway.

> > +
> > +       ret = vduse_dev_msg_sync(dev, &msg);
> > +       if (ret)
> > +               return ret;
> > +
> > +       return msg.resp.vq_group.num;
> > +}
> > +
> >  static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> >                                 struct vdpa_vq_state *state)
> >  {
> > @@ -789,6 +814,7 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> >         .set_vq_cb              = vduse_vdpa_set_vq_cb,
> >         .set_vq_num             = vduse_vdpa_set_vq_num,
> >         .get_vq_size            = vduse_vdpa_get_vq_size,
> > +       .get_vq_group           = vduse_get_vq_group,
> >         .set_vq_ready           = vduse_vdpa_set_vq_ready,
> >         .get_vq_ready           = vduse_vdpa_get_vq_ready,
> >         .set_vq_state           = vduse_vdpa_set_vq_state,
> > @@ -1850,6 +1876,16 @@ static int vduse_create_dev(struct vduse_dev_config *config,
> >         dev->device_features = config->features;
> >         dev->device_id = config->device_id;
> >         dev->vendor_id = config->vendor_id;
> > +       if (dev->api_version >= 1) {
> > +               if (config->ngroups > VDUSE_MAX_VQ_GROUPS) {
> > +                       pr_err("Not creating a VDUSE device with %u vq groups. Max: %u",
> > +                               config->ngroups, VDUSE_MAX_VQ_GROUPS);
> > +                       goto err_ngroups;
> > +               }
> > +               dev->ngroups = config->ngroups ?: 1;
> > +       } else {
> > +               dev->ngroups = 1;
> > +       }
> >         dev->name = kstrdup(config->name, GFP_KERNEL);
> >         if (!dev->name)
> >                 goto err_str;
> > @@ -1885,6 +1921,7 @@ static int vduse_create_dev(struct vduse_dev_config *config,
> >         idr_remove(&vduse_idr, dev->minor);
> >  err_idr:
> >         kfree(dev->name);
> > +err_ngroups:
> >  err_str:
> >         vduse_dev_destroy(dev);
> >  err:
> > @@ -2003,13 +2040,18 @@ static struct vduse_mgmt_dev *vduse_mgmt;
> >  static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
> >  {
> >         struct vduse_vdpa *vdev;
> > +       __u32 ngroups = 1;
> >         int ret;
> >
> >         if (dev->vdev)
> >                 return -EEXIST;
> >
> > +       if (vdev->dev->api_version >= VDUSE_API_VERSION_1)
> > +               ngroups = vdev->dev->ngroups;
> > +
> >         vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev,
> > -                                &vduse_vdpa_config_ops, 1, 1, name, true);
> > +                                &vduse_vdpa_config_ops, ngroups, 1, name,
> > +                                true);
> >         if (IS_ERR(vdev))
> >                 return PTR_ERR(vdev);
> >
> > diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> > index 9a56d0416bfe..a779bcddac58 100644
> > --- a/include/uapi/linux/vduse.h
> > +++ b/include/uapi/linux/vduse.h
> > @@ -45,7 +45,8 @@ struct vduse_dev_config {
> >         __u64 features;
> >         __u32 vq_num;
> >         __u32 vq_align;
> > -       __u32 reserved[13];
> > +       __u32 ngroups; /* if VDUSE_API_VERSION >= 1 */
> > +       __u32 reserved[12];
> >         __u32 config_size;
> >         __u8 config[];
> >  };
> > @@ -160,6 +161,16 @@ struct vduse_vq_state_packed {
> >         __u16 last_used_idx;
> >  };
> >
> > +/**
> > + * struct vduse_vq_group - virtqueue group
> > + * @num: Index of the virtqueue group
> > + * @num: Group
> > + */
> > +struct vduse_vq_group {
> > +       __u32 index;
> > +       __u32 num;
> > +};
> > +
> >  /**
> >   * struct vduse_vq_info - information of a virtqueue
> >   * @index: virtqueue index
> > @@ -182,6 +193,7 @@ struct vduse_vq_info {
> >         union {
> >                 struct vduse_vq_state_split split;
> >                 struct vduse_vq_state_packed packed;
> > +               struct vduse_vq_group group;
> >         };
> >         __u8 ready;
> >  };
> > @@ -274,6 +286,7 @@ enum vduse_req_type {
> >         VDUSE_GET_VQ_STATE,
> >         VDUSE_SET_STATUS,
> >         VDUSE_UPDATE_IOTLB,
> > +       VDUSE_GET_VQ_GROUP,
> >  };
> >
> >  /**
> > @@ -328,6 +341,7 @@ struct vduse_dev_request {
> >                 struct vduse_vq_state vq_state;
> >                 struct vduse_dev_status s;
> >                 struct vduse_iova_range iova;
> > +               struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
> >                 __u32 padding[32];
>
> This should be [31] otherwise we break the uAPI?
>
> >         };
> >  };
> > @@ -350,6 +364,7 @@ struct vduse_dev_response {
> >         __u32 reserved[4];
> >         union {
> >                 struct vduse_vq_state vq_state;
> > +               struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
> >                 __u32 padding[32];
>
> And here?
>

Yes, thank you for both catches!

> Thanks
>
> >         };
> >  };
> > --
> > 2.49.0
> >
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 1/6] vduse: add v1 API definition
  2025-06-09  1:50     ` Jason Wang
@ 2025-06-09  6:10       ` Eugenio Perez Martin
  2025-06-10  8:35         ` Jason Wang
  0 siblings, 1 reply; 30+ messages in thread
From: Eugenio Perez Martin @ 2025-06-09  6:10 UTC (permalink / raw)
  To: Jason Wang
  Cc: Yongji Xie, Cindy Lu, linux-kernel, Stefano Garzarella,
	Stefan Hajnoczi, Maxime Coquelin, Michael S. Tsirkin,
	virtualization, Xuan Zhuo, Laurent Vivier

On Mon, Jun 9, 2025 at 3:50 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Mon, Jun 9, 2025 at 9:41 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Fri, Jun 6, 2025 at 7:50 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > >
> > > This allows to define all functions checking the API version set by the
> > > userland device.
> > >
> > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >
> > It might be worth clarifying how it works.
> >
> > For example,
> >
> > 1) would VDUSE behave differently or if it's just some new ioctls

I'd like to test more in-depth, but a device can just bump the version
ID and then implement the replies to the vduse messages. No need to
implement new ioctls. If the VDUSE device sets 0 in either number of
ASID or vq groups, the kernel assumes 1.

But you have a very good point here, I think it is wise to evaluate
the shortcut of these messages in the VDUSE kernel module. If a VDUSE
device only has one vq group and one ASID, it can always return group
0 and asid 0 for everything, and fail every try to ser asid != 0. This
way, the update is transparent for the VDUSE device, and future
devices do not need to implement the reply of these. What do you
think?

> > 2) If VDUSE behave differently, do we need a ioctl to set the API
> > version for backward compatibility?
>
> Speak too fast, there's a VDUSE_SET_API_VERSION actually.
>
> I think we need to think if it complicates the migration compatibility or not.
>

Do you mean migration as "increase the VDUSE version number", not "VM
live migration from vduse version 0 to vduse version 1", isn't it? The
second should not have any problem but I haven't tested it.

> > 3) If it means a brunch of new ioctls, could userspace just probe the
> > new ioctls instead?
> >
> > Thanks
>
> Thanks
>
> >
> > > ---
> > >  include/uapi/linux/vduse.h | 4 ++++
> > >  1 file changed, 4 insertions(+)
> > >
> > > diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> > > index 68a627d04afa..9a56d0416bfe 100644
> > > --- a/include/uapi/linux/vduse.h
> > > +++ b/include/uapi/linux/vduse.h
> > > @@ -10,6 +10,10 @@
> > >
> > >  #define VDUSE_API_VERSION      0
> > >
> > > +/* VQ groups and ASID support */
> > > +
> > > +#define VDUSE_API_VERSION_1    1
> > > +
> > >  /*
> > >   * Get the version of VDUSE API that kernel supported (VDUSE_API_VERSION).
> > >   * This is used for future extension.
> > > --
> > > 2.49.0
> > >
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 1/6] vduse: add v1 API definition
  2025-06-09  6:10       ` Eugenio Perez Martin
@ 2025-06-10  8:35         ` Jason Wang
  2025-08-07 10:55           ` Eugenio Perez Martin
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Wang @ 2025-06-10  8:35 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Yongji Xie, Cindy Lu, linux-kernel, Stefano Garzarella,
	Stefan Hajnoczi, Maxime Coquelin, Michael S. Tsirkin,
	virtualization, Xuan Zhuo, Laurent Vivier

On Mon, Jun 9, 2025 at 2:11 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> On Mon, Jun 9, 2025 at 3:50 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Mon, Jun 9, 2025 at 9:41 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Fri, Jun 6, 2025 at 7:50 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > >
> > > > This allows to define all functions checking the API version set by the
> > > > userland device.
> > > >
> > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > >
> > > It might be worth clarifying how it works.
> > >
> > > For example,
> > >
> > > 1) would VDUSE behave differently or if it's just some new ioctls
>
> I'd like to test more in-depth, but a device can just bump the version
> ID and then implement the replies to the vduse messages. No need to
> implement new ioctls. If the VDUSE device sets 0 in either number of
> ASID or vq groups, the kernel assumes 1.

Right, this is the way we use now and I think maybe we can document
this somewhere.

>
> But you have a very good point here, I think it is wise to evaluate
> the shortcut of these messages in the VDUSE kernel module. If a VDUSE
> device only has one vq group and one ASID, it can always return group
> 0 and asid 0 for everything, and fail every try to ser asid != 0.

Yes, and vhost-vDPA needs to guard against the misconfiguration.

> This
> way, the update is transparent for the VDUSE device, and future
> devices do not need to implement the reply of these. What do you
> think?

This should work.

>
> > > 2) If VDUSE behave differently, do we need a ioctl to set the API
> > > version for backward compatibility?
> >
> > Speak too fast, there's a VDUSE_SET_API_VERSION actually.
> >
> > I think we need to think if it complicates the migration compatibility or not.
> >
>
> Do you mean migration as "increase the VDUSE version number", not "VM
> live migration from vduse version 0 to vduse version 1", isn't it? The
> second should not have any problem but I haven't tested it.

I mean if we bump the version, we can't migrate from version 1 to
version 0. Or we can offload this to the management (do we need to
extend the vdpa tool for this)?

Thanks


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 2/6] vduse: add vq group support
  2025-06-09  6:03     ` Eugenio Perez Martin
@ 2025-06-10  8:53       ` Jason Wang
  0 siblings, 0 replies; 30+ messages in thread
From: Jason Wang @ 2025-06-10  8:53 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Yongji Xie, Cindy Lu, linux-kernel, Stefano Garzarella,
	Stefan Hajnoczi, Maxime Coquelin, Michael S. Tsirkin,
	virtualization, Xuan Zhuo, Laurent Vivier

On Mon, Jun 9, 2025 at 2:03 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> On Mon, Jun 9, 2025 at 3:55 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Fri, Jun 6, 2025 at 7:50 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > >
> > > The virtqueue group is the minimal set of virtqueues that must share an
> > > address space.  And the address space identifier could only be attached
> > > to a specific virtqueue group.  The virtqueue is attached to a
> > > virtqueue group for all the life of the device.
> > >
> > > During vDPA device allocation, the VDUSE device needs to report the
> > > number of virtqueue groups supported. At this moment only vhost_vdpa is
> > > able to do it.
> > >
> > > This helps to isolate the environments for the virtqueue that will not
> > > be assigned directly. E.g in the case of virtio-net, the control
> > > virtqueue will not be assigned directly to guest.
> > >
> > > As we need to back the vq groups with a struct device for the file
> > > operations, let's keep this number as low as possible at the moment: 2.
> > > We can back one VQ group with the vduse device and the other one with
> > > the vdpa device.
> > >
> > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > ---
> > >  drivers/vdpa/vdpa_user/vduse_dev.c | 44 +++++++++++++++++++++++++++++-
> > >  include/uapi/linux/vduse.h         | 17 +++++++++++-
> > >  2 files changed, 59 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > index 6a9a37351310..6fa687bc4912 100644
> > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > @@ -46,6 +46,11 @@
> > >  #define VDUSE_IOVA_SIZE (VDUSE_MAX_BOUNCE_SIZE + 128 * 1024 * 1024)
> > >  #define VDUSE_MSG_DEFAULT_TIMEOUT 30
> > >
> > > +/*
> > > + * Let's make it 2 for simplicity.
> > > + */
> > > +#define VDUSE_MAX_VQ_GROUPS 2
> > > +
> > >  #define IRQ_UNBOUND -1
> > >
> > >  struct vduse_virtqueue {
> > > @@ -114,6 +119,7 @@ struct vduse_dev {
> > >         u8 status;
> > >         u32 vq_num;
> > >         u32 vq_align;
> > > +       u32 ngroups;
> > >         struct vduse_umem *umem;
> > >         struct mutex mem_lock;
> > >         unsigned int bounce_size;
> > > @@ -592,6 +598,25 @@ static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
> > >         return 0;
> > >  }
> > >
> > > +static u32 vduse_get_vq_group(struct vdpa_device *vdpa, u16 idx)
> > > +{
> > > +       struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > > +       struct vduse_dev_msg msg = { 0 };
> > > +       int ret;
> > > +
> > > +       if (dev->api_version < VDUSE_API_VERSION_1)
> > > +               return 0;
> > > +
> > > +       msg.req.type = VDUSE_GET_VQ_GROUP;
> > > +       msg.req.vq_group.index = idx;
> >
> > Considering there will be a set_group_asid request, could the kernel
> > cache the result so we don't need to bother with requests from
> > userspace?
> >
>
> Yes we can, actually a previous version did it. But what's the use? It
> is not used in the dataplane, so we reduce complexity if we don't
> store it.

It helps to reduce the chance to wait for the userspace. I think it's safer.

For example, we cache device status as well, if needed userspace can
update the status via ioctl().

>
> What's more, in the case of the net device, the vq number -> vq group
> association can change in a reset as the CVQ is either the last one or
> #2 if MQ is negotiated. We need to code when to reset this
> association, so complexity grows even more. And the vq group are not
> asked by QEMU after that point anyway.

Yes, we can have an array. E.g simulator has something like:

        struct vhost_iotlb *iommu;

Thanks


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 3/6] vduse: add vq group asid support
  2025-06-06 11:50 ` [RFC 3/6] vduse: add vq group asid support Eugenio Pérez
@ 2025-06-12  0:30   ` Jason Wang
  2025-06-12  7:24     ` Eugenio Perez Martin
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Wang @ 2025-06-12  0:30 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: Yongji Xie, Cindy Lu, linux-kernel, Stefano Garzarella,
	Stefan Hajnoczi, Maxime Coquelin, Michael S. Tsirkin,
	virtualization, Xuan Zhuo, Laurent Vivier

On Fri, Jun 6, 2025 at 7:50 PM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> Make one IOTLB domain per address space, and allow the driver to assign
> each ASID to a vq group.  Each address space via an dedicated identifier
> (ASID).
>
> During vDPA device allocation, the VDUSE device needs to report the
> number of address spaces supported.  Then the vdpa driver is able to
> configure them.  At this moment only vhost_vdpa is able to do it.
>
> This helps to isolate the environments for the virtqueue that will not
> be assigned directly.  E.g in the case of virtio-net, the control
> virtqueue will not be assigned directly to guest.
>
> TODO: Ideally, umem should not be duplicated.  But it is hard or
> impossible to refactor everything around one single umem.  So should we
> continue with device specifying umem per vq group?

This is a good question.

I think umem should be bound to address space and umem needs to be isolated.

For the issue of complexity, we can simply extend the vduse_iova_umem
to have an asid field. But it looks like it needs more work as:

struct vduse_iova_umem {
        __u64 uaddr;
        __u64 iova;
        __u64 size;
        __u64 reserved[3];
};

Do we have a way to know if reserved is used or not (as we are lacking
a flag field anyhow ....).

So we probably need a new uAPI like vduse_iova_umem_v2 that includes a
flag field at least.

>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  drivers/vdpa/vdpa_user/vduse_dev.c | 250 +++++++++++++++++++++--------
>  include/uapi/linux/vduse.h         |  38 ++++-
>  2 files changed, 216 insertions(+), 72 deletions(-)
>
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> index 6fa687bc4912..d51e4f26fe72 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -51,6 +51,11 @@
>   */
>  #define VDUSE_MAX_VQ_GROUPS 2
>
> +/*
> + * Let's make it 2 for simplicity.
> + */
> +#define VDUSE_MAX_ASID 2

Similar to previous patch, it's better to increase this otherwise we
need new uAPI or it requires the userspace to probe the maximum value
once we decide to change it in the future.

> +
>  #define IRQ_UNBOUND -1
>
>  struct vduse_virtqueue {
> @@ -92,7 +97,7 @@ struct vduse_dev {
>         struct vduse_vdpa *vdev;
>         struct device *dev;
>         struct vduse_virtqueue **vqs;
> -       struct vduse_iova_domain *domain;
> +       struct vduse_iova_domain *domain[VDUSE_MAX_ASID];
>         char *name;
>         struct mutex lock;
>         spinlock_t msg_lock;
> @@ -120,7 +125,8 @@ struct vduse_dev {
>         u32 vq_num;
>         u32 vq_align;
>         u32 ngroups;
> -       struct vduse_umem *umem;
> +       u32 nas;
> +       struct vduse_umem *umem[VDUSE_MAX_ASID];
>         struct mutex mem_lock;
>         unsigned int bounce_size;
>         struct mutex domain_lock;
> @@ -436,11 +442,14 @@ static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
>  static void vduse_dev_reset(struct vduse_dev *dev)
>  {
>         int i;
> -       struct vduse_iova_domain *domain = dev->domain;
>
>         /* The coherent mappings are handled in vduse_dev_free_coherent() */
> -       if (domain && domain->bounce_map)
> -               vduse_domain_reset_bounce_map(domain);
> +       for (i = 0; i < dev->nas; i++) {
> +               struct vduse_iova_domain *domain = dev->domain[i];
> +
> +               if (domain && domain->bounce_map)
> +                       vduse_domain_reset_bounce_map(domain);
> +       }
>
>         down_write(&dev->rwsem);
>
> @@ -617,6 +626,23 @@ static u32 vduse_get_vq_group(struct vdpa_device *vdpa, u16 idx)
>         return msg.resp.vq_group.num;
>  }
>
> +static int vduse_set_group_asid(struct vdpa_device *vdpa, unsigned int group,
> +                               unsigned int asid)
> +{
> +       struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +       struct vduse_dev_msg msg = { 0 };
> +
> +       if (dev->api_version < VDUSE_API_VERSION_1 ||
> +           group >= dev->ngroups || asid >= dev->nas)
> +               return -EINVAL;
> +
> +       msg.req.type = VDUSE_SET_VQ_GROUP_ASID;
> +       msg.req.vq_group_asid.group = group;
> +       msg.req.vq_group_asid.asid = asid;
> +
> +       return vduse_dev_msg_sync(dev, &msg);
> +}
> +
>  static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
>                                 struct vdpa_vq_state *state)
>  {
> @@ -788,13 +814,13 @@ static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
>         struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>         int ret;
>
> -       ret = vduse_domain_set_map(dev->domain, iotlb);
> +       ret = vduse_domain_set_map(dev->domain[asid], iotlb);
>         if (ret)
>                 return ret;
>
>         ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
>         if (ret) {
> -               vduse_domain_clear_map(dev->domain, iotlb);
> +               vduse_domain_clear_map(dev->domain[asid], iotlb);
>                 return ret;
>         }
>
> @@ -837,6 +863,7 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = {
>         .get_vq_affinity        = vduse_vdpa_get_vq_affinity,
>         .reset                  = vduse_vdpa_reset,
>         .set_map                = vduse_vdpa_set_map,
> +       .set_group_asid         = vduse_set_group_asid,
>         .free                   = vduse_vdpa_free,
>  };
>
> @@ -845,9 +872,12 @@ static void vduse_dev_sync_single_for_device(struct device *dev,
>                                              enum dma_data_direction dir)
>  {
>         struct vduse_dev *vdev = dev_to_vduse(dev);
> -       struct vduse_iova_domain *domain = vdev->domain;
>
> -       vduse_domain_sync_single_for_device(domain, dma_addr, size, dir);
> +       for (int i = 0; i < vdev->nas; i++) {
> +               struct vduse_iova_domain *domain = vdev->domain[i];

Interesting, I thought there could be a way to deduce the iova domain
from the dma device since virtio supports per virtqueue dma device
now. For example, I don't see get_vq_dma_dev() implemented in this
patch. But anyhow maybe we need to revisit this point as DMA
mainatiner ask to fix the abuse of the dma device so we don't need a
trick for dma dev anymore if we design the new mapping API in virtio
core correctly.

> +
> +               vduse_domain_sync_single_for_device(domain, dma_addr, size, dir);
> +       }
>  }
>
>  static void vduse_dev_sync_single_for_cpu(struct device *dev,
> @@ -855,9 +885,12 @@ static void vduse_dev_sync_single_for_cpu(struct device *dev,
>                                              enum dma_data_direction dir)
>  {
>         struct vduse_dev *vdev = dev_to_vduse(dev);
> -       struct vduse_iova_domain *domain = vdev->domain;
>
> -       vduse_domain_sync_single_for_cpu(domain, dma_addr, size, dir);
> +       for (int i = 0; i < vdev->nas; i++) {
> +               struct vduse_iova_domain *domain = vdev->domain[i];
> +
> +               vduse_domain_sync_single_for_cpu(domain, dma_addr, size, dir);
> +       }
>  }
>
>  static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
> @@ -866,7 +899,7 @@ static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
>                                      unsigned long attrs)
>  {
>         struct vduse_dev *vdev = dev_to_vduse(dev);
> -       struct vduse_iova_domain *domain = vdev->domain;
> +       struct vduse_iova_domain *domain = vdev->domain[0];

Any reason that we need to assume asid 0 in this case?

>
>         return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
>  }
> @@ -876,7 +909,7 @@ static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
>                                 unsigned long attrs)
>  {
>         struct vduse_dev *vdev = dev_to_vduse(dev);
> -       struct vduse_iova_domain *domain = vdev->domain;
> +       struct vduse_iova_domain *domain = vdev->domain[0];
>
>         return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
>  }
> @@ -886,7 +919,7 @@ static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
>                                         unsigned long attrs)
>  {
>         struct vduse_dev *vdev = dev_to_vduse(dev);
> -       struct vduse_iova_domain *domain = vdev->domain;
> +       struct vduse_iova_domain *domain = vdev->domain[0];
>         unsigned long iova;
>         void *addr;
>
> @@ -906,17 +939,25 @@ static void vduse_dev_free_coherent(struct device *dev, size_t size,
>                                         unsigned long attrs)
>  {
>         struct vduse_dev *vdev = dev_to_vduse(dev);
> -       struct vduse_iova_domain *domain = vdev->domain;
> +       struct vduse_iova_domain *domain = vdev->domain[0];
>
>         vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
>  }
>
> +/* TODO check if this is correct */
>  static size_t vduse_dev_max_mapping_size(struct device *dev)
>  {
>         struct vduse_dev *vdev = dev_to_vduse(dev);
> -       struct vduse_iova_domain *domain = vdev->domain;
> +       size_t max_mapping_size = 0;
> +
> +       for (int i = 0; i < vdev->nas; i++) {
> +               struct vduse_iova_domain *domain = vdev->domain[i];
>
> -       return domain->bounce_size;
> +               if (domain->bounce_size > max_mapping_size)
> +                       max_mapping_size = domain->bounce_size;
> +       }
> +
> +       return max_mapping_size;
>  }
>
>  static const struct dma_map_ops vduse_dev_dma_ops = {
> @@ -1054,31 +1095,32 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
>         return ret;
>  }
>
> -static int vduse_dev_dereg_umem(struct vduse_dev *dev,
> +static int vduse_dev_dereg_umem(struct vduse_dev *dev, u32 asid,
>                                 u64 iova, u64 size)
>  {
>         int ret;
>
>         mutex_lock(&dev->mem_lock);
>         ret = -ENOENT;
> -       if (!dev->umem)
> +       if (!dev->umem[asid])
>                 goto unlock;
>
>         ret = -EINVAL;
> -       if (!dev->domain)
> +       if (!dev->domain[asid])
>                 goto unlock;
>
> -       if (dev->umem->iova != iova || size != dev->domain->bounce_size)
> +       if (dev->umem[asid]->iova != iova ||
> +           size != dev->domain[asid]->bounce_size)
>                 goto unlock;
>
> -       vduse_domain_remove_user_bounce_pages(dev->domain);
> -       unpin_user_pages_dirty_lock(dev->umem->pages,
> -                                   dev->umem->npages, true);
> -       atomic64_sub(dev->umem->npages, &dev->umem->mm->pinned_vm);
> -       mmdrop(dev->umem->mm);
> -       vfree(dev->umem->pages);
> -       kfree(dev->umem);
> -       dev->umem = NULL;
> +       vduse_domain_remove_user_bounce_pages(dev->domain[asid]);
> +       unpin_user_pages_dirty_lock(dev->umem[asid]->pages,
> +                                   dev->umem[asid]->npages, true);
> +       atomic64_sub(dev->umem[asid]->npages, &dev->umem[asid]->mm->pinned_vm);
> +       mmdrop(dev->umem[asid]->mm);
> +       vfree(dev->umem[asid]->pages);
> +       kfree(dev->umem[asid]);
> +       dev->umem[asid] = NULL;
>         ret = 0;
>  unlock:
>         mutex_unlock(&dev->mem_lock);
> @@ -1086,7 +1128,7 @@ static int vduse_dev_dereg_umem(struct vduse_dev *dev,
>  }
>
>  static int vduse_dev_reg_umem(struct vduse_dev *dev,
> -                             u64 iova, u64 uaddr, u64 size)
> +                             u32 asid, u64 iova, u64 uaddr, u64 size)
>  {
>         struct page **page_list = NULL;
>         struct vduse_umem *umem = NULL;
> @@ -1094,14 +1136,14 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
>         unsigned long npages, lock_limit;
>         int ret;
>
> -       if (!dev->domain || !dev->domain->bounce_map ||
> -           size != dev->domain->bounce_size ||
> +       if (!dev->domain[asid] || !dev->domain[asid]->bounce_map ||
> +           size != dev->domain[asid]->bounce_size ||
>             iova != 0 || uaddr & ~PAGE_MASK)
>                 return -EINVAL;
>
>         mutex_lock(&dev->mem_lock);
>         ret = -EEXIST;
> -       if (dev->umem)
> +       if (dev->umem[asid])
>                 goto unlock;
>
>         ret = -ENOMEM;
> @@ -1125,7 +1167,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
>                 goto out;
>         }
>
> -       ret = vduse_domain_add_user_bounce_pages(dev->domain,
> +       ret = vduse_domain_add_user_bounce_pages(dev->domain[asid],
>                                                  page_list, pinned);
>         if (ret)
>                 goto out;
> @@ -1138,7 +1180,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
>         umem->mm = current->mm;
>         mmgrab(current->mm);
>
> -       dev->umem = umem;
> +       dev->umem[asid] = umem;
>  out:
>         if (ret && pinned > 0)
>                 unpin_user_pages(page_list, pinned);
> @@ -1181,26 +1223,42 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>
>         switch (cmd) {
>         case VDUSE_IOTLB_GET_FD: {
> -               struct vduse_iotlb_entry entry;
> +               struct vduse_iotlb_entry_v2 entry = {};
> +               struct vduse_iotlb_entry entry_old;
>                 struct vhost_iotlb_map *map;
>                 struct vdpa_map_file *map_file;
>                 struct file *f = NULL;
>
>                 ret = -EFAULT;
> -               if (copy_from_user(&entry, argp, sizeof(entry)))
> -                       break;
> +               if (dev->api_version >= VDUSE_API_VERSION_1) {
> +                       if (copy_from_user(&entry, argp, sizeof(entry)))
> +                               break;
> +               } else {
> +                       if (copy_from_user(&entry_old, argp,
> +                                          sizeof(entry_old)))
> +                               break;
> +
> +                       entry.offset = entry_old.offset;
> +                       entry.start = entry_old.start;
> +                       entry.last = entry_old.last;
> +                       entry.perm = entry_old.perm;

I wonder if a new ioctl is needed.

> +               }
>
>                 ret = -EINVAL;
>                 if (entry.start > entry.last)
>                         break;
>
> +               if (entry.asid >= dev->nas)
> +                       break;
> +
>                 mutex_lock(&dev->domain_lock);
> -               if (!dev->domain) {
> +               /* TODO accessing an array with idx from userspace, mitigations? */
> +               if (!dev->domain[entry.asid]) {
>                         mutex_unlock(&dev->domain_lock);
>                         break;
>                 }
> -               spin_lock(&dev->domain->iotlb_lock);
> -               map = vhost_iotlb_itree_first(dev->domain->iotlb,
> +               spin_lock(&dev->domain[entry.asid]->iotlb_lock);
> +               map = vhost_iotlb_itree_first(dev->domain[entry.asid]->iotlb,
>                                               entry.start, entry.last);
>                 if (map) {
>                         map_file = (struct vdpa_map_file *)map->opaque;
> @@ -1210,7 +1268,7 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>                         entry.last = map->last;
>                         entry.perm = map->perm;
>                 }
> -               spin_unlock(&dev->domain->iotlb_lock);
> +               spin_unlock(&dev->domain[entry.asid]->iotlb_lock);
>                 mutex_unlock(&dev->domain_lock);
>                 ret = -EINVAL;
>                 if (!f)
> @@ -1360,12 +1418,18 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>                         break;
>
>                 ret = -EINVAL;
> +               /* TODO: Using asid from userspace, need to mitigate? */
>                 if (!is_mem_zero((const char *)umem.reserved,
> -                                sizeof(umem.reserved)))
> +                                sizeof(umem.reserved)) ||
> +                   !is_mem_zero((const char *)umem.reserved2,
> +                                sizeof(umem.reserved2)) ||
> +                   (dev->api_version < VDUSE_API_VERSION_1 &&
> +                    umem.asid != 0) ||
> +                    umem.asid >= dev->nas)

This is probably a hint that we need a new uAPI, see my comment for changelog.

>                         break;
>
>                 mutex_lock(&dev->domain_lock);
> -               ret = vduse_dev_reg_umem(dev, umem.iova,
> +               ret = vduse_dev_reg_umem(dev, umem.asid, umem.iova,
>                                          umem.uaddr, umem.size);
>                 mutex_unlock(&dev->domain_lock);
>                 break;
> @@ -1378,15 +1442,23 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>                         break;
>
>                 ret = -EINVAL;
> +               /* TODO: Using asid from userspace, need to mitigate? */
>                 if (!is_mem_zero((const char *)umem.reserved,
> -                                sizeof(umem.reserved)))
> +                                sizeof(umem.reserved)) ||
> +                   !is_mem_zero((const char *)umem.reserved2,
> +                                sizeof(umem.reserved2)) ||
> +                   (dev->api_version < VDUSE_API_VERSION_1 &&
> +                    umem.asid != 0) ||
> +                    umem.asid >= dev->nas)
>                         break;
> +
>                 mutex_lock(&dev->domain_lock);
> -               ret = vduse_dev_dereg_umem(dev, umem.iova,
> +               ret = vduse_dev_dereg_umem(dev, umem.asid, umem.iova,
>                                            umem.size);
>                 mutex_unlock(&dev->domain_lock);
>                 break;
>         }
> +       /* TODO can we merge this with GET_FD? */
>         case VDUSE_IOTLB_GET_INFO: {
>                 struct vduse_iova_info info;
>                 struct vhost_iotlb_map *map;
> @@ -1399,27 +1471,32 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>                 if (info.start > info.last)
>                         break;
>
> +               if (info.asid >= dev->nas)
> +                       break;
> +
>                 if (!is_mem_zero((const char *)info.reserved,
>                                  sizeof(info.reserved)))
>                         break;
>
>                 mutex_lock(&dev->domain_lock);
> -               if (!dev->domain) {
> +               /* TODO asid comes from userspace. mitigations? */
> +               if (!dev->domain[info.asid]) {
>                         mutex_unlock(&dev->domain_lock);
>                         break;
>                 }
> -               spin_lock(&dev->domain->iotlb_lock);
> -               map = vhost_iotlb_itree_first(dev->domain->iotlb,
> +               spin_lock(&dev->domain[info.asid]->iotlb_lock);
> +               map = vhost_iotlb_itree_first(dev->domain[info.asid]->iotlb,
>                                               info.start, info.last);
>                 if (map) {
>                         info.start = map->start;
>                         info.last = map->last;
>                         info.capability = 0;
> -                       if (dev->domain->bounce_map && map->start == 0 &&
> -                           map->last == dev->domain->bounce_size - 1)
> +                       if (dev->domain[info.asid]->bounce_map &&
> +                           map->start == 0 &&
> +                           map->last == dev->domain[info.asid]->bounce_size - 1)
>                                 info.capability |= VDUSE_IOVA_CAP_UMEM;
>                 }
> -               spin_unlock(&dev->domain->iotlb_lock);
> +               spin_unlock(&dev->domain[info.asid]->iotlb_lock);
>                 mutex_unlock(&dev->domain_lock);
>                 if (!map)
>                         break;
> @@ -1444,8 +1521,13 @@ static int vduse_dev_release(struct inode *inode, struct file *file)
>         struct vduse_dev *dev = file->private_data;
>
>         mutex_lock(&dev->domain_lock);
> -       if (dev->domain)
> -               vduse_dev_dereg_umem(dev, 0, dev->domain->bounce_size);
> +       for (int i = 0; i < dev->nas; i++) {
> +               if (dev->domain[i]) {
> +                       vduse_dev_dereg_umem(dev, i, 0,
> +                                            dev->domain[i]->bounce_size);
> +                       dev->domain[i] = NULL;
> +               }
> +       }
>         mutex_unlock(&dev->domain_lock);
>         spin_lock(&dev->msg_lock);
>         /* Make sure the inflight messages can processed after reconncection */
> @@ -1715,8 +1797,10 @@ static int vduse_destroy_dev(char *name)
>         idr_remove(&vduse_idr, dev->minor);
>         kvfree(dev->config);
>         vduse_dev_deinit_vqs(dev);
> -       if (dev->domain)
> -               vduse_domain_destroy(dev->domain);
> +       for (int i = 0; i < dev->nas; i++) {
> +               if (dev->domain[i])
> +                       vduse_domain_destroy(dev->domain[i]);
> +       }
>         kfree(dev->name);
>         vduse_dev_destroy(dev);
>         module_put(THIS_MODULE);
> @@ -1824,7 +1908,7 @@ static ssize_t bounce_size_store(struct device *device,
>
>         ret = -EPERM;
>         mutex_lock(&dev->domain_lock);
> -       if (dev->domain)
> +       if (dev->domain[0] && dev->domain[1])
>                 goto unlock;
>
>         ret = kstrtouint(buf, 10, &bounce_size);
> @@ -1882,9 +1966,18 @@ static int vduse_create_dev(struct vduse_dev_config *config,
>                                 config->ngroups, VDUSE_MAX_VQ_GROUPS);
>                         goto err_ngroups;
>                 }
> +
> +               if (config->nas > VDUSE_MAX_ASID) {
> +                       pr_err("Not creating a VDUSE device with %u asid. Max: %u",
> +                               config->nas, VDUSE_MAX_ASID);
> +                       goto err_nas;
> +               }
> +
>                 dev->ngroups = config->ngroups ?: 1;
> +               dev->nas = config->nas ?: 1;
>         } else {
>                 dev->ngroups = 1;
> +               dev->nas = 1;
>         }
>         dev->name = kstrdup(config->name, GFP_KERNEL);
>         if (!dev->name)
> @@ -1923,6 +2016,7 @@ static int vduse_create_dev(struct vduse_dev_config *config,
>         kfree(dev->name);
>  err_ngroups:
>  err_str:
> +err_nas:
>         vduse_dev_destroy(dev);
>  err:
>         return ret;
> @@ -2015,7 +2109,6 @@ static int vduse_open(struct inode *inode, struct file *file)
>         if (!control)
>                 return -ENOMEM;
>
> -       control->api_version = VDUSE_API_VERSION;
>         file->private_data = control;
>
>         return 0;
> @@ -2040,17 +2133,15 @@ static struct vduse_mgmt_dev *vduse_mgmt;
>  static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
>  {
>         struct vduse_vdpa *vdev;
> -       __u32 ngroups = 1;
> +       __u32 ngroups = dev->ngroups;
>         int ret;
>
>         if (dev->vdev)
>                 return -EEXIST;
>
> -       if (vdev->dev->api_version >= VDUSE_API_VERSION_1)
> -               ngroups = vdev->dev->ngroups;
> -
> +       /* TODO do we need to store ngroups and nas? vdpa device already store it for us */
>         vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev,
> -                                &vduse_vdpa_config_ops, ngroups, 1, name,
> +                                &vduse_vdpa_config_ops, ngroups, dev->nas, name,
>                                  true);
>         if (IS_ERR(vdev))
>                 return PTR_ERR(vdev);
> @@ -2088,11 +2179,23 @@ static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name,
>                 return ret;
>
>         mutex_lock(&dev->domain_lock);
> -       if (!dev->domain)
> -               dev->domain = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
> -                                                 dev->bounce_size);
> +       ret = 0;
> +
> +       /* TODO we could delay the creation of the domain */
> +       for (int i = 0; i < dev->nas; ++i) {
> +               if (!dev->domain[i])
> +                       dev->domain[i] = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
> +                                                            dev->bounce_size);
> +               if (!dev->domain[i]) {
> +                       ret = -ENOMEM;
> +                       for (int j = 0; j < i; ++j)
> +                               vduse_domain_destroy(dev->domain[j]);
> +                       goto err_domain;
> +               }
> +       }
> +
>         mutex_unlock(&dev->domain_lock);
> -       if (!dev->domain) {
> +       if (ret == -ENOMEM) {
>                 put_device(&dev->vdev->vdpa.dev);
>                 return -ENOMEM;
>         }
> @@ -2101,13 +2204,22 @@ static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name,
>         if (ret) {
>                 put_device(&dev->vdev->vdpa.dev);
>                 mutex_lock(&dev->domain_lock);
> -               vduse_domain_destroy(dev->domain);
> -               dev->domain = NULL;
> +               for (int i = 0; i < dev->nas; i++) {
> +                       if (dev->domain[i]) {
> +                               vduse_domain_destroy(dev->domain[i]);
> +                               dev->domain[i] = NULL;
> +                       }
> +               }
>                 mutex_unlock(&dev->domain_lock);
>                 return ret;
>         }
>
>         return 0;
> +
> +err_domain:
> +       /* TODO do I need to call put_device? */
> +       mutex_unlock(&dev->domain_lock);
> +       return ret;
>  }
>
>  static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> index a779bcddac58..3a17a0b4e938 100644
> --- a/include/uapi/linux/vduse.h
> +++ b/include/uapi/linux/vduse.h
> @@ -46,7 +46,8 @@ struct vduse_dev_config {
>         __u32 vq_num;
>         __u32 vq_align;
>         __u32 ngroups; /* if VDUSE_API_VERSION >= 1 */
> -       __u32 reserved[12];
> +       __u32 nas; /* if VDUSE_API_VERSION >= 1 */
> +       __u32 reserved[11];
>         __u32 config_size;
>         __u8 config[];
>  };
> @@ -81,6 +82,17 @@ struct vduse_iotlb_entry {
>         __u8 perm;
>  };
>
> +struct vduse_iotlb_entry_v2 {
> +       __u64 offset;
> +       __u64 start;
> +       __u64 last;
> +       __u32 asid;
> +#define VDUSE_ACCESS_RO 0x1
> +#define VDUSE_ACCESS_WO 0x2
> +#define VDUSE_ACCESS_RW 0x3
> +       __u8 perm;
> +};
> +
>  /*
>   * Find the first IOVA region that overlaps with the range [start, last]
>   * and return the corresponding file descriptor. Return -EINVAL means the
> @@ -171,6 +183,16 @@ struct vduse_vq_group {
>         __u32 num;
>  };
>
> +/**
> + * struct vduse_vq_group - virtqueue group
> + @ @group: Index of the virtqueue group
> + * @asid: Address space ID of the group
> + */
> +struct vduse_vq_group_asid {
> +       __u32 group;
> +       __u32 asid;
> +};
> +
>  /**
>   * struct vduse_vq_info - information of a virtqueue
>   * @index: virtqueue index
> @@ -231,7 +253,9 @@ struct vduse_vq_eventfd {
>   * @uaddr: start address of userspace memory, it must be aligned to page size
>   * @iova: start of the IOVA region
>   * @size: size of the IOVA region
> + * @asid: Address space ID of the IOVA region
>   * @reserved: for future use, needs to be initialized to zero
> + * @reserved2: for future use, needs to be initialized to zero
>   *
>   * Structure used by VDUSE_IOTLB_REG_UMEM and VDUSE_IOTLB_DEREG_UMEM
>   * ioctls to register/de-register userspace memory for IOVA regions
> @@ -240,7 +264,9 @@ struct vduse_iova_umem {
>         __u64 uaddr;
>         __u64 iova;
>         __u64 size;
> -       __u64 reserved[3];
> +       __u32 asid;
> +       __u32 reserved[1];

Basically, I'm not sure we can assume reserved to be zero for API_VERSION == 1.

> +       __u64 reserved2[2];

Any reasons we can't reuse reserved array?

>  };
>
>  /* Register userspace memory for IOVA regions */
> @@ -264,7 +290,8 @@ struct vduse_iova_info {
>         __u64 last;
>  #define VDUSE_IOVA_CAP_UMEM (1 << 0)
>         __u64 capability;
> -       __u64 reserved[3];
> +       __u64 asid; /* Only if device API version >= 1 */
> +       __u64 reserved[2];

Same here.

>  };
>
>  /*
> @@ -287,6 +314,7 @@ enum vduse_req_type {
>         VDUSE_SET_STATUS,
>         VDUSE_UPDATE_IOTLB,
>         VDUSE_GET_VQ_GROUP,
> +       VDUSE_SET_VQ_GROUP_ASID,
>  };
>
>  /**
> @@ -342,6 +370,8 @@ struct vduse_dev_request {
>                 struct vduse_dev_status s;
>                 struct vduse_iova_range iova;
>                 struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
> +               /* Only if vduse api version >= 1 */
> +               struct vduse_vq_group_asid vq_group_asid;

This seems to break the uAPI as sizeof(struct vduse_dev_request) changes.

>                 __u32 padding[32];
>         };
>  };
> @@ -365,6 +395,8 @@ struct vduse_dev_response {
>         union {
>                 struct vduse_vq_state vq_state;
>                 struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
> +               /* Only if vduse api version >= 1 */
> +               struct vduse_vq_group_asid vq_group_asid;

Same here.

>                 __u32 padding[32];
>         };
>  };
> --
> 2.49.0
>

Thanks


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 5/6] vduse: reset group asid in reset
  2025-06-06 11:50 ` [RFC 5/6] vduse: reset group asid in reset Eugenio Pérez
@ 2025-06-12  0:32   ` Jason Wang
  2025-06-12  7:24     ` Eugenio Perez Martin
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Wang @ 2025-06-12  0:32 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: Yongji Xie, Cindy Lu, linux-kernel, Stefano Garzarella,
	Stefan Hajnoczi, Maxime Coquelin, Michael S. Tsirkin,
	virtualization, Xuan Zhuo, Laurent Vivier

On Fri, Jun 6, 2025 at 7:50 PM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> Is the expected behavior with vdpa_sim and mlx.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  drivers/vdpa/vdpa_user/vduse_dev.c | 6 ++++++
>  1 file changed, 6 insertions(+)
>

Should we squash this into patch 3?

Thanks


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 3/6] vduse: add vq group asid support
  2025-06-12  0:30   ` Jason Wang
@ 2025-06-12  7:24     ` Eugenio Perez Martin
  2025-06-13  1:21       ` Jason Wang
  0 siblings, 1 reply; 30+ messages in thread
From: Eugenio Perez Martin @ 2025-06-12  7:24 UTC (permalink / raw)
  To: Jason Wang
  Cc: Yongji Xie, Cindy Lu, linux-kernel, Stefano Garzarella,
	Stefan Hajnoczi, Maxime Coquelin, Michael S. Tsirkin,
	virtualization, Xuan Zhuo, Laurent Vivier

On Thu, Jun 12, 2025 at 2:30 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Fri, Jun 6, 2025 at 7:50 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> >
> > Make one IOTLB domain per address space, and allow the driver to assign
> > each ASID to a vq group.  Each address space via an dedicated identifier
> > (ASID).
> >
> > During vDPA device allocation, the VDUSE device needs to report the
> > number of address spaces supported.  Then the vdpa driver is able to
> > configure them.  At this moment only vhost_vdpa is able to do it.
> >
> > This helps to isolate the environments for the virtqueue that will not
> > be assigned directly.  E.g in the case of virtio-net, the control
> > virtqueue will not be assigned directly to guest.
> >
> > TODO: Ideally, umem should not be duplicated.  But it is hard or
> > impossible to refactor everything around one single umem.  So should we
> > continue with device specifying umem per vq group?
>
> This is a good question.
>
> I think umem should be bound to address space and umem needs to be isolated.
>
> For the issue of complexity, we can simply extend the vduse_iova_umem
> to have an asid field. But it looks like it needs more work as:
>
> struct vduse_iova_umem {
>         __u64 uaddr;
>         __u64 iova;
>         __u64 size;
>         __u64 reserved[3];
> };
>
> Do we have a way to know if reserved is used or not (as we are lacking
> a flag field anyhow ....).
>

I'd say that we should work the same way as the rest of the structs:
We add the asid field, and if the API v1 is negotiated we handle it as
ASID. If it is not negotiated, it is reserved.

> So we probably need a new uAPI like vduse_iova_umem_v2 that includes a
> flag field at least.
>
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >  drivers/vdpa/vdpa_user/vduse_dev.c | 250 +++++++++++++++++++++--------
> >  include/uapi/linux/vduse.h         |  38 ++++-
> >  2 files changed, 216 insertions(+), 72 deletions(-)
> >
> > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > index 6fa687bc4912..d51e4f26fe72 100644
> > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > @@ -51,6 +51,11 @@
> >   */
> >  #define VDUSE_MAX_VQ_GROUPS 2
> >
> > +/*
> > + * Let's make it 2 for simplicity.
> > + */
> > +#define VDUSE_MAX_ASID 2
>
> Similar to previous patch, it's better to increase this otherwise we
> need new uAPI or it requires the userspace to probe the maximum value
> once we decide to change it in the future.
>

I'm ok with this, but what is a good max value? UINT32_MAX seems excessive?

This requires us to allocate arrays for both vduse_dev->domain and
vduse_dev->umem, so we need to set a reasonable value.

> > +
> >  #define IRQ_UNBOUND -1
> >
> >  struct vduse_virtqueue {
> > @@ -92,7 +97,7 @@ struct vduse_dev {
> >         struct vduse_vdpa *vdev;
> >         struct device *dev;
> >         struct vduse_virtqueue **vqs;
> > -       struct vduse_iova_domain *domain;
> > +       struct vduse_iova_domain *domain[VDUSE_MAX_ASID];
> >         char *name;
> >         struct mutex lock;
> >         spinlock_t msg_lock;
> > @@ -120,7 +125,8 @@ struct vduse_dev {
> >         u32 vq_num;
> >         u32 vq_align;
> >         u32 ngroups;
> > -       struct vduse_umem *umem;
> > +       u32 nas;
> > +       struct vduse_umem *umem[VDUSE_MAX_ASID];
> >         struct mutex mem_lock;
> >         unsigned int bounce_size;
> >         struct mutex domain_lock;
> > @@ -436,11 +442,14 @@ static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> >  static void vduse_dev_reset(struct vduse_dev *dev)
> >  {
> >         int i;
> > -       struct vduse_iova_domain *domain = dev->domain;
> >
> >         /* The coherent mappings are handled in vduse_dev_free_coherent() */
> > -       if (domain && domain->bounce_map)
> > -               vduse_domain_reset_bounce_map(domain);
> > +       for (i = 0; i < dev->nas; i++) {
> > +               struct vduse_iova_domain *domain = dev->domain[i];
> > +
> > +               if (domain && domain->bounce_map)
> > +                       vduse_domain_reset_bounce_map(domain);
> > +       }
> >
> >         down_write(&dev->rwsem);
> >
> > @@ -617,6 +626,23 @@ static u32 vduse_get_vq_group(struct vdpa_device *vdpa, u16 idx)
> >         return msg.resp.vq_group.num;
> >  }
> >
> > +static int vduse_set_group_asid(struct vdpa_device *vdpa, unsigned int group,
> > +                               unsigned int asid)
> > +{
> > +       struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +       struct vduse_dev_msg msg = { 0 };
> > +
> > +       if (dev->api_version < VDUSE_API_VERSION_1 ||
> > +           group >= dev->ngroups || asid >= dev->nas)
> > +               return -EINVAL;
> > +
> > +       msg.req.type = VDUSE_SET_VQ_GROUP_ASID;
> > +       msg.req.vq_group_asid.group = group;
> > +       msg.req.vq_group_asid.asid = asid;
> > +
> > +       return vduse_dev_msg_sync(dev, &msg);
> > +}
> > +
> >  static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> >                                 struct vdpa_vq_state *state)
> >  {
> > @@ -788,13 +814,13 @@ static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
> >         struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >         int ret;
> >
> > -       ret = vduse_domain_set_map(dev->domain, iotlb);
> > +       ret = vduse_domain_set_map(dev->domain[asid], iotlb);
> >         if (ret)
> >                 return ret;
> >
> >         ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> >         if (ret) {
> > -               vduse_domain_clear_map(dev->domain, iotlb);
> > +               vduse_domain_clear_map(dev->domain[asid], iotlb);
> >                 return ret;
> >         }
> >
> > @@ -837,6 +863,7 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> >         .get_vq_affinity        = vduse_vdpa_get_vq_affinity,
> >         .reset                  = vduse_vdpa_reset,
> >         .set_map                = vduse_vdpa_set_map,
> > +       .set_group_asid         = vduse_set_group_asid,
> >         .free                   = vduse_vdpa_free,
> >  };
> >
> > @@ -845,9 +872,12 @@ static void vduse_dev_sync_single_for_device(struct device *dev,
> >                                              enum dma_data_direction dir)
> >  {
> >         struct vduse_dev *vdev = dev_to_vduse(dev);
> > -       struct vduse_iova_domain *domain = vdev->domain;
> >
> > -       vduse_domain_sync_single_for_device(domain, dma_addr, size, dir);
> > +       for (int i = 0; i < vdev->nas; i++) {
> > +               struct vduse_iova_domain *domain = vdev->domain[i];
>
> Interesting, I thought there could be a way to deduce the iova domain
> from the dma device since virtio supports per virtqueue dma device
> now. For example, I don't see get_vq_dma_dev() implemented in this
> patch.

vhost_vdpa does not interact with it so it was not needed for the RFC.
For example, the simulator does not implement it either and it works
with ASID.

> But anyhow maybe we need to revisit this point as DMA
> mainatiner ask to fix the abuse of the dma device so we don't need a
> trick for dma dev anymore if we design the new mapping API in virtio
> core correctly.
>
> > +
> > +               vduse_domain_sync_single_for_device(domain, dma_addr, size, dir);
> > +       }
> >  }
> >
> >  static void vduse_dev_sync_single_for_cpu(struct device *dev,
> > @@ -855,9 +885,12 @@ static void vduse_dev_sync_single_for_cpu(struct device *dev,
> >                                              enum dma_data_direction dir)
> >  {
> >         struct vduse_dev *vdev = dev_to_vduse(dev);
> > -       struct vduse_iova_domain *domain = vdev->domain;
> >
> > -       vduse_domain_sync_single_for_cpu(domain, dma_addr, size, dir);
> > +       for (int i = 0; i < vdev->nas; i++) {
> > +               struct vduse_iova_domain *domain = vdev->domain[i];
> > +
> > +               vduse_domain_sync_single_for_cpu(domain, dma_addr, size, dir);
> > +       }
> >  }
> >
> >  static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
> > @@ -866,7 +899,7 @@ static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
> >                                      unsigned long attrs)
> >  {
> >         struct vduse_dev *vdev = dev_to_vduse(dev);
> > -       struct vduse_iova_domain *domain = vdev->domain;
> > +       struct vduse_iova_domain *domain = vdev->domain[0];
>
> Any reason that we need to assume asid 0 in this case?
>

At this moment, virtio_vdpa is the only one using these functions and
it is not able to change ASID. So this must be 0 for the moment. But
yes, this should be better addressed by knowing what vq group is this
referring to.

> >
> >         return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> >  }
> > @@ -876,7 +909,7 @@ static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
> >                                 unsigned long attrs)
> >  {
> >         struct vduse_dev *vdev = dev_to_vduse(dev);
> > -       struct vduse_iova_domain *domain = vdev->domain;
> > +       struct vduse_iova_domain *domain = vdev->domain[0];
> >
> >         return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> >  }
> > @@ -886,7 +919,7 @@ static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
> >                                         unsigned long attrs)
> >  {
> >         struct vduse_dev *vdev = dev_to_vduse(dev);
> > -       struct vduse_iova_domain *domain = vdev->domain;
> > +       struct vduse_iova_domain *domain = vdev->domain[0];
> >         unsigned long iova;
> >         void *addr;
> >
> > @@ -906,17 +939,25 @@ static void vduse_dev_free_coherent(struct device *dev, size_t size,
> >                                         unsigned long attrs)
> >  {
> >         struct vduse_dev *vdev = dev_to_vduse(dev);
> > -       struct vduse_iova_domain *domain = vdev->domain;
> > +       struct vduse_iova_domain *domain = vdev->domain[0];
> >
> >         vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
> >  }
> >
> > +/* TODO check if this is correct */
> >  static size_t vduse_dev_max_mapping_size(struct device *dev)
> >  {
> >         struct vduse_dev *vdev = dev_to_vduse(dev);
> > -       struct vduse_iova_domain *domain = vdev->domain;
> > +       size_t max_mapping_size = 0;
> > +
> > +       for (int i = 0; i < vdev->nas; i++) {
> > +               struct vduse_iova_domain *domain = vdev->domain[i];
> >
> > -       return domain->bounce_size;
> > +               if (domain->bounce_size > max_mapping_size)
> > +                       max_mapping_size = domain->bounce_size;
> > +       }
> > +
> > +       return max_mapping_size;
> >  }
> >
> >  static const struct dma_map_ops vduse_dev_dma_ops = {
> > @@ -1054,31 +1095,32 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
> >         return ret;
> >  }
> >
> > -static int vduse_dev_dereg_umem(struct vduse_dev *dev,
> > +static int vduse_dev_dereg_umem(struct vduse_dev *dev, u32 asid,
> >                                 u64 iova, u64 size)
> >  {
> >         int ret;
> >
> >         mutex_lock(&dev->mem_lock);
> >         ret = -ENOENT;
> > -       if (!dev->umem)
> > +       if (!dev->umem[asid])
> >                 goto unlock;
> >
> >         ret = -EINVAL;
> > -       if (!dev->domain)
> > +       if (!dev->domain[asid])
> >                 goto unlock;
> >
> > -       if (dev->umem->iova != iova || size != dev->domain->bounce_size)
> > +       if (dev->umem[asid]->iova != iova ||
> > +           size != dev->domain[asid]->bounce_size)
> >                 goto unlock;
> >
> > -       vduse_domain_remove_user_bounce_pages(dev->domain);
> > -       unpin_user_pages_dirty_lock(dev->umem->pages,
> > -                                   dev->umem->npages, true);
> > -       atomic64_sub(dev->umem->npages, &dev->umem->mm->pinned_vm);
> > -       mmdrop(dev->umem->mm);
> > -       vfree(dev->umem->pages);
> > -       kfree(dev->umem);
> > -       dev->umem = NULL;
> > +       vduse_domain_remove_user_bounce_pages(dev->domain[asid]);
> > +       unpin_user_pages_dirty_lock(dev->umem[asid]->pages,
> > +                                   dev->umem[asid]->npages, true);
> > +       atomic64_sub(dev->umem[asid]->npages, &dev->umem[asid]->mm->pinned_vm);
> > +       mmdrop(dev->umem[asid]->mm);
> > +       vfree(dev->umem[asid]->pages);
> > +       kfree(dev->umem[asid]);
> > +       dev->umem[asid] = NULL;
> >         ret = 0;
> >  unlock:
> >         mutex_unlock(&dev->mem_lock);
> > @@ -1086,7 +1128,7 @@ static int vduse_dev_dereg_umem(struct vduse_dev *dev,
> >  }
> >
> >  static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > -                             u64 iova, u64 uaddr, u64 size)
> > +                             u32 asid, u64 iova, u64 uaddr, u64 size)
> >  {
> >         struct page **page_list = NULL;
> >         struct vduse_umem *umem = NULL;
> > @@ -1094,14 +1136,14 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> >         unsigned long npages, lock_limit;
> >         int ret;
> >
> > -       if (!dev->domain || !dev->domain->bounce_map ||
> > -           size != dev->domain->bounce_size ||
> > +       if (!dev->domain[asid] || !dev->domain[asid]->bounce_map ||
> > +           size != dev->domain[asid]->bounce_size ||
> >             iova != 0 || uaddr & ~PAGE_MASK)
> >                 return -EINVAL;
> >
> >         mutex_lock(&dev->mem_lock);
> >         ret = -EEXIST;
> > -       if (dev->umem)
> > +       if (dev->umem[asid])
> >                 goto unlock;
> >
> >         ret = -ENOMEM;
> > @@ -1125,7 +1167,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> >                 goto out;
> >         }
> >
> > -       ret = vduse_domain_add_user_bounce_pages(dev->domain,
> > +       ret = vduse_domain_add_user_bounce_pages(dev->domain[asid],
> >                                                  page_list, pinned);
> >         if (ret)
> >                 goto out;
> > @@ -1138,7 +1180,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> >         umem->mm = current->mm;
> >         mmgrab(current->mm);
> >
> > -       dev->umem = umem;
> > +       dev->umem[asid] = umem;
> >  out:
> >         if (ret && pinned > 0)
> >                 unpin_user_pages(page_list, pinned);
> > @@ -1181,26 +1223,42 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> >
> >         switch (cmd) {
> >         case VDUSE_IOTLB_GET_FD: {
> > -               struct vduse_iotlb_entry entry;
> > +               struct vduse_iotlb_entry_v2 entry = {};
> > +               struct vduse_iotlb_entry entry_old;
> >                 struct vhost_iotlb_map *map;
> >                 struct vdpa_map_file *map_file;
> >                 struct file *f = NULL;
> >
> >                 ret = -EFAULT;
> > -               if (copy_from_user(&entry, argp, sizeof(entry)))
> > -                       break;
> > +               if (dev->api_version >= VDUSE_API_VERSION_1) {
> > +                       if (copy_from_user(&entry, argp, sizeof(entry)))
> > +                               break;
> > +               } else {
> > +                       if (copy_from_user(&entry_old, argp,
> > +                                          sizeof(entry_old)))
> > +                               break;
> > +
> > +                       entry.offset = entry_old.offset;
> > +                       entry.start = entry_old.start;
> > +                       entry.last = entry_old.last;
> > +                       entry.perm = entry_old.perm;
>
> I wonder if a new ioctl is needed.
>

The problem is that vduse_iotlb_entry is already used in full, can we
extend the argument size without introducing the new ioctl?

> > +               }
> >
> >                 ret = -EINVAL;
> >                 if (entry.start > entry.last)
> >                         break;
> >
> > +               if (entry.asid >= dev->nas)
> > +                       break;
> > +
> >                 mutex_lock(&dev->domain_lock);
> > -               if (!dev->domain) {
> > +               /* TODO accessing an array with idx from userspace, mitigations? */
> > +               if (!dev->domain[entry.asid]) {
> >                         mutex_unlock(&dev->domain_lock);
> >                         break;
> >                 }
> > -               spin_lock(&dev->domain->iotlb_lock);
> > -               map = vhost_iotlb_itree_first(dev->domain->iotlb,
> > +               spin_lock(&dev->domain[entry.asid]->iotlb_lock);
> > +               map = vhost_iotlb_itree_first(dev->domain[entry.asid]->iotlb,
> >                                               entry.start, entry.last);
> >                 if (map) {
> >                         map_file = (struct vdpa_map_file *)map->opaque;
> > @@ -1210,7 +1268,7 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> >                         entry.last = map->last;
> >                         entry.perm = map->perm;
> >                 }
> > -               spin_unlock(&dev->domain->iotlb_lock);
> > +               spin_unlock(&dev->domain[entry.asid]->iotlb_lock);
> >                 mutex_unlock(&dev->domain_lock);
> >                 ret = -EINVAL;
> >                 if (!f)
> > @@ -1360,12 +1418,18 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> >                         break;
> >
> >                 ret = -EINVAL;
> > +               /* TODO: Using asid from userspace, need to mitigate? */
> >                 if (!is_mem_zero((const char *)umem.reserved,
> > -                                sizeof(umem.reserved)))
> > +                                sizeof(umem.reserved)) ||
> > +                   !is_mem_zero((const char *)umem.reserved2,
> > +                                sizeof(umem.reserved2)) ||
> > +                   (dev->api_version < VDUSE_API_VERSION_1 &&
> > +                    umem.asid != 0) ||
> > +                    umem.asid >= dev->nas)
>
> This is probably a hint that we need a new uAPI, see my comment for changelog.
>
> >                         break;
> >
> >                 mutex_lock(&dev->domain_lock);
> > -               ret = vduse_dev_reg_umem(dev, umem.iova,
> > +               ret = vduse_dev_reg_umem(dev, umem.asid, umem.iova,
> >                                          umem.uaddr, umem.size);
> >                 mutex_unlock(&dev->domain_lock);
> >                 break;
> > @@ -1378,15 +1442,23 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> >                         break;
> >
> >                 ret = -EINVAL;
> > +               /* TODO: Using asid from userspace, need to mitigate? */
> >                 if (!is_mem_zero((const char *)umem.reserved,
> > -                                sizeof(umem.reserved)))
> > +                                sizeof(umem.reserved)) ||
> > +                   !is_mem_zero((const char *)umem.reserved2,
> > +                                sizeof(umem.reserved2)) ||
> > +                   (dev->api_version < VDUSE_API_VERSION_1 &&
> > +                    umem.asid != 0) ||
> > +                    umem.asid >= dev->nas)
> >                         break;
> > +
> >                 mutex_lock(&dev->domain_lock);
> > -               ret = vduse_dev_dereg_umem(dev, umem.iova,
> > +               ret = vduse_dev_dereg_umem(dev, umem.asid, umem.iova,
> >                                            umem.size);
> >                 mutex_unlock(&dev->domain_lock);
> >                 break;
> >         }
> > +       /* TODO can we merge this with GET_FD? */
> >         case VDUSE_IOTLB_GET_INFO: {
> >                 struct vduse_iova_info info;
> >                 struct vhost_iotlb_map *map;
> > @@ -1399,27 +1471,32 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> >                 if (info.start > info.last)
> >                         break;
> >
> > +               if (info.asid >= dev->nas)
> > +                       break;
> > +
> >                 if (!is_mem_zero((const char *)info.reserved,
> >                                  sizeof(info.reserved)))
> >                         break;
> >
> >                 mutex_lock(&dev->domain_lock);
> > -               if (!dev->domain) {
> > +               /* TODO asid comes from userspace. mitigations? */
> > +               if (!dev->domain[info.asid]) {
> >                         mutex_unlock(&dev->domain_lock);
> >                         break;
> >                 }
> > -               spin_lock(&dev->domain->iotlb_lock);
> > -               map = vhost_iotlb_itree_first(dev->domain->iotlb,
> > +               spin_lock(&dev->domain[info.asid]->iotlb_lock);
> > +               map = vhost_iotlb_itree_first(dev->domain[info.asid]->iotlb,
> >                                               info.start, info.last);
> >                 if (map) {
> >                         info.start = map->start;
> >                         info.last = map->last;
> >                         info.capability = 0;
> > -                       if (dev->domain->bounce_map && map->start == 0 &&
> > -                           map->last == dev->domain->bounce_size - 1)
> > +                       if (dev->domain[info.asid]->bounce_map &&
> > +                           map->start == 0 &&
> > +                           map->last == dev->domain[info.asid]->bounce_size - 1)
> >                                 info.capability |= VDUSE_IOVA_CAP_UMEM;
> >                 }
> > -               spin_unlock(&dev->domain->iotlb_lock);
> > +               spin_unlock(&dev->domain[info.asid]->iotlb_lock);
> >                 mutex_unlock(&dev->domain_lock);
> >                 if (!map)
> >                         break;
> > @@ -1444,8 +1521,13 @@ static int vduse_dev_release(struct inode *inode, struct file *file)
> >         struct vduse_dev *dev = file->private_data;
> >
> >         mutex_lock(&dev->domain_lock);
> > -       if (dev->domain)
> > -               vduse_dev_dereg_umem(dev, 0, dev->domain->bounce_size);
> > +       for (int i = 0; i < dev->nas; i++) {
> > +               if (dev->domain[i]) {
> > +                       vduse_dev_dereg_umem(dev, i, 0,
> > +                                            dev->domain[i]->bounce_size);
> > +                       dev->domain[i] = NULL;
> > +               }
> > +       }
> >         mutex_unlock(&dev->domain_lock);
> >         spin_lock(&dev->msg_lock);
> >         /* Make sure the inflight messages can processed after reconncection */
> > @@ -1715,8 +1797,10 @@ static int vduse_destroy_dev(char *name)
> >         idr_remove(&vduse_idr, dev->minor);
> >         kvfree(dev->config);
> >         vduse_dev_deinit_vqs(dev);
> > -       if (dev->domain)
> > -               vduse_domain_destroy(dev->domain);
> > +       for (int i = 0; i < dev->nas; i++) {
> > +               if (dev->domain[i])
> > +                       vduse_domain_destroy(dev->domain[i]);
> > +       }
> >         kfree(dev->name);
> >         vduse_dev_destroy(dev);
> >         module_put(THIS_MODULE);
> > @@ -1824,7 +1908,7 @@ static ssize_t bounce_size_store(struct device *device,
> >
> >         ret = -EPERM;
> >         mutex_lock(&dev->domain_lock);
> > -       if (dev->domain)
> > +       if (dev->domain[0] && dev->domain[1])
> >                 goto unlock;
> >
> >         ret = kstrtouint(buf, 10, &bounce_size);
> > @@ -1882,9 +1966,18 @@ static int vduse_create_dev(struct vduse_dev_config *config,
> >                                 config->ngroups, VDUSE_MAX_VQ_GROUPS);
> >                         goto err_ngroups;
> >                 }
> > +
> > +               if (config->nas > VDUSE_MAX_ASID) {
> > +                       pr_err("Not creating a VDUSE device with %u asid. Max: %u",
> > +                               config->nas, VDUSE_MAX_ASID);
> > +                       goto err_nas;
> > +               }
> > +
> >                 dev->ngroups = config->ngroups ?: 1;
> > +               dev->nas = config->nas ?: 1;
> >         } else {
> >                 dev->ngroups = 1;
> > +               dev->nas = 1;
> >         }
> >         dev->name = kstrdup(config->name, GFP_KERNEL);
> >         if (!dev->name)
> > @@ -1923,6 +2016,7 @@ static int vduse_create_dev(struct vduse_dev_config *config,
> >         kfree(dev->name);
> >  err_ngroups:
> >  err_str:
> > +err_nas:
> >         vduse_dev_destroy(dev);
> >  err:
> >         return ret;
> > @@ -2015,7 +2109,6 @@ static int vduse_open(struct inode *inode, struct file *file)
> >         if (!control)
> >                 return -ENOMEM;
> >
> > -       control->api_version = VDUSE_API_VERSION;
> >         file->private_data = control;
> >
> >         return 0;
> > @@ -2040,17 +2133,15 @@ static struct vduse_mgmt_dev *vduse_mgmt;
> >  static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
> >  {
> >         struct vduse_vdpa *vdev;
> > -       __u32 ngroups = 1;
> > +       __u32 ngroups = dev->ngroups;
> >         int ret;
> >
> >         if (dev->vdev)
> >                 return -EEXIST;
> >
> > -       if (vdev->dev->api_version >= VDUSE_API_VERSION_1)
> > -               ngroups = vdev->dev->ngroups;
> > -
> > +       /* TODO do we need to store ngroups and nas? vdpa device already store it for us */
> >         vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev,
> > -                                &vduse_vdpa_config_ops, ngroups, 1, name,
> > +                                &vduse_vdpa_config_ops, ngroups, dev->nas, name,
> >                                  true);
> >         if (IS_ERR(vdev))
> >                 return PTR_ERR(vdev);
> > @@ -2088,11 +2179,23 @@ static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name,
> >                 return ret;
> >
> >         mutex_lock(&dev->domain_lock);
> > -       if (!dev->domain)
> > -               dev->domain = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
> > -                                                 dev->bounce_size);
> > +       ret = 0;
> > +
> > +       /* TODO we could delay the creation of the domain */
> > +       for (int i = 0; i < dev->nas; ++i) {
> > +               if (!dev->domain[i])
> > +                       dev->domain[i] = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
> > +                                                            dev->bounce_size);
> > +               if (!dev->domain[i]) {
> > +                       ret = -ENOMEM;
> > +                       for (int j = 0; j < i; ++j)
> > +                               vduse_domain_destroy(dev->domain[j]);
> > +                       goto err_domain;
> > +               }
> > +       }
> > +
> >         mutex_unlock(&dev->domain_lock);
> > -       if (!dev->domain) {
> > +       if (ret == -ENOMEM) {
> >                 put_device(&dev->vdev->vdpa.dev);
> >                 return -ENOMEM;
> >         }
> > @@ -2101,13 +2204,22 @@ static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name,
> >         if (ret) {
> >                 put_device(&dev->vdev->vdpa.dev);
> >                 mutex_lock(&dev->domain_lock);
> > -               vduse_domain_destroy(dev->domain);
> > -               dev->domain = NULL;
> > +               for (int i = 0; i < dev->nas; i++) {
> > +                       if (dev->domain[i]) {
> > +                               vduse_domain_destroy(dev->domain[i]);
> > +                               dev->domain[i] = NULL;
> > +                       }
> > +               }
> >                 mutex_unlock(&dev->domain_lock);
> >                 return ret;
> >         }
> >
> >         return 0;
> > +
> > +err_domain:
> > +       /* TODO do I need to call put_device? */
> > +       mutex_unlock(&dev->domain_lock);
> > +       return ret;
> >  }
> >
> >  static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
> > diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> > index a779bcddac58..3a17a0b4e938 100644
> > --- a/include/uapi/linux/vduse.h
> > +++ b/include/uapi/linux/vduse.h
> > @@ -46,7 +46,8 @@ struct vduse_dev_config {
> >         __u32 vq_num;
> >         __u32 vq_align;
> >         __u32 ngroups; /* if VDUSE_API_VERSION >= 1 */
> > -       __u32 reserved[12];
> > +       __u32 nas; /* if VDUSE_API_VERSION >= 1 */
> > +       __u32 reserved[11];
> >         __u32 config_size;
> >         __u8 config[];
> >  };
> > @@ -81,6 +82,17 @@ struct vduse_iotlb_entry {
> >         __u8 perm;
> >  };
> >
> > +struct vduse_iotlb_entry_v2 {
> > +       __u64 offset;
> > +       __u64 start;
> > +       __u64 last;
> > +       __u32 asid;
> > +#define VDUSE_ACCESS_RO 0x1
> > +#define VDUSE_ACCESS_WO 0x2
> > +#define VDUSE_ACCESS_RW 0x3
> > +       __u8 perm;
> > +};
> > +
> >  /*
> >   * Find the first IOVA region that overlaps with the range [start, last]
> >   * and return the corresponding file descriptor. Return -EINVAL means the
> > @@ -171,6 +183,16 @@ struct vduse_vq_group {
> >         __u32 num;
> >  };
> >
> > +/**
> > + * struct vduse_vq_group - virtqueue group
> > + @ @group: Index of the virtqueue group
> > + * @asid: Address space ID of the group
> > + */
> > +struct vduse_vq_group_asid {
> > +       __u32 group;
> > +       __u32 asid;
> > +};
> > +
> >  /**
> >   * struct vduse_vq_info - information of a virtqueue
> >   * @index: virtqueue index
> > @@ -231,7 +253,9 @@ struct vduse_vq_eventfd {
> >   * @uaddr: start address of userspace memory, it must be aligned to page size
> >   * @iova: start of the IOVA region
> >   * @size: size of the IOVA region
> > + * @asid: Address space ID of the IOVA region
> >   * @reserved: for future use, needs to be initialized to zero
> > + * @reserved2: for future use, needs to be initialized to zero
> >   *
> >   * Structure used by VDUSE_IOTLB_REG_UMEM and VDUSE_IOTLB_DEREG_UMEM
> >   * ioctls to register/de-register userspace memory for IOVA regions
> > @@ -240,7 +264,9 @@ struct vduse_iova_umem {
> >         __u64 uaddr;
> >         __u64 iova;
> >         __u64 size;
> > -       __u64 reserved[3];
> > +       __u32 asid;
> > +       __u32 reserved[1];
>
> Basically, I'm not sure we can assume reserved to be zero for API_VERSION == 1.
>

Why not? The ioctl returns -EINVAL if it is not 0, so either an v0 or
v1 ioctl user must set it if it wants the ioctl to return success.

> > +       __u64 reserved2[2];
>
> Any reasons we can't reuse reserved array?
>

I don't get this comment, we're reusing it, isn't it? I'm just
splitting the u64[3] into u32[1]+u64[2]. Maybe it is more elegant to
use an u32[5] or similar?

> >  };
> >
> >  /* Register userspace memory for IOVA regions */
> > @@ -264,7 +290,8 @@ struct vduse_iova_info {
> >         __u64 last;
> >  #define VDUSE_IOVA_CAP_UMEM (1 << 0)
> >         __u64 capability;
> > -       __u64 reserved[3];
> > +       __u64 asid; /* Only if device API version >= 1 */
> > +       __u64 reserved[2];
>
> Same here.
>
> >  };
> >
> >  /*
> > @@ -287,6 +314,7 @@ enum vduse_req_type {
> >         VDUSE_SET_STATUS,
> >         VDUSE_UPDATE_IOTLB,
> >         VDUSE_GET_VQ_GROUP,
> > +       VDUSE_SET_VQ_GROUP_ASID,
> >  };
> >
> >  /**
> > @@ -342,6 +370,8 @@ struct vduse_dev_request {
> >                 struct vduse_dev_status s;
> >                 struct vduse_iova_range iova;
> >                 struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
> > +               /* Only if vduse api version >= 1 */
> > +               struct vduse_vq_group_asid vq_group_asid;
>
> This seems to break the uAPI as sizeof(struct vduse_dev_request) changes.
>

It should not change. vduse_vq_group is 64 bits, smaller than padding
which is 32bits*32 = 1024bits. And they're both in the same union.

> >                 __u32 padding[32];
> >         };
> >  };
> > @@ -365,6 +395,8 @@ struct vduse_dev_response {
> >         union {
> >                 struct vduse_vq_state vq_state;
> >                 struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
> > +               /* Only if vduse api version >= 1 */
> > +               struct vduse_vq_group_asid vq_group_asid;
>
> Same here.
>

Same reply :).

> >                 __u32 padding[32];
> >         };
> >  };
> > --
> > 2.49.0
> >
>
> Thanks
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 5/6] vduse: reset group asid in reset
  2025-06-12  0:32   ` Jason Wang
@ 2025-06-12  7:24     ` Eugenio Perez Martin
  0 siblings, 0 replies; 30+ messages in thread
From: Eugenio Perez Martin @ 2025-06-12  7:24 UTC (permalink / raw)
  To: Jason Wang
  Cc: Yongji Xie, Cindy Lu, linux-kernel, Stefano Garzarella,
	Stefan Hajnoczi, Maxime Coquelin, Michael S. Tsirkin,
	virtualization, Xuan Zhuo, Laurent Vivier

On Thu, Jun 12, 2025 at 2:33 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Fri, Jun 6, 2025 at 7:50 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> >
> > Is the expected behavior with vdpa_sim and mlx.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >  drivers/vdpa/vdpa_user/vduse_dev.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> >
>
> Should we squash this into patch 3?
>

I agree, I'll do it for the next series. Thanks!


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 3/6] vduse: add vq group asid support
  2025-06-12  7:24     ` Eugenio Perez Martin
@ 2025-06-13  1:21       ` Jason Wang
  2025-08-07 11:10         ` Eugenio Perez Martin
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Wang @ 2025-06-13  1:21 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Yongji Xie, Cindy Lu, linux-kernel, Stefano Garzarella,
	Stefan Hajnoczi, Maxime Coquelin, Michael S. Tsirkin,
	virtualization, Xuan Zhuo, Laurent Vivier

On Thu, Jun 12, 2025 at 3:25 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Thu, Jun 12, 2025 at 2:30 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Fri, Jun 6, 2025 at 7:50 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > >
> > > Make one IOTLB domain per address space, and allow the driver to assign
> > > each ASID to a vq group.  Each address space via an dedicated identifier
> > > (ASID).
> > >
> > > During vDPA device allocation, the VDUSE device needs to report the
> > > number of address spaces supported.  Then the vdpa driver is able to
> > > configure them.  At this moment only vhost_vdpa is able to do it.
> > >
> > > This helps to isolate the environments for the virtqueue that will not
> > > be assigned directly.  E.g in the case of virtio-net, the control
> > > virtqueue will not be assigned directly to guest.
> > >
> > > TODO: Ideally, umem should not be duplicated.  But it is hard or
> > > impossible to refactor everything around one single umem.  So should we
> > > continue with device specifying umem per vq group?
> >
> > This is a good question.
> >
> > I think umem should be bound to address space and umem needs to be isolated.
> >
> > For the issue of complexity, we can simply extend the vduse_iova_umem
> > to have an asid field. But it looks like it needs more work as:
> >
> > struct vduse_iova_umem {
> >         __u64 uaddr;
> >         __u64 iova;
> >         __u64 size;
> >         __u64 reserved[3];
> > };
> >
> > Do we have a way to know if reserved is used or not (as we are lacking
> > a flag field anyhow ....).
> >
>
> I'd say that we should work the same way as the rest of the structs:
> We add the asid field, and if the API v1 is negotiated we handle it as
> ASID. If it is not negotiated, it is reserved.

Ok, that makes sense.

>
> > So we probably need a new uAPI like vduse_iova_umem_v2 that includes a
> > flag field at least.
> >
> > >
> > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > ---
> > >  drivers/vdpa/vdpa_user/vduse_dev.c | 250 +++++++++++++++++++++--------
> > >  include/uapi/linux/vduse.h         |  38 ++++-
> > >  2 files changed, 216 insertions(+), 72 deletions(-)
> > >
> > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > index 6fa687bc4912..d51e4f26fe72 100644
> > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > @@ -51,6 +51,11 @@
> > >   */
> > >  #define VDUSE_MAX_VQ_GROUPS 2
> > >
> > > +/*
> > > + * Let's make it 2 for simplicity.
> > > + */
> > > +#define VDUSE_MAX_ASID 2
> >
> > Similar to previous patch, it's better to increase this otherwise we
> > need new uAPI or it requires the userspace to probe the maximum value
> > once we decide to change it in the future.
> >
>
> I'm ok with this, but what is a good max value? UINT32_MAX seems excessive?

Maybe 64 or 256.

>
> This requires us to allocate arrays for both vduse_dev->domain and
> vduse_dev->umem, so we need to set a reasonable value.

Could we do the allocation based on the userspace privionsing?

>
> > > +
> > >  #define IRQ_UNBOUND -1
> > >
> > >  struct vduse_virtqueue {
> > > @@ -92,7 +97,7 @@ struct vduse_dev {
> > >         struct vduse_vdpa *vdev;
> > >         struct device *dev;
> > >         struct vduse_virtqueue **vqs;
> > > -       struct vduse_iova_domain *domain;
> > > +       struct vduse_iova_domain *domain[VDUSE_MAX_ASID];
> > >         char *name;
> > >         struct mutex lock;
> > >         spinlock_t msg_lock;
> > > @@ -120,7 +125,8 @@ struct vduse_dev {
> > >         u32 vq_num;
> > >         u32 vq_align;
> > >         u32 ngroups;
> > > -       struct vduse_umem *umem;
> > > +       u32 nas;
> > > +       struct vduse_umem *umem[VDUSE_MAX_ASID];
> > >         struct mutex mem_lock;
> > >         unsigned int bounce_size;
> > >         struct mutex domain_lock;
> > > @@ -436,11 +442,14 @@ static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> > >  static void vduse_dev_reset(struct vduse_dev *dev)
> > >  {
> > >         int i;
> > > -       struct vduse_iova_domain *domain = dev->domain;
> > >
> > >         /* The coherent mappings are handled in vduse_dev_free_coherent() */
> > > -       if (domain && domain->bounce_map)
> > > -               vduse_domain_reset_bounce_map(domain);
> > > +       for (i = 0; i < dev->nas; i++) {
> > > +               struct vduse_iova_domain *domain = dev->domain[i];
> > > +
> > > +               if (domain && domain->bounce_map)
> > > +                       vduse_domain_reset_bounce_map(domain);
> > > +       }
> > >
> > >         down_write(&dev->rwsem);
> > >
> > > @@ -617,6 +626,23 @@ static u32 vduse_get_vq_group(struct vdpa_device *vdpa, u16 idx)
> > >         return msg.resp.vq_group.num;
> > >  }
> > >
> > > +static int vduse_set_group_asid(struct vdpa_device *vdpa, unsigned int group,
> > > +                               unsigned int asid)
> > > +{
> > > +       struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > > +       struct vduse_dev_msg msg = { 0 };
> > > +
> > > +       if (dev->api_version < VDUSE_API_VERSION_1 ||
> > > +           group >= dev->ngroups || asid >= dev->nas)
> > > +               return -EINVAL;
> > > +
> > > +       msg.req.type = VDUSE_SET_VQ_GROUP_ASID;
> > > +       msg.req.vq_group_asid.group = group;
> > > +       msg.req.vq_group_asid.asid = asid;
> > > +
> > > +       return vduse_dev_msg_sync(dev, &msg);
> > > +}
> > > +
> > >  static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> > >                                 struct vdpa_vq_state *state)
> > >  {
> > > @@ -788,13 +814,13 @@ static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
> > >         struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > >         int ret;
> > >
> > > -       ret = vduse_domain_set_map(dev->domain, iotlb);
> > > +       ret = vduse_domain_set_map(dev->domain[asid], iotlb);
> > >         if (ret)
> > >                 return ret;
> > >
> > >         ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> > >         if (ret) {
> > > -               vduse_domain_clear_map(dev->domain, iotlb);
> > > +               vduse_domain_clear_map(dev->domain[asid], iotlb);
> > >                 return ret;
> > >         }
> > >
> > > @@ -837,6 +863,7 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> > >         .get_vq_affinity        = vduse_vdpa_get_vq_affinity,
> > >         .reset                  = vduse_vdpa_reset,
> > >         .set_map                = vduse_vdpa_set_map,
> > > +       .set_group_asid         = vduse_set_group_asid,
> > >         .free                   = vduse_vdpa_free,
> > >  };
> > >
> > > @@ -845,9 +872,12 @@ static void vduse_dev_sync_single_for_device(struct device *dev,
> > >                                              enum dma_data_direction dir)
> > >  {
> > >         struct vduse_dev *vdev = dev_to_vduse(dev);
> > > -       struct vduse_iova_domain *domain = vdev->domain;
> > >
> > > -       vduse_domain_sync_single_for_device(domain, dma_addr, size, dir);
> > > +       for (int i = 0; i < vdev->nas; i++) {
> > > +               struct vduse_iova_domain *domain = vdev->domain[i];
> >
> > Interesting, I thought there could be a way to deduce the iova domain
> > from the dma device since virtio supports per virtqueue dma device
> > now. For example, I don't see get_vq_dma_dev() implemented in this
> > patch.
>
> vhost_vdpa does not interact with it so it was not needed for the RFC.
> For example, the simulator does not implement it either and it works
> with ASID.

Right, the reason for it is:

1) in vdpasim_create() we assign vdpa device as the device device:

vdpasim->vdpa.dma_dev = dev;

2) since the vDPA device itself can't do DMA, it tricks the DMA ops to use a PA
3) we do 1:1 mapping in vringh IOTLB by default to make the PA trick work:

               vhost_iotlb_add_range(&vdpasim->iommu[i], 0, ULONG_MAX, 0,
                                      VHOST_MAP_RW);

4) This trick works the case oneday virtio_vdpa may use
set_group_asid() as well, as all the AS are using PA

Note that it's 2) that is blamed by the DMA maintainer.

It looks different from VDUSE here:

1) VDUSE has its own dma_ops, each as is backed by different IOVA
domain as well as IOVA address
2) vduse_domain_sync_single_for_device() will trigger unnecessary
IOTLB synchronizations

>
> > But anyhow maybe we need to revisit this point as DMA
> > mainatiner ask to fix the abuse of the dma device so we don't need a
> > trick for dma dev anymore if we design the new mapping API in virtio
> > core correctly.
> >
> > > +
> > > +               vduse_domain_sync_single_for_device(domain, dma_addr, size, dir);
> > > +       }
> > >  }
> > >
> > >  static void vduse_dev_sync_single_for_cpu(struct device *dev,
> > > @@ -855,9 +885,12 @@ static void vduse_dev_sync_single_for_cpu(struct device *dev,
> > >                                              enum dma_data_direction dir)
> > >  {
> > >         struct vduse_dev *vdev = dev_to_vduse(dev);
> > > -       struct vduse_iova_domain *domain = vdev->domain;
> > >
> > > -       vduse_domain_sync_single_for_cpu(domain, dma_addr, size, dir);
> > > +       for (int i = 0; i < vdev->nas; i++) {
> > > +               struct vduse_iova_domain *domain = vdev->domain[i];
> > > +
> > > +               vduse_domain_sync_single_for_cpu(domain, dma_addr, size, dir);
> > > +       }
> > >  }
> > >
> > >  static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
> > > @@ -866,7 +899,7 @@ static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
> > >                                      unsigned long attrs)
> > >  {
> > >         struct vduse_dev *vdev = dev_to_vduse(dev);
> > > -       struct vduse_iova_domain *domain = vdev->domain;
> > > +       struct vduse_iova_domain *domain = vdev->domain[0];
> >
> > Any reason that we need to assume asid 0 in this case?
> >
>
> At this moment, virtio_vdpa is the only one using these functions and
> it is not able to change ASID. So this must be 0 for the moment. But
> yes, this should be better addressed by knowing what vq group is this
> referring to.

My plan is to

1) introduce map_ops() in the virtio_device()
2) in PCI, it will be converted to DMA ops
3) for vDPA, it will be converted to vDPA specific ops
4) get_vq_dma_dev() will be converted to get_vq_map_token(), so each
vq could provide it's own token
5) when the virtio core tries to do mapping, it will pass per vq dma
token as parameters
6) so vduse here can receive a per virtqueue token where it can map it
to group then ASID

>
> > >
> > >         return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> > >  }
> > > @@ -876,7 +909,7 @@ static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
> > >                                 unsigned long attrs)
> > >  {
> > >         struct vduse_dev *vdev = dev_to_vduse(dev);
> > > -       struct vduse_iova_domain *domain = vdev->domain;
> > > +       struct vduse_iova_domain *domain = vdev->domain[0];
> > >
> > >         return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> > >  }
> > > @@ -886,7 +919,7 @@ static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
> > >                                         unsigned long attrs)
> > >  {
> > >         struct vduse_dev *vdev = dev_to_vduse(dev);
> > > -       struct vduse_iova_domain *domain = vdev->domain;
> > > +       struct vduse_iova_domain *domain = vdev->domain[0];
> > >         unsigned long iova;
> > >         void *addr;
> > >
> > > @@ -906,17 +939,25 @@ static void vduse_dev_free_coherent(struct device *dev, size_t size,
> > >                                         unsigned long attrs)
> > >  {
> > >         struct vduse_dev *vdev = dev_to_vduse(dev);
> > > -       struct vduse_iova_domain *domain = vdev->domain;
> > > +       struct vduse_iova_domain *domain = vdev->domain[0];
> > >
> > >         vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
> > >  }
> > >
> > > +/* TODO check if this is correct */
> > >  static size_t vduse_dev_max_mapping_size(struct device *dev)
> > >  {
> > >         struct vduse_dev *vdev = dev_to_vduse(dev);
> > > -       struct vduse_iova_domain *domain = vdev->domain;
> > > +       size_t max_mapping_size = 0;
> > > +
> > > +       for (int i = 0; i < vdev->nas; i++) {
> > > +               struct vduse_iova_domain *domain = vdev->domain[i];
> > >
> > > -       return domain->bounce_size;
> > > +               if (domain->bounce_size > max_mapping_size)
> > > +                       max_mapping_size = domain->bounce_size;
> > > +       }
> > > +
> > > +       return max_mapping_size;
> > >  }
> > >
> > >  static const struct dma_map_ops vduse_dev_dma_ops = {
> > > @@ -1054,31 +1095,32 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
> > >         return ret;
> > >  }
> > >
> > > -static int vduse_dev_dereg_umem(struct vduse_dev *dev,
> > > +static int vduse_dev_dereg_umem(struct vduse_dev *dev, u32 asid,
> > >                                 u64 iova, u64 size)
> > >  {
> > >         int ret;
> > >
> > >         mutex_lock(&dev->mem_lock);
> > >         ret = -ENOENT;
> > > -       if (!dev->umem)
> > > +       if (!dev->umem[asid])
> > >                 goto unlock;
> > >
> > >         ret = -EINVAL;
> > > -       if (!dev->domain)
> > > +       if (!dev->domain[asid])
> > >                 goto unlock;
> > >
> > > -       if (dev->umem->iova != iova || size != dev->domain->bounce_size)
> > > +       if (dev->umem[asid]->iova != iova ||
> > > +           size != dev->domain[asid]->bounce_size)
> > >                 goto unlock;
> > >
> > > -       vduse_domain_remove_user_bounce_pages(dev->domain);
> > > -       unpin_user_pages_dirty_lock(dev->umem->pages,
> > > -                                   dev->umem->npages, true);
> > > -       atomic64_sub(dev->umem->npages, &dev->umem->mm->pinned_vm);
> > > -       mmdrop(dev->umem->mm);
> > > -       vfree(dev->umem->pages);
> > > -       kfree(dev->umem);
> > > -       dev->umem = NULL;
> > > +       vduse_domain_remove_user_bounce_pages(dev->domain[asid]);
> > > +       unpin_user_pages_dirty_lock(dev->umem[asid]->pages,
> > > +                                   dev->umem[asid]->npages, true);
> > > +       atomic64_sub(dev->umem[asid]->npages, &dev->umem[asid]->mm->pinned_vm);
> > > +       mmdrop(dev->umem[asid]->mm);
> > > +       vfree(dev->umem[asid]->pages);
> > > +       kfree(dev->umem[asid]);
> > > +       dev->umem[asid] = NULL;
> > >         ret = 0;
> > >  unlock:
> > >         mutex_unlock(&dev->mem_lock);
> > > @@ -1086,7 +1128,7 @@ static int vduse_dev_dereg_umem(struct vduse_dev *dev,
> > >  }
> > >
> > >  static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > > -                             u64 iova, u64 uaddr, u64 size)
> > > +                             u32 asid, u64 iova, u64 uaddr, u64 size)
> > >  {
> > >         struct page **page_list = NULL;
> > >         struct vduse_umem *umem = NULL;
> > > @@ -1094,14 +1136,14 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > >         unsigned long npages, lock_limit;
> > >         int ret;
> > >
> > > -       if (!dev->domain || !dev->domain->bounce_map ||
> > > -           size != dev->domain->bounce_size ||
> > > +       if (!dev->domain[asid] || !dev->domain[asid]->bounce_map ||
> > > +           size != dev->domain[asid]->bounce_size ||
> > >             iova != 0 || uaddr & ~PAGE_MASK)
> > >                 return -EINVAL;
> > >
> > >         mutex_lock(&dev->mem_lock);
> > >         ret = -EEXIST;
> > > -       if (dev->umem)
> > > +       if (dev->umem[asid])
> > >                 goto unlock;
> > >
> > >         ret = -ENOMEM;
> > > @@ -1125,7 +1167,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > >                 goto out;
> > >         }
> > >
> > > -       ret = vduse_domain_add_user_bounce_pages(dev->domain,
> > > +       ret = vduse_domain_add_user_bounce_pages(dev->domain[asid],
> > >                                                  page_list, pinned);
> > >         if (ret)
> > >                 goto out;
> > > @@ -1138,7 +1180,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > >         umem->mm = current->mm;
> > >         mmgrab(current->mm);
> > >
> > > -       dev->umem = umem;
> > > +       dev->umem[asid] = umem;
> > >  out:
> > >         if (ret && pinned > 0)
> > >                 unpin_user_pages(page_list, pinned);
> > > @@ -1181,26 +1223,42 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > >
> > >         switch (cmd) {
> > >         case VDUSE_IOTLB_GET_FD: {
> > > -               struct vduse_iotlb_entry entry;
> > > +               struct vduse_iotlb_entry_v2 entry = {};
> > > +               struct vduse_iotlb_entry entry_old;
> > >                 struct vhost_iotlb_map *map;
> > >                 struct vdpa_map_file *map_file;
> > >                 struct file *f = NULL;
> > >
> > >                 ret = -EFAULT;
> > > -               if (copy_from_user(&entry, argp, sizeof(entry)))
> > > -                       break;
> > > +               if (dev->api_version >= VDUSE_API_VERSION_1) {
> > > +                       if (copy_from_user(&entry, argp, sizeof(entry)))
> > > +                               break;
> > > +               } else {
> > > +                       if (copy_from_user(&entry_old, argp,
> > > +                                          sizeof(entry_old)))
> > > +                               break;
> > > +
> > > +                       entry.offset = entry_old.offset;
> > > +                       entry.start = entry_old.start;
> > > +                       entry.last = entry_old.last;
> > > +                       entry.perm = entry_old.perm;
> >
> > I wonder if a new ioctl is needed.
> >
>
> The problem is that vduse_iotlb_entry is already used in full, can we
> extend the argument size without introducing the new ioctl?

Yes, I think it should work if VDUSE_API_VERSION_1 makes sense.

>
> > > +               }
> > >
> > >                 ret = -EINVAL;
> > >                 if (entry.start > entry.last)
> > >                         break;
> > >
> > > +               if (entry.asid >= dev->nas)
> > > +                       break;
> > > +
> > >                 mutex_lock(&dev->domain_lock);
> > > -               if (!dev->domain) {
> > > +               /* TODO accessing an array with idx from userspace, mitigations? */
> > > +               if (!dev->domain[entry.asid]) {
> > >                         mutex_unlock(&dev->domain_lock);
> > >                         break;
> > >                 }
> > > -               spin_lock(&dev->domain->iotlb_lock);
> > > -               map = vhost_iotlb_itree_first(dev->domain->iotlb,
> > > +               spin_lock(&dev->domain[entry.asid]->iotlb_lock);
> > > +               map = vhost_iotlb_itree_first(dev->domain[entry.asid]->iotlb,
> > >                                               entry.start, entry.last);
> > >                 if (map) {
> > >                         map_file = (struct vdpa_map_file *)map->opaque;
> > > @@ -1210,7 +1268,7 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > >                         entry.last = map->last;
> > >                         entry.perm = map->perm;
> > >                 }
> > > -               spin_unlock(&dev->domain->iotlb_lock);
> > > +               spin_unlock(&dev->domain[entry.asid]->iotlb_lock);
> > >                 mutex_unlock(&dev->domain_lock);
> > >                 ret = -EINVAL;
> > >                 if (!f)
> > > @@ -1360,12 +1418,18 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > >                         break;
> > >
> > >                 ret = -EINVAL;
> > > +               /* TODO: Using asid from userspace, need to mitigate? */
> > >                 if (!is_mem_zero((const char *)umem.reserved,
> > > -                                sizeof(umem.reserved)))
> > > +                                sizeof(umem.reserved)) ||
> > > +                   !is_mem_zero((const char *)umem.reserved2,
> > > +                                sizeof(umem.reserved2)) ||
> > > +                   (dev->api_version < VDUSE_API_VERSION_1 &&
> > > +                    umem.asid != 0) ||
> > > +                    umem.asid >= dev->nas)
> >
> > This is probably a hint that we need a new uAPI, see my comment for changelog.
> >
> > >                         break;
> > >
> > >                 mutex_lock(&dev->domain_lock);
> > > -               ret = vduse_dev_reg_umem(dev, umem.iova,
> > > +               ret = vduse_dev_reg_umem(dev, umem.asid, umem.iova,
> > >                                          umem.uaddr, umem.size);
> > >                 mutex_unlock(&dev->domain_lock);
> > >                 break;
> > > @@ -1378,15 +1442,23 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > >                         break;
> > >
> > >                 ret = -EINVAL;
> > > +               /* TODO: Using asid from userspace, need to mitigate? */
> > >                 if (!is_mem_zero((const char *)umem.reserved,
> > > -                                sizeof(umem.reserved)))
> > > +                                sizeof(umem.reserved)) ||
> > > +                   !is_mem_zero((const char *)umem.reserved2,
> > > +                                sizeof(umem.reserved2)) ||
> > > +                   (dev->api_version < VDUSE_API_VERSION_1 &&
> > > +                    umem.asid != 0) ||
> > > +                    umem.asid >= dev->nas)
> > >                         break;
> > > +
> > >                 mutex_lock(&dev->domain_lock);
> > > -               ret = vduse_dev_dereg_umem(dev, umem.iova,
> > > +               ret = vduse_dev_dereg_umem(dev, umem.asid, umem.iova,
> > >                                            umem.size);
> > >                 mutex_unlock(&dev->domain_lock);
> > >                 break;
> > >         }
> > > +       /* TODO can we merge this with GET_FD? */
> > >         case VDUSE_IOTLB_GET_INFO: {
> > >                 struct vduse_iova_info info;
> > >                 struct vhost_iotlb_map *map;
> > > @@ -1399,27 +1471,32 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > >                 if (info.start > info.last)
> > >                         break;
> > >
> > > +               if (info.asid >= dev->nas)
> > > +                       break;
> > > +
> > >                 if (!is_mem_zero((const char *)info.reserved,
> > >                                  sizeof(info.reserved)))
> > >                         break;
> > >
> > >                 mutex_lock(&dev->domain_lock);
> > > -               if (!dev->domain) {
> > > +               /* TODO asid comes from userspace. mitigations? */
> > > +               if (!dev->domain[info.asid]) {
> > >                         mutex_unlock(&dev->domain_lock);
> > >                         break;
> > >                 }
> > > -               spin_lock(&dev->domain->iotlb_lock);
> > > -               map = vhost_iotlb_itree_first(dev->domain->iotlb,
> > > +               spin_lock(&dev->domain[info.asid]->iotlb_lock);
> > > +               map = vhost_iotlb_itree_first(dev->domain[info.asid]->iotlb,
> > >                                               info.start, info.last);
> > >                 if (map) {
> > >                         info.start = map->start;
> > >                         info.last = map->last;
> > >                         info.capability = 0;
> > > -                       if (dev->domain->bounce_map && map->start == 0 &&
> > > -                           map->last == dev->domain->bounce_size - 1)
> > > +                       if (dev->domain[info.asid]->bounce_map &&
> > > +                           map->start == 0 &&
> > > +                           map->last == dev->domain[info.asid]->bounce_size - 1)
> > >                                 info.capability |= VDUSE_IOVA_CAP_UMEM;
> > >                 }
> > > -               spin_unlock(&dev->domain->iotlb_lock);
> > > +               spin_unlock(&dev->domain[info.asid]->iotlb_lock);
> > >                 mutex_unlock(&dev->domain_lock);
> > >                 if (!map)
> > >                         break;
> > > @@ -1444,8 +1521,13 @@ static int vduse_dev_release(struct inode *inode, struct file *file)
> > >         struct vduse_dev *dev = file->private_data;
> > >
> > >         mutex_lock(&dev->domain_lock);
> > > -       if (dev->domain)
> > > -               vduse_dev_dereg_umem(dev, 0, dev->domain->bounce_size);
> > > +       for (int i = 0; i < dev->nas; i++) {
> > > +               if (dev->domain[i]) {
> > > +                       vduse_dev_dereg_umem(dev, i, 0,
> > > +                                            dev->domain[i]->bounce_size);
> > > +                       dev->domain[i] = NULL;
> > > +               }
> > > +       }
> > >         mutex_unlock(&dev->domain_lock);
> > >         spin_lock(&dev->msg_lock);
> > >         /* Make sure the inflight messages can processed after reconncection */
> > > @@ -1715,8 +1797,10 @@ static int vduse_destroy_dev(char *name)
> > >         idr_remove(&vduse_idr, dev->minor);
> > >         kvfree(dev->config);
> > >         vduse_dev_deinit_vqs(dev);
> > > -       if (dev->domain)
> > > -               vduse_domain_destroy(dev->domain);
> > > +       for (int i = 0; i < dev->nas; i++) {
> > > +               if (dev->domain[i])
> > > +                       vduse_domain_destroy(dev->domain[i]);
> > > +       }
> > >         kfree(dev->name);
> > >         vduse_dev_destroy(dev);
> > >         module_put(THIS_MODULE);
> > > @@ -1824,7 +1908,7 @@ static ssize_t bounce_size_store(struct device *device,
> > >
> > >         ret = -EPERM;
> > >         mutex_lock(&dev->domain_lock);
> > > -       if (dev->domain)
> > > +       if (dev->domain[0] && dev->domain[1])
> > >                 goto unlock;
> > >
> > >         ret = kstrtouint(buf, 10, &bounce_size);
> > > @@ -1882,9 +1966,18 @@ static int vduse_create_dev(struct vduse_dev_config *config,
> > >                                 config->ngroups, VDUSE_MAX_VQ_GROUPS);
> > >                         goto err_ngroups;
> > >                 }
> > > +
> > > +               if (config->nas > VDUSE_MAX_ASID) {
> > > +                       pr_err("Not creating a VDUSE device with %u asid. Max: %u",
> > > +                               config->nas, VDUSE_MAX_ASID);
> > > +                       goto err_nas;
> > > +               }
> > > +
> > >                 dev->ngroups = config->ngroups ?: 1;
> > > +               dev->nas = config->nas ?: 1;
> > >         } else {
> > >                 dev->ngroups = 1;
> > > +               dev->nas = 1;
> > >         }
> > >         dev->name = kstrdup(config->name, GFP_KERNEL);
> > >         if (!dev->name)
> > > @@ -1923,6 +2016,7 @@ static int vduse_create_dev(struct vduse_dev_config *config,
> > >         kfree(dev->name);
> > >  err_ngroups:
> > >  err_str:
> > > +err_nas:
> > >         vduse_dev_destroy(dev);
> > >  err:
> > >         return ret;
> > > @@ -2015,7 +2109,6 @@ static int vduse_open(struct inode *inode, struct file *file)
> > >         if (!control)
> > >                 return -ENOMEM;
> > >
> > > -       control->api_version = VDUSE_API_VERSION;
> > >         file->private_data = control;
> > >
> > >         return 0;
> > > @@ -2040,17 +2133,15 @@ static struct vduse_mgmt_dev *vduse_mgmt;
> > >  static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
> > >  {
> > >         struct vduse_vdpa *vdev;
> > > -       __u32 ngroups = 1;
> > > +       __u32 ngroups = dev->ngroups;
> > >         int ret;
> > >
> > >         if (dev->vdev)
> > >                 return -EEXIST;
> > >
> > > -       if (vdev->dev->api_version >= VDUSE_API_VERSION_1)
> > > -               ngroups = vdev->dev->ngroups;
> > > -
> > > +       /* TODO do we need to store ngroups and nas? vdpa device already store it for us */
> > >         vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev,
> > > -                                &vduse_vdpa_config_ops, ngroups, 1, name,
> > > +                                &vduse_vdpa_config_ops, ngroups, dev->nas, name,
> > >                                  true);
> > >         if (IS_ERR(vdev))
> > >                 return PTR_ERR(vdev);
> > > @@ -2088,11 +2179,23 @@ static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name,
> > >                 return ret;
> > >
> > >         mutex_lock(&dev->domain_lock);
> > > -       if (!dev->domain)
> > > -               dev->domain = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
> > > -                                                 dev->bounce_size);
> > > +       ret = 0;
> > > +
> > > +       /* TODO we could delay the creation of the domain */
> > > +       for (int i = 0; i < dev->nas; ++i) {
> > > +               if (!dev->domain[i])
> > > +                       dev->domain[i] = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
> > > +                                                            dev->bounce_size);
> > > +               if (!dev->domain[i]) {
> > > +                       ret = -ENOMEM;
> > > +                       for (int j = 0; j < i; ++j)
> > > +                               vduse_domain_destroy(dev->domain[j]);
> > > +                       goto err_domain;
> > > +               }
> > > +       }
> > > +
> > >         mutex_unlock(&dev->domain_lock);
> > > -       if (!dev->domain) {
> > > +       if (ret == -ENOMEM) {
> > >                 put_device(&dev->vdev->vdpa.dev);
> > >                 return -ENOMEM;
> > >         }
> > > @@ -2101,13 +2204,22 @@ static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name,
> > >         if (ret) {
> > >                 put_device(&dev->vdev->vdpa.dev);
> > >                 mutex_lock(&dev->domain_lock);
> > > -               vduse_domain_destroy(dev->domain);
> > > -               dev->domain = NULL;
> > > +               for (int i = 0; i < dev->nas; i++) {
> > > +                       if (dev->domain[i]) {
> > > +                               vduse_domain_destroy(dev->domain[i]);
> > > +                               dev->domain[i] = NULL;
> > > +                       }
> > > +               }
> > >                 mutex_unlock(&dev->domain_lock);
> > >                 return ret;
> > >         }
> > >
> > >         return 0;
> > > +
> > > +err_domain:
> > > +       /* TODO do I need to call put_device? */
> > > +       mutex_unlock(&dev->domain_lock);
> > > +       return ret;
> > >  }
> > >
> > >  static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
> > > diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> > > index a779bcddac58..3a17a0b4e938 100644
> > > --- a/include/uapi/linux/vduse.h
> > > +++ b/include/uapi/linux/vduse.h
> > > @@ -46,7 +46,8 @@ struct vduse_dev_config {
> > >         __u32 vq_num;
> > >         __u32 vq_align;
> > >         __u32 ngroups; /* if VDUSE_API_VERSION >= 1 */
> > > -       __u32 reserved[12];
> > > +       __u32 nas; /* if VDUSE_API_VERSION >= 1 */
> > > +       __u32 reserved[11];
> > >         __u32 config_size;
> > >         __u8 config[];
> > >  };
> > > @@ -81,6 +82,17 @@ struct vduse_iotlb_entry {
> > >         __u8 perm;
> > >  };
> > >
> > > +struct vduse_iotlb_entry_v2 {
> > > +       __u64 offset;
> > > +       __u64 start;
> > > +       __u64 last;
> > > +       __u32 asid;
> > > +#define VDUSE_ACCESS_RO 0x1
> > > +#define VDUSE_ACCESS_WO 0x2
> > > +#define VDUSE_ACCESS_RW 0x3
> > > +       __u8 perm;
> > > +};
> > > +
> > >  /*
> > >   * Find the first IOVA region that overlaps with the range [start, last]
> > >   * and return the corresponding file descriptor. Return -EINVAL means the
> > > @@ -171,6 +183,16 @@ struct vduse_vq_group {
> > >         __u32 num;
> > >  };
> > >
> > > +/**
> > > + * struct vduse_vq_group - virtqueue group
> > > + @ @group: Index of the virtqueue group
> > > + * @asid: Address space ID of the group
> > > + */
> > > +struct vduse_vq_group_asid {
> > > +       __u32 group;
> > > +       __u32 asid;
> > > +};
> > > +
> > >  /**
> > >   * struct vduse_vq_info - information of a virtqueue
> > >   * @index: virtqueue index
> > > @@ -231,7 +253,9 @@ struct vduse_vq_eventfd {
> > >   * @uaddr: start address of userspace memory, it must be aligned to page size
> > >   * @iova: start of the IOVA region
> > >   * @size: size of the IOVA region
> > > + * @asid: Address space ID of the IOVA region
> > >   * @reserved: for future use, needs to be initialized to zero
> > > + * @reserved2: for future use, needs to be initialized to zero
> > >   *
> > >   * Structure used by VDUSE_IOTLB_REG_UMEM and VDUSE_IOTLB_DEREG_UMEM
> > >   * ioctls to register/de-register userspace memory for IOVA regions
> > > @@ -240,7 +264,9 @@ struct vduse_iova_umem {
> > >         __u64 uaddr;
> > >         __u64 iova;
> > >         __u64 size;
> > > -       __u64 reserved[3];
> > > +       __u32 asid;
> > > +       __u32 reserved[1];
> >
> > Basically, I'm not sure we can assume reserved to be zero for API_VERSION == 1.
> >
>
> Why not? The ioctl returns -EINVAL if it is not 0, so either an v0 or
> v1 ioctl user must set it if it wants the ioctl to return success.

Typo, I mean API_VERSION == 0. It is reserved but the kernel doesn't
mandate it to be zero. But it doesn't matter probably as it will be
ruled by API_VERSION as you explain above.

>
> > > +       __u64 reserved2[2];
> >
> > Any reasons we can't reuse reserved array?
> >
>
> I don't get this comment, we're reusing it, isn't it? I'm just
> splitting the u64[3] into u32[1]+u64[2]. Maybe it is more elegant to
> use an u32[5] or similar?

Yes, I mean reuse the u32 array. (but not a must I guess unless the
ABI is stable).

>
> > >  };
> > >
> > >  /* Register userspace memory for IOVA regions */
> > > @@ -264,7 +290,8 @@ struct vduse_iova_info {
> > >         __u64 last;
> > >  #define VDUSE_IOVA_CAP_UMEM (1 << 0)
> > >         __u64 capability;
> > > -       __u64 reserved[3];
> > > +       __u64 asid; /* Only if device API version >= 1 */
> > > +       __u64 reserved[2];
> >
> > Same here.
> >
> > >  };
> > >
> > >  /*
> > > @@ -287,6 +314,7 @@ enum vduse_req_type {
> > >         VDUSE_SET_STATUS,
> > >         VDUSE_UPDATE_IOTLB,
> > >         VDUSE_GET_VQ_GROUP,
> > > +       VDUSE_SET_VQ_GROUP_ASID,
> > >  };
> > >
> > >  /**
> > > @@ -342,6 +370,8 @@ struct vduse_dev_request {
> > >                 struct vduse_dev_status s;
> > >                 struct vduse_iova_range iova;
> > >                 struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
> > > +               /* Only if vduse api version >= 1 */
> > > +               struct vduse_vq_group_asid vq_group_asid;
> >
> > This seems to break the uAPI as sizeof(struct vduse_dev_request) changes.
> >
>
> It should not change. vduse_vq_group is 64 bits, smaller than padding
> which is 32bits*32 = 1024bits. And they're both in the same union.

You're right.

>
> > >                 __u32 padding[32];
> > >         };
> > >  };
> > > @@ -365,6 +395,8 @@ struct vduse_dev_response {
> > >         union {
> > >                 struct vduse_vq_state vq_state;
> > >                 struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
> > > +               /* Only if vduse api version >= 1 */
> > > +               struct vduse_vq_group_asid vq_group_asid;
> >
> > Same here.
> >
>
> Same reply :).

Yes.

>
> > >                 __u32 padding[32];
> > >         };
> > >  };
> > > --
> > > 2.49.0
> > >
> >
> > Thanks
> >
>

Thanks


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/6] Add multiple address spaces support to VDUSE
  2025-06-09  5:51 ` [RFC 0/6] Add multiple address spaces support to VDUSE Christoph Hellwig
@ 2025-06-20  6:25   ` Eugenio Perez Martin
  2025-06-23  5:09     ` Christoph Hellwig
  0 siblings, 1 reply; 30+ messages in thread
From: Eugenio Perez Martin @ 2025-06-20  6:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: jasowang, Yongji Xie, Cindy Lu, linux-kernel, Stefano Garzarella,
	Stefan Hajnoczi, Maxime Coquelin, Michael S. Tsirkin,
	virtualization, Xuan Zhuo, Laurent Vivier

On Mon, Jun 9, 2025 at 7:51 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> Before you add new features to vduse, please remove the broken abuse of
> the DMA API first.  Without that no new feature work should go into
> this code.
>

Hi Christoph,

This code does not touch the DMA API at all. Actually, we could even
remove all the DMA code and these changes would work the same. Can't
we just continue with this series and remove the abuse of the DMA API
in parallel?

If you detect broken uses of the DMA API I've missed please let me
know and I'll be happy to remove or change them.

Thanks!


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/6] Add multiple address spaces support to VDUSE
  2025-06-20  6:25   ` Eugenio Perez Martin
@ 2025-06-23  5:09     ` Christoph Hellwig
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2025-06-23  5:09 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Christoph Hellwig, jasowang, Yongji Xie, Cindy Lu, linux-kernel,
	Stefano Garzarella, Stefan Hajnoczi, Maxime Coquelin,
	Michael S. Tsirkin, virtualization, Xuan Zhuo, Laurent Vivier

On Fri, Jun 20, 2025 at 08:25:50AM +0200, Eugenio Perez Martin wrote:
> This code does not touch the DMA API at all. Actually, we could even
> remove all the DMA code and these changes would work the same. Can't
> we just continue with this series and remove the abuse of the DMA API
> in parallel?

But it touches vduse.  And vduse is marked broken because it still has
this long standing issues.  You don't add "ehancements" to code marked
BROKEN begfore fixing that.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 1/6] vduse: add v1 API definition
  2025-06-10  8:35         ` Jason Wang
@ 2025-08-07 10:55           ` Eugenio Perez Martin
  2025-08-08  0:50             ` Jason Wang
  0 siblings, 1 reply; 30+ messages in thread
From: Eugenio Perez Martin @ 2025-08-07 10:55 UTC (permalink / raw)
  To: Jason Wang
  Cc: Yongji Xie, Cindy Lu, linux-kernel, Stefano Garzarella,
	Stefan Hajnoczi, Maxime Coquelin, Michael S. Tsirkin,
	virtualization, Xuan Zhuo, Laurent Vivier

On Tue, Jun 10, 2025 at 10:36 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Mon, Jun 9, 2025 at 2:11 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> >
> > On Mon, Jun 9, 2025 at 3:50 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Mon, Jun 9, 2025 at 9:41 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Fri, Jun 6, 2025 at 7:50 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > >
> > > > > This allows to define all functions checking the API version set by the
> > > > > userland device.
> > > > >
> > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > >
> > > > It might be worth clarifying how it works.
> > > >
> > > > For example,
> > > >
> > > > 1) would VDUSE behave differently or if it's just some new ioctls
> >
> > I'd like to test more in-depth, but a device can just bump the version
> > ID and then implement the replies to the vduse messages. No need to
> > implement new ioctls. If the VDUSE device sets 0 in either number of
> > ASID or vq groups, the kernel assumes 1.
>
> Right, this is the way we use now and I think maybe we can document
> this somewhere.
>
> >
> > But you have a very good point here, I think it is wise to evaluate
> > the shortcut of these messages in the VDUSE kernel module. If a VDUSE
> > device only has one vq group and one ASID, it can always return group
> > 0 and asid 0 for everything, and fail every try to ser asid != 0.
>
> Yes, and vhost-vDPA needs to guard against the misconfiguration.
>
> > This
> > way, the update is transparent for the VDUSE device, and future
> > devices do not need to implement the reply of these. What do you
> > think?
>
> This should work.
>
> >
> > > > 2) If VDUSE behave differently, do we need a ioctl to set the API
> > > > version for backward compatibility?
> > >
> > > Speak too fast, there's a VDUSE_SET_API_VERSION actually.
> > >
> > > I think we need to think if it complicates the migration compatibility or not.
> > >
> >
> > Do you mean migration as "increase the VDUSE version number", not "VM
> > live migration from vduse version 0 to vduse version 1", isn't it? The
> > second should not have any problem but I haven't tested it.
>
> I mean if we bump the version, we can't migrate from version 1 to
> version 0. Or we can offload this to the management (do we need to
> extend the vdpa tool for this)?
>

I just noticed I left this unreplied. But I still do not get what
migrate means here :).

If migrate means to run current VDUSE devices on kernel with this
series applied these devices don't set V1 API so they have one vq
group, and one asid. I'm actually testing this with my libfuse+VDUSE
modifications that don't use V1 at all. Adding this explanation to the
patch as it is a very good point indeed.

If it means to migrate a guest from using a V1 VDUSE device to a V0
device "it should work", as it is just a backend implementation
detail. If we migrate from or to a vdpa device backed by hardware, for
example, one of the devices does not even have the concept of VDUSE
API version.

In the case of net, it does not work at the moment because the only
way to set features like mq are through the shadow CVQ. If a VDUSE net
device implements, let's say, admin vq, something similar to PF and
VF, and dirty bitmap, I guess it should be possible. Maybe it is
easier to play with this wuth block devices.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 3/6] vduse: add vq group asid support
  2025-06-13  1:21       ` Jason Wang
@ 2025-08-07 11:10         ` Eugenio Perez Martin
  0 siblings, 0 replies; 30+ messages in thread
From: Eugenio Perez Martin @ 2025-08-07 11:10 UTC (permalink / raw)
  To: Jason Wang
  Cc: Yongji Xie, Cindy Lu, linux-kernel, Stefano Garzarella,
	Stefan Hajnoczi, Maxime Coquelin, Michael S. Tsirkin,
	virtualization, Xuan Zhuo, Laurent Vivier

On Fri, Jun 13, 2025 at 3:21 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Thu, Jun 12, 2025 at 3:25 PM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > On Thu, Jun 12, 2025 at 2:30 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Fri, Jun 6, 2025 at 7:50 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > >
> > > > Make one IOTLB domain per address space, and allow the driver to assign
> > > > each ASID to a vq group.  Each address space via an dedicated identifier
> > > > (ASID).
> > > >
> > > > During vDPA device allocation, the VDUSE device needs to report the
> > > > number of address spaces supported.  Then the vdpa driver is able to
> > > > configure them.  At this moment only vhost_vdpa is able to do it.
> > > >
> > > > This helps to isolate the environments for the virtqueue that will not
> > > > be assigned directly.  E.g in the case of virtio-net, the control
> > > > virtqueue will not be assigned directly to guest.
> > > >
> > > > TODO: Ideally, umem should not be duplicated.  But it is hard or
> > > > impossible to refactor everything around one single umem.  So should we
> > > > continue with device specifying umem per vq group?
> > >
> > > This is a good question.
> > >
> > > I think umem should be bound to address space and umem needs to be isolated.
> > >
> > > For the issue of complexity, we can simply extend the vduse_iova_umem
> > > to have an asid field. But it looks like it needs more work as:
> > >
> > > struct vduse_iova_umem {
> > >         __u64 uaddr;
> > >         __u64 iova;
> > >         __u64 size;
> > >         __u64 reserved[3];
> > > };
> > >
> > > Do we have a way to know if reserved is used or not (as we are lacking
> > > a flag field anyhow ....).
> > >
> >
> > I'd say that we should work the same way as the rest of the structs:
> > We add the asid field, and if the API v1 is negotiated we handle it as
> > ASID. If it is not negotiated, it is reserved.
>
> Ok, that makes sense.
>
> >
> > > So we probably need a new uAPI like vduse_iova_umem_v2 that includes a
> > > flag field at least.
> > >
> > > >
> > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > > ---
> > > >  drivers/vdpa/vdpa_user/vduse_dev.c | 250 +++++++++++++++++++++--------
> > > >  include/uapi/linux/vduse.h         |  38 ++++-
> > > >  2 files changed, 216 insertions(+), 72 deletions(-)
> > > >
> > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > index 6fa687bc4912..d51e4f26fe72 100644
> > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > @@ -51,6 +51,11 @@
> > > >   */
> > > >  #define VDUSE_MAX_VQ_GROUPS 2
> > > >
> > > > +/*
> > > > + * Let's make it 2 for simplicity.
> > > > + */
> > > > +#define VDUSE_MAX_ASID 2
> > >
> > > Similar to previous patch, it's better to increase this otherwise we
> > > need new uAPI or it requires the userspace to probe the maximum value
> > > once we decide to change it in the future.
> > >
> >
> > I'm ok with this, but what is a good max value? UINT32_MAX seems excessive?
>
> Maybe 64 or 256.
>
> >
> > This requires us to allocate arrays for both vduse_dev->domain and
> > vduse_dev->umem, so we need to set a reasonable value.
>
> Could we do the allocation based on the userspace privionsing?
>

Oh yes, my point is that we need to set a hard maximum so userland
cannot allocate an array of uint32_t[UINT32_MAX], for example. I guess
64 is reasonable. Is it worth it to dynamically allocate a whole array
with this maximum or is it better to make it just part of the struct?
I'm happy with both solutions :).

> >
> > > > +
> > > >  #define IRQ_UNBOUND -1
> > > >
> > > >  struct vduse_virtqueue {
> > > > @@ -92,7 +97,7 @@ struct vduse_dev {
> > > >         struct vduse_vdpa *vdev;
> > > >         struct device *dev;
> > > >         struct vduse_virtqueue **vqs;
> > > > -       struct vduse_iova_domain *domain;
> > > > +       struct vduse_iova_domain *domain[VDUSE_MAX_ASID];
> > > >         char *name;
> > > >         struct mutex lock;
> > > >         spinlock_t msg_lock;
> > > > @@ -120,7 +125,8 @@ struct vduse_dev {
> > > >         u32 vq_num;
> > > >         u32 vq_align;
> > > >         u32 ngroups;
> > > > -       struct vduse_umem *umem;
> > > > +       u32 nas;
> > > > +       struct vduse_umem *umem[VDUSE_MAX_ASID];
> > > >         struct mutex mem_lock;
> > > >         unsigned int bounce_size;
> > > >         struct mutex domain_lock;
> > > > @@ -436,11 +442,14 @@ static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> > > >  static void vduse_dev_reset(struct vduse_dev *dev)
> > > >  {
> > > >         int i;
> > > > -       struct vduse_iova_domain *domain = dev->domain;
> > > >
> > > >         /* The coherent mappings are handled in vduse_dev_free_coherent() */
> > > > -       if (domain && domain->bounce_map)
> > > > -               vduse_domain_reset_bounce_map(domain);
> > > > +       for (i = 0; i < dev->nas; i++) {
> > > > +               struct vduse_iova_domain *domain = dev->domain[i];
> > > > +
> > > > +               if (domain && domain->bounce_map)
> > > > +                       vduse_domain_reset_bounce_map(domain);
> > > > +       }
> > > >
> > > >         down_write(&dev->rwsem);
> > > >
> > > > @@ -617,6 +626,23 @@ static u32 vduse_get_vq_group(struct vdpa_device *vdpa, u16 idx)
> > > >         return msg.resp.vq_group.num;
> > > >  }
> > > >
> > > > +static int vduse_set_group_asid(struct vdpa_device *vdpa, unsigned int group,
> > > > +                               unsigned int asid)
> > > > +{
> > > > +       struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > > > +       struct vduse_dev_msg msg = { 0 };
> > > > +
> > > > +       if (dev->api_version < VDUSE_API_VERSION_1 ||
> > > > +           group >= dev->ngroups || asid >= dev->nas)
> > > > +               return -EINVAL;
> > > > +
> > > > +       msg.req.type = VDUSE_SET_VQ_GROUP_ASID;
> > > > +       msg.req.vq_group_asid.group = group;
> > > > +       msg.req.vq_group_asid.asid = asid;
> > > > +
> > > > +       return vduse_dev_msg_sync(dev, &msg);
> > > > +}
> > > > +
> > > >  static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> > > >                                 struct vdpa_vq_state *state)
> > > >  {
> > > > @@ -788,13 +814,13 @@ static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
> > > >         struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > > >         int ret;
> > > >
> > > > -       ret = vduse_domain_set_map(dev->domain, iotlb);
> > > > +       ret = vduse_domain_set_map(dev->domain[asid], iotlb);
> > > >         if (ret)
> > > >                 return ret;
> > > >
> > > >         ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> > > >         if (ret) {
> > > > -               vduse_domain_clear_map(dev->domain, iotlb);
> > > > +               vduse_domain_clear_map(dev->domain[asid], iotlb);
> > > >                 return ret;
> > > >         }
> > > >
> > > > @@ -837,6 +863,7 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> > > >         .get_vq_affinity        = vduse_vdpa_get_vq_affinity,
> > > >         .reset                  = vduse_vdpa_reset,
> > > >         .set_map                = vduse_vdpa_set_map,
> > > > +       .set_group_asid         = vduse_set_group_asid,
> > > >         .free                   = vduse_vdpa_free,
> > > >  };
> > > >
> > > > @@ -845,9 +872,12 @@ static void vduse_dev_sync_single_for_device(struct device *dev,
> > > >                                              enum dma_data_direction dir)
> > > >  {
> > > >         struct vduse_dev *vdev = dev_to_vduse(dev);
> > > > -       struct vduse_iova_domain *domain = vdev->domain;
> > > >
> > > > -       vduse_domain_sync_single_for_device(domain, dma_addr, size, dir);
> > > > +       for (int i = 0; i < vdev->nas; i++) {
> > > > +               struct vduse_iova_domain *domain = vdev->domain[i];
> > >
> > > Interesting, I thought there could be a way to deduce the iova domain
> > > from the dma device since virtio supports per virtqueue dma device
> > > now. For example, I don't see get_vq_dma_dev() implemented in this
> > > patch.
> >
> > vhost_vdpa does not interact with it so it was not needed for the RFC.
> > For example, the simulator does not implement it either and it works
> > with ASID.
>
> Right, the reason for it is:
>
> 1) in vdpasim_create() we assign vdpa device as the device device:
>
> vdpasim->vdpa.dma_dev = dev;
>
> 2) since the vDPA device itself can't do DMA, it tricks the DMA ops to use a PA
> 3) we do 1:1 mapping in vringh IOTLB by default to make the PA trick work:
>
>                vhost_iotlb_add_range(&vdpasim->iommu[i], 0, ULONG_MAX, 0,
>                                       VHOST_MAP_RW);
>
> 4) This trick works the case oneday virtio_vdpa may use
> set_group_asid() as well, as all the AS are using PA
>
> Note that it's 2) that is blamed by the DMA maintainer.
>
> It looks different from VDUSE here:
>
> 1) VDUSE has its own dma_ops, each as is backed by different IOVA
> domain as well as IOVA address
> 2) vduse_domain_sync_single_for_device() will trigger unnecessary
> IOTLB synchronizations
>
> >
> > > But anyhow maybe we need to revisit this point as DMA
> > > mainatiner ask to fix the abuse of the dma device so we don't need a
> > > trick for dma dev anymore if we design the new mapping API in virtio
> > > core correctly.
> > >
> > > > +
> > > > +               vduse_domain_sync_single_for_device(domain, dma_addr, size, dir);
> > > > +       }
> > > >  }
> > > >
> > > >  static void vduse_dev_sync_single_for_cpu(struct device *dev,
> > > > @@ -855,9 +885,12 @@ static void vduse_dev_sync_single_for_cpu(struct device *dev,
> > > >                                              enum dma_data_direction dir)
> > > >  {
> > > >         struct vduse_dev *vdev = dev_to_vduse(dev);
> > > > -       struct vduse_iova_domain *domain = vdev->domain;
> > > >
> > > > -       vduse_domain_sync_single_for_cpu(domain, dma_addr, size, dir);
> > > > +       for (int i = 0; i < vdev->nas; i++) {
> > > > +               struct vduse_iova_domain *domain = vdev->domain[i];
> > > > +
> > > > +               vduse_domain_sync_single_for_cpu(domain, dma_addr, size, dir);
> > > > +       }
> > > >  }
> > > >
> > > >  static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
> > > > @@ -866,7 +899,7 @@ static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
> > > >                                      unsigned long attrs)
> > > >  {
> > > >         struct vduse_dev *vdev = dev_to_vduse(dev);
> > > > -       struct vduse_iova_domain *domain = vdev->domain;
> > > > +       struct vduse_iova_domain *domain = vdev->domain[0];
> > >
> > > Any reason that we need to assume asid 0 in this case?
> > >
> >
> > At this moment, virtio_vdpa is the only one using these functions and
> > it is not able to change ASID. So this must be 0 for the moment. But
> > yes, this should be better addressed by knowing what vq group is this
> > referring to.
>
> My plan is to
>
> 1) introduce map_ops() in the virtio_device()
> 2) in PCI, it will be converted to DMA ops
> 3) for vDPA, it will be converted to vDPA specific ops
> 4) get_vq_dma_dev() will be converted to get_vq_map_token(), so each
> vq could provide it's own token
> 5) when the virtio core tries to do mapping, it will pass per vq dma
> token as parameters
> 6) so vduse here can receive a per virtqueue token where it can map it
> to group then ASID
>
> >
> > > >
> > > >         return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> > > >  }
> > > > @@ -876,7 +909,7 @@ static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
> > > >                                 unsigned long attrs)
> > > >  {
> > > >         struct vduse_dev *vdev = dev_to_vduse(dev);
> > > > -       struct vduse_iova_domain *domain = vdev->domain;
> > > > +       struct vduse_iova_domain *domain = vdev->domain[0];
> > > >
> > > >         return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> > > >  }
> > > > @@ -886,7 +919,7 @@ static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
> > > >                                         unsigned long attrs)
> > > >  {
> > > >         struct vduse_dev *vdev = dev_to_vduse(dev);
> > > > -       struct vduse_iova_domain *domain = vdev->domain;
> > > > +       struct vduse_iova_domain *domain = vdev->domain[0];
> > > >         unsigned long iova;
> > > >         void *addr;
> > > >
> > > > @@ -906,17 +939,25 @@ static void vduse_dev_free_coherent(struct device *dev, size_t size,
> > > >                                         unsigned long attrs)
> > > >  {
> > > >         struct vduse_dev *vdev = dev_to_vduse(dev);
> > > > -       struct vduse_iova_domain *domain = vdev->domain;
> > > > +       struct vduse_iova_domain *domain = vdev->domain[0];
> > > >
> > > >         vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
> > > >  }
> > > >
> > > > +/* TODO check if this is correct */
> > > >  static size_t vduse_dev_max_mapping_size(struct device *dev)
> > > >  {
> > > >         struct vduse_dev *vdev = dev_to_vduse(dev);
> > > > -       struct vduse_iova_domain *domain = vdev->domain;
> > > > +       size_t max_mapping_size = 0;
> > > > +
> > > > +       for (int i = 0; i < vdev->nas; i++) {
> > > > +               struct vduse_iova_domain *domain = vdev->domain[i];
> > > >
> > > > -       return domain->bounce_size;
> > > > +               if (domain->bounce_size > max_mapping_size)
> > > > +                       max_mapping_size = domain->bounce_size;
> > > > +       }
> > > > +
> > > > +       return max_mapping_size;
> > > >  }
> > > >
> > > >  static const struct dma_map_ops vduse_dev_dma_ops = {
> > > > @@ -1054,31 +1095,32 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
> > > >         return ret;
> > > >  }
> > > >
> > > > -static int vduse_dev_dereg_umem(struct vduse_dev *dev,
> > > > +static int vduse_dev_dereg_umem(struct vduse_dev *dev, u32 asid,
> > > >                                 u64 iova, u64 size)
> > > >  {
> > > >         int ret;
> > > >
> > > >         mutex_lock(&dev->mem_lock);
> > > >         ret = -ENOENT;
> > > > -       if (!dev->umem)
> > > > +       if (!dev->umem[asid])
> > > >                 goto unlock;
> > > >
> > > >         ret = -EINVAL;
> > > > -       if (!dev->domain)
> > > > +       if (!dev->domain[asid])
> > > >                 goto unlock;
> > > >
> > > > -       if (dev->umem->iova != iova || size != dev->domain->bounce_size)
> > > > +       if (dev->umem[asid]->iova != iova ||
> > > > +           size != dev->domain[asid]->bounce_size)
> > > >                 goto unlock;
> > > >
> > > > -       vduse_domain_remove_user_bounce_pages(dev->domain);
> > > > -       unpin_user_pages_dirty_lock(dev->umem->pages,
> > > > -                                   dev->umem->npages, true);
> > > > -       atomic64_sub(dev->umem->npages, &dev->umem->mm->pinned_vm);
> > > > -       mmdrop(dev->umem->mm);
> > > > -       vfree(dev->umem->pages);
> > > > -       kfree(dev->umem);
> > > > -       dev->umem = NULL;
> > > > +       vduse_domain_remove_user_bounce_pages(dev->domain[asid]);
> > > > +       unpin_user_pages_dirty_lock(dev->umem[asid]->pages,
> > > > +                                   dev->umem[asid]->npages, true);
> > > > +       atomic64_sub(dev->umem[asid]->npages, &dev->umem[asid]->mm->pinned_vm);
> > > > +       mmdrop(dev->umem[asid]->mm);
> > > > +       vfree(dev->umem[asid]->pages);
> > > > +       kfree(dev->umem[asid]);
> > > > +       dev->umem[asid] = NULL;
> > > >         ret = 0;
> > > >  unlock:
> > > >         mutex_unlock(&dev->mem_lock);
> > > > @@ -1086,7 +1128,7 @@ static int vduse_dev_dereg_umem(struct vduse_dev *dev,
> > > >  }
> > > >
> > > >  static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > > > -                             u64 iova, u64 uaddr, u64 size)
> > > > +                             u32 asid, u64 iova, u64 uaddr, u64 size)
> > > >  {
> > > >         struct page **page_list = NULL;
> > > >         struct vduse_umem *umem = NULL;
> > > > @@ -1094,14 +1136,14 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > > >         unsigned long npages, lock_limit;
> > > >         int ret;
> > > >
> > > > -       if (!dev->domain || !dev->domain->bounce_map ||
> > > > -           size != dev->domain->bounce_size ||
> > > > +       if (!dev->domain[asid] || !dev->domain[asid]->bounce_map ||
> > > > +           size != dev->domain[asid]->bounce_size ||
> > > >             iova != 0 || uaddr & ~PAGE_MASK)
> > > >                 return -EINVAL;
> > > >
> > > >         mutex_lock(&dev->mem_lock);
> > > >         ret = -EEXIST;
> > > > -       if (dev->umem)
> > > > +       if (dev->umem[asid])
> > > >                 goto unlock;
> > > >
> > > >         ret = -ENOMEM;
> > > > @@ -1125,7 +1167,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > > >                 goto out;
> > > >         }
> > > >
> > > > -       ret = vduse_domain_add_user_bounce_pages(dev->domain,
> > > > +       ret = vduse_domain_add_user_bounce_pages(dev->domain[asid],
> > > >                                                  page_list, pinned);
> > > >         if (ret)
> > > >                 goto out;
> > > > @@ -1138,7 +1180,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > > >         umem->mm = current->mm;
> > > >         mmgrab(current->mm);
> > > >
> > > > -       dev->umem = umem;
> > > > +       dev->umem[asid] = umem;
> > > >  out:
> > > >         if (ret && pinned > 0)
> > > >                 unpin_user_pages(page_list, pinned);
> > > > @@ -1181,26 +1223,42 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > > >
> > > >         switch (cmd) {
> > > >         case VDUSE_IOTLB_GET_FD: {
> > > > -               struct vduse_iotlb_entry entry;
> > > > +               struct vduse_iotlb_entry_v2 entry = {};
> > > > +               struct vduse_iotlb_entry entry_old;
> > > >                 struct vhost_iotlb_map *map;
> > > >                 struct vdpa_map_file *map_file;
> > > >                 struct file *f = NULL;
> > > >
> > > >                 ret = -EFAULT;
> > > > -               if (copy_from_user(&entry, argp, sizeof(entry)))
> > > > -                       break;
> > > > +               if (dev->api_version >= VDUSE_API_VERSION_1) {
> > > > +                       if (copy_from_user(&entry, argp, sizeof(entry)))
> > > > +                               break;
> > > > +               } else {
> > > > +                       if (copy_from_user(&entry_old, argp,
> > > > +                                          sizeof(entry_old)))
> > > > +                               break;
> > > > +
> > > > +                       entry.offset = entry_old.offset;
> > > > +                       entry.start = entry_old.start;
> > > > +                       entry.last = entry_old.last;
> > > > +                       entry.perm = entry_old.perm;
> > >
> > > I wonder if a new ioctl is needed.
> > >
> >
> > The problem is that vduse_iotlb_entry is already used in full, can we
> > extend the argument size without introducing the new ioctl?
>
> Yes, I think it should work if VDUSE_API_VERSION_1 makes sense.
>
> >
> > > > +               }
> > > >
> > > >                 ret = -EINVAL;
> > > >                 if (entry.start > entry.last)
> > > >                         break;
> > > >
> > > > +               if (entry.asid >= dev->nas)
> > > > +                       break;
> > > > +
> > > >                 mutex_lock(&dev->domain_lock);
> > > > -               if (!dev->domain) {
> > > > +               /* TODO accessing an array with idx from userspace, mitigations? */
> > > > +               if (!dev->domain[entry.asid]) {
> > > >                         mutex_unlock(&dev->domain_lock);
> > > >                         break;
> > > >                 }
> > > > -               spin_lock(&dev->domain->iotlb_lock);
> > > > -               map = vhost_iotlb_itree_first(dev->domain->iotlb,
> > > > +               spin_lock(&dev->domain[entry.asid]->iotlb_lock);
> > > > +               map = vhost_iotlb_itree_first(dev->domain[entry.asid]->iotlb,
> > > >                                               entry.start, entry.last);
> > > >                 if (map) {
> > > >                         map_file = (struct vdpa_map_file *)map->opaque;
> > > > @@ -1210,7 +1268,7 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > > >                         entry.last = map->last;
> > > >                         entry.perm = map->perm;
> > > >                 }
> > > > -               spin_unlock(&dev->domain->iotlb_lock);
> > > > +               spin_unlock(&dev->domain[entry.asid]->iotlb_lock);
> > > >                 mutex_unlock(&dev->domain_lock);
> > > >                 ret = -EINVAL;
> > > >                 if (!f)
> > > > @@ -1360,12 +1418,18 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > > >                         break;
> > > >
> > > >                 ret = -EINVAL;
> > > > +               /* TODO: Using asid from userspace, need to mitigate? */
> > > >                 if (!is_mem_zero((const char *)umem.reserved,
> > > > -                                sizeof(umem.reserved)))
> > > > +                                sizeof(umem.reserved)) ||
> > > > +                   !is_mem_zero((const char *)umem.reserved2,
> > > > +                                sizeof(umem.reserved2)) ||
> > > > +                   (dev->api_version < VDUSE_API_VERSION_1 &&
> > > > +                    umem.asid != 0) ||
> > > > +                    umem.asid >= dev->nas)
> > >
> > > This is probably a hint that we need a new uAPI, see my comment for changelog.
> > >
> > > >                         break;
> > > >
> > > >                 mutex_lock(&dev->domain_lock);
> > > > -               ret = vduse_dev_reg_umem(dev, umem.iova,
> > > > +               ret = vduse_dev_reg_umem(dev, umem.asid, umem.iova,
> > > >                                          umem.uaddr, umem.size);
> > > >                 mutex_unlock(&dev->domain_lock);
> > > >                 break;
> > > > @@ -1378,15 +1442,23 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > > >                         break;
> > > >
> > > >                 ret = -EINVAL;
> > > > +               /* TODO: Using asid from userspace, need to mitigate? */
> > > >                 if (!is_mem_zero((const char *)umem.reserved,
> > > > -                                sizeof(umem.reserved)))
> > > > +                                sizeof(umem.reserved)) ||
> > > > +                   !is_mem_zero((const char *)umem.reserved2,
> > > > +                                sizeof(umem.reserved2)) ||
> > > > +                   (dev->api_version < VDUSE_API_VERSION_1 &&
> > > > +                    umem.asid != 0) ||
> > > > +                    umem.asid >= dev->nas)
> > > >                         break;
> > > > +
> > > >                 mutex_lock(&dev->domain_lock);
> > > > -               ret = vduse_dev_dereg_umem(dev, umem.iova,
> > > > +               ret = vduse_dev_dereg_umem(dev, umem.asid, umem.iova,
> > > >                                            umem.size);
> > > >                 mutex_unlock(&dev->domain_lock);
> > > >                 break;
> > > >         }
> > > > +       /* TODO can we merge this with GET_FD? */
> > > >         case VDUSE_IOTLB_GET_INFO: {
> > > >                 struct vduse_iova_info info;
> > > >                 struct vhost_iotlb_map *map;
> > > > @@ -1399,27 +1471,32 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > > >                 if (info.start > info.last)
> > > >                         break;
> > > >
> > > > +               if (info.asid >= dev->nas)
> > > > +                       break;
> > > > +
> > > >                 if (!is_mem_zero((const char *)info.reserved,
> > > >                                  sizeof(info.reserved)))
> > > >                         break;
> > > >
> > > >                 mutex_lock(&dev->domain_lock);
> > > > -               if (!dev->domain) {
> > > > +               /* TODO asid comes from userspace. mitigations? */
> > > > +               if (!dev->domain[info.asid]) {
> > > >                         mutex_unlock(&dev->domain_lock);
> > > >                         break;
> > > >                 }
> > > > -               spin_lock(&dev->domain->iotlb_lock);
> > > > -               map = vhost_iotlb_itree_first(dev->domain->iotlb,
> > > > +               spin_lock(&dev->domain[info.asid]->iotlb_lock);
> > > > +               map = vhost_iotlb_itree_first(dev->domain[info.asid]->iotlb,
> > > >                                               info.start, info.last);
> > > >                 if (map) {
> > > >                         info.start = map->start;
> > > >                         info.last = map->last;
> > > >                         info.capability = 0;
> > > > -                       if (dev->domain->bounce_map && map->start == 0 &&
> > > > -                           map->last == dev->domain->bounce_size - 1)
> > > > +                       if (dev->domain[info.asid]->bounce_map &&
> > > > +                           map->start == 0 &&
> > > > +                           map->last == dev->domain[info.asid]->bounce_size - 1)
> > > >                                 info.capability |= VDUSE_IOVA_CAP_UMEM;
> > > >                 }
> > > > -               spin_unlock(&dev->domain->iotlb_lock);
> > > > +               spin_unlock(&dev->domain[info.asid]->iotlb_lock);
> > > >                 mutex_unlock(&dev->domain_lock);
> > > >                 if (!map)
> > > >                         break;
> > > > @@ -1444,8 +1521,13 @@ static int vduse_dev_release(struct inode *inode, struct file *file)
> > > >         struct vduse_dev *dev = file->private_data;
> > > >
> > > >         mutex_lock(&dev->domain_lock);
> > > > -       if (dev->domain)
> > > > -               vduse_dev_dereg_umem(dev, 0, dev->domain->bounce_size);
> > > > +       for (int i = 0; i < dev->nas; i++) {
> > > > +               if (dev->domain[i]) {
> > > > +                       vduse_dev_dereg_umem(dev, i, 0,
> > > > +                                            dev->domain[i]->bounce_size);
> > > > +                       dev->domain[i] = NULL;
> > > > +               }
> > > > +       }
> > > >         mutex_unlock(&dev->domain_lock);
> > > >         spin_lock(&dev->msg_lock);
> > > >         /* Make sure the inflight messages can processed after reconncection */
> > > > @@ -1715,8 +1797,10 @@ static int vduse_destroy_dev(char *name)
> > > >         idr_remove(&vduse_idr, dev->minor);
> > > >         kvfree(dev->config);
> > > >         vduse_dev_deinit_vqs(dev);
> > > > -       if (dev->domain)
> > > > -               vduse_domain_destroy(dev->domain);
> > > > +       for (int i = 0; i < dev->nas; i++) {
> > > > +               if (dev->domain[i])
> > > > +                       vduse_domain_destroy(dev->domain[i]);
> > > > +       }
> > > >         kfree(dev->name);
> > > >         vduse_dev_destroy(dev);
> > > >         module_put(THIS_MODULE);
> > > > @@ -1824,7 +1908,7 @@ static ssize_t bounce_size_store(struct device *device,
> > > >
> > > >         ret = -EPERM;
> > > >         mutex_lock(&dev->domain_lock);
> > > > -       if (dev->domain)
> > > > +       if (dev->domain[0] && dev->domain[1])
> > > >                 goto unlock;
> > > >
> > > >         ret = kstrtouint(buf, 10, &bounce_size);
> > > > @@ -1882,9 +1966,18 @@ static int vduse_create_dev(struct vduse_dev_config *config,
> > > >                                 config->ngroups, VDUSE_MAX_VQ_GROUPS);
> > > >                         goto err_ngroups;
> > > >                 }
> > > > +
> > > > +               if (config->nas > VDUSE_MAX_ASID) {
> > > > +                       pr_err("Not creating a VDUSE device with %u asid. Max: %u",
> > > > +                               config->nas, VDUSE_MAX_ASID);
> > > > +                       goto err_nas;
> > > > +               }
> > > > +
> > > >                 dev->ngroups = config->ngroups ?: 1;
> > > > +               dev->nas = config->nas ?: 1;
> > > >         } else {
> > > >                 dev->ngroups = 1;
> > > > +               dev->nas = 1;
> > > >         }
> > > >         dev->name = kstrdup(config->name, GFP_KERNEL);
> > > >         if (!dev->name)
> > > > @@ -1923,6 +2016,7 @@ static int vduse_create_dev(struct vduse_dev_config *config,
> > > >         kfree(dev->name);
> > > >  err_ngroups:
> > > >  err_str:
> > > > +err_nas:
> > > >         vduse_dev_destroy(dev);
> > > >  err:
> > > >         return ret;
> > > > @@ -2015,7 +2109,6 @@ static int vduse_open(struct inode *inode, struct file *file)
> > > >         if (!control)
> > > >                 return -ENOMEM;
> > > >
> > > > -       control->api_version = VDUSE_API_VERSION;
> > > >         file->private_data = control;
> > > >
> > > >         return 0;
> > > > @@ -2040,17 +2133,15 @@ static struct vduse_mgmt_dev *vduse_mgmt;
> > > >  static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
> > > >  {
> > > >         struct vduse_vdpa *vdev;
> > > > -       __u32 ngroups = 1;
> > > > +       __u32 ngroups = dev->ngroups;
> > > >         int ret;
> > > >
> > > >         if (dev->vdev)
> > > >                 return -EEXIST;
> > > >
> > > > -       if (vdev->dev->api_version >= VDUSE_API_VERSION_1)
> > > > -               ngroups = vdev->dev->ngroups;
> > > > -
> > > > +       /* TODO do we need to store ngroups and nas? vdpa device already store it for us */
> > > >         vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev,
> > > > -                                &vduse_vdpa_config_ops, ngroups, 1, name,
> > > > +                                &vduse_vdpa_config_ops, ngroups, dev->nas, name,
> > > >                                  true);
> > > >         if (IS_ERR(vdev))
> > > >                 return PTR_ERR(vdev);
> > > > @@ -2088,11 +2179,23 @@ static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name,
> > > >                 return ret;
> > > >
> > > >         mutex_lock(&dev->domain_lock);
> > > > -       if (!dev->domain)
> > > > -               dev->domain = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
> > > > -                                                 dev->bounce_size);
> > > > +       ret = 0;
> > > > +
> > > > +       /* TODO we could delay the creation of the domain */
> > > > +       for (int i = 0; i < dev->nas; ++i) {
> > > > +               if (!dev->domain[i])
> > > > +                       dev->domain[i] = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
> > > > +                                                            dev->bounce_size);
> > > > +               if (!dev->domain[i]) {
> > > > +                       ret = -ENOMEM;
> > > > +                       for (int j = 0; j < i; ++j)
> > > > +                               vduse_domain_destroy(dev->domain[j]);
> > > > +                       goto err_domain;
> > > > +               }
> > > > +       }
> > > > +
> > > >         mutex_unlock(&dev->domain_lock);
> > > > -       if (!dev->domain) {
> > > > +       if (ret == -ENOMEM) {
> > > >                 put_device(&dev->vdev->vdpa.dev);
> > > >                 return -ENOMEM;
> > > >         }
> > > > @@ -2101,13 +2204,22 @@ static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name,
> > > >         if (ret) {
> > > >                 put_device(&dev->vdev->vdpa.dev);
> > > >                 mutex_lock(&dev->domain_lock);
> > > > -               vduse_domain_destroy(dev->domain);
> > > > -               dev->domain = NULL;
> > > > +               for (int i = 0; i < dev->nas; i++) {
> > > > +                       if (dev->domain[i]) {
> > > > +                               vduse_domain_destroy(dev->domain[i]);
> > > > +                               dev->domain[i] = NULL;
> > > > +                       }
> > > > +               }
> > > >                 mutex_unlock(&dev->domain_lock);
> > > >                 return ret;
> > > >         }
> > > >
> > > >         return 0;
> > > > +
> > > > +err_domain:
> > > > +       /* TODO do I need to call put_device? */
> > > > +       mutex_unlock(&dev->domain_lock);
> > > > +       return ret;
> > > >  }
> > > >
> > > >  static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
> > > > diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> > > > index a779bcddac58..3a17a0b4e938 100644
> > > > --- a/include/uapi/linux/vduse.h
> > > > +++ b/include/uapi/linux/vduse.h
> > > > @@ -46,7 +46,8 @@ struct vduse_dev_config {
> > > >         __u32 vq_num;
> > > >         __u32 vq_align;
> > > >         __u32 ngroups; /* if VDUSE_API_VERSION >= 1 */
> > > > -       __u32 reserved[12];
> > > > +       __u32 nas; /* if VDUSE_API_VERSION >= 1 */
> > > > +       __u32 reserved[11];
> > > >         __u32 config_size;
> > > >         __u8 config[];
> > > >  };
> > > > @@ -81,6 +82,17 @@ struct vduse_iotlb_entry {
> > > >         __u8 perm;
> > > >  };
> > > >
> > > > +struct vduse_iotlb_entry_v2 {
> > > > +       __u64 offset;
> > > > +       __u64 start;
> > > > +       __u64 last;
> > > > +       __u32 asid;
> > > > +#define VDUSE_ACCESS_RO 0x1
> > > > +#define VDUSE_ACCESS_WO 0x2
> > > > +#define VDUSE_ACCESS_RW 0x3
> > > > +       __u8 perm;
> > > > +};
> > > > +
> > > >  /*
> > > >   * Find the first IOVA region that overlaps with the range [start, last]
> > > >   * and return the corresponding file descriptor. Return -EINVAL means the
> > > > @@ -171,6 +183,16 @@ struct vduse_vq_group {
> > > >         __u32 num;
> > > >  };
> > > >
> > > > +/**
> > > > + * struct vduse_vq_group - virtqueue group
> > > > + @ @group: Index of the virtqueue group
> > > > + * @asid: Address space ID of the group
> > > > + */
> > > > +struct vduse_vq_group_asid {
> > > > +       __u32 group;
> > > > +       __u32 asid;
> > > > +};
> > > > +
> > > >  /**
> > > >   * struct vduse_vq_info - information of a virtqueue
> > > >   * @index: virtqueue index
> > > > @@ -231,7 +253,9 @@ struct vduse_vq_eventfd {
> > > >   * @uaddr: start address of userspace memory, it must be aligned to page size
> > > >   * @iova: start of the IOVA region
> > > >   * @size: size of the IOVA region
> > > > + * @asid: Address space ID of the IOVA region
> > > >   * @reserved: for future use, needs to be initialized to zero
> > > > + * @reserved2: for future use, needs to be initialized to zero
> > > >   *
> > > >   * Structure used by VDUSE_IOTLB_REG_UMEM and VDUSE_IOTLB_DEREG_UMEM
> > > >   * ioctls to register/de-register userspace memory for IOVA regions
> > > > @@ -240,7 +264,9 @@ struct vduse_iova_umem {
> > > >         __u64 uaddr;
> > > >         __u64 iova;
> > > >         __u64 size;
> > > > -       __u64 reserved[3];
> > > > +       __u32 asid;
> > > > +       __u32 reserved[1];
> > >
> > > Basically, I'm not sure we can assume reserved to be zero for API_VERSION == 1.
> > >
> >
> > Why not? The ioctl returns -EINVAL if it is not 0, so either an v0 or
> > v1 ioctl user must set it if it wants the ioctl to return success.
>
> Typo, I mean API_VERSION == 0. It is reserved but the kernel doesn't
> mandate it to be zero. But it doesn't matter probably as it will be
> ruled by API_VERSION as you explain above.
>

The kernel ioctl system does not mandate them to be 0, but
vduse_dev_ioctl checks for it to be 0. If returns -EINVAL in case
they're not.

This patch also checks that asid, previously part of "reserved" keeps
being 0 if VDUSE_API < 1.

> >
> > > > +       __u64 reserved2[2];
> > >
> > > Any reasons we can't reuse reserved array?
> > >
> >
> > I don't get this comment, we're reusing it, isn't it? I'm just
> > splitting the u64[3] into u32[1]+u64[2]. Maybe it is more elegant to
> > use an u32[5] or similar?
>
> Yes, I mean reuse the u32 array. (but not a must I guess unless the
> ABI is stable).
>

I'm happy with the u32 array too!

> >
> > > >  };
> > > >
> > > >  /* Register userspace memory for IOVA regions */
> > > > @@ -264,7 +290,8 @@ struct vduse_iova_info {
> > > >         __u64 last;
> > > >  #define VDUSE_IOVA_CAP_UMEM (1 << 0)
> > > >         __u64 capability;
> > > > -       __u64 reserved[3];
> > > > +       __u64 asid; /* Only if device API version >= 1 */
> > > > +       __u64 reserved[2];
> > >
> > > Same here.
> > >
> > > >  };
> > > >
> > > >  /*
> > > > @@ -287,6 +314,7 @@ enum vduse_req_type {
> > > >         VDUSE_SET_STATUS,
> > > >         VDUSE_UPDATE_IOTLB,
> > > >         VDUSE_GET_VQ_GROUP,
> > > > +       VDUSE_SET_VQ_GROUP_ASID,
> > > >  };
> > > >
> > > >  /**
> > > > @@ -342,6 +370,8 @@ struct vduse_dev_request {
> > > >                 struct vduse_dev_status s;
> > > >                 struct vduse_iova_range iova;
> > > >                 struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
> > > > +               /* Only if vduse api version >= 1 */
> > > > +               struct vduse_vq_group_asid vq_group_asid;
> > >
> > > This seems to break the uAPI as sizeof(struct vduse_dev_request) changes.
> > >
> >
> > It should not change. vduse_vq_group is 64 bits, smaller than padding
> > which is 32bits*32 = 1024bits. And they're both in the same union.
>
> You're right.
>
> >
> > > >                 __u32 padding[32];
> > > >         };
> > > >  };
> > > > @@ -365,6 +395,8 @@ struct vduse_dev_response {
> > > >         union {
> > > >                 struct vduse_vq_state vq_state;
> > > >                 struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
> > > > +               /* Only if vduse api version >= 1 */
> > > > +               struct vduse_vq_group_asid vq_group_asid;
> > >
> > > Same here.
> > >
> >
> > Same reply :).
>
> Yes.
>
> >
> > > >                 __u32 padding[32];
> > > >         };
> > > >  };
> > > > --
> > > > 2.49.0
> > > >
> > >
> > > Thanks
> > >
> >
>
> Thanks
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 1/6] vduse: add v1 API definition
  2025-08-07 10:55           ` Eugenio Perez Martin
@ 2025-08-08  0:50             ` Jason Wang
  2025-08-10 10:18               ` Eugenio Perez Martin
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Wang @ 2025-08-08  0:50 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Yongji Xie, Cindy Lu, linux-kernel, Stefano Garzarella,
	Stefan Hajnoczi, Maxime Coquelin, Michael S. Tsirkin,
	virtualization, Xuan Zhuo, Laurent Vivier

On Thu, Aug 7, 2025 at 6:56 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> On Tue, Jun 10, 2025 at 10:36 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Mon, Jun 9, 2025 at 2:11 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > >
> > > On Mon, Jun 9, 2025 at 3:50 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Mon, Jun 9, 2025 at 9:41 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > On Fri, Jun 6, 2025 at 7:50 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > > >
> > > > > > This allows to define all functions checking the API version set by the
> > > > > > userland device.
> > > > > >
> > > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > > >
> > > > > It might be worth clarifying how it works.
> > > > >
> > > > > For example,
> > > > >
> > > > > 1) would VDUSE behave differently or if it's just some new ioctls
> > >
> > > I'd like to test more in-depth, but a device can just bump the version
> > > ID and then implement the replies to the vduse messages. No need to
> > > implement new ioctls. If the VDUSE device sets 0 in either number of
> > > ASID or vq groups, the kernel assumes 1.
> >
> > Right, this is the way we use now and I think maybe we can document
> > this somewhere.
> >
> > >
> > > But you have a very good point here, I think it is wise to evaluate
> > > the shortcut of these messages in the VDUSE kernel module. If a VDUSE
> > > device only has one vq group and one ASID, it can always return group
> > > 0 and asid 0 for everything, and fail every try to ser asid != 0.
> >
> > Yes, and vhost-vDPA needs to guard against the misconfiguration.
> >
> > > This
> > > way, the update is transparent for the VDUSE device, and future
> > > devices do not need to implement the reply of these. What do you
> > > think?
> >
> > This should work.
> >
> > >
> > > > > 2) If VDUSE behave differently, do we need a ioctl to set the API
> > > > > version for backward compatibility?
> > > >
> > > > Speak too fast, there's a VDUSE_SET_API_VERSION actually.
> > > >
> > > > I think we need to think if it complicates the migration compatibility or not.
> > > >
> > >
> > > Do you mean migration as "increase the VDUSE version number", not "VM
> > > live migration from vduse version 0 to vduse version 1", isn't it? The
> > > second should not have any problem but I haven't tested it.
> >
> > I mean if we bump the version, we can't migrate from version 1 to
> > version 0. Or we can offload this to the management (do we need to
> > extend the vdpa tool for this)?
> >
>
> I just noticed I left this unreplied. But I still do not get what
> migrate means here :).
>
> If migrate means to run current VDUSE devices on kernel with this
> series applied these devices don't set V1 API so they have one vq
> group, and one asid. I'm actually testing this with my libfuse+VDUSE
> modifications that don't use V1 at all. Adding this explanation to the
> patch as it is a very good point indeed.

Right.

>
> If it means to migrate a guest from using a V1 VDUSE device to a V0
> device "it should work", as it is just a backend implementation
> detail.

For example src is the VDUSE with multiqueue support (v1) but dest
doesn't have this support (v0). I think the qemu should fail to launch
in dest.

> If we migrate from or to a vdpa device backed by hardware, for
> example, one of the devices does not even have the concept of VDUSE
> API version.
>
> In the case of net, it does not work at the moment because the only
> way to set features like mq are through the shadow CVQ.

I think you mean qemu should fail, I'm not sure this is friendly to libvirt.

> If a VDUSE net
> device implements, let's say, admin vq, something similar to PF and
> VF, and dirty bitmap, I guess it should be possible. Maybe it is
> easier to play with this wuth block devices.
>

It looks like another topics:

1) harden CVQ for VDUSE
2) support hardware dirty page tracking

Thanks


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 1/6] vduse: add v1 API definition
  2025-08-08  0:50             ` Jason Wang
@ 2025-08-10 10:18               ` Eugenio Perez Martin
  2025-08-11  2:58                 ` Jason Wang
  0 siblings, 1 reply; 30+ messages in thread
From: Eugenio Perez Martin @ 2025-08-10 10:18 UTC (permalink / raw)
  To: Jason Wang
  Cc: Yongji Xie, Cindy Lu, linux-kernel, Stefano Garzarella,
	Stefan Hajnoczi, Maxime Coquelin, Michael S. Tsirkin,
	virtualization, Xuan Zhuo, Laurent Vivier

On Fri, Aug 8, 2025 at 2:50 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Thu, Aug 7, 2025 at 6:56 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> >
> > On Tue, Jun 10, 2025 at 10:36 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Mon, Jun 9, 2025 at 2:11 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > >
> > > > On Mon, Jun 9, 2025 at 3:50 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > On Mon, Jun 9, 2025 at 9:41 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > >
> > > > > > On Fri, Jun 6, 2025 at 7:50 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > > > >
> > > > > > > This allows to define all functions checking the API version set by the
> > > > > > > userland device.
> > > > > > >
> > > > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > > > >
> > > > > > It might be worth clarifying how it works.
> > > > > >
> > > > > > For example,
> > > > > >
> > > > > > 1) would VDUSE behave differently or if it's just some new ioctls
> > > >
> > > > I'd like to test more in-depth, but a device can just bump the version
> > > > ID and then implement the replies to the vduse messages. No need to
> > > > implement new ioctls. If the VDUSE device sets 0 in either number of
> > > > ASID or vq groups, the kernel assumes 1.
> > >
> > > Right, this is the way we use now and I think maybe we can document
> > > this somewhere.
> > >
> > > >
> > > > But you have a very good point here, I think it is wise to evaluate
> > > > the shortcut of these messages in the VDUSE kernel module. If a VDUSE
> > > > device only has one vq group and one ASID, it can always return group
> > > > 0 and asid 0 for everything, and fail every try to ser asid != 0.
> > >
> > > Yes, and vhost-vDPA needs to guard against the misconfiguration.
> > >
> > > > This
> > > > way, the update is transparent for the VDUSE device, and future
> > > > devices do not need to implement the reply of these. What do you
> > > > think?
> > >
> > > This should work.
> > >
> > > >
> > > > > > 2) If VDUSE behave differently, do we need a ioctl to set the API
> > > > > > version for backward compatibility?
> > > > >
> > > > > Speak too fast, there's a VDUSE_SET_API_VERSION actually.
> > > > >
> > > > > I think we need to think if it complicates the migration compatibility or not.
> > > > >
> > > >
> > > > Do you mean migration as "increase the VDUSE version number", not "VM
> > > > live migration from vduse version 0 to vduse version 1", isn't it? The
> > > > second should not have any problem but I haven't tested it.
> > >
> > > I mean if we bump the version, we can't migrate from version 1 to
> > > version 0. Or we can offload this to the management (do we need to
> > > extend the vdpa tool for this)?
> > >
> >
> > I just noticed I left this unreplied. But I still do not get what
> > migrate means here :).
> >
> > If migrate means to run current VDUSE devices on kernel with this
> > series applied these devices don't set V1 API so they have one vq
> > group, and one asid. I'm actually testing this with my libfuse+VDUSE
> > modifications that don't use V1 at all. Adding this explanation to the
> > patch as it is a very good point indeed.
>
> Right.
>
> >
> > If it means to migrate a guest from using a V1 VDUSE device to a V0
> > device "it should work", as it is just a backend implementation
> > detail.
>
> For example src is the VDUSE with multiqueue support (v1) but dest
> doesn't have this support (v0). I think the qemu should fail to launch
> in dest.
>
> > If we migrate from or to a vdpa device backed by hardware, for
> > example, one of the devices does not even have the concept of VDUSE
> > API version.
> >
> > In the case of net, it does not work at the moment because the only
> > way to set features like mq are through the shadow CVQ.
>
> I think you mean qemu should fail, I'm not sure this is friendly to libvirt.
>

No, I think QEMU should not transmit vdpa backend properties not
visible to the guest, so we don't get an explosion of properties that
are hard to get. Expanding on this, QEMU is not even able to know if
it is talking with VDUSE, vdpa_sim, or a vdpa device backed by
hardware. And I think we don't want QEMU to have this information. So
sending the VDUSE version implies removing a lot of useful
abstractions.

In the case of net, the destination QEMU should fail if it is not able
to restore the device state. At this moment this implies to have at
least two ASID if the net device has CVQ, and that CVQ is in its own
ASID, but that may not be true in the future.

But QEMU does not check if that is the case migrating between two
vdpa_net_sim if one supports ASID but the other doesn't. And this is
only a requirement for net, not for other vdpa devices.

So if we want to develop it, I think it is independent of this series.

[...]


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 1/6] vduse: add v1 API definition
  2025-08-10 10:18               ` Eugenio Perez Martin
@ 2025-08-11  2:58                 ` Jason Wang
  2025-08-11  9:01                   ` Eugenio Perez Martin
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Wang @ 2025-08-11  2:58 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Yongji Xie, Cindy Lu, linux-kernel, Stefano Garzarella,
	Stefan Hajnoczi, Maxime Coquelin, Michael S. Tsirkin,
	virtualization, Xuan Zhuo, Laurent Vivier

On Sun, Aug 10, 2025 at 6:18 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Fri, Aug 8, 2025 at 2:50 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Thu, Aug 7, 2025 at 6:56 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > >
> > > On Tue, Jun 10, 2025 at 10:36 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Mon, Jun 9, 2025 at 2:11 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > >
> > > > > On Mon, Jun 9, 2025 at 3:50 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > >
> > > > > > On Mon, Jun 9, 2025 at 9:41 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > >
> > > > > > > On Fri, Jun 6, 2025 at 7:50 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > > > > >
> > > > > > > > This allows to define all functions checking the API version set by the
> > > > > > > > userland device.
> > > > > > > >
> > > > > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > > > > >
> > > > > > > It might be worth clarifying how it works.
> > > > > > >
> > > > > > > For example,
> > > > > > >
> > > > > > > 1) would VDUSE behave differently or if it's just some new ioctls
> > > > >
> > > > > I'd like to test more in-depth, but a device can just bump the version
> > > > > ID and then implement the replies to the vduse messages. No need to
> > > > > implement new ioctls. If the VDUSE device sets 0 in either number of
> > > > > ASID or vq groups, the kernel assumes 1.
> > > >
> > > > Right, this is the way we use now and I think maybe we can document
> > > > this somewhere.
> > > >
> > > > >
> > > > > But you have a very good point here, I think it is wise to evaluate
> > > > > the shortcut of these messages in the VDUSE kernel module. If a VDUSE
> > > > > device only has one vq group and one ASID, it can always return group
> > > > > 0 and asid 0 for everything, and fail every try to ser asid != 0.
> > > >
> > > > Yes, and vhost-vDPA needs to guard against the misconfiguration.
> > > >
> > > > > This
> > > > > way, the update is transparent for the VDUSE device, and future
> > > > > devices do not need to implement the reply of these. What do you
> > > > > think?
> > > >
> > > > This should work.
> > > >
> > > > >
> > > > > > > 2) If VDUSE behave differently, do we need a ioctl to set the API
> > > > > > > version for backward compatibility?
> > > > > >
> > > > > > Speak too fast, there's a VDUSE_SET_API_VERSION actually.
> > > > > >
> > > > > > I think we need to think if it complicates the migration compatibility or not.
> > > > > >
> > > > >
> > > > > Do you mean migration as "increase the VDUSE version number", not "VM
> > > > > live migration from vduse version 0 to vduse version 1", isn't it? The
> > > > > second should not have any problem but I haven't tested it.
> > > >
> > > > I mean if we bump the version, we can't migrate from version 1 to
> > > > version 0. Or we can offload this to the management (do we need to
> > > > extend the vdpa tool for this)?
> > > >
> > >
> > > I just noticed I left this unreplied. But I still do not get what
> > > migrate means here :).
> > >
> > > If migrate means to run current VDUSE devices on kernel with this
> > > series applied these devices don't set V1 API so they have one vq
> > > group, and one asid. I'm actually testing this with my libfuse+VDUSE
> > > modifications that don't use V1 at all. Adding this explanation to the
> > > patch as it is a very good point indeed.
> >
> > Right.
> >
> > >
> > > If it means to migrate a guest from using a V1 VDUSE device to a V0
> > > device "it should work", as it is just a backend implementation
> > > detail.
> >
> > For example src is the VDUSE with multiqueue support (v1) but dest
> > doesn't have this support (v0). I think the qemu should fail to launch
> > in dest.
> >
> > > If we migrate from or to a vdpa device backed by hardware, for
> > > example, one of the devices does not even have the concept of VDUSE
> > > API version.
> > >
> > > In the case of net, it does not work at the moment because the only
> > > way to set features like mq are through the shadow CVQ.
> >
> > I think you mean qemu should fail, I'm not sure this is friendly to libvirt.
> >
>
> No, I think QEMU should not transmit vdpa backend properties not
> visible to the guest, so we don't get an explosion of properties that
> are hard to get. Expanding on this, QEMU is not even able to know if
> it is talking with VDUSE, vdpa_sim, or a vdpa device backed by
> hardware. And I think we don't want QEMU to have this information. So
> sending the VDUSE version implies removing a lot of useful
> abstractions.
>
> In the case of net, the destination QEMU should fail if it is not able
> to restore the device state. At this moment this implies to have at
> least two ASID if the net device has CVQ, and that CVQ is in its own
> ASID, but that may not be true in the future.
>
> But QEMU does not check if that is the case migrating between two
> vdpa_net_sim if one supports ASID but the other doesn't.

Ok I think I must miss something. I need some context here. For
example, Is the shadow cvq option used by libvirt or not? (Or it has
been enabled by default if Qemu detect cvq has its own group?)

If shadow cvq is neither used by libvirt nor automatically enabled, we
need to make it work for libvirt first.

If it relies on libvirt to enable it explicitly, is libvirt expected
to detect the vDPA ability (I guess it is not what libvirt wants)?
If shadow cvq is enabled automatically, migrating from V2 to V1 is
fine but not the reverse.

> And this is
> only a requirement for net, not for other vdpa devices.
>
> So if we want to develop it, I think it is independent of this series.

Thanks

>
> [...]
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 1/6] vduse: add v1 API definition
  2025-08-11  2:58                 ` Jason Wang
@ 2025-08-11  9:01                   ` Eugenio Perez Martin
  2025-08-12  2:55                     ` Jason Wang
  0 siblings, 1 reply; 30+ messages in thread
From: Eugenio Perez Martin @ 2025-08-11  9:01 UTC (permalink / raw)
  To: Jason Wang
  Cc: Yongji Xie, Cindy Lu, linux-kernel, Stefano Garzarella,
	Stefan Hajnoczi, Maxime Coquelin, Michael S. Tsirkin,
	virtualization, Xuan Zhuo, Laurent Vivier

On Mon, Aug 11, 2025 at 4:58 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Sun, Aug 10, 2025 at 6:18 PM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > On Fri, Aug 8, 2025 at 2:50 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Thu, Aug 7, 2025 at 6:56 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > >
> > > > On Tue, Jun 10, 2025 at 10:36 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > On Mon, Jun 9, 2025 at 2:11 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > > >
> > > > > > On Mon, Jun 9, 2025 at 3:50 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > >
> > > > > > > On Mon, Jun 9, 2025 at 9:41 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Fri, Jun 6, 2025 at 7:50 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > This allows to define all functions checking the API version set by the
> > > > > > > > > userland device.
> > > > > > > > >
> > > > > > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > > > > > >
> > > > > > > > It might be worth clarifying how it works.
> > > > > > > >
> > > > > > > > For example,
> > > > > > > >
> > > > > > > > 1) would VDUSE behave differently or if it's just some new ioctls
> > > > > >
> > > > > > I'd like to test more in-depth, but a device can just bump the version
> > > > > > ID and then implement the replies to the vduse messages. No need to
> > > > > > implement new ioctls. If the VDUSE device sets 0 in either number of
> > > > > > ASID or vq groups, the kernel assumes 1.
> > > > >
> > > > > Right, this is the way we use now and I think maybe we can document
> > > > > this somewhere.
> > > > >
> > > > > >
> > > > > > But you have a very good point here, I think it is wise to evaluate
> > > > > > the shortcut of these messages in the VDUSE kernel module. If a VDUSE
> > > > > > device only has one vq group and one ASID, it can always return group
> > > > > > 0 and asid 0 for everything, and fail every try to ser asid != 0.
> > > > >
> > > > > Yes, and vhost-vDPA needs to guard against the misconfiguration.
> > > > >
> > > > > > This
> > > > > > way, the update is transparent for the VDUSE device, and future
> > > > > > devices do not need to implement the reply of these. What do you
> > > > > > think?
> > > > >
> > > > > This should work.
> > > > >
> > > > > >
> > > > > > > > 2) If VDUSE behave differently, do we need a ioctl to set the API
> > > > > > > > version for backward compatibility?
> > > > > > >
> > > > > > > Speak too fast, there's a VDUSE_SET_API_VERSION actually.
> > > > > > >
> > > > > > > I think we need to think if it complicates the migration compatibility or not.
> > > > > > >
> > > > > >
> > > > > > Do you mean migration as "increase the VDUSE version number", not "VM
> > > > > > live migration from vduse version 0 to vduse version 1", isn't it? The
> > > > > > second should not have any problem but I haven't tested it.
> > > > >
> > > > > I mean if we bump the version, we can't migrate from version 1 to
> > > > > version 0. Or we can offload this to the management (do we need to
> > > > > extend the vdpa tool for this)?
> > > > >
> > > >
> > > > I just noticed I left this unreplied. But I still do not get what
> > > > migrate means here :).
> > > >
> > > > If migrate means to run current VDUSE devices on kernel with this
> > > > series applied these devices don't set V1 API so they have one vq
> > > > group, and one asid. I'm actually testing this with my libfuse+VDUSE
> > > > modifications that don't use V1 at all. Adding this explanation to the
> > > > patch as it is a very good point indeed.
> > >
> > > Right.
> > >
> > > >
> > > > If it means to migrate a guest from using a V1 VDUSE device to a V0
> > > > device "it should work", as it is just a backend implementation
> > > > detail.
> > >
> > > For example src is the VDUSE with multiqueue support (v1) but dest
> > > doesn't have this support (v0). I think the qemu should fail to launch
> > > in dest.
> > >
> > > > If we migrate from or to a vdpa device backed by hardware, for
> > > > example, one of the devices does not even have the concept of VDUSE
> > > > API version.
> > > >
> > > > In the case of net, it does not work at the moment because the only
> > > > way to set features like mq are through the shadow CVQ.
> > >
> > > I think you mean qemu should fail, I'm not sure this is friendly to libvirt.
> > >
> >
> > No, I think QEMU should not transmit vdpa backend properties not
> > visible to the guest, so we don't get an explosion of properties that
> > are hard to get. Expanding on this, QEMU is not even able to know if
> > it is talking with VDUSE, vdpa_sim, or a vdpa device backed by
> > hardware. And I think we don't want QEMU to have this information. So
> > sending the VDUSE version implies removing a lot of useful
> > abstractions.
> >
> > In the case of net, the destination QEMU should fail if it is not able
> > to restore the device state. At this moment this implies to have at
> > least two ASID if the net device has CVQ, and that CVQ is in its own
> > ASID, but that may not be true in the future.
> >
> > But QEMU does not check if that is the case migrating between two
> > vdpa_net_sim if one supports ASID but the other doesn't.
>
> Ok I think I must miss something. I need some context here. For
> example, Is the shadow cvq option used by libvirt or not? (Or it has
> been enabled by default if Qemu detect cvq has its own group?)
>
> If shadow cvq is neither used by libvirt nor automatically enabled, we
> need to make it work for libvirt first.
>
> If it relies on libvirt to enable it explicitly, is libvirt expected
> to detect the vDPA ability (I guess it is not what libvirt wants)?
> If shadow cvq is enabled automatically, migrating from V2 to V1 is
> fine but not the reverse.
>

QEMU uses shadow CVQ automatically if all the conditions (ASID, proper
set of features, etc) are supported.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 1/6] vduse: add v1 API definition
  2025-08-11  9:01                   ` Eugenio Perez Martin
@ 2025-08-12  2:55                     ` Jason Wang
  2025-08-12  6:03                       ` Eugenio Perez Martin
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Wang @ 2025-08-12  2:55 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Yongji Xie, Cindy Lu, linux-kernel, Stefano Garzarella,
	Stefan Hajnoczi, Maxime Coquelin, Michael S. Tsirkin,
	virtualization, Xuan Zhuo, Laurent Vivier

On Mon, Aug 11, 2025 at 5:02 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Mon, Aug 11, 2025 at 4:58 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Sun, Aug 10, 2025 at 6:18 PM Eugenio Perez Martin
> > <eperezma@redhat.com> wrote:
> > >
> > > On Fri, Aug 8, 2025 at 2:50 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Thu, Aug 7, 2025 at 6:56 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > >
> > > > > On Tue, Jun 10, 2025 at 10:36 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > >
> > > > > > On Mon, Jun 9, 2025 at 2:11 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > > > >
> > > > > > > On Mon, Jun 9, 2025 at 3:50 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Mon, Jun 9, 2025 at 9:41 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > On Fri, Jun 6, 2025 at 7:50 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > > > > > > >
> > > > > > > > > > This allows to define all functions checking the API version set by the
> > > > > > > > > > userland device.
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > > > > > > >
> > > > > > > > > It might be worth clarifying how it works.
> > > > > > > > >
> > > > > > > > > For example,
> > > > > > > > >
> > > > > > > > > 1) would VDUSE behave differently or if it's just some new ioctls
> > > > > > >
> > > > > > > I'd like to test more in-depth, but a device can just bump the version
> > > > > > > ID and then implement the replies to the vduse messages. No need to
> > > > > > > implement new ioctls. If the VDUSE device sets 0 in either number of
> > > > > > > ASID or vq groups, the kernel assumes 1.
> > > > > >
> > > > > > Right, this is the way we use now and I think maybe we can document
> > > > > > this somewhere.
> > > > > >
> > > > > > >
> > > > > > > But you have a very good point here, I think it is wise to evaluate
> > > > > > > the shortcut of these messages in the VDUSE kernel module. If a VDUSE
> > > > > > > device only has one vq group and one ASID, it can always return group
> > > > > > > 0 and asid 0 for everything, and fail every try to ser asid != 0.
> > > > > >
> > > > > > Yes, and vhost-vDPA needs to guard against the misconfiguration.
> > > > > >
> > > > > > > This
> > > > > > > way, the update is transparent for the VDUSE device, and future
> > > > > > > devices do not need to implement the reply of these. What do you
> > > > > > > think?
> > > > > >
> > > > > > This should work.
> > > > > >
> > > > > > >
> > > > > > > > > 2) If VDUSE behave differently, do we need a ioctl to set the API
> > > > > > > > > version for backward compatibility?
> > > > > > > >
> > > > > > > > Speak too fast, there's a VDUSE_SET_API_VERSION actually.
> > > > > > > >
> > > > > > > > I think we need to think if it complicates the migration compatibility or not.
> > > > > > > >
> > > > > > >
> > > > > > > Do you mean migration as "increase the VDUSE version number", not "VM
> > > > > > > live migration from vduse version 0 to vduse version 1", isn't it? The
> > > > > > > second should not have any problem but I haven't tested it.
> > > > > >
> > > > > > I mean if we bump the version, we can't migrate from version 1 to
> > > > > > version 0. Or we can offload this to the management (do we need to
> > > > > > extend the vdpa tool for this)?
> > > > > >
> > > > >
> > > > > I just noticed I left this unreplied. But I still do not get what
> > > > > migrate means here :).
> > > > >
> > > > > If migrate means to run current VDUSE devices on kernel with this
> > > > > series applied these devices don't set V1 API so they have one vq
> > > > > group, and one asid. I'm actually testing this with my libfuse+VDUSE
> > > > > modifications that don't use V1 at all. Adding this explanation to the
> > > > > patch as it is a very good point indeed.
> > > >
> > > > Right.
> > > >
> > > > >
> > > > > If it means to migrate a guest from using a V1 VDUSE device to a V0
> > > > > device "it should work", as it is just a backend implementation
> > > > > detail.
> > > >
> > > > For example src is the VDUSE with multiqueue support (v1) but dest
> > > > doesn't have this support (v0). I think the qemu should fail to launch
> > > > in dest.
> > > >
> > > > > If we migrate from or to a vdpa device backed by hardware, for
> > > > > example, one of the devices does not even have the concept of VDUSE
> > > > > API version.
> > > > >
> > > > > In the case of net, it does not work at the moment because the only
> > > > > way to set features like mq are through the shadow CVQ.
> > > >
> > > > I think you mean qemu should fail, I'm not sure this is friendly to libvirt.
> > > >
> > >
> > > No, I think QEMU should not transmit vdpa backend properties not
> > > visible to the guest, so we don't get an explosion of properties that
> > > are hard to get. Expanding on this, QEMU is not even able to know if
> > > it is talking with VDUSE, vdpa_sim, or a vdpa device backed by
> > > hardware. And I think we don't want QEMU to have this information. So
> > > sending the VDUSE version implies removing a lot of useful
> > > abstractions.
> > >
> > > In the case of net, the destination QEMU should fail if it is not able
> > > to restore the device state. At this moment this implies to have at
> > > least two ASID if the net device has CVQ, and that CVQ is in its own
> > > ASID, but that may not be true in the future.
> > >
> > > But QEMU does not check if that is the case migrating between two
> > > vdpa_net_sim if one supports ASID but the other doesn't.
> >
> > Ok I think I must miss something. I need some context here. For
> > example, Is the shadow cvq option used by libvirt or not? (Or it has
> > been enabled by default if Qemu detect cvq has its own group?)
> >
> > If shadow cvq is neither used by libvirt nor automatically enabled, we
> > need to make it work for libvirt first.
> >
> > If it relies on libvirt to enable it explicitly, is libvirt expected
> > to detect the vDPA ability (I guess it is not what libvirt wants)?
> > If shadow cvq is enabled automatically, migrating from V2 to V1 is
> > fine but not the reverse.
> >
>
> QEMU uses shadow CVQ automatically if all the conditions (ASID, proper
> set of features, etc) are supported.
>

Ok, so V1 implies non-migratable.  So I'm still confused about how to
deal with migration compatibility.

For example, ping-pong migration between V1 and V2.

Thanks


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 1/6] vduse: add v1 API definition
  2025-08-12  2:55                     ` Jason Wang
@ 2025-08-12  6:03                       ` Eugenio Perez Martin
  0 siblings, 0 replies; 30+ messages in thread
From: Eugenio Perez Martin @ 2025-08-12  6:03 UTC (permalink / raw)
  To: Jason Wang
  Cc: Yongji Xie, Cindy Lu, linux-kernel, Stefano Garzarella,
	Stefan Hajnoczi, Maxime Coquelin, Michael S. Tsirkin,
	virtualization, Xuan Zhuo, Laurent Vivier

On Tue, Aug 12, 2025 at 4:55 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Mon, Aug 11, 2025 at 5:02 PM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > On Mon, Aug 11, 2025 at 4:58 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Sun, Aug 10, 2025 at 6:18 PM Eugenio Perez Martin
> > > <eperezma@redhat.com> wrote:
> > > >
> > > > On Fri, Aug 8, 2025 at 2:50 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > On Thu, Aug 7, 2025 at 6:56 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > > >
> > > > > > On Tue, Jun 10, 2025 at 10:36 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > >
> > > > > > > On Mon, Jun 9, 2025 at 2:11 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Mon, Jun 9, 2025 at 3:50 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > On Mon, Jun 9, 2025 at 9:41 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Fri, Jun 6, 2025 at 7:50 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > This allows to define all functions checking the API version set by the
> > > > > > > > > > > userland device.
> > > > > > > > > > >
> > > > > > > > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > > > > > > > >
> > > > > > > > > > It might be worth clarifying how it works.
> > > > > > > > > >
> > > > > > > > > > For example,
> > > > > > > > > >
> > > > > > > > > > 1) would VDUSE behave differently or if it's just some new ioctls
> > > > > > > >
> > > > > > > > I'd like to test more in-depth, but a device can just bump the version
> > > > > > > > ID and then implement the replies to the vduse messages. No need to
> > > > > > > > implement new ioctls. If the VDUSE device sets 0 in either number of
> > > > > > > > ASID or vq groups, the kernel assumes 1.
> > > > > > >
> > > > > > > Right, this is the way we use now and I think maybe we can document
> > > > > > > this somewhere.
> > > > > > >
> > > > > > > >
> > > > > > > > But you have a very good point here, I think it is wise to evaluate
> > > > > > > > the shortcut of these messages in the VDUSE kernel module. If a VDUSE
> > > > > > > > device only has one vq group and one ASID, it can always return group
> > > > > > > > 0 and asid 0 for everything, and fail every try to ser asid != 0.
> > > > > > >
> > > > > > > Yes, and vhost-vDPA needs to guard against the misconfiguration.
> > > > > > >
> > > > > > > > This
> > > > > > > > way, the update is transparent for the VDUSE device, and future
> > > > > > > > devices do not need to implement the reply of these. What do you
> > > > > > > > think?
> > > > > > >
> > > > > > > This should work.
> > > > > > >
> > > > > > > >
> > > > > > > > > > 2) If VDUSE behave differently, do we need a ioctl to set the API
> > > > > > > > > > version for backward compatibility?
> > > > > > > > >
> > > > > > > > > Speak too fast, there's a VDUSE_SET_API_VERSION actually.
> > > > > > > > >
> > > > > > > > > I think we need to think if it complicates the migration compatibility or not.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Do you mean migration as "increase the VDUSE version number", not "VM
> > > > > > > > live migration from vduse version 0 to vduse version 1", isn't it? The
> > > > > > > > second should not have any problem but I haven't tested it.
> > > > > > >
> > > > > > > I mean if we bump the version, we can't migrate from version 1 to
> > > > > > > version 0. Or we can offload this to the management (do we need to
> > > > > > > extend the vdpa tool for this)?
> > > > > > >
> > > > > >
> > > > > > I just noticed I left this unreplied. But I still do not get what
> > > > > > migrate means here :).
> > > > > >
> > > > > > If migrate means to run current VDUSE devices on kernel with this
> > > > > > series applied these devices don't set V1 API so they have one vq
> > > > > > group, and one asid. I'm actually testing this with my libfuse+VDUSE
> > > > > > modifications that don't use V1 at all. Adding this explanation to the
> > > > > > patch as it is a very good point indeed.
> > > > >
> > > > > Right.
> > > > >
> > > > > >
> > > > > > If it means to migrate a guest from using a V1 VDUSE device to a V0
> > > > > > device "it should work", as it is just a backend implementation
> > > > > > detail.
> > > > >
> > > > > For example src is the VDUSE with multiqueue support (v1) but dest
> > > > > doesn't have this support (v0). I think the qemu should fail to launch
> > > > > in dest.
> > > > >
> > > > > > If we migrate from or to a vdpa device backed by hardware, for
> > > > > > example, one of the devices does not even have the concept of VDUSE
> > > > > > API version.
> > > > > >
> > > > > > In the case of net, it does not work at the moment because the only
> > > > > > way to set features like mq are through the shadow CVQ.
> > > > >
> > > > > I think you mean qemu should fail, I'm not sure this is friendly to libvirt.
> > > > >
> > > >
> > > > No, I think QEMU should not transmit vdpa backend properties not
> > > > visible to the guest, so we don't get an explosion of properties that
> > > > are hard to get. Expanding on this, QEMU is not even able to know if
> > > > it is talking with VDUSE, vdpa_sim, or a vdpa device backed by
> > > > hardware. And I think we don't want QEMU to have this information. So
> > > > sending the VDUSE version implies removing a lot of useful
> > > > abstractions.
> > > >
> > > > In the case of net, the destination QEMU should fail if it is not able
> > > > to restore the device state. At this moment this implies to have at
> > > > least two ASID if the net device has CVQ, and that CVQ is in its own
> > > > ASID, but that may not be true in the future.
> > > >
> > > > But QEMU does not check if that is the case migrating between two
> > > > vdpa_net_sim if one supports ASID but the other doesn't.
> > >
> > > Ok I think I must miss something. I need some context here. For
> > > example, Is the shadow cvq option used by libvirt or not? (Or it has
> > > been enabled by default if Qemu detect cvq has its own group?)
> > >
> > > If shadow cvq is neither used by libvirt nor automatically enabled, we
> > > need to make it work for libvirt first.
> > >
> > > If it relies on libvirt to enable it explicitly, is libvirt expected
> > > to detect the vDPA ability (I guess it is not what libvirt wants)?
> > > If shadow cvq is enabled automatically, migrating from V2 to V1 is
> > > fine but not the reverse.
> > >
> >
> > QEMU uses shadow CVQ automatically if all the conditions (ASID, proper
> > set of features, etc) are supported.
> >
>
> Ok, so V1 implies non-migratable.  So I'm still confused about how to
> deal with migration compatibility.
>
> For example, ping-pong migration between V1 and V2.
>

If the device does not have CVQ, everything works.

If the device has CVQ, it is not possible at the moment. Everything
breaks the same way as if the destination vhost_vdpa does not have the
ASID capability. For example, if the destination device is an intel,
pensando, or solidrun vdpa.


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2025-08-12  6:04 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-06 11:50 [RFC 0/6] Add multiple address spaces support to VDUSE Eugenio Pérez
2025-06-06 11:50 ` [RFC 1/6] vduse: add v1 API definition Eugenio Pérez
2025-06-09  1:41   ` Jason Wang
2025-06-09  1:50     ` Jason Wang
2025-06-09  6:10       ` Eugenio Perez Martin
2025-06-10  8:35         ` Jason Wang
2025-08-07 10:55           ` Eugenio Perez Martin
2025-08-08  0:50             ` Jason Wang
2025-08-10 10:18               ` Eugenio Perez Martin
2025-08-11  2:58                 ` Jason Wang
2025-08-11  9:01                   ` Eugenio Perez Martin
2025-08-12  2:55                     ` Jason Wang
2025-08-12  6:03                       ` Eugenio Perez Martin
2025-06-06 11:50 ` [RFC 2/6] vduse: add vq group support Eugenio Pérez
2025-06-09  1:55   ` Jason Wang
2025-06-09  6:03     ` Eugenio Perez Martin
2025-06-10  8:53       ` Jason Wang
2025-06-06 11:50 ` [RFC 3/6] vduse: add vq group asid support Eugenio Pérez
2025-06-12  0:30   ` Jason Wang
2025-06-12  7:24     ` Eugenio Perez Martin
2025-06-13  1:21       ` Jason Wang
2025-08-07 11:10         ` Eugenio Perez Martin
2025-06-06 11:50 ` [RFC 4/6] vduse: send update_iotlb_v2 message Eugenio Pérez
2025-06-06 11:50 ` [RFC 5/6] vduse: reset group asid in reset Eugenio Pérez
2025-06-12  0:32   ` Jason Wang
2025-06-12  7:24     ` Eugenio Perez Martin
2025-06-06 11:50 ` [RFC 6/6] vduse: bump version number Eugenio Pérez
2025-06-09  5:51 ` [RFC 0/6] Add multiple address spaces support to VDUSE Christoph Hellwig
2025-06-20  6:25   ` Eugenio Perez Martin
2025-06-23  5:09     ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).