virtualization.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v10 0/8] Add multiple address spaces support to VDUSE
@ 2025-12-17 11:24 Eugenio Pérez
  2025-12-17 11:24 ` [PATCH v10 1/8] vduse: add v1 API definition Eugenio Pérez
                   ` (7 more replies)
  0 siblings, 8 replies; 22+ messages in thread
From: Eugenio Pérez @ 2025-12-17 11:24 UTC (permalink / raw)
  To: Michael S . Tsirkin 
  Cc: Maxime Coquelin, Laurent Vivier, virtualization, linux-kernel,
	Stefano Garzarella, Yongji Xie, jasowang, Xuan Zhuo, Cindy Lu,
	Eugenio Pérez

When used by vhost-vDPA bus driver for VM, the control virtqueue
should be shadowed via userspace VMM (QEMU) instead of being assigned
directly to Guest. This is because QEMU needs to know the device state
in order to start and stop device correctly (e.g for Live Migration).

This requies to isolate the memory mapping for control virtqueue
presented by vhost-vDPA to prevent guest from accessing it directly.

This series add support to multiple address spaces in VDUSE device
allowing selective virtqueue isolation through address space IDs (ASID).

The VDUSE device needs to report:
* Number of virtqueue groups
* Association of each vq group with each virtqueue
* Number of address spaces supported.

Then, the vDPA driver can modify the ASID assigned to each VQ group to
isolate the memory AS.  This aligns VDUSE with vdpa_sim and nvidia mlx5
devices which already support ASID.

This helps to isolate the environments for the virtqueues that will not
be assigned directly. E.g in the case of virtio-net, the control
virtqueue will not be assigned directly to guest.

Also, to be able to test this patch, the user needs to manually revert
56e71885b034 ("vduse: Temporarily fail if control queue feature requested").

Tested by creating a VDUSE device OVS with and without MQ, and live migrating
between two hosts back and forth while maintaining ping alive in all the
stages.  All tested with and without lockdep.

PATCH v9:
* Change to RCU.

PATCH v8:
* Revert the change from mutex to rwlock (MST).

PATCH v7:
* Fix not taking the write lock in the registering vdpa device error path
  (Jason).

PATCH v6:
* Make vdpa_dev_add use gotos for error handling (MST).
* s/(dev->api_version < 1) ?/(dev->api_version < VDUSE_API_VERSION_1) ?/
  in group and nas handling at device creation (MST).
* Fix struct name not matching in the doc.
* s/sepparate/separate (MST).

PATCH v5:
* Properly return errno if copy_to_user returns >0 in VDUSE_IOTLB_GET_FD
  ioctl (Jason).
* Properly set domain bounce size to divide equally between nas (Jason).
* Revert core vdpa changes (Jason).
* Fix group == ngroup case in checking VQ_SETUP argument (Jason).
* Exclude "padding" member from the only >V1 members in
  vduse_dev_request.

PATCH v4:
* Consider config->nas == 0 and config->ngroups == 0 as a fail (Jason).
* Revert the "invalid vq group" concept and assume 0 if not set.
* Divide each domain bounce size between the device bounce size (Jason).
* Revert unneeded addr = NULL assignment (Jason)
* Change if (x && (y || z)) return to if (x) { if (y) return; if (z)
  return; } (Jason)
* Change a bad multiline comment, using @ caracter instead of * (Jason).

PATCH v3:
* Make the default group an invalid group as long as VDUSE device does
  not set it to some valid u32 value.  Modify the vdpa core to take that
  into account (Jason).  Adapt all the virtio_map_ops callbacks to it.
* Make setting status DRIVER_OK fail if vq group is not valid.
* Create the VDUSE_DEV_MAX_GROUPS and VDUSE_DEV_MAX_AS instead of using a magic
  number
* Remove the _int name suffix from struct vduse_vq_group.
* Get the vduse domain through the vduse_as in the map functions (Jason).
* Squash the patch implementing the AS logic with the patch creating the
  vduse_as struct (Jason).

PATCH v2:
* Now the vq group is in vduse_vq_config struct instead of issuing one
  VDUSE message per vq.
* Convert the use of mutex to rwlock (Xie Yongji).

PATCH v1:
* Fix: Remove BIT_ULL(VIRTIO_S_*), as _S_ is already the bit (Maxime)
* Using vduse_vq_group_int directly instead of an empty struct in union
  virtio_map.

RFC v3:
* Increase VDUSE_MAX_VQ_GROUPS to 0xffff (Jason). It was set to a lower
  value to reduce memory consumption, but vqs are already limited to
  that value and userspace VDUSE is able to allocate that many vqs.  Also, it's
  a dynamic array now.  Same with ASID.
* Move the valid vq groups range check to vduse_validate_config.
* Embed vduse_iotlb_entry into vduse_iotlb_entry_v2.
* Use of array_index_nospec in VDUSE device ioctls.
* Move the umem mutex to asid struct so there is no contention between
  ASIDs.
* Remove the descs vq group capability as it will not be used and we can
  add it on top.
* Do not ask for vq groups in number of vq groups < 2.
* Remove TODO about merging VDUSE_IOTLB_GET_FD ioctl with
  VDUSE_IOTLB_GET_INFO.

RFC v2:
* Cache group information in kernel, as we need to provide the vq map
  tokens properly.
* Add descs vq group to optimize SVQ forwarding and support indirect
  descriptors out of the box.
* Make iotlb entry the last one of vduse_iotlb_entry_v2 so the first
  part of the struct is the same.
* Fixes detected testing with OVS+VDUSE.

Eugenio Pérez (8):
  vduse: add v1 API definition
  vduse: add vq group support
  vduse: return internal vq group struct as map token
  vduse: refactor vdpa_dev_add for goto err handling
  vduse: remove unused vaddr parameter of vduse_domain_free_coherent
  vduse: take out allocations from vduse_dev_alloc_coherent
  vduse: add vq group asid support
  vduse: bump version number

 drivers/vdpa/vdpa_user/iova_domain.c |  10 +-
 drivers/vdpa/vdpa_user/iova_domain.h |   5 +-
 drivers/vdpa/vdpa_user/vduse_dev.c   | 469 +++++++++++++++++++++------
 include/linux/virtio.h               |   6 +-
 include/uapi/linux/vduse.h           |  67 +++-
 5 files changed, 434 insertions(+), 123 deletions(-)

-- 
2.52.0


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v10 1/8] vduse: add v1 API definition
  2025-12-17 11:24 [PATCH v10 0/8] Add multiple address spaces support to VDUSE Eugenio Pérez
@ 2025-12-17 11:24 ` Eugenio Pérez
  2025-12-17 11:24 ` [PATCH v10 2/8] vduse: add vq group support Eugenio Pérez
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Eugenio Pérez @ 2025-12-17 11:24 UTC (permalink / raw)
  To: Michael S . Tsirkin 
  Cc: Maxime Coquelin, Laurent Vivier, virtualization, linux-kernel,
	Stefano Garzarella, Yongji Xie, jasowang, Xuan Zhuo, Cindy Lu,
	Eugenio Pérez

This allows the kernel to detect whether the userspace VDUSE device
supports the VQ group and ASID features.  VDUSE devices that don't set
the V1 API will not receive the new messages, and vdpa device will be
created with only one vq group and asid.

The next patches implement the new feature incrementally, only enabling
the VDUSE device to set the V1 API version by the end of the series.

Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Xie Yongji <xieyongji@bytedance.com>
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/uapi/linux/vduse.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
index 10ad71aa00d6..ccb92a1efce0 100644
--- a/include/uapi/linux/vduse.h
+++ b/include/uapi/linux/vduse.h
@@ -10,6 +10,10 @@
 
 #define VDUSE_API_VERSION	0
 
+/* VQ groups and ASID support */
+
+#define VDUSE_API_VERSION_1	1
+
 /*
  * Get the version of VDUSE API that kernel supported (VDUSE_API_VERSION).
  * This is used for future extension.
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v10 2/8] vduse: add vq group support
  2025-12-17 11:24 [PATCH v10 0/8] Add multiple address spaces support to VDUSE Eugenio Pérez
  2025-12-17 11:24 ` [PATCH v10 1/8] vduse: add v1 API definition Eugenio Pérez
@ 2025-12-17 11:24 ` Eugenio Pérez
  2025-12-18  6:46   ` Jason Wang
  2025-12-17 11:24 ` [PATCH v10 3/8] vduse: return internal vq group struct as map token Eugenio Pérez
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Eugenio Pérez @ 2025-12-17 11:24 UTC (permalink / raw)
  To: Michael S . Tsirkin 
  Cc: Maxime Coquelin, Laurent Vivier, virtualization, linux-kernel,
	Stefano Garzarella, Yongji Xie, jasowang, Xuan Zhuo, Cindy Lu,
	Eugenio Pérez

This allows separate the different virtqueues in groups that shares the
same address space.  Asking the VDUSE device for the groups of the vq at
the beginning as they're needed for the DMA API.

Allocating 3 vq groups as net is the device that need the most groups:
* Dataplane (guest passthrough)
* CVQ
* Shadowed vrings.

Future versions of the series can include dynamic allocation of the
groups array so VDUSE can declare more groups.

Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Xie Yongji <xieyongji@bytedance.com>
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
v6:
* s/sepparate/separate (MST).
* s/dev->api_version < 1/dev->api_version < VDUSE_API_VERSION_1

v5:
* Revert core vdpa changes (Jason).
* Fix group == ngroup case in checking VQ_SETUP argument (Jason).

v4:
* Revert the "invalid vq group" concept and assume 0 if not set (Jason).
* Make config->ngroups == 0 invalid (Jason).

v3:
* Make the default group an invalid group as long as VDUSE device does
  not set it to some valid u32 value.  Modify the vdpa core to take that
  into account (Jason).
* Create the VDUSE_DEV_MAX_GROUPS instead of using a magic number

v2:
* Now the vq group is in vduse_vq_config struct instead of issuing one
  VDUSE message per vq.

v1:
* Fix: Remove BIT_ULL(VIRTIO_S_*), as _S_ is already the bit (Maxime)

RFC v3:
* Increase VDUSE_MAX_VQ_GROUPS to 0xffff (Jason).  It was set to a lower
  value to reduce memory consumption, but vqs are already limited to
  that value and userspace VDUSE is able to allocate that many vqs.
* Remove the descs vq group capability as it will not be used and we can
  add it on top.
* Do not ask for vq groups in number of vq groups < 2.
* Move the valid vq groups range check to vduse_validate_config.

RFC v2:
* Cache group information in kernel, as we need to provide the vq map
  tokens properly.
* Add descs vq group to optimize SVQ forwarding and support indirect
  descriptors out of the box.
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 48 ++++++++++++++++++++++++++----
 include/uapi/linux/vduse.h         | 12 ++++++--
 2 files changed, 52 insertions(+), 8 deletions(-)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index ae357d014564..b012dc3557b9 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -39,6 +39,7 @@
 #define DRV_LICENSE  "GPL v2"
 
 #define VDUSE_DEV_MAX (1U << MINORBITS)
+#define VDUSE_DEV_MAX_GROUPS 0xffff
 #define VDUSE_MAX_BOUNCE_SIZE (1024 * 1024 * 1024)
 #define VDUSE_MIN_BOUNCE_SIZE (1024 * 1024)
 #define VDUSE_BOUNCE_SIZE (64 * 1024 * 1024)
@@ -58,6 +59,7 @@ struct vduse_virtqueue {
 	struct vdpa_vq_state state;
 	bool ready;
 	bool kicked;
+	u32 vq_group;
 	spinlock_t kick_lock;
 	spinlock_t irq_lock;
 	struct eventfd_ctx *kickfd;
@@ -114,6 +116,7 @@ struct vduse_dev {
 	u8 status;
 	u32 vq_num;
 	u32 vq_align;
+	u32 ngroups;
 	struct vduse_umem *umem;
 	struct mutex mem_lock;
 	unsigned int bounce_size;
@@ -455,6 +458,7 @@ static void vduse_dev_reset(struct vduse_dev *dev)
 		vq->driver_addr = 0;
 		vq->device_addr = 0;
 		vq->num = 0;
+		vq->vq_group = 0;
 		memset(&vq->state, 0, sizeof(vq->state));
 
 		spin_lock(&vq->kick_lock);
@@ -592,6 +596,16 @@ static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
 	return 0;
 }
 
+static u32 vduse_get_vq_group(struct vdpa_device *vdpa, u16 idx)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	if (dev->api_version < VDUSE_API_VERSION_1)
+		return 0;
+
+	return dev->vqs[idx]->vq_group;
+}
+
 static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
 				struct vdpa_vq_state *state)
 {
@@ -789,6 +803,7 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = {
 	.set_vq_cb		= vduse_vdpa_set_vq_cb,
 	.set_vq_num             = vduse_vdpa_set_vq_num,
 	.get_vq_size		= vduse_vdpa_get_vq_size,
+	.get_vq_group		= vduse_get_vq_group,
 	.set_vq_ready		= vduse_vdpa_set_vq_ready,
 	.get_vq_ready		= vduse_vdpa_get_vq_ready,
 	.set_vq_state		= vduse_vdpa_set_vq_state,
@@ -1252,12 +1267,24 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
 		if (config.index >= dev->vq_num)
 			break;
 
-		if (!is_mem_zero((const char *)config.reserved,
-				 sizeof(config.reserved)))
+		if (dev->api_version < VDUSE_API_VERSION_1 && config.group)
+			break;
+
+		if (dev->api_version >= VDUSE_API_VERSION_1) {
+			if (config.group >= dev->ngroups)
+				break;
+			if (dev->status & VIRTIO_CONFIG_S_DRIVER_OK)
+				break;
+		}
+
+		if (config.reserved1 ||
+		    !is_mem_zero((const char *)config.reserved2,
+				 sizeof(config.reserved2)))
 			break;
 
 		index = array_index_nospec(config.index, dev->vq_num);
 		dev->vqs[index]->num_max = config.max_size;
+		dev->vqs[index]->vq_group = config.group;
 		ret = 0;
 		break;
 	}
@@ -1737,12 +1764,20 @@ static bool features_is_valid(struct vduse_dev_config *config)
 	return true;
 }
 
-static bool vduse_validate_config(struct vduse_dev_config *config)
+static bool vduse_validate_config(struct vduse_dev_config *config,
+				  u64 api_version)
 {
 	if (!is_mem_zero((const char *)config->reserved,
 			 sizeof(config->reserved)))
 		return false;
 
+	if (api_version < VDUSE_API_VERSION_1 && config->ngroups)
+		return false;
+
+	if (api_version >= VDUSE_API_VERSION_1 &&
+	    (!config->ngroups || config->ngroups > VDUSE_DEV_MAX_GROUPS))
+		return false;
+
 	if (config->vq_align > PAGE_SIZE)
 		return false;
 
@@ -1858,6 +1893,9 @@ static int vduse_create_dev(struct vduse_dev_config *config,
 	dev->device_features = config->features;
 	dev->device_id = config->device_id;
 	dev->vendor_id = config->vendor_id;
+	dev->ngroups = (dev->api_version < VDUSE_API_VERSION_1)
+		       ? 1
+		       : config->ngroups;
 	dev->name = kstrdup(config->name, GFP_KERNEL);
 	if (!dev->name)
 		goto err_str;
@@ -1936,7 +1974,7 @@ static long vduse_ioctl(struct file *file, unsigned int cmd,
 			break;
 
 		ret = -EINVAL;
-		if (vduse_validate_config(&config) == false)
+		if (!vduse_validate_config(&config, control->api_version))
 			break;
 
 		buf = vmemdup_user(argp + size, config.config_size);
@@ -2017,7 +2055,7 @@ static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
 
 	vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev,
 				 &vduse_vdpa_config_ops, &vduse_map_ops,
-				 1, 1, name, true);
+				 dev->ngroups, 1, name, true);
 	if (IS_ERR(vdev))
 		return PTR_ERR(vdev);
 
diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
index ccb92a1efce0..a3d51cf6df3a 100644
--- a/include/uapi/linux/vduse.h
+++ b/include/uapi/linux/vduse.h
@@ -31,6 +31,7 @@
  * @features: virtio features
  * @vq_num: the number of virtqueues
  * @vq_align: the allocation alignment of virtqueue's metadata
+ * @ngroups: number of vq groups that VDUSE device declares
  * @reserved: for future use, needs to be initialized to zero
  * @config_size: the size of the configuration space
  * @config: the buffer of the configuration space
@@ -45,7 +46,8 @@ struct vduse_dev_config {
 	__u64 features;
 	__u32 vq_num;
 	__u32 vq_align;
-	__u32 reserved[13];
+	__u32 ngroups; /* if VDUSE_API_VERSION >= 1 */
+	__u32 reserved[12];
 	__u32 config_size;
 	__u8 config[];
 };
@@ -122,14 +124,18 @@ struct vduse_config_data {
  * struct vduse_vq_config - basic configuration of a virtqueue
  * @index: virtqueue index
  * @max_size: the max size of virtqueue
- * @reserved: for future use, needs to be initialized to zero
+ * @reserved1: for future use, needs to be initialized to zero
+ * @group: virtqueue group
+ * @reserved2: for future use, needs to be initialized to zero
  *
  * Structure used by VDUSE_VQ_SETUP ioctl to setup a virtqueue.
  */
 struct vduse_vq_config {
 	__u32 index;
 	__u16 max_size;
-	__u16 reserved[13];
+	__u16 reserved1;
+	__u32 group;
+	__u16 reserved2[10];
 };
 
 /*
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v10 3/8] vduse: return internal vq group struct as map token
  2025-12-17 11:24 [PATCH v10 0/8] Add multiple address spaces support to VDUSE Eugenio Pérez
  2025-12-17 11:24 ` [PATCH v10 1/8] vduse: add v1 API definition Eugenio Pérez
  2025-12-17 11:24 ` [PATCH v10 2/8] vduse: add vq group support Eugenio Pérez
@ 2025-12-17 11:24 ` Eugenio Pérez
  2025-12-17 11:24 ` [PATCH v10 4/8] vduse: refactor vdpa_dev_add for goto err handling Eugenio Pérez
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Eugenio Pérez @ 2025-12-17 11:24 UTC (permalink / raw)
  To: Michael S . Tsirkin 
  Cc: Maxime Coquelin, Laurent Vivier, virtualization, linux-kernel,
	Stefano Garzarella, Yongji Xie, jasowang, Xuan Zhuo, Cindy Lu,
	Eugenio Pérez

Return the internal struct that represents the vq group as virtqueue map
token, instead of the device.  This allows the map functions to access
the information per group.

At this moment all the virtqueues share the same vq group, that only
can point to ASID 0.  This change prepares the infrastructure for actual
per-group address space handling

Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
v4:
* Revert the "invalid vq group" concept, and assume 0 by default.
* Revert unnecesary blank line addition (Jason)

v3:
* Adapt all virtio_map_ops callbacks to handle empty tokens in case of
  invalid groups.
* Make setting status DRIVER_OK fail if vq group is not valid.
* Remove the _int name suffix from struct vduse_vq_group.

RFC v3:
* Make the vq groups a dynamic array to support an arbitrary number of
  them.
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 100 ++++++++++++++++++++++++++---
 include/linux/virtio.h             |   6 +-
 2 files changed, 94 insertions(+), 12 deletions(-)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index b012dc3557b9..d64bcd9ce76e 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -22,6 +22,7 @@
 #include <linux/uio.h>
 #include <linux/vdpa.h>
 #include <linux/nospec.h>
+#include <linux/virtio.h>
 #include <linux/vmalloc.h>
 #include <linux/sched/mm.h>
 #include <uapi/linux/vduse.h>
@@ -85,6 +86,10 @@ struct vduse_umem {
 	struct mm_struct *mm;
 };
 
+struct vduse_vq_group {
+	struct vduse_dev *dev;
+};
+
 struct vduse_dev {
 	struct vduse_vdpa *vdev;
 	struct device *dev;
@@ -118,6 +123,7 @@ struct vduse_dev {
 	u32 vq_align;
 	u32 ngroups;
 	struct vduse_umem *umem;
+	struct vduse_vq_group *groups;
 	struct mutex mem_lock;
 	unsigned int bounce_size;
 	struct mutex domain_lock;
@@ -606,6 +612,17 @@ static u32 vduse_get_vq_group(struct vdpa_device *vdpa, u16 idx)
 	return dev->vqs[idx]->vq_group;
 }
 
+static union virtio_map vduse_get_vq_map(struct vdpa_device *vdpa, u16 idx)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	u32 vq_group = vduse_get_vq_group(vdpa, idx);
+	union virtio_map ret = {
+		.group = &dev->groups[vq_group],
+	};
+
+	return ret;
+}
+
 static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
 				struct vdpa_vq_state *state)
 {
@@ -826,6 +843,7 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = {
 	.get_vq_affinity	= vduse_vdpa_get_vq_affinity,
 	.reset			= vduse_vdpa_reset,
 	.set_map		= vduse_vdpa_set_map,
+	.get_vq_map		= vduse_get_vq_map,
 	.free			= vduse_vdpa_free,
 };
 
@@ -833,7 +851,14 @@ static void vduse_dev_sync_single_for_device(union virtio_map token,
 					     dma_addr_t dma_addr, size_t size,
 					     enum dma_data_direction dir)
 {
-	struct vduse_iova_domain *domain = token.iova_domain;
+	struct vduse_dev *vdev;
+	struct vduse_iova_domain *domain;
+
+	if (!token.group)
+		return;
+
+	vdev = token.group->dev;
+	domain = vdev->domain;
 
 	vduse_domain_sync_single_for_device(domain, dma_addr, size, dir);
 }
@@ -842,7 +867,14 @@ static void vduse_dev_sync_single_for_cpu(union virtio_map token,
 					     dma_addr_t dma_addr, size_t size,
 					     enum dma_data_direction dir)
 {
-	struct vduse_iova_domain *domain = token.iova_domain;
+	struct vduse_dev *vdev;
+	struct vduse_iova_domain *domain;
+
+	if (!token.group)
+		return;
+
+	vdev = token.group->dev;
+	domain = vdev->domain;
 
 	vduse_domain_sync_single_for_cpu(domain, dma_addr, size, dir);
 }
@@ -852,7 +884,14 @@ static dma_addr_t vduse_dev_map_page(union virtio_map token, struct page *page,
 				     enum dma_data_direction dir,
 				     unsigned long attrs)
 {
-	struct vduse_iova_domain *domain = token.iova_domain;
+	struct vduse_dev *vdev;
+	struct vduse_iova_domain *domain;
+
+	if (!token.group)
+		return DMA_MAPPING_ERROR;
+
+	vdev = token.group->dev;
+	domain = vdev->domain;
 
 	return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
 }
@@ -861,7 +900,14 @@ static void vduse_dev_unmap_page(union virtio_map token, dma_addr_t dma_addr,
 				 size_t size, enum dma_data_direction dir,
 				 unsigned long attrs)
 {
-	struct vduse_iova_domain *domain = token.iova_domain;
+	struct vduse_dev *vdev;
+	struct vduse_iova_domain *domain;
+
+	if (!token.group)
+		return;
+
+	vdev = token.group->dev;
+	domain = vdev->domain;
 
 	return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
 }
@@ -869,11 +915,17 @@ static void vduse_dev_unmap_page(union virtio_map token, dma_addr_t dma_addr,
 static void *vduse_dev_alloc_coherent(union virtio_map token, size_t size,
 				      dma_addr_t *dma_addr, gfp_t flag)
 {
-	struct vduse_iova_domain *domain = token.iova_domain;
+	struct vduse_dev *vdev;
+	struct vduse_iova_domain *domain;
 	unsigned long iova;
 	void *addr;
 
 	*dma_addr = DMA_MAPPING_ERROR;
+	if (!token.group)
+		return NULL;
+
+	vdev = token.group->dev;
+	domain = vdev->domain;
 	addr = vduse_domain_alloc_coherent(domain, size,
 					   (dma_addr_t *)&iova, flag);
 	if (!addr)
@@ -888,14 +940,28 @@ static void vduse_dev_free_coherent(union virtio_map token, size_t size,
 				    void *vaddr, dma_addr_t dma_addr,
 				    unsigned long attrs)
 {
-	struct vduse_iova_domain *domain = token.iova_domain;
+	struct vduse_dev *vdev;
+	struct vduse_iova_domain *domain;
+
+	if (!token.group)
+		return;
+
+	vdev = token.group->dev;
+	domain = vdev->domain;
 
 	vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
 }
 
 static bool vduse_dev_need_sync(union virtio_map token, dma_addr_t dma_addr)
 {
-	struct vduse_iova_domain *domain = token.iova_domain;
+	struct vduse_dev *vdev;
+	struct vduse_iova_domain *domain;
+
+	if (!token.group)
+		return false;
+
+	vdev = token.group->dev;
+	domain = vdev->domain;
 
 	return dma_addr < domain->bounce_size;
 }
@@ -909,7 +975,14 @@ static int vduse_dev_mapping_error(union virtio_map token, dma_addr_t dma_addr)
 
 static size_t vduse_dev_max_mapping_size(union virtio_map token)
 {
-	struct vduse_iova_domain *domain = token.iova_domain;
+	struct vduse_dev *vdev;
+	struct vduse_iova_domain *domain;
+
+	if (!token.group)
+		return 0;
+
+	vdev = token.group->dev;
+	domain = vdev->domain;
 
 	return domain->bounce_size;
 }
@@ -1727,6 +1800,7 @@ static int vduse_destroy_dev(char *name)
 	if (dev->domain)
 		vduse_domain_destroy(dev->domain);
 	kfree(dev->name);
+	kfree(dev->groups);
 	vduse_dev_destroy(dev);
 	module_put(THIS_MODULE);
 
@@ -1896,6 +1970,13 @@ static int vduse_create_dev(struct vduse_dev_config *config,
 	dev->ngroups = (dev->api_version < VDUSE_API_VERSION_1)
 		       ? 1
 		       : config->ngroups;
+	dev->groups = kcalloc(dev->ngroups, sizeof(dev->groups[0]),
+			      GFP_KERNEL);
+	if (!dev->groups)
+		goto err_vq_groups;
+	for (u32 i = 0; i < dev->ngroups; ++i)
+		dev->groups[i].dev = dev;
+
 	dev->name = kstrdup(config->name, GFP_KERNEL);
 	if (!dev->name)
 		goto err_str;
@@ -1932,6 +2013,8 @@ static int vduse_create_dev(struct vduse_dev_config *config,
 err_idr:
 	kfree(dev->name);
 err_str:
+	kfree(dev->groups);
+err_vq_groups:
 	vduse_dev_destroy(dev);
 err:
 	return ret;
@@ -2093,7 +2176,6 @@ static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name,
 		return -ENOMEM;
 	}
 
-	dev->vdev->vdpa.vmap.iova_domain = dev->domain;
 	ret = _vdpa_register_device(&dev->vdev->vdpa, dev->vq_num);
 	if (ret) {
 		put_device(&dev->vdev->vdpa.dev);
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 3626eb694728..818363eb1b1d 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -43,13 +43,13 @@ struct virtqueue {
 	void *priv;
 };
 
-struct vduse_iova_domain;
+struct vduse_vq_group;
 
 union virtio_map {
 	/* Device that performs DMA */
 	struct device *dma_dev;
-	/* VDUSE specific mapping data */
-	struct vduse_iova_domain *iova_domain;
+	/* VDUSE specific virtqueue group for doing map */
+	struct vduse_vq_group *group;
 };
 
 int virtqueue_add_outbuf(struct virtqueue *vq,
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v10 4/8] vduse: refactor vdpa_dev_add for goto err handling
  2025-12-17 11:24 [PATCH v10 0/8] Add multiple address spaces support to VDUSE Eugenio Pérez
                   ` (2 preceding siblings ...)
  2025-12-17 11:24 ` [PATCH v10 3/8] vduse: return internal vq group struct as map token Eugenio Pérez
@ 2025-12-17 11:24 ` Eugenio Pérez
  2025-12-17 11:24 ` [PATCH v10 5/8] vduse: remove unused vaddr parameter of vduse_domain_free_coherent Eugenio Pérez
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Eugenio Pérez @ 2025-12-17 11:24 UTC (permalink / raw)
  To: Michael S . Tsirkin 
  Cc: Maxime Coquelin, Laurent Vivier, virtualization, linux-kernel,
	Stefano Garzarella, Yongji Xie, jasowang, Xuan Zhuo, Cindy Lu,
	Eugenio Pérez

Next patches introduce more error paths in this function.  Refactor it
so they can be accomodated through gotos.

Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Xie Yongji <xieyongji@bytedance.com>
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
v6: New in v6.
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index d64bcd9ce76e..0390e012c584 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -2172,21 +2172,27 @@ static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name,
 						  dev->bounce_size);
 	mutex_unlock(&dev->domain_lock);
 	if (!dev->domain) {
-		put_device(&dev->vdev->vdpa.dev);
-		return -ENOMEM;
+		ret = -ENOMEM;
+		goto domain_err;
 	}
 
 	ret = _vdpa_register_device(&dev->vdev->vdpa, dev->vq_num);
 	if (ret) {
-		put_device(&dev->vdev->vdpa.dev);
-		mutex_lock(&dev->domain_lock);
-		vduse_domain_destroy(dev->domain);
-		dev->domain = NULL;
-		mutex_unlock(&dev->domain_lock);
-		return ret;
+		goto register_err;
 	}
 
 	return 0;
+
+register_err:
+	mutex_lock(&dev->domain_lock);
+	vduse_domain_destroy(dev->domain);
+	dev->domain = NULL;
+	mutex_unlock(&dev->domain_lock);
+
+domain_err:
+	put_device(&dev->vdev->vdpa.dev);
+
+	return ret;
 }
 
 static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v10 5/8] vduse: remove unused vaddr parameter of vduse_domain_free_coherent
  2025-12-17 11:24 [PATCH v10 0/8] Add multiple address spaces support to VDUSE Eugenio Pérez
                   ` (3 preceding siblings ...)
  2025-12-17 11:24 ` [PATCH v10 4/8] vduse: refactor vdpa_dev_add for goto err handling Eugenio Pérez
@ 2025-12-17 11:24 ` Eugenio Pérez
  2025-12-17 11:24 ` [PATCH v10 6/8] vduse: take out allocations from vduse_dev_alloc_coherent Eugenio Pérez
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Eugenio Pérez @ 2025-12-17 11:24 UTC (permalink / raw)
  To: Michael S . Tsirkin 
  Cc: Maxime Coquelin, Laurent Vivier, virtualization, linux-kernel,
	Stefano Garzarella, Yongji Xie, jasowang, Xuan Zhuo, Cindy Lu,
	Eugenio Pérez

We will modify the function in next patches so let's clean it first.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 drivers/vdpa/vdpa_user/iova_domain.c | 3 +--
 drivers/vdpa/vdpa_user/iova_domain.h | 3 +--
 drivers/vdpa/vdpa_user/vduse_dev.c   | 2 +-
 3 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
index 4352b5cf74f0..309cd5a039d1 100644
--- a/drivers/vdpa/vdpa_user/iova_domain.c
+++ b/drivers/vdpa/vdpa_user/iova_domain.c
@@ -528,8 +528,7 @@ void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
 }
 
 void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
-				void *vaddr, dma_addr_t dma_addr,
-				unsigned long attrs)
+				dma_addr_t dma_addr, unsigned long attrs)
 {
 	struct iova_domain *iovad = &domain->consistent_iovad;
 	struct vhost_iotlb_map *map;
diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
index 775cad5238f3..42090cd1a622 100644
--- a/drivers/vdpa/vdpa_user/iova_domain.h
+++ b/drivers/vdpa/vdpa_user/iova_domain.h
@@ -72,8 +72,7 @@ void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
 				  gfp_t flag);
 
 void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
-				void *vaddr, dma_addr_t dma_addr,
-				unsigned long attrs);
+				dma_addr_t dma_addr, unsigned long attrs);
 
 void vduse_domain_reset_bounce_map(struct vduse_iova_domain *domain);
 
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index 0390e012c584..a278cff7a4fa 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -949,7 +949,7 @@ static void vduse_dev_free_coherent(union virtio_map token, size_t size,
 	vdev = token.group->dev;
 	domain = vdev->domain;
 
-	vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
+	vduse_domain_free_coherent(domain, size, dma_addr, attrs);
 }
 
 static bool vduse_dev_need_sync(union virtio_map token, dma_addr_t dma_addr)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v10 6/8] vduse: take out allocations from vduse_dev_alloc_coherent
  2025-12-17 11:24 [PATCH v10 0/8] Add multiple address spaces support to VDUSE Eugenio Pérez
                   ` (4 preceding siblings ...)
  2025-12-17 11:24 ` [PATCH v10 5/8] vduse: remove unused vaddr parameter of vduse_domain_free_coherent Eugenio Pérez
@ 2025-12-17 11:24 ` Eugenio Pérez
  2025-12-18  5:45   ` Jason Wang
  2025-12-17 11:24 ` [PATCH v10 7/8] vduse: add vq group asid support Eugenio Pérez
  2025-12-17 11:24 ` [PATCH v10 8/8] vduse: bump version number Eugenio Pérez
  7 siblings, 1 reply; 22+ messages in thread
From: Eugenio Pérez @ 2025-12-17 11:24 UTC (permalink / raw)
  To: Michael S . Tsirkin 
  Cc: Maxime Coquelin, Laurent Vivier, virtualization, linux-kernel,
	Stefano Garzarella, Yongji Xie, jasowang, Xuan Zhuo, Cindy Lu,
	Eugenio Pérez

The function vduse_dev_alloc_coherent will be called under rwlock in
next patches.  Make it out of the lock to avoid increasing its fail
rate.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 drivers/vdpa/vdpa_user/iova_domain.c |  7 ++-----
 drivers/vdpa/vdpa_user/iova_domain.h |  2 +-
 drivers/vdpa/vdpa_user/vduse_dev.c   | 13 +++++++++++--
 3 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
index 309cd5a039d1..0ae52890518c 100644
--- a/drivers/vdpa/vdpa_user/iova_domain.c
+++ b/drivers/vdpa/vdpa_user/iova_domain.c
@@ -495,14 +495,13 @@ void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
 
 void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
 				  size_t size, dma_addr_t *dma_addr,
-				  gfp_t flag)
+				  void *orig)
 {
 	struct iova_domain *iovad = &domain->consistent_iovad;
 	unsigned long limit = domain->iova_limit;
 	dma_addr_t iova = vduse_domain_alloc_iova(iovad, size, limit);
-	void *orig = alloc_pages_exact(size, flag);
 
-	if (!iova || !orig)
+	if (!iova)
 		goto err;
 
 	spin_lock(&domain->iotlb_lock);
@@ -519,8 +518,6 @@ void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
 	return orig;
 err:
 	*dma_addr = DMA_MAPPING_ERROR;
-	if (orig)
-		free_pages_exact(orig, size);
 	if (iova)
 		vduse_domain_free_iova(iovad, iova, size);
 
diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
index 42090cd1a622..86b7c7eadfd0 100644
--- a/drivers/vdpa/vdpa_user/iova_domain.h
+++ b/drivers/vdpa/vdpa_user/iova_domain.h
@@ -69,7 +69,7 @@ void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
 
 void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
 				  size_t size, dma_addr_t *dma_addr,
-				  gfp_t flag);
+				  void *orig);
 
 void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
 				dma_addr_t dma_addr, unsigned long attrs);
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index a278cff7a4fa..767abcb7e375 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -924,16 +924,24 @@ static void *vduse_dev_alloc_coherent(union virtio_map token, size_t size,
 	if (!token.group)
 		return NULL;
 
+	addr = alloc_pages_exact(size, flag);
+	if (!addr)
+		return NULL;
+
 	vdev = token.group->dev;
 	domain = vdev->domain;
 	addr = vduse_domain_alloc_coherent(domain, size,
-					   (dma_addr_t *)&iova, flag);
+					   (dma_addr_t *)&iova, addr);
 	if (!addr)
-		return NULL;
+		goto err;
 
 	*dma_addr = (dma_addr_t)iova;
 
 	return addr;
+
+err:
+	free_pages_exact(addr, size);
+	return NULL;
 }
 
 static void vduse_dev_free_coherent(union virtio_map token, size_t size,
@@ -950,6 +958,7 @@ static void vduse_dev_free_coherent(union virtio_map token, size_t size,
 	domain = vdev->domain;
 
 	vduse_domain_free_coherent(domain, size, dma_addr, attrs);
+	free_pages_exact(vaddr, size);
 }
 
 static bool vduse_dev_need_sync(union virtio_map token, dma_addr_t dma_addr)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v10 7/8] vduse: add vq group asid support
  2025-12-17 11:24 [PATCH v10 0/8] Add multiple address spaces support to VDUSE Eugenio Pérez
                   ` (5 preceding siblings ...)
  2025-12-17 11:24 ` [PATCH v10 6/8] vduse: take out allocations from vduse_dev_alloc_coherent Eugenio Pérez
@ 2025-12-17 11:24 ` Eugenio Pérez
  2025-12-18  6:44   ` Jason Wang
  2025-12-17 11:24 ` [PATCH v10 8/8] vduse: bump version number Eugenio Pérez
  7 siblings, 1 reply; 22+ messages in thread
From: Eugenio Pérez @ 2025-12-17 11:24 UTC (permalink / raw)
  To: Michael S . Tsirkin 
  Cc: Maxime Coquelin, Laurent Vivier, virtualization, linux-kernel,
	Stefano Garzarella, Yongji Xie, jasowang, Xuan Zhuo, Cindy Lu,
	Eugenio Pérez

Add support for assigning Address Space Identifiers (ASIDs) to each VQ
group.  This enables mapping each group into a distinct memory space.

The vq group to ASID association is protected by a rwlock now.  But the
mutex domain_lock keeps protecting the domains of all ASIDs, as some
operations like the one related with the bounce buffer size still
requires to lock all the ASIDs.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>

---
Future improvements can include performance optimizations on top could
move to per-ASID locks, or hardening by tracking ASID or ASID hashes on
unused bits of the DMA address.

Tested virtio_vdpa by adding manually two threads in vduse_set_status:
one of them modifies the vq group 0 ASID and the other one map and unmap
memory continuously.  After a while, the two threads stop and the usual
work continues.

Tested with vhost_vdpa by migrating a VM while ping on OVS+VDUSE.  A few
workaround were needed in some parts:
* Do not enable CVQ before data vqs in QEMU, as VDUSE does not forward
  the enable message to the userland device.  This will be solved in the
  future.
* Share the suspended state between all vhost devices in QEMU:
  https://lists.nongnu.org/archive/html/qemu-devel/2025-11/msg02947.html
* Implement a fake VDUSE suspend vdpa operation callback that always
  returns true in the kernel.  DPDK suspend the device at the first
  GET_VRING_BASE.
* Remove the CVQ blocker in ASID.

---
v10:
* Back to rwlock version so stronger locks are used.
* Take out allocations from rwlock.
* Forbid changing ASID of a vq group after DRIVER_OK (Jason)
* Remove bad fetching again of domain variable in
  vduse_dev_max_mapping_size (Yongji).
* Remove unused vdev definition in vdpa map_ops callbacks (kernel test
  robot).

v9:
* Replace mutex with rwlock, as the vdpa map_ops can run from atomic
  context.

v8:
* Revert the mutex to rwlock change, it needs proper profiling to
  justify it.

v7:
* Take write lock in the error path (Jason).

v6:
* Make vdpa_dev_add use gotos for error handling (MST).
* s/(dev->api_version < 1) ?/(dev->api_version < VDUSE_API_VERSION_1) ?/
  (MST).
* Fix struct name not matching in the doc.

v5:
* Properly return errno if copy_to_user returns >0 in VDUSE_IOTLB_GET_FD
  ioctl (Jason).
* Properly set domain bounce size to divide equally between nas (Jason).
* Exclude "padding" member from the only >V1 members in
  vduse_dev_request.

v4:
* Divide each domain bounce size between the device bounce size (Jason).
* revert unneeded addr = NULL assignment (Jason)
* Change if (x && (y || z)) return to if (x) { if (y) return; if (z)
  return; } (Jason)
* Change a bad multiline comment, using @ caracter instead of * (Jason).
* Consider config->nas == 0 as a fail (Jason).

v3:
* Get the vduse domain through the vduse_as in the map functions
  (Jason).
* Squash with the patch creating the vduse_as struct (Jason).
* Create VDUSE_DEV_MAX_AS instead of comparing agains a magic number
  (Jason)

v2:
* Convert the use of mutex to rwlock.

RFC v3:
* Increase VDUSE_MAX_VQ_GROUPS to 0xffff (Jason). It was set to a lower
  value to reduce memory consumption, but vqs are already limited to
  that value and userspace VDUSE is able to allocate that many vqs.
* Remove TODO about merging VDUSE_IOTLB_GET_FD ioctl with
  VDUSE_IOTLB_GET_INFO.
* Use of array_index_nospec in VDUSE device ioctls.
* Embed vduse_iotlb_entry into vduse_iotlb_entry_v2.
* Move the umem mutex to asid struct so there is no contention between
  ASIDs.

RFC v2:
* Make iotlb entry the last one of vduse_iotlb_entry_v2 so the first
  part of the struct is the same.
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 366 +++++++++++++++++++----------
 include/uapi/linux/vduse.h         |  53 ++++-
 2 files changed, 295 insertions(+), 124 deletions(-)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index 767abcb7e375..786ab2378825 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -41,6 +41,7 @@
 
 #define VDUSE_DEV_MAX (1U << MINORBITS)
 #define VDUSE_DEV_MAX_GROUPS 0xffff
+#define VDUSE_DEV_MAX_AS 0xffff
 #define VDUSE_MAX_BOUNCE_SIZE (1024 * 1024 * 1024)
 #define VDUSE_MIN_BOUNCE_SIZE (1024 * 1024)
 #define VDUSE_BOUNCE_SIZE (64 * 1024 * 1024)
@@ -86,7 +87,15 @@ struct vduse_umem {
 	struct mm_struct *mm;
 };
 
+struct vduse_as {
+	struct vduse_iova_domain *domain;
+	struct vduse_umem *umem;
+	struct mutex mem_lock;
+};
+
 struct vduse_vq_group {
+	rwlock_t as_lock;
+	struct vduse_as *as; /* Protected by as_lock */
 	struct vduse_dev *dev;
 };
 
@@ -94,7 +103,7 @@ struct vduse_dev {
 	struct vduse_vdpa *vdev;
 	struct device *dev;
 	struct vduse_virtqueue **vqs;
-	struct vduse_iova_domain *domain;
+	struct vduse_as *as;
 	char *name;
 	struct mutex lock;
 	spinlock_t msg_lock;
@@ -122,9 +131,8 @@ struct vduse_dev {
 	u32 vq_num;
 	u32 vq_align;
 	u32 ngroups;
-	struct vduse_umem *umem;
+	u32 nas;
 	struct vduse_vq_group *groups;
-	struct mutex mem_lock;
 	unsigned int bounce_size;
 	struct mutex domain_lock;
 };
@@ -314,7 +322,7 @@ static int vduse_dev_set_status(struct vduse_dev *dev, u8 status)
 	return vduse_dev_msg_sync(dev, &msg);
 }
 
-static int vduse_dev_update_iotlb(struct vduse_dev *dev,
+static int vduse_dev_update_iotlb(struct vduse_dev *dev, u32 asid,
 				  u64 start, u64 last)
 {
 	struct vduse_dev_msg msg = { 0 };
@@ -323,8 +331,14 @@ static int vduse_dev_update_iotlb(struct vduse_dev *dev,
 		return -EINVAL;
 
 	msg.req.type = VDUSE_UPDATE_IOTLB;
-	msg.req.iova.start = start;
-	msg.req.iova.last = last;
+	if (dev->api_version < VDUSE_API_VERSION_1) {
+		msg.req.iova.start = start;
+		msg.req.iova.last = last;
+	} else {
+		msg.req.iova_v2.start = start;
+		msg.req.iova_v2.last = last;
+		msg.req.iova_v2.asid = asid;
+	}
 
 	return vduse_dev_msg_sync(dev, &msg);
 }
@@ -436,14 +450,29 @@ static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
 	return mask;
 }
 
+/* Force set the asid to a vq group without a message to the VDUSE device */
+static void vduse_set_group_asid_nomsg(struct vduse_dev *dev,
+				       unsigned int group, unsigned int asid)
+{
+	write_lock(&dev->groups[group].as_lock);
+	dev->groups[group].as = &dev->as[asid];
+	write_unlock(&dev->groups[group].as_lock);
+}
+
 static void vduse_dev_reset(struct vduse_dev *dev)
 {
 	int i;
-	struct vduse_iova_domain *domain = dev->domain;
 
 	/* The coherent mappings are handled in vduse_dev_free_coherent() */
-	if (domain && domain->bounce_map)
-		vduse_domain_reset_bounce_map(domain);
+	for (i = 0; i < dev->nas; i++) {
+		struct vduse_iova_domain *domain = dev->as[i].domain;
+
+		if (domain && domain->bounce_map)
+			vduse_domain_reset_bounce_map(domain);
+	}
+
+	for (i = 0; i < dev->ngroups; i++)
+		vduse_set_group_asid_nomsg(dev, i, 0);
 
 	down_write(&dev->rwsem);
 
@@ -623,6 +652,30 @@ static union virtio_map vduse_get_vq_map(struct vdpa_device *vdpa, u16 idx)
 	return ret;
 }
 
+static int vduse_set_group_asid(struct vdpa_device *vdpa, unsigned int group,
+				unsigned int asid)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_dev_msg msg = { 0 };
+	int r;
+
+	if (dev->api_version < VDUSE_API_VERSION_1 ||
+	    group >= dev->ngroups || asid >= dev->nas ||
+	    dev->status & VIRTIO_CONFIG_S_DRIVER_OK)
+		return -EINVAL;
+
+	msg.req.type = VDUSE_SET_VQ_GROUP_ASID;
+	msg.req.vq_group_asid.group = group;
+	msg.req.vq_group_asid.asid = asid;
+
+	r = vduse_dev_msg_sync(dev, &msg);
+	if (r < 0)
+		return r;
+
+	vduse_set_group_asid_nomsg(dev, group, asid);
+	return 0;
+}
+
 static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
 				struct vdpa_vq_state *state)
 {
@@ -794,13 +847,13 @@ static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
 	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
 	int ret;
 
-	ret = vduse_domain_set_map(dev->domain, iotlb);
+	ret = vduse_domain_set_map(dev->as[asid].domain, iotlb);
 	if (ret)
 		return ret;
 
-	ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
+	ret = vduse_dev_update_iotlb(dev, asid, 0ULL, ULLONG_MAX);
 	if (ret) {
-		vduse_domain_clear_map(dev->domain, iotlb);
+		vduse_domain_clear_map(dev->as[asid].domain, iotlb);
 		return ret;
 	}
 
@@ -843,6 +896,7 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = {
 	.get_vq_affinity	= vduse_vdpa_get_vq_affinity,
 	.reset			= vduse_vdpa_reset,
 	.set_map		= vduse_vdpa_set_map,
+	.set_group_asid		= vduse_set_group_asid,
 	.get_vq_map		= vduse_get_vq_map,
 	.free			= vduse_vdpa_free,
 };
@@ -851,32 +905,30 @@ static void vduse_dev_sync_single_for_device(union virtio_map token,
 					     dma_addr_t dma_addr, size_t size,
 					     enum dma_data_direction dir)
 {
-	struct vduse_dev *vdev;
 	struct vduse_iova_domain *domain;
 
 	if (!token.group)
 		return;
 
-	vdev = token.group->dev;
-	domain = vdev->domain;
-
+	read_lock(&token.group->as_lock);
+	domain = token.group->as->domain;
 	vduse_domain_sync_single_for_device(domain, dma_addr, size, dir);
+	read_unlock(&token.group->as_lock);
 }
 
 static void vduse_dev_sync_single_for_cpu(union virtio_map token,
 					     dma_addr_t dma_addr, size_t size,
 					     enum dma_data_direction dir)
 {
-	struct vduse_dev *vdev;
 	struct vduse_iova_domain *domain;
 
 	if (!token.group)
 		return;
 
-	vdev = token.group->dev;
-	domain = vdev->domain;
-
+	read_lock(&token.group->as_lock);
+	domain = token.group->as->domain;
 	vduse_domain_sync_single_for_cpu(domain, dma_addr, size, dir);
+	read_unlock(&token.group->as_lock);
 }
 
 static dma_addr_t vduse_dev_map_page(union virtio_map token, struct page *page,
@@ -884,38 +936,38 @@ static dma_addr_t vduse_dev_map_page(union virtio_map token, struct page *page,
 				     enum dma_data_direction dir,
 				     unsigned long attrs)
 {
-	struct vduse_dev *vdev;
 	struct vduse_iova_domain *domain;
+	dma_addr_t r;
 
 	if (!token.group)
 		return DMA_MAPPING_ERROR;
 
-	vdev = token.group->dev;
-	domain = vdev->domain;
+	read_lock(&token.group->as_lock);
+	domain = token.group->as->domain;
+	r = vduse_domain_map_page(domain, page, offset, size, dir, attrs);
+	read_unlock(&token.group->as_lock);
 
-	return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
+	return r;
 }
 
 static void vduse_dev_unmap_page(union virtio_map token, dma_addr_t dma_addr,
 				 size_t size, enum dma_data_direction dir,
 				 unsigned long attrs)
 {
-	struct vduse_dev *vdev;
 	struct vduse_iova_domain *domain;
 
 	if (!token.group)
 		return;
 
-	vdev = token.group->dev;
-	domain = vdev->domain;
-
-	return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
+	read_lock(&token.group->as_lock);
+	domain = token.group->as->domain;
+	vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
+	read_unlock(&token.group->as_lock);
 }
 
 static void *vduse_dev_alloc_coherent(union virtio_map token, size_t size,
 				      dma_addr_t *dma_addr, gfp_t flag)
 {
-	struct vduse_dev *vdev;
 	struct vduse_iova_domain *domain;
 	unsigned long iova;
 	void *addr;
@@ -928,18 +980,21 @@ static void *vduse_dev_alloc_coherent(union virtio_map token, size_t size,
 	if (!addr)
 		return NULL;
 
-	vdev = token.group->dev;
-	domain = vdev->domain;
+	*dma_addr = (dma_addr_t)iova;
+	read_lock(&token.group->as_lock);
+	domain = token.group->as->domain;
 	addr = vduse_domain_alloc_coherent(domain, size,
 					   (dma_addr_t *)&iova, addr);
 	if (!addr)
 		goto err;
 
 	*dma_addr = (dma_addr_t)iova;
+	read_unlock(&token.group->as_lock);
 
 	return addr;
 
 err:
+	read_unlock(&token.group->as_lock);
 	free_pages_exact(addr, size);
 	return NULL;
 }
@@ -948,31 +1003,30 @@ static void vduse_dev_free_coherent(union virtio_map token, size_t size,
 				    void *vaddr, dma_addr_t dma_addr,
 				    unsigned long attrs)
 {
-	struct vduse_dev *vdev;
 	struct vduse_iova_domain *domain;
 
 	if (!token.group)
 		return;
 
-	vdev = token.group->dev;
-	domain = vdev->domain;
-
+	read_lock(&token.group->as_lock);
+	domain = token.group->as->domain;
 	vduse_domain_free_coherent(domain, size, dma_addr, attrs);
+	read_unlock(&token.group->as_lock);
 	free_pages_exact(vaddr, size);
 }
 
 static bool vduse_dev_need_sync(union virtio_map token, dma_addr_t dma_addr)
 {
-	struct vduse_dev *vdev;
-	struct vduse_iova_domain *domain;
+	size_t bounce_size;
 
 	if (!token.group)
 		return false;
 
-	vdev = token.group->dev;
-	domain = vdev->domain;
+	read_lock(&token.group->as_lock);
+	bounce_size = token.group->as->domain->bounce_size;
+	read_unlock(&token.group->as_lock);
 
-	return dma_addr < domain->bounce_size;
+	return dma_addr < bounce_size;
 }
 
 static int vduse_dev_mapping_error(union virtio_map token, dma_addr_t dma_addr)
@@ -984,16 +1038,16 @@ static int vduse_dev_mapping_error(union virtio_map token, dma_addr_t dma_addr)
 
 static size_t vduse_dev_max_mapping_size(union virtio_map token)
 {
-	struct vduse_dev *vdev;
-	struct vduse_iova_domain *domain;
+	size_t bounce_size;
 
 	if (!token.group)
 		return 0;
 
-	vdev = token.group->dev;
-	domain = vdev->domain;
+	read_lock(&token.group->as_lock);
+	bounce_size = token.group->as->domain->bounce_size;
+	read_unlock(&token.group->as_lock);
 
-	return domain->bounce_size;
+	return bounce_size;
 }
 
 static const struct virtio_map_ops vduse_map_ops = {
@@ -1133,39 +1187,40 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
 	return ret;
 }
 
-static int vduse_dev_dereg_umem(struct vduse_dev *dev,
+static int vduse_dev_dereg_umem(struct vduse_dev *dev, u32 asid,
 				u64 iova, u64 size)
 {
 	int ret;
 
-	mutex_lock(&dev->mem_lock);
+	mutex_lock(&dev->as[asid].mem_lock);
 	ret = -ENOENT;
-	if (!dev->umem)
+	if (!dev->as[asid].umem)
 		goto unlock;
 
 	ret = -EINVAL;
-	if (!dev->domain)
+	if (!dev->as[asid].domain)
 		goto unlock;
 
-	if (dev->umem->iova != iova || size != dev->domain->bounce_size)
+	if (dev->as[asid].umem->iova != iova ||
+	    size != dev->as[asid].domain->bounce_size)
 		goto unlock;
 
-	vduse_domain_remove_user_bounce_pages(dev->domain);
-	unpin_user_pages_dirty_lock(dev->umem->pages,
-				    dev->umem->npages, true);
-	atomic64_sub(dev->umem->npages, &dev->umem->mm->pinned_vm);
-	mmdrop(dev->umem->mm);
-	vfree(dev->umem->pages);
-	kfree(dev->umem);
-	dev->umem = NULL;
+	vduse_domain_remove_user_bounce_pages(dev->as[asid].domain);
+	unpin_user_pages_dirty_lock(dev->as[asid].umem->pages,
+				    dev->as[asid].umem->npages, true);
+	atomic64_sub(dev->as[asid].umem->npages, &dev->as[asid].umem->mm->pinned_vm);
+	mmdrop(dev->as[asid].umem->mm);
+	vfree(dev->as[asid].umem->pages);
+	kfree(dev->as[asid].umem);
+	dev->as[asid].umem = NULL;
 	ret = 0;
 unlock:
-	mutex_unlock(&dev->mem_lock);
+	mutex_unlock(&dev->as[asid].mem_lock);
 	return ret;
 }
 
 static int vduse_dev_reg_umem(struct vduse_dev *dev,
-			      u64 iova, u64 uaddr, u64 size)
+			      u32 asid, u64 iova, u64 uaddr, u64 size)
 {
 	struct page **page_list = NULL;
 	struct vduse_umem *umem = NULL;
@@ -1173,14 +1228,14 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
 	unsigned long npages, lock_limit;
 	int ret;
 
-	if (!dev->domain || !dev->domain->bounce_map ||
-	    size != dev->domain->bounce_size ||
+	if (!dev->as[asid].domain || !dev->as[asid].domain->bounce_map ||
+	    size != dev->as[asid].domain->bounce_size ||
 	    iova != 0 || uaddr & ~PAGE_MASK)
 		return -EINVAL;
 
-	mutex_lock(&dev->mem_lock);
+	mutex_lock(&dev->as[asid].mem_lock);
 	ret = -EEXIST;
-	if (dev->umem)
+	if (dev->as[asid].umem)
 		goto unlock;
 
 	ret = -ENOMEM;
@@ -1204,7 +1259,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
 		goto out;
 	}
 
-	ret = vduse_domain_add_user_bounce_pages(dev->domain,
+	ret = vduse_domain_add_user_bounce_pages(dev->as[asid].domain,
 						 page_list, pinned);
 	if (ret)
 		goto out;
@@ -1217,7 +1272,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
 	umem->mm = current->mm;
 	mmgrab(current->mm);
 
-	dev->umem = umem;
+	dev->as[asid].umem = umem;
 out:
 	if (ret && pinned > 0)
 		unpin_user_pages(page_list, pinned);
@@ -1228,7 +1283,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
 		vfree(page_list);
 		kfree(umem);
 	}
-	mutex_unlock(&dev->mem_lock);
+	mutex_unlock(&dev->as[asid].mem_lock);
 	return ret;
 }
 
@@ -1260,47 +1315,66 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
 
 	switch (cmd) {
 	case VDUSE_IOTLB_GET_FD: {
-		struct vduse_iotlb_entry entry;
+		struct vduse_iotlb_entry_v2 entry;
 		struct vhost_iotlb_map *map;
 		struct vdpa_map_file *map_file;
 		struct file *f = NULL;
+		u32 asid;
 
 		ret = -EFAULT;
-		if (copy_from_user(&entry, argp, sizeof(entry)))
-			break;
+		if (dev->api_version >= VDUSE_API_VERSION_1) {
+			if (copy_from_user(&entry, argp, sizeof(entry)))
+				break;
+		} else {
+			entry.asid = 0;
+			if (copy_from_user(&entry.v1, argp,
+					   sizeof(entry.v1)))
+				break;
+		}
 
 		ret = -EINVAL;
-		if (entry.start > entry.last)
+		if (entry.v1.start > entry.v1.last)
+			break;
+
+		if (entry.asid >= dev->nas)
 			break;
 
 		mutex_lock(&dev->domain_lock);
-		if (!dev->domain) {
+		asid = array_index_nospec(entry.asid, dev->nas);
+		if (!dev->as[asid].domain) {
 			mutex_unlock(&dev->domain_lock);
 			break;
 		}
-		spin_lock(&dev->domain->iotlb_lock);
-		map = vhost_iotlb_itree_first(dev->domain->iotlb,
-					      entry.start, entry.last);
+		spin_lock(&dev->as[asid].domain->iotlb_lock);
+		map = vhost_iotlb_itree_first(dev->as[asid].domain->iotlb,
+					      entry.v1.start, entry.v1.last);
 		if (map) {
 			map_file = (struct vdpa_map_file *)map->opaque;
 			f = get_file(map_file->file);
-			entry.offset = map_file->offset;
-			entry.start = map->start;
-			entry.last = map->last;
-			entry.perm = map->perm;
+			entry.v1.offset = map_file->offset;
+			entry.v1.start = map->start;
+			entry.v1.last = map->last;
+			entry.v1.perm = map->perm;
 		}
-		spin_unlock(&dev->domain->iotlb_lock);
+		spin_unlock(&dev->as[asid].domain->iotlb_lock);
 		mutex_unlock(&dev->domain_lock);
 		ret = -EINVAL;
 		if (!f)
 			break;
 
-		ret = -EFAULT;
-		if (copy_to_user(argp, &entry, sizeof(entry))) {
+		if (dev->api_version >= VDUSE_API_VERSION_1)
+			ret = copy_to_user(argp, &entry,
+					   sizeof(entry));
+		else
+			ret = copy_to_user(argp, &entry.v1,
+					   sizeof(entry.v1));
+
+		if (ret) {
+			ret = -EFAULT;
 			fput(f);
 			break;
 		}
-		ret = receive_fd(f, NULL, perm_to_file_flags(entry.perm));
+		ret = receive_fd(f, NULL, perm_to_file_flags(entry.v1.perm));
 		fput(f);
 		break;
 	}
@@ -1445,6 +1519,7 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
 	}
 	case VDUSE_IOTLB_REG_UMEM: {
 		struct vduse_iova_umem umem;
+		u32 asid;
 
 		ret = -EFAULT;
 		if (copy_from_user(&umem, argp, sizeof(umem)))
@@ -1452,17 +1527,21 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
 
 		ret = -EINVAL;
 		if (!is_mem_zero((const char *)umem.reserved,
-				 sizeof(umem.reserved)))
+				 sizeof(umem.reserved)) ||
+		    (dev->api_version < VDUSE_API_VERSION_1 &&
+		     umem.asid != 0) || umem.asid >= dev->nas)
 			break;
 
 		mutex_lock(&dev->domain_lock);
-		ret = vduse_dev_reg_umem(dev, umem.iova,
+		asid = array_index_nospec(umem.asid, dev->nas);
+		ret = vduse_dev_reg_umem(dev, asid, umem.iova,
 					 umem.uaddr, umem.size);
 		mutex_unlock(&dev->domain_lock);
 		break;
 	}
 	case VDUSE_IOTLB_DEREG_UMEM: {
 		struct vduse_iova_umem umem;
+		u32 asid;
 
 		ret = -EFAULT;
 		if (copy_from_user(&umem, argp, sizeof(umem)))
@@ -1470,10 +1549,15 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
 
 		ret = -EINVAL;
 		if (!is_mem_zero((const char *)umem.reserved,
-				 sizeof(umem.reserved)))
+				 sizeof(umem.reserved)) ||
+		    (dev->api_version < VDUSE_API_VERSION_1 &&
+		     umem.asid != 0) ||
+		     umem.asid >= dev->nas)
 			break;
+
 		mutex_lock(&dev->domain_lock);
-		ret = vduse_dev_dereg_umem(dev, umem.iova,
+		asid = array_index_nospec(umem.asid, dev->nas);
+		ret = vduse_dev_dereg_umem(dev, asid, umem.iova,
 					   umem.size);
 		mutex_unlock(&dev->domain_lock);
 		break;
@@ -1481,6 +1565,7 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
 	case VDUSE_IOTLB_GET_INFO: {
 		struct vduse_iova_info info;
 		struct vhost_iotlb_map *map;
+		u32 asid;
 
 		ret = -EFAULT;
 		if (copy_from_user(&info, argp, sizeof(info)))
@@ -1494,23 +1579,31 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
 				 sizeof(info.reserved)))
 			break;
 
+		if (dev->api_version < VDUSE_API_VERSION_1) {
+			if (info.asid)
+				break;
+		} else if (info.asid >= dev->nas)
+			break;
+
 		mutex_lock(&dev->domain_lock);
-		if (!dev->domain) {
+		asid = array_index_nospec(info.asid, dev->nas);
+		if (!dev->as[asid].domain) {
 			mutex_unlock(&dev->domain_lock);
 			break;
 		}
-		spin_lock(&dev->domain->iotlb_lock);
-		map = vhost_iotlb_itree_first(dev->domain->iotlb,
+		spin_lock(&dev->as[asid].domain->iotlb_lock);
+		map = vhost_iotlb_itree_first(dev->as[asid].domain->iotlb,
 					      info.start, info.last);
 		if (map) {
 			info.start = map->start;
 			info.last = map->last;
 			info.capability = 0;
-			if (dev->domain->bounce_map && map->start == 0 &&
-			    map->last == dev->domain->bounce_size - 1)
+			if (dev->as[asid].domain->bounce_map &&
+			    map->start == 0 &&
+			    map->last == dev->as[asid].domain->bounce_size - 1)
 				info.capability |= VDUSE_IOVA_CAP_UMEM;
 		}
-		spin_unlock(&dev->domain->iotlb_lock);
+		spin_unlock(&dev->as[asid].domain->iotlb_lock);
 		mutex_unlock(&dev->domain_lock);
 		if (!map)
 			break;
@@ -1535,8 +1628,10 @@ static int vduse_dev_release(struct inode *inode, struct file *file)
 	struct vduse_dev *dev = file->private_data;
 
 	mutex_lock(&dev->domain_lock);
-	if (dev->domain)
-		vduse_dev_dereg_umem(dev, 0, dev->domain->bounce_size);
+	for (int i = 0; i < dev->nas; i++)
+		if (dev->as[i].domain)
+			vduse_dev_dereg_umem(dev, i, 0,
+					     dev->as[i].domain->bounce_size);
 	mutex_unlock(&dev->domain_lock);
 	spin_lock(&dev->msg_lock);
 	/* Make sure the inflight messages can processed after reconncection */
@@ -1755,7 +1850,6 @@ static struct vduse_dev *vduse_dev_create(void)
 		return NULL;
 
 	mutex_init(&dev->lock);
-	mutex_init(&dev->mem_lock);
 	mutex_init(&dev->domain_lock);
 	spin_lock_init(&dev->msg_lock);
 	INIT_LIST_HEAD(&dev->send_list);
@@ -1806,8 +1900,11 @@ static int vduse_destroy_dev(char *name)
 	idr_remove(&vduse_idr, dev->minor);
 	kvfree(dev->config);
 	vduse_dev_deinit_vqs(dev);
-	if (dev->domain)
-		vduse_domain_destroy(dev->domain);
+	for (int i = 0; i < dev->nas; i++) {
+		if (dev->as[i].domain)
+			vduse_domain_destroy(dev->as[i].domain);
+	}
+	kfree(dev->as);
 	kfree(dev->name);
 	kfree(dev->groups);
 	vduse_dev_destroy(dev);
@@ -1854,12 +1951,17 @@ static bool vduse_validate_config(struct vduse_dev_config *config,
 			 sizeof(config->reserved)))
 		return false;
 
-	if (api_version < VDUSE_API_VERSION_1 && config->ngroups)
+	if (api_version < VDUSE_API_VERSION_1 &&
+	    (config->ngroups || config->nas))
 		return false;
 
-	if (api_version >= VDUSE_API_VERSION_1 &&
-	    (!config->ngroups || config->ngroups > VDUSE_DEV_MAX_GROUPS))
-		return false;
+	if (api_version >= VDUSE_API_VERSION_1) {
+		if (!config->ngroups || config->ngroups > VDUSE_DEV_MAX_GROUPS)
+			return false;
+
+		if (!config->nas || config->nas > VDUSE_DEV_MAX_AS)
+			return false;
+	}
 
 	if (config->vq_align > PAGE_SIZE)
 		return false;
@@ -1924,7 +2026,8 @@ static ssize_t bounce_size_store(struct device *device,
 
 	ret = -EPERM;
 	mutex_lock(&dev->domain_lock);
-	if (dev->domain)
+	/* Assuming that if the first domain is allocated, all are allocated */
+	if (dev->as[0].domain)
 		goto unlock;
 
 	ret = kstrtouint(buf, 10, &bounce_size);
@@ -1983,8 +2086,17 @@ static int vduse_create_dev(struct vduse_dev_config *config,
 			      GFP_KERNEL);
 	if (!dev->groups)
 		goto err_vq_groups;
-	for (u32 i = 0; i < dev->ngroups; ++i)
+	for (u32 i = 0; i < dev->ngroups; ++i) {
 		dev->groups[i].dev = dev;
+		rwlock_init(&dev->groups[i].as_lock);
+	}
+
+	dev->nas = (dev->api_version < VDUSE_API_VERSION_1) ? 1 : config->nas;
+	dev->as = kcalloc(dev->nas, sizeof(dev->as[0]), GFP_KERNEL);
+	if (!dev->as)
+		goto err_as;
+	for (int i = 0; i < dev->nas; i++)
+		mutex_init(&dev->as[i].mem_lock);
 
 	dev->name = kstrdup(config->name, GFP_KERNEL);
 	if (!dev->name)
@@ -2022,6 +2134,8 @@ static int vduse_create_dev(struct vduse_dev_config *config,
 err_idr:
 	kfree(dev->name);
 err_str:
+	kfree(dev->as);
+err_as:
 	kfree(dev->groups);
 err_vq_groups:
 	vduse_dev_destroy(dev);
@@ -2147,7 +2261,7 @@ static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
 
 	vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev,
 				 &vduse_vdpa_config_ops, &vduse_map_ops,
-				 dev->ngroups, 1, name, true);
+				 dev->ngroups, dev->nas, name, true);
 	if (IS_ERR(vdev))
 		return PTR_ERR(vdev);
 
@@ -2162,7 +2276,8 @@ static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name,
 			const struct vdpa_dev_set_config *config)
 {
 	struct vduse_dev *dev;
-	int ret;
+	size_t domain_bounce_size;
+	int ret, i;
 
 	mutex_lock(&vduse_lock);
 	dev = vduse_find_dev(name);
@@ -2176,29 +2291,38 @@ static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name,
 		return ret;
 
 	mutex_lock(&dev->domain_lock);
-	if (!dev->domain)
-		dev->domain = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
-						  dev->bounce_size);
-	mutex_unlock(&dev->domain_lock);
-	if (!dev->domain) {
-		ret = -ENOMEM;
-		goto domain_err;
+	ret = 0;
+
+	domain_bounce_size = dev->bounce_size / dev->nas;
+	for (i = 0; i < dev->nas; ++i) {
+		dev->as[i].domain = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
+							domain_bounce_size);
+		if (!dev->as[i].domain) {
+			ret = -ENOMEM;
+			goto err;
+		}
 	}
 
+	mutex_unlock(&dev->domain_lock);
+
 	ret = _vdpa_register_device(&dev->vdev->vdpa, dev->vq_num);
-	if (ret) {
-		goto register_err;
-	}
+	if (ret)
+		goto err_register;
 
 	return 0;
 
-register_err:
+err_register:
 	mutex_lock(&dev->domain_lock);
-	vduse_domain_destroy(dev->domain);
-	dev->domain = NULL;
+
+err:
+	for (int j = 0; j < i; j++) {
+		if (dev->as[j].domain) {
+			vduse_domain_destroy(dev->as[j].domain);
+			dev->as[j].domain = NULL;
+		}
+	}
 	mutex_unlock(&dev->domain_lock);
 
-domain_err:
 	put_device(&dev->vdev->vdpa.dev);
 
 	return ret;
diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
index a3d51cf6df3a..da2c5e47990e 100644
--- a/include/uapi/linux/vduse.h
+++ b/include/uapi/linux/vduse.h
@@ -47,7 +47,8 @@ struct vduse_dev_config {
 	__u32 vq_num;
 	__u32 vq_align;
 	__u32 ngroups; /* if VDUSE_API_VERSION >= 1 */
-	__u32 reserved[12];
+	__u32 nas; /* if VDUSE_API_VERSION >= 1 */
+	__u32 reserved[11];
 	__u32 config_size;
 	__u8 config[];
 };
@@ -82,6 +83,18 @@ struct vduse_iotlb_entry {
 	__u8 perm;
 };
 
+/**
+ * struct vduse_iotlb_entry_v2 - entry of IOTLB to describe one IOVA region in an ASID
+ * @v1: the original vduse_iotlb_entry
+ * @asid: address space ID of the IOVA region
+ *
+ * Structure used by VDUSE_IOTLB_GET_FD ioctl to find an overlapped IOVA region.
+ */
+struct vduse_iotlb_entry_v2 {
+	struct vduse_iotlb_entry v1;
+	__u32 asid;
+};
+
 /*
  * Find the first IOVA region that overlaps with the range [start, last]
  * and return the corresponding file descriptor. Return -EINVAL means the
@@ -166,6 +179,16 @@ struct vduse_vq_state_packed {
 	__u16 last_used_idx;
 };
 
+/**
+ * struct vduse_vq_group_asid - virtqueue group ASID
+ * @group: Index of the virtqueue group
+ * @asid: Address space ID of the group
+ */
+struct vduse_vq_group_asid {
+	__u32 group;
+	__u32 asid;
+};
+
 /**
  * struct vduse_vq_info - information of a virtqueue
  * @index: virtqueue index
@@ -225,6 +248,7 @@ struct vduse_vq_eventfd {
  * @uaddr: start address of userspace memory, it must be aligned to page size
  * @iova: start of the IOVA region
  * @size: size of the IOVA region
+ * @asid: Address space ID of the IOVA region
  * @reserved: for future use, needs to be initialized to zero
  *
  * Structure used by VDUSE_IOTLB_REG_UMEM and VDUSE_IOTLB_DEREG_UMEM
@@ -234,7 +258,8 @@ struct vduse_iova_umem {
 	__u64 uaddr;
 	__u64 iova;
 	__u64 size;
-	__u64 reserved[3];
+	__u32 asid;
+	__u32 reserved[5];
 };
 
 /* Register userspace memory for IOVA regions */
@@ -248,6 +273,7 @@ struct vduse_iova_umem {
  * @start: start of the IOVA region
  * @last: last of the IOVA region
  * @capability: capability of the IOVA region
+ * @asid: Address space ID of the IOVA region, only if device API version >= 1
  * @reserved: for future use, needs to be initialized to zero
  *
  * Structure used by VDUSE_IOTLB_GET_INFO ioctl to get information of
@@ -258,7 +284,8 @@ struct vduse_iova_info {
 	__u64 last;
 #define VDUSE_IOVA_CAP_UMEM (1 << 0)
 	__u64 capability;
-	__u64 reserved[3];
+	__u32 asid; /* Only if device API version >= 1 */
+	__u32 reserved[5];
 };
 
 /*
@@ -280,6 +307,7 @@ enum vduse_req_type {
 	VDUSE_GET_VQ_STATE,
 	VDUSE_SET_STATUS,
 	VDUSE_UPDATE_IOTLB,
+	VDUSE_SET_VQ_GROUP_ASID,
 };
 
 /**
@@ -314,6 +342,18 @@ struct vduse_iova_range {
 	__u64 last;
 };
 
+/**
+ * struct vduse_iova_range - IOVA range [start, last] if API_VERSION >= 1
+ * @start: start of the IOVA range
+ * @last: last of the IOVA range
+ * @asid: address space ID of the IOVA range
+ */
+struct vduse_iova_range_v2 {
+	__u64 start;
+	__u64 last;
+	__u32 asid;
+};
+
 /**
  * struct vduse_dev_request - control request
  * @type: request type
@@ -322,6 +362,8 @@ struct vduse_iova_range {
  * @vq_state: virtqueue state, only index field is available
  * @s: device status
  * @iova: IOVA range for updating
+ * @iova_v2: IOVA range for updating if API_VERSION >= 1
+ * @vq_group_asid: ASID of a virtqueue group
  * @padding: padding
  *
  * Structure used by read(2) on /dev/vduse/$NAME.
@@ -334,6 +376,11 @@ struct vduse_dev_request {
 		struct vduse_vq_state vq_state;
 		struct vduse_dev_status s;
 		struct vduse_iova_range iova;
+		/* Following members but padding exist only if vduse api
+		 * version >= 1
+		 */;
+		struct vduse_iova_range_v2 iova_v2;
+		struct vduse_vq_group_asid vq_group_asid;
 		__u32 padding[32];
 	};
 };
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v10 8/8] vduse: bump version number
  2025-12-17 11:24 [PATCH v10 0/8] Add multiple address spaces support to VDUSE Eugenio Pérez
                   ` (6 preceding siblings ...)
  2025-12-17 11:24 ` [PATCH v10 7/8] vduse: add vq group asid support Eugenio Pérez
@ 2025-12-17 11:24 ` Eugenio Pérez
  7 siblings, 0 replies; 22+ messages in thread
From: Eugenio Pérez @ 2025-12-17 11:24 UTC (permalink / raw)
  To: Michael S . Tsirkin 
  Cc: Maxime Coquelin, Laurent Vivier, virtualization, linux-kernel,
	Stefano Garzarella, Yongji Xie, jasowang, Xuan Zhuo, Cindy Lu,
	Eugenio Pérez

Finalize the series by advertising VDUSE API v1 support to userspace.

Now that all required infrastructure for v1 (ASIDs, VQ groups,
update_iotlb_v2) is in place, VDUSE devices can opt in to the new
features.

Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Xie Yongji <xieyongji@bytedance.com>
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index 786ab2378825..15860bd331a6 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -2163,7 +2163,7 @@ static long vduse_ioctl(struct file *file, unsigned int cmd,
 			break;
 
 		ret = -EINVAL;
-		if (api_version > VDUSE_API_VERSION)
+		if (api_version > VDUSE_API_VERSION_1)
 			break;
 
 		ret = 0;
@@ -2230,7 +2230,7 @@ static int vduse_open(struct inode *inode, struct file *file)
 	if (!control)
 		return -ENOMEM;
 
-	control->api_version = VDUSE_API_VERSION;
+	control->api_version = VDUSE_API_VERSION_1;
 	file->private_data = control;
 
 	return 0;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v10 6/8] vduse: take out allocations from vduse_dev_alloc_coherent
  2025-12-17 11:24 ` [PATCH v10 6/8] vduse: take out allocations from vduse_dev_alloc_coherent Eugenio Pérez
@ 2025-12-18  5:45   ` Jason Wang
  2025-12-18  8:40     ` Eugenio Perez Martin
  0 siblings, 1 reply; 22+ messages in thread
From: Jason Wang @ 2025-12-18  5:45 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: Michael S . Tsirkin, Maxime Coquelin, Laurent Vivier,
	virtualization, linux-kernel, Stefano Garzarella, Yongji Xie,
	Xuan Zhuo, Cindy Lu

On Wed, Dec 17, 2025 at 7:24 PM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> The function vduse_dev_alloc_coherent will be called under rwlock in
> next patches.  Make it out of the lock to avoid increasing its fail
> rate.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  drivers/vdpa/vdpa_user/iova_domain.c |  7 ++-----
>  drivers/vdpa/vdpa_user/iova_domain.h |  2 +-
>  drivers/vdpa/vdpa_user/vduse_dev.c   | 13 +++++++++++--
>  3 files changed, 14 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
> index 309cd5a039d1..0ae52890518c 100644
> --- a/drivers/vdpa/vdpa_user/iova_domain.c
> +++ b/drivers/vdpa/vdpa_user/iova_domain.c
> @@ -495,14 +495,13 @@ void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
>
>  void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
>                                   size_t size, dma_addr_t *dma_addr,
> -                                 gfp_t flag)
> +                                 void *orig)
>  {
>         struct iova_domain *iovad = &domain->consistent_iovad;
>         unsigned long limit = domain->iova_limit;
>         dma_addr_t iova = vduse_domain_alloc_iova(iovad, size, limit);
> -       void *orig = alloc_pages_exact(size, flag);
>
> -       if (!iova || !orig)
> +       if (!iova)
>                 goto err;
>
>         spin_lock(&domain->iotlb_lock);
> @@ -519,8 +518,6 @@ void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
>         return orig;
>  err:
>         *dma_addr = DMA_MAPPING_ERROR;
> -       if (orig)
> -               free_pages_exact(orig, size);
>         if (iova)
>                 vduse_domain_free_iova(iovad, iova, size);
>
> diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
> index 42090cd1a622..86b7c7eadfd0 100644
> --- a/drivers/vdpa/vdpa_user/iova_domain.h
> +++ b/drivers/vdpa/vdpa_user/iova_domain.h
> @@ -69,7 +69,7 @@ void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
>
>  void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
>                                   size_t size, dma_addr_t *dma_addr,
> -                                 gfp_t flag);
> +                                 void *orig);
>
>  void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
>                                 dma_addr_t dma_addr, unsigned long attrs);
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> index a278cff7a4fa..767abcb7e375 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -924,16 +924,24 @@ static void *vduse_dev_alloc_coherent(union virtio_map token, size_t size,
>         if (!token.group)
>                 return NULL;
>
> +       addr = alloc_pages_exact(size, flag);
> +       if (!addr)
> +               return NULL;
> +
>         vdev = token.group->dev;
>         domain = vdev->domain;
>         addr = vduse_domain_alloc_coherent(domain, size,
> -                                          (dma_addr_t *)&iova, flag);
> +                                          (dma_addr_t *)&iova, addr);
>         if (!addr)
> -               return NULL;
> +               goto err;
>
>         *dma_addr = (dma_addr_t)iova;
>
>         return addr;
> +
> +err:
> +       free_pages_exact(addr, size);
> +       return NULL;
>  }
>
>  static void vduse_dev_free_coherent(union virtio_map token, size_t size,
> @@ -950,6 +958,7 @@ static void vduse_dev_free_coherent(union virtio_map token, size_t size,
>         domain = vdev->domain;
>
>         vduse_domain_free_coherent(domain, size, dma_addr, attrs);
> +       free_pages_exact(vaddr, size);

There looks like a double-free as there's another free_page_exact() in
vduse_domain_free_coherent.

Thanks

>  }
>
>  static bool vduse_dev_need_sync(union virtio_map token, dma_addr_t dma_addr)
> --
> 2.52.0
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v10 7/8] vduse: add vq group asid support
  2025-12-17 11:24 ` [PATCH v10 7/8] vduse: add vq group asid support Eugenio Pérez
@ 2025-12-18  6:44   ` Jason Wang
  2025-12-18 13:10     ` Eugenio Perez Martin
  0 siblings, 1 reply; 22+ messages in thread
From: Jason Wang @ 2025-12-18  6:44 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: Michael S . Tsirkin, Maxime Coquelin, Laurent Vivier,
	virtualization, linux-kernel, Stefano Garzarella, Yongji Xie,
	Xuan Zhuo, Cindy Lu

On Wed, Dec 17, 2025 at 7:24 PM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> Add support for assigning Address Space Identifiers (ASIDs) to each VQ
> group.  This enables mapping each group into a distinct memory space.
>
> The vq group to ASID association is protected by a rwlock now.  But the
> mutex domain_lock keeps protecting the domains of all ASIDs, as some
> operations like the one related with the bounce buffer size still
> requires to lock all the ASIDs.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>
> ---
> Future improvements can include performance optimizations on top could
> move to per-ASID locks, or hardening by tracking ASID or ASID hashes on
> unused bits of the DMA address.
>
> Tested virtio_vdpa by adding manually two threads in vduse_set_status:
> one of them modifies the vq group 0 ASID and the other one map and unmap
> memory continuously.  After a while, the two threads stop and the usual
> work continues.
>
> Tested with vhost_vdpa by migrating a VM while ping on OVS+VDUSE.  A few
> workaround were needed in some parts:
> * Do not enable CVQ before data vqs in QEMU, as VDUSE does not forward
>   the enable message to the userland device.  This will be solved in the
>   future.
> * Share the suspended state between all vhost devices in QEMU:
>   https://lists.nongnu.org/archive/html/qemu-devel/2025-11/msg02947.html
> * Implement a fake VDUSE suspend vdpa operation callback that always
>   returns true in the kernel.  DPDK suspend the device at the first
>   GET_VRING_BASE.
> * Remove the CVQ blocker in ASID.
>
> ---
> v10:
> * Back to rwlock version so stronger locks are used.
> * Take out allocations from rwlock.
> * Forbid changing ASID of a vq group after DRIVER_OK (Jason)
> * Remove bad fetching again of domain variable in
>   vduse_dev_max_mapping_size (Yongji).
> * Remove unused vdev definition in vdpa map_ops callbacks (kernel test
>   robot).
>
> v9:
> * Replace mutex with rwlock, as the vdpa map_ops can run from atomic
>   context.
>
> v8:
> * Revert the mutex to rwlock change, it needs proper profiling to
>   justify it.
>
> v7:
> * Take write lock in the error path (Jason).
>
> v6:
> * Make vdpa_dev_add use gotos for error handling (MST).
> * s/(dev->api_version < 1) ?/(dev->api_version < VDUSE_API_VERSION_1) ?/
>   (MST).
> * Fix struct name not matching in the doc.
>
> v5:
> * Properly return errno if copy_to_user returns >0 in VDUSE_IOTLB_GET_FD
>   ioctl (Jason).
> * Properly set domain bounce size to divide equally between nas (Jason).
> * Exclude "padding" member from the only >V1 members in
>   vduse_dev_request.
>
> v4:
> * Divide each domain bounce size between the device bounce size (Jason).
> * revert unneeded addr = NULL assignment (Jason)
> * Change if (x && (y || z)) return to if (x) { if (y) return; if (z)
>   return; } (Jason)
> * Change a bad multiline comment, using @ caracter instead of * (Jason).
> * Consider config->nas == 0 as a fail (Jason).
>
> v3:
> * Get the vduse domain through the vduse_as in the map functions
>   (Jason).
> * Squash with the patch creating the vduse_as struct (Jason).
> * Create VDUSE_DEV_MAX_AS instead of comparing agains a magic number
>   (Jason)
>
> v2:
> * Convert the use of mutex to rwlock.
>
> RFC v3:
> * Increase VDUSE_MAX_VQ_GROUPS to 0xffff (Jason). It was set to a lower
>   value to reduce memory consumption, but vqs are already limited to
>   that value and userspace VDUSE is able to allocate that many vqs.
> * Remove TODO about merging VDUSE_IOTLB_GET_FD ioctl with
>   VDUSE_IOTLB_GET_INFO.
> * Use of array_index_nospec in VDUSE device ioctls.
> * Embed vduse_iotlb_entry into vduse_iotlb_entry_v2.
> * Move the umem mutex to asid struct so there is no contention between
>   ASIDs.
>
> RFC v2:
> * Make iotlb entry the last one of vduse_iotlb_entry_v2 so the first
>   part of the struct is the same.
> ---
>  drivers/vdpa/vdpa_user/vduse_dev.c | 366 +++++++++++++++++++----------
>  include/uapi/linux/vduse.h         |  53 ++++-
>  2 files changed, 295 insertions(+), 124 deletions(-)
>
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> index 767abcb7e375..786ab2378825 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -41,6 +41,7 @@
>
>  #define VDUSE_DEV_MAX (1U << MINORBITS)
>  #define VDUSE_DEV_MAX_GROUPS 0xffff
> +#define VDUSE_DEV_MAX_AS 0xffff
>  #define VDUSE_MAX_BOUNCE_SIZE (1024 * 1024 * 1024)
>  #define VDUSE_MIN_BOUNCE_SIZE (1024 * 1024)
>  #define VDUSE_BOUNCE_SIZE (64 * 1024 * 1024)
> @@ -86,7 +87,15 @@ struct vduse_umem {
>         struct mm_struct *mm;
>  };
>
> +struct vduse_as {
> +       struct vduse_iova_domain *domain;
> +       struct vduse_umem *umem;
> +       struct mutex mem_lock;

Not related to this patch. But If I was not wrong we have 1:1 mapping
between domain and as. If this is true, can we use bounce_lock instead
of a new mem_lock? Since I see mem_lock is only used for synchronizing
umem reg/degreg which has been synchronized with domain rwlock.

> +};
> +
>  struct vduse_vq_group {
> +       rwlock_t as_lock;
> +       struct vduse_as *as; /* Protected by as_lock */
>         struct vduse_dev *dev;
>  };
>
> @@ -94,7 +103,7 @@ struct vduse_dev {
>         struct vduse_vdpa *vdev;
>         struct device *dev;
>         struct vduse_virtqueue **vqs;
> -       struct vduse_iova_domain *domain;
> +       struct vduse_as *as;
>         char *name;
>         struct mutex lock;
>         spinlock_t msg_lock;
> @@ -122,9 +131,8 @@ struct vduse_dev {
>         u32 vq_num;
>         u32 vq_align;
>         u32 ngroups;
> -       struct vduse_umem *umem;
> +       u32 nas;
>         struct vduse_vq_group *groups;
> -       struct mutex mem_lock;
>         unsigned int bounce_size;
>         struct mutex domain_lock;
>  };
> @@ -314,7 +322,7 @@ static int vduse_dev_set_status(struct vduse_dev *dev, u8 status)
>         return vduse_dev_msg_sync(dev, &msg);
>  }
>
> -static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> +static int vduse_dev_update_iotlb(struct vduse_dev *dev, u32 asid,
>                                   u64 start, u64 last)
>  {
>         struct vduse_dev_msg msg = { 0 };
> @@ -323,8 +331,14 @@ static int vduse_dev_update_iotlb(struct vduse_dev *dev,
>                 return -EINVAL;
>
>         msg.req.type = VDUSE_UPDATE_IOTLB;
> -       msg.req.iova.start = start;
> -       msg.req.iova.last = last;
> +       if (dev->api_version < VDUSE_API_VERSION_1) {
> +               msg.req.iova.start = start;
> +               msg.req.iova.last = last;
> +       } else {
> +               msg.req.iova_v2.start = start;
> +               msg.req.iova_v2.last = last;
> +               msg.req.iova_v2.asid = asid;
> +       }
>
>         return vduse_dev_msg_sync(dev, &msg);
>  }
> @@ -436,14 +450,29 @@ static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
>         return mask;
>  }
>
> +/* Force set the asid to a vq group without a message to the VDUSE device */
> +static void vduse_set_group_asid_nomsg(struct vduse_dev *dev,
> +                                      unsigned int group, unsigned int asid)
> +{
> +       write_lock(&dev->groups[group].as_lock);
> +       dev->groups[group].as = &dev->as[asid];
> +       write_unlock(&dev->groups[group].as_lock);
> +}
> +
>  static void vduse_dev_reset(struct vduse_dev *dev)
>  {
>         int i;
> -       struct vduse_iova_domain *domain = dev->domain;
>
>         /* The coherent mappings are handled in vduse_dev_free_coherent() */
> -       if (domain && domain->bounce_map)
> -               vduse_domain_reset_bounce_map(domain);
> +       for (i = 0; i < dev->nas; i++) {
> +               struct vduse_iova_domain *domain = dev->as[i].domain;
> +
> +               if (domain && domain->bounce_map)
> +                       vduse_domain_reset_bounce_map(domain);

Btw, I see this:

void vduse_domain_reset_bounce_map(struct vduse_iova_domain *domain)
{
        if (!domain->bounce_map)
                return;

        spin_lock(&domain->iotlb_lock);
        if (!domain->bounce_map)
                goto unlock;


The counbe_map is checked twice, let's fix that.

> +       }
> +
> +       for (i = 0; i < dev->ngroups; i++)
> +               vduse_set_group_asid_nomsg(dev, i, 0);

Note that this function still does:

                vq->vq_group = 0;

Which is wrong.

>
>         down_write(&dev->rwsem);
>
> @@ -623,6 +652,30 @@ static union virtio_map vduse_get_vq_map(struct vdpa_device *vdpa, u16 idx)
>         return ret;
>  }
>
> +static int vduse_set_group_asid(struct vdpa_device *vdpa, unsigned int group,
> +                               unsigned int asid)
> +{
> +       struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +       struct vduse_dev_msg msg = { 0 };
> +       int r;
> +
> +       if (dev->api_version < VDUSE_API_VERSION_1 ||
> +           group >= dev->ngroups || asid >= dev->nas ||
> +           dev->status & VIRTIO_CONFIG_S_DRIVER_OK)
> +               return -EINVAL;

If we forbid setting group asid for !DRIVER_OK, why do we still need a
rwlock? All we need to do is to synchronize set_group_asid() with
set_status()/reset()?

Or if you want to synchronize map ops with set_status() that looks
like an independent thing (hardening).

> +
> +       msg.req.type = VDUSE_SET_VQ_GROUP_ASID;
> +       msg.req.vq_group_asid.group = group;
> +       msg.req.vq_group_asid.asid = asid;
> +
> +       r = vduse_dev_msg_sync(dev, &msg);
> +       if (r < 0)
> +               return r;
> +
> +       vduse_set_group_asid_nomsg(dev, group, asid);

I'm not sure this has been discussed before, but I think it would be
better to introduce a new ioctl to get group -> as mapping. This helps
to avoid vduse_dev_msg_sync() as much as possible. And it doesn't
require the userspace to poll vduse fd before DRIVER_OK.

> +       return 0;
> +}
> +
>  static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
>                                 struct vdpa_vq_state *state)
>  {
> @@ -794,13 +847,13 @@ static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
>         struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>         int ret;
>
> -       ret = vduse_domain_set_map(dev->domain, iotlb);
> +       ret = vduse_domain_set_map(dev->as[asid].domain, iotlb);
>         if (ret)
>                 return ret;
>
> -       ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> +       ret = vduse_dev_update_iotlb(dev, asid, 0ULL, ULLONG_MAX);
>         if (ret) {
> -               vduse_domain_clear_map(dev->domain, iotlb);
> +               vduse_domain_clear_map(dev->as[asid].domain, iotlb);
>                 return ret;
>         }
>
> @@ -843,6 +896,7 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = {
>         .get_vq_affinity        = vduse_vdpa_get_vq_affinity,
>         .reset                  = vduse_vdpa_reset,
>         .set_map                = vduse_vdpa_set_map,
> +       .set_group_asid         = vduse_set_group_asid,
>         .get_vq_map             = vduse_get_vq_map,
>         .free                   = vduse_vdpa_free,
>  };
> @@ -851,32 +905,30 @@ static void vduse_dev_sync_single_for_device(union virtio_map token,
>                                              dma_addr_t dma_addr, size_t size,
>                                              enum dma_data_direction dir)
>  {
> -       struct vduse_dev *vdev;
>         struct vduse_iova_domain *domain;
>
>         if (!token.group)
>                 return;
>
> -       vdev = token.group->dev;
> -       domain = vdev->domain;
> -
> +       read_lock(&token.group->as_lock);

I think we could optimize the lock here. E.g when nas is 1, we don't
need any lock in fact.

> +       domain = token.group->as->domain;
>         vduse_domain_sync_single_for_device(domain, dma_addr, size, dir);
> +       read_unlock(&token.group->as_lock);
>  }
>
>  static void vduse_dev_sync_single_for_cpu(union virtio_map token,
>                                              dma_addr_t dma_addr, size_t size,
>                                              enum dma_data_direction dir)
>  {
> -       struct vduse_dev *vdev;
>         struct vduse_iova_domain *domain;
>
>         if (!token.group)
>                 return;
>
> -       vdev = token.group->dev;
> -       domain = vdev->domain;
> -
> +       read_lock(&token.group->as_lock);
> +       domain = token.group->as->domain;
>         vduse_domain_sync_single_for_cpu(domain, dma_addr, size, dir);
> +       read_unlock(&token.group->as_lock);
>  }
>
>  static dma_addr_t vduse_dev_map_page(union virtio_map token, struct page *page,
> @@ -884,38 +936,38 @@ static dma_addr_t vduse_dev_map_page(union virtio_map token, struct page *page,
>                                      enum dma_data_direction dir,
>                                      unsigned long attrs)
>  {
> -       struct vduse_dev *vdev;
>         struct vduse_iova_domain *domain;
> +       dma_addr_t r;
>
>         if (!token.group)
>                 return DMA_MAPPING_ERROR;
>
> -       vdev = token.group->dev;
> -       domain = vdev->domain;
> +       read_lock(&token.group->as_lock);
> +       domain = token.group->as->domain;
> +       r = vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> +       read_unlock(&token.group->as_lock);
>
> -       return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> +       return r;
>  }
>
>  static void vduse_dev_unmap_page(union virtio_map token, dma_addr_t dma_addr,
>                                  size_t size, enum dma_data_direction dir,
>                                  unsigned long attrs)
>  {
> -       struct vduse_dev *vdev;
>         struct vduse_iova_domain *domain;
>
>         if (!token.group)
>                 return;
>
> -       vdev = token.group->dev;
> -       domain = vdev->domain;
> -
> -       return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> +       read_lock(&token.group->as_lock);
> +       domain = token.group->as->domain;
> +       vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> +       read_unlock(&token.group->as_lock);
>  }
>
>  static void *vduse_dev_alloc_coherent(union virtio_map token, size_t size,
>                                       dma_addr_t *dma_addr, gfp_t flag)
>  {
> -       struct vduse_dev *vdev;
>         struct vduse_iova_domain *domain;
>         unsigned long iova;
>         void *addr;
> @@ -928,18 +980,21 @@ static void *vduse_dev_alloc_coherent(union virtio_map token, size_t size,
>         if (!addr)
>                 return NULL;
>
> -       vdev = token.group->dev;
> -       domain = vdev->domain;
> +       *dma_addr = (dma_addr_t)iova;

Any reason we need to touch *dma_addr here? It might trigger UBSAN/KMSAN.

> +       read_lock(&token.group->as_lock);
> +       domain = token.group->as->domain;
>         addr = vduse_domain_alloc_coherent(domain, size,
>                                            (dma_addr_t *)&iova, addr);
>         if (!addr)
>                 goto err;
>
>         *dma_addr = (dma_addr_t)iova;
> +       read_unlock(&token.group->as_lock);
>
>         return addr;
>
>  err:
> +       read_unlock(&token.group->as_lock);
>         free_pages_exact(addr, size);
>         return NULL;
>  }
> @@ -948,31 +1003,30 @@ static void vduse_dev_free_coherent(union virtio_map token, size_t size,
>                                     void *vaddr, dma_addr_t dma_addr,
>                                     unsigned long attrs)
>  {
> -       struct vduse_dev *vdev;
>         struct vduse_iova_domain *domain;
>
>         if (!token.group)
>                 return;
>
> -       vdev = token.group->dev;
> -       domain = vdev->domain;
> -
> +       read_lock(&token.group->as_lock);
> +       domain = token.group->as->domain;
>         vduse_domain_free_coherent(domain, size, dma_addr, attrs);
> +       read_unlock(&token.group->as_lock);
>         free_pages_exact(vaddr, size);
>  }
>
>  static bool vduse_dev_need_sync(union virtio_map token, dma_addr_t dma_addr)
>  {
> -       struct vduse_dev *vdev;
> -       struct vduse_iova_domain *domain;
> +       size_t bounce_size;
>
>         if (!token.group)
>                 return false;
>
> -       vdev = token.group->dev;
> -       domain = vdev->domain;
> +       read_lock(&token.group->as_lock);
> +       bounce_size = token.group->as->domain->bounce_size;
> +       read_unlock(&token.group->as_lock);
>
> -       return dma_addr < domain->bounce_size;
> +       return dma_addr < bounce_size;
>  }
>
>  static int vduse_dev_mapping_error(union virtio_map token, dma_addr_t dma_addr)
> @@ -984,16 +1038,16 @@ static int vduse_dev_mapping_error(union virtio_map token, dma_addr_t dma_addr)
>
>  static size_t vduse_dev_max_mapping_size(union virtio_map token)
>  {
> -       struct vduse_dev *vdev;
> -       struct vduse_iova_domain *domain;
> +       size_t bounce_size;
>
>         if (!token.group)
>                 return 0;
>
> -       vdev = token.group->dev;
> -       domain = vdev->domain;
> +       read_lock(&token.group->as_lock);
> +       bounce_size = token.group->as->domain->bounce_size;
> +       read_unlock(&token.group->as_lock);
>
> -       return domain->bounce_size;
> +       return bounce_size;
>  }
>
>  static const struct virtio_map_ops vduse_map_ops = {
> @@ -1133,39 +1187,40 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
>         return ret;
>  }
>
> -static int vduse_dev_dereg_umem(struct vduse_dev *dev,
> +static int vduse_dev_dereg_umem(struct vduse_dev *dev, u32 asid,
>                                 u64 iova, u64 size)
>  {
>         int ret;
>
> -       mutex_lock(&dev->mem_lock);
> +       mutex_lock(&dev->as[asid].mem_lock);
>         ret = -ENOENT;
> -       if (!dev->umem)
> +       if (!dev->as[asid].umem)
>                 goto unlock;
>
>         ret = -EINVAL;
> -       if (!dev->domain)
> +       if (!dev->as[asid].domain)
>                 goto unlock;
>
> -       if (dev->umem->iova != iova || size != dev->domain->bounce_size)
> +       if (dev->as[asid].umem->iova != iova ||
> +           size != dev->as[asid].domain->bounce_size)
>                 goto unlock;
>
> -       vduse_domain_remove_user_bounce_pages(dev->domain);
> -       unpin_user_pages_dirty_lock(dev->umem->pages,
> -                                   dev->umem->npages, true);
> -       atomic64_sub(dev->umem->npages, &dev->umem->mm->pinned_vm);
> -       mmdrop(dev->umem->mm);
> -       vfree(dev->umem->pages);
> -       kfree(dev->umem);
> -       dev->umem = NULL;
> +       vduse_domain_remove_user_bounce_pages(dev->as[asid].domain);
> +       unpin_user_pages_dirty_lock(dev->as[asid].umem->pages,
> +                                   dev->as[asid].umem->npages, true);
> +       atomic64_sub(dev->as[asid].umem->npages, &dev->as[asid].umem->mm->pinned_vm);
> +       mmdrop(dev->as[asid].umem->mm);
> +       vfree(dev->as[asid].umem->pages);
> +       kfree(dev->as[asid].umem);
> +       dev->as[asid].umem = NULL;
>         ret = 0;
>  unlock:
> -       mutex_unlock(&dev->mem_lock);
> +       mutex_unlock(&dev->as[asid].mem_lock);
>         return ret;
>  }
>
>  static int vduse_dev_reg_umem(struct vduse_dev *dev,
> -                             u64 iova, u64 uaddr, u64 size)
> +                             u32 asid, u64 iova, u64 uaddr, u64 size)
>  {
>         struct page **page_list = NULL;
>         struct vduse_umem *umem = NULL;
> @@ -1173,14 +1228,14 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
>         unsigned long npages, lock_limit;
>         int ret;
>
> -       if (!dev->domain || !dev->domain->bounce_map ||
> -           size != dev->domain->bounce_size ||
> +       if (!dev->as[asid].domain || !dev->as[asid].domain->bounce_map ||
> +           size != dev->as[asid].domain->bounce_size ||
>             iova != 0 || uaddr & ~PAGE_MASK)
>                 return -EINVAL;
>
> -       mutex_lock(&dev->mem_lock);
> +       mutex_lock(&dev->as[asid].mem_lock);
>         ret = -EEXIST;
> -       if (dev->umem)
> +       if (dev->as[asid].umem)
>                 goto unlock;
>
>         ret = -ENOMEM;
> @@ -1204,7 +1259,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
>                 goto out;
>         }
>
> -       ret = vduse_domain_add_user_bounce_pages(dev->domain,
> +       ret = vduse_domain_add_user_bounce_pages(dev->as[asid].domain,
>                                                  page_list, pinned);
>         if (ret)
>                 goto out;
> @@ -1217,7 +1272,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
>         umem->mm = current->mm;
>         mmgrab(current->mm);
>
> -       dev->umem = umem;
> +       dev->as[asid].umem = umem;
>  out:
>         if (ret && pinned > 0)
>                 unpin_user_pages(page_list, pinned);
> @@ -1228,7 +1283,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
>                 vfree(page_list);
>                 kfree(umem);
>         }
> -       mutex_unlock(&dev->mem_lock);
> +       mutex_unlock(&dev->as[asid].mem_lock);
>         return ret;
>  }
>
> @@ -1260,47 +1315,66 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>
>         switch (cmd) {
>         case VDUSE_IOTLB_GET_FD: {
> -               struct vduse_iotlb_entry entry;
> +               struct vduse_iotlb_entry_v2 entry;

Nit: if we stick entry and do copy_from_user() twice it might save
lots of unnecessary changes.

>                 struct vhost_iotlb_map *map;
>                 struct vdpa_map_file *map_file;
>                 struct file *f = NULL;
> +               u32 asid;
>
>                 ret = -EFAULT;
> -               if (copy_from_user(&entry, argp, sizeof(entry)))
> -                       break;
> +               if (dev->api_version >= VDUSE_API_VERSION_1) {
> +                       if (copy_from_user(&entry, argp, sizeof(entry)))
> +                               break;
> +               } else {
> +                       entry.asid = 0;
> +                       if (copy_from_user(&entry.v1, argp,
> +                                          sizeof(entry.v1)))
> +                               break;
> +               }
>
>                 ret = -EINVAL;
> -               if (entry.start > entry.last)
> +               if (entry.v1.start > entry.v1.last)
> +                       break;
> +
> +               if (entry.asid >= dev->nas)
>                         break;
>
>                 mutex_lock(&dev->domain_lock);
> -               if (!dev->domain) {
> +               asid = array_index_nospec(entry.asid, dev->nas);
> +               if (!dev->as[asid].domain) {
>                         mutex_unlock(&dev->domain_lock);
>                         break;
>                 }
> -               spin_lock(&dev->domain->iotlb_lock);
> -               map = vhost_iotlb_itree_first(dev->domain->iotlb,
> -                                             entry.start, entry.last);
> +               spin_lock(&dev->as[asid].domain->iotlb_lock);
> +               map = vhost_iotlb_itree_first(dev->as[asid].domain->iotlb,
> +                                             entry.v1.start, entry.v1.last);
>                 if (map) {
>                         map_file = (struct vdpa_map_file *)map->opaque;
>                         f = get_file(map_file->file);
> -                       entry.offset = map_file->offset;
> -                       entry.start = map->start;
> -                       entry.last = map->last;
> -                       entry.perm = map->perm;
> +                       entry.v1.offset = map_file->offset;
> +                       entry.v1.start = map->start;
> +                       entry.v1.last = map->last;
> +                       entry.v1.perm = map->perm;
>                 }
> -               spin_unlock(&dev->domain->iotlb_lock);
> +               spin_unlock(&dev->as[asid].domain->iotlb_lock);
>                 mutex_unlock(&dev->domain_lock);
>                 ret = -EINVAL;
>                 if (!f)
>                         break;
>
> -               ret = -EFAULT;
> -               if (copy_to_user(argp, &entry, sizeof(entry))) {
> +               if (dev->api_version >= VDUSE_API_VERSION_1)
> +                       ret = copy_to_user(argp, &entry,
> +                                          sizeof(entry));
> +               else
> +                       ret = copy_to_user(argp, &entry.v1,
> +                                          sizeof(entry.v1));
> +
> +               if (ret) {
> +                       ret = -EFAULT;
>                         fput(f);
>                         break;
>                 }
> -               ret = receive_fd(f, NULL, perm_to_file_flags(entry.perm));
> +               ret = receive_fd(f, NULL, perm_to_file_flags(entry.v1.perm));
>                 fput(f);
>                 break;
>         }
> @@ -1445,6 +1519,7 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>         }
>         case VDUSE_IOTLB_REG_UMEM: {
>                 struct vduse_iova_umem umem;
> +               u32 asid;
>
>                 ret = -EFAULT;
>                 if (copy_from_user(&umem, argp, sizeof(umem)))
> @@ -1452,17 +1527,21 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>
>                 ret = -EINVAL;
>                 if (!is_mem_zero((const char *)umem.reserved,
> -                                sizeof(umem.reserved)))
> +                                sizeof(umem.reserved)) ||
> +                   (dev->api_version < VDUSE_API_VERSION_1 &&
> +                    umem.asid != 0) || umem.asid >= dev->nas)
>                         break;
>
>                 mutex_lock(&dev->domain_lock);
> -               ret = vduse_dev_reg_umem(dev, umem.iova,
> +               asid = array_index_nospec(umem.asid, dev->nas);
> +               ret = vduse_dev_reg_umem(dev, asid, umem.iova,
>                                          umem.uaddr, umem.size);
>                 mutex_unlock(&dev->domain_lock);
>                 break;
>         }
>         case VDUSE_IOTLB_DEREG_UMEM: {
>                 struct vduse_iova_umem umem;
> +               u32 asid;
>
>                 ret = -EFAULT;
>                 if (copy_from_user(&umem, argp, sizeof(umem)))
> @@ -1470,10 +1549,15 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>
>                 ret = -EINVAL;
>                 if (!is_mem_zero((const char *)umem.reserved,
> -                                sizeof(umem.reserved)))
> +                                sizeof(umem.reserved)) ||
> +                   (dev->api_version < VDUSE_API_VERSION_1 &&
> +                    umem.asid != 0) ||
> +                    umem.asid >= dev->nas)
>                         break;
> +
>                 mutex_lock(&dev->domain_lock);
> -               ret = vduse_dev_dereg_umem(dev, umem.iova,
> +               asid = array_index_nospec(umem.asid, dev->nas);
> +               ret = vduse_dev_dereg_umem(dev, asid, umem.iova,
>                                            umem.size);
>                 mutex_unlock(&dev->domain_lock);
>                 break;
> @@ -1481,6 +1565,7 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>         case VDUSE_IOTLB_GET_INFO: {

Btw I see this:

                dev->vqs[index]->vq_group = config.group;

In VDUSE_VQ_SETUP:

I wonder what's the reason that it is not a part of CREATE_DEV? I
meant it might be racy if DMA happens between CREATE_DEV and
VDUSE_VQ_SETUP.

>                 struct vduse_iova_info info;
>                 struct vhost_iotlb_map *map;
> +               u32 asid;
>
>                 ret = -EFAULT;
>                 if (copy_from_user(&info, argp, sizeof(info)))
> @@ -1494,23 +1579,31 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>                                  sizeof(info.reserved)))
>                         break;
>
> +               if (dev->api_version < VDUSE_API_VERSION_1) {
> +                       if (info.asid)
> +                               break;
> +               } else if (info.asid >= dev->nas)
> +                       break;
> +
>                 mutex_lock(&dev->domain_lock);
> -               if (!dev->domain) {
> +               asid = array_index_nospec(info.asid, dev->nas);
> +               if (!dev->as[asid].domain) {
>                         mutex_unlock(&dev->domain_lock);
>                         break;
>                 }
> -               spin_lock(&dev->domain->iotlb_lock);
> -               map = vhost_iotlb_itree_first(dev->domain->iotlb,
> +               spin_lock(&dev->as[asid].domain->iotlb_lock);
> +               map = vhost_iotlb_itree_first(dev->as[asid].domain->iotlb,
>                                               info.start, info.last);
>                 if (map) {
>                         info.start = map->start;
>                         info.last = map->last;
>                         info.capability = 0;
> -                       if (dev->domain->bounce_map && map->start == 0 &&
> -                           map->last == dev->domain->bounce_size - 1)
> +                       if (dev->as[asid].domain->bounce_map &&
> +                           map->start == 0 &&
> +                           map->last == dev->as[asid].domain->bounce_size - 1)
>                                 info.capability |= VDUSE_IOVA_CAP_UMEM;
>                 }
> -               spin_unlock(&dev->domain->iotlb_lock);
> +               spin_unlock(&dev->as[asid].domain->iotlb_lock);
>                 mutex_unlock(&dev->domain_lock);
>                 if (!map)
>                         break;
> @@ -1535,8 +1628,10 @@ static int vduse_dev_release(struct inode *inode, struct file *file)
>         struct vduse_dev *dev = file->private_data;
>
>         mutex_lock(&dev->domain_lock);
> -       if (dev->domain)
> -               vduse_dev_dereg_umem(dev, 0, dev->domain->bounce_size);
> +       for (int i = 0; i < dev->nas; i++)
> +               if (dev->as[i].domain)
> +                       vduse_dev_dereg_umem(dev, i, 0,
> +                                            dev->as[i].domain->bounce_size);
>         mutex_unlock(&dev->domain_lock);
>         spin_lock(&dev->msg_lock);
>         /* Make sure the inflight messages can processed after reconncection */
> @@ -1755,7 +1850,6 @@ static struct vduse_dev *vduse_dev_create(void)
>                 return NULL;
>
>         mutex_init(&dev->lock);
> -       mutex_init(&dev->mem_lock);
>         mutex_init(&dev->domain_lock);
>         spin_lock_init(&dev->msg_lock);
>         INIT_LIST_HEAD(&dev->send_list);
> @@ -1806,8 +1900,11 @@ static int vduse_destroy_dev(char *name)
>         idr_remove(&vduse_idr, dev->minor);
>         kvfree(dev->config);
>         vduse_dev_deinit_vqs(dev);
> -       if (dev->domain)
> -               vduse_domain_destroy(dev->domain);
> +       for (int i = 0; i < dev->nas; i++) {
> +               if (dev->as[i].domain)
> +                       vduse_domain_destroy(dev->as[i].domain);
> +       }
> +       kfree(dev->as);
>         kfree(dev->name);
>         kfree(dev->groups);
>         vduse_dev_destroy(dev);
> @@ -1854,12 +1951,17 @@ static bool vduse_validate_config(struct vduse_dev_config *config,
>                          sizeof(config->reserved)))
>                 return false;
>
> -       if (api_version < VDUSE_API_VERSION_1 && config->ngroups)
> +       if (api_version < VDUSE_API_VERSION_1 &&
> +           (config->ngroups || config->nas))
>                 return false;
>
> -       if (api_version >= VDUSE_API_VERSION_1 &&
> -           (!config->ngroups || config->ngroups > VDUSE_DEV_MAX_GROUPS))
> -               return false;
> +       if (api_version >= VDUSE_API_VERSION_1) {
> +               if (!config->ngroups || config->ngroups > VDUSE_DEV_MAX_GROUPS)
> +                       return false;
> +
> +               if (!config->nas || config->nas > VDUSE_DEV_MAX_AS)
> +                       return false;
> +       }
>
>         if (config->vq_align > PAGE_SIZE)
>                 return false;
> @@ -1924,7 +2026,8 @@ static ssize_t bounce_size_store(struct device *device,
>
>         ret = -EPERM;
>         mutex_lock(&dev->domain_lock);
> -       if (dev->domain)
> +       /* Assuming that if the first domain is allocated, all are allocated */
> +       if (dev->as[0].domain)
>                 goto unlock;

Not for this patch but I don't understand why we need to check dev->domain here.

>
>         ret = kstrtouint(buf, 10, &bounce_size);
> @@ -1983,8 +2086,17 @@ static int vduse_create_dev(struct vduse_dev_config *config,
>                               GFP_KERNEL);
>         if (!dev->groups)
>                 goto err_vq_groups;
> -       for (u32 i = 0; i < dev->ngroups; ++i)
> +       for (u32 i = 0; i < dev->ngroups; ++i) {
>                 dev->groups[i].dev = dev;
> +               rwlock_init(&dev->groups[i].as_lock);
> +       }
> +
> +       dev->nas = (dev->api_version < VDUSE_API_VERSION_1) ? 1 : config->nas;
> +       dev->as = kcalloc(dev->nas, sizeof(dev->as[0]), GFP_KERNEL);
> +       if (!dev->as)
> +               goto err_as;
> +       for (int i = 0; i < dev->nas; i++)
> +               mutex_init(&dev->as[i].mem_lock);
>
>         dev->name = kstrdup(config->name, GFP_KERNEL);
>         if (!dev->name)
> @@ -2022,6 +2134,8 @@ static int vduse_create_dev(struct vduse_dev_config *config,
>  err_idr:
>         kfree(dev->name);
>  err_str:
> +       kfree(dev->as);
> +err_as:
>         kfree(dev->groups);
>  err_vq_groups:
>         vduse_dev_destroy(dev);
> @@ -2147,7 +2261,7 @@ static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
>
>         vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev,
>                                  &vduse_vdpa_config_ops, &vduse_map_ops,
> -                                dev->ngroups, 1, name, true);
> +                                dev->ngroups, dev->nas, name, true);
>         if (IS_ERR(vdev))
>                 return PTR_ERR(vdev);
>
> @@ -2162,7 +2276,8 @@ static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name,
>                         const struct vdpa_dev_set_config *config)
>  {
>         struct vduse_dev *dev;
> -       int ret;
> +       size_t domain_bounce_size;
> +       int ret, i;
>
>         mutex_lock(&vduse_lock);
>         dev = vduse_find_dev(name);
> @@ -2176,29 +2291,38 @@ static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name,
>                 return ret;
>
>         mutex_lock(&dev->domain_lock);
> -       if (!dev->domain)
> -               dev->domain = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
> -                                                 dev->bounce_size);
> -       mutex_unlock(&dev->domain_lock);
> -       if (!dev->domain) {
> -               ret = -ENOMEM;
> -               goto domain_err;
> +       ret = 0;
> +
> +       domain_bounce_size = dev->bounce_size / dev->nas;
> +       for (i = 0; i < dev->nas; ++i) {
> +               dev->as[i].domain = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
> +                                                       domain_bounce_size);
> +               if (!dev->as[i].domain) {
> +                       ret = -ENOMEM;
> +                       goto err;
> +               }
>         }
>
> +       mutex_unlock(&dev->domain_lock);
> +
>         ret = _vdpa_register_device(&dev->vdev->vdpa, dev->vq_num);
> -       if (ret) {
> -               goto register_err;
> -       }
> +       if (ret)
> +               goto err_register;
>
>         return 0;
>
> -register_err:
> +err_register:
>         mutex_lock(&dev->domain_lock);
> -       vduse_domain_destroy(dev->domain);
> -       dev->domain = NULL;
> +
> +err:
> +       for (int j = 0; j < i; j++) {
> +               if (dev->as[j].domain) {
> +                       vduse_domain_destroy(dev->as[j].domain);
> +                       dev->as[j].domain = NULL;
> +               }
> +       }
>         mutex_unlock(&dev->domain_lock);
>
> -domain_err:
>         put_device(&dev->vdev->vdpa.dev);
>
>         return ret;
> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> index a3d51cf6df3a..da2c5e47990e 100644
> --- a/include/uapi/linux/vduse.h
> +++ b/include/uapi/linux/vduse.h
> @@ -47,7 +47,8 @@ struct vduse_dev_config {
>         __u32 vq_num;
>         __u32 vq_align;
>         __u32 ngroups; /* if VDUSE_API_VERSION >= 1 */
> -       __u32 reserved[12];
> +       __u32 nas; /* if VDUSE_API_VERSION >= 1 */
> +       __u32 reserved[11];
>         __u32 config_size;
>         __u8 config[];
>  };
> @@ -82,6 +83,18 @@ struct vduse_iotlb_entry {
>         __u8 perm;
>  };
>
> +/**
> + * struct vduse_iotlb_entry_v2 - entry of IOTLB to describe one IOVA region in an ASID
> + * @v1: the original vduse_iotlb_entry
> + * @asid: address space ID of the IOVA region
> + *
> + * Structure used by VDUSE_IOTLB_GET_FD ioctl to find an overlapped IOVA region.
> + */
> +struct vduse_iotlb_entry_v2 {
> +       struct vduse_iotlb_entry v1;
> +       __u32 asid;
> +};
> +
>  /*
>   * Find the first IOVA region that overlaps with the range [start, last]
>   * and return the corresponding file descriptor. Return -EINVAL means the
> @@ -166,6 +179,16 @@ struct vduse_vq_state_packed {
>         __u16 last_used_idx;
>  };
>
> +/**
> + * struct vduse_vq_group_asid - virtqueue group ASID
> + * @group: Index of the virtqueue group
> + * @asid: Address space ID of the group
> + */
> +struct vduse_vq_group_asid {
> +       __u32 group;
> +       __u32 asid;
> +};
> +
>  /**
>   * struct vduse_vq_info - information of a virtqueue
>   * @index: virtqueue index
> @@ -225,6 +248,7 @@ struct vduse_vq_eventfd {
>   * @uaddr: start address of userspace memory, it must be aligned to page size
>   * @iova: start of the IOVA region
>   * @size: size of the IOVA region
> + * @asid: Address space ID of the IOVA region
>   * @reserved: for future use, needs to be initialized to zero
>   *
>   * Structure used by VDUSE_IOTLB_REG_UMEM and VDUSE_IOTLB_DEREG_UMEM
> @@ -234,7 +258,8 @@ struct vduse_iova_umem {
>         __u64 uaddr;
>         __u64 iova;
>         __u64 size;
> -       __u64 reserved[3];
> +       __u32 asid;
> +       __u32 reserved[5];
>  };
>
>  /* Register userspace memory for IOVA regions */
> @@ -248,6 +273,7 @@ struct vduse_iova_umem {
>   * @start: start of the IOVA region
>   * @last: last of the IOVA region
>   * @capability: capability of the IOVA region
> + * @asid: Address space ID of the IOVA region, only if device API version >= 1
>   * @reserved: for future use, needs to be initialized to zero
>   *
>   * Structure used by VDUSE_IOTLB_GET_INFO ioctl to get information of
> @@ -258,7 +284,8 @@ struct vduse_iova_info {
>         __u64 last;
>  #define VDUSE_IOVA_CAP_UMEM (1 << 0)
>         __u64 capability;
> -       __u64 reserved[3];
> +       __u32 asid; /* Only if device API version >= 1 */
> +       __u32 reserved[5];
>  };
>
>  /*
> @@ -280,6 +307,7 @@ enum vduse_req_type {
>         VDUSE_GET_VQ_STATE,
>         VDUSE_SET_STATUS,
>         VDUSE_UPDATE_IOTLB,
> +       VDUSE_SET_VQ_GROUP_ASID,
>  };
>
>  /**
> @@ -314,6 +342,18 @@ struct vduse_iova_range {
>         __u64 last;
>  };
>
> +/**
> + * struct vduse_iova_range - IOVA range [start, last] if API_VERSION >= 1
> + * @start: start of the IOVA range
> + * @last: last of the IOVA range
> + * @asid: address space ID of the IOVA range
> + */
> +struct vduse_iova_range_v2 {
> +       __u64 start;
> +       __u64 last;
> +       __u32 asid;
> +};
> +
>  /**
>   * struct vduse_dev_request - control request
>   * @type: request type
> @@ -322,6 +362,8 @@ struct vduse_iova_range {
>   * @vq_state: virtqueue state, only index field is available
>   * @s: device status
>   * @iova: IOVA range for updating
> + * @iova_v2: IOVA range for updating if API_VERSION >= 1
> + * @vq_group_asid: ASID of a virtqueue group
>   * @padding: padding
>   *
>   * Structure used by read(2) on /dev/vduse/$NAME.
> @@ -334,6 +376,11 @@ struct vduse_dev_request {
>                 struct vduse_vq_state vq_state;
>                 struct vduse_dev_status s;
>                 struct vduse_iova_range iova;
> +               /* Following members but padding exist only if vduse api
> +                * version >= 1
> +                */;
> +               struct vduse_iova_range_v2 iova_v2;
> +               struct vduse_vq_group_asid vq_group_asid;
>                 __u32 padding[32];
>         };
>  };
> --
> 2.52.0
>

Thanks


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v10 2/8] vduse: add vq group support
  2025-12-17 11:24 ` [PATCH v10 2/8] vduse: add vq group support Eugenio Pérez
@ 2025-12-18  6:46   ` Jason Wang
  2025-12-18 10:06     ` Eugenio Perez Martin
  0 siblings, 1 reply; 22+ messages in thread
From: Jason Wang @ 2025-12-18  6:46 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: Michael S . Tsirkin, Maxime Coquelin, Laurent Vivier,
	virtualization, linux-kernel, Stefano Garzarella, Yongji Xie,
	Xuan Zhuo, Cindy Lu

On Wed, Dec 17, 2025 at 7:24 PM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> This allows separate the different virtqueues in groups that shares the
> same address space.  Asking the VDUSE device for the groups of the vq at
> the beginning as they're needed for the DMA API.
>
> Allocating 3 vq groups as net is the device that need the most groups:
> * Dataplane (guest passthrough)
> * CVQ
> * Shadowed vrings.
>
> Future versions of the series can include dynamic allocation of the
> groups array so VDUSE can declare more groups.
>
> Acked-by: Jason Wang <jasowang@redhat.com>
> Reviewed-by: Xie Yongji <xieyongji@bytedance.com>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
> v6:
> * s/sepparate/separate (MST).
> * s/dev->api_version < 1/dev->api_version < VDUSE_API_VERSION_1
>
> v5:
> * Revert core vdpa changes (Jason).
> * Fix group == ngroup case in checking VQ_SETUP argument (Jason).
>
> v4:
> * Revert the "invalid vq group" concept and assume 0 if not set (Jason).
> * Make config->ngroups == 0 invalid (Jason).
>
> v3:
> * Make the default group an invalid group as long as VDUSE device does
>   not set it to some valid u32 value.  Modify the vdpa core to take that
>   into account (Jason).
> * Create the VDUSE_DEV_MAX_GROUPS instead of using a magic number
>
> v2:
> * Now the vq group is in vduse_vq_config struct instead of issuing one
>   VDUSE message per vq.
>
> v1:
> * Fix: Remove BIT_ULL(VIRTIO_S_*), as _S_ is already the bit (Maxime)
>
> RFC v3:
> * Increase VDUSE_MAX_VQ_GROUPS to 0xffff (Jason).  It was set to a lower
>   value to reduce memory consumption, but vqs are already limited to
>   that value and userspace VDUSE is able to allocate that many vqs.
> * Remove the descs vq group capability as it will not be used and we can
>   add it on top.
> * Do not ask for vq groups in number of vq groups < 2.
> * Move the valid vq groups range check to vduse_validate_config.
>
> RFC v2:
> * Cache group information in kernel, as we need to provide the vq map
>   tokens properly.
> * Add descs vq group to optimize SVQ forwarding and support indirect
>   descriptors out of the box.
> ---
>  drivers/vdpa/vdpa_user/vduse_dev.c | 48 ++++++++++++++++++++++++++----
>  include/uapi/linux/vduse.h         | 12 ++++++--
>  2 files changed, 52 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> index ae357d014564..b012dc3557b9 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -39,6 +39,7 @@
>  #define DRV_LICENSE  "GPL v2"
>
>  #define VDUSE_DEV_MAX (1U << MINORBITS)
> +#define VDUSE_DEV_MAX_GROUPS 0xffff
>  #define VDUSE_MAX_BOUNCE_SIZE (1024 * 1024 * 1024)
>  #define VDUSE_MIN_BOUNCE_SIZE (1024 * 1024)
>  #define VDUSE_BOUNCE_SIZE (64 * 1024 * 1024)
> @@ -58,6 +59,7 @@ struct vduse_virtqueue {
>         struct vdpa_vq_state state;
>         bool ready;
>         bool kicked;
> +       u32 vq_group;

Nit, since we are under the context of virtqueue, I'd rename this as a
simple "group".

Thanks


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v10 6/8] vduse: take out allocations from vduse_dev_alloc_coherent
  2025-12-18  5:45   ` Jason Wang
@ 2025-12-18  8:40     ` Eugenio Perez Martin
  0 siblings, 0 replies; 22+ messages in thread
From: Eugenio Perez Martin @ 2025-12-18  8:40 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S . Tsirkin, Maxime Coquelin, Laurent Vivier,
	virtualization, linux-kernel, Stefano Garzarella, Yongji Xie,
	Xuan Zhuo, Cindy Lu

On Thu, Dec 18, 2025 at 6:45 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Wed, Dec 17, 2025 at 7:24 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> >
> > The function vduse_dev_alloc_coherent will be called under rwlock in
> > next patches.  Make it out of the lock to avoid increasing its fail
> > rate.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >  drivers/vdpa/vdpa_user/iova_domain.c |  7 ++-----
> >  drivers/vdpa/vdpa_user/iova_domain.h |  2 +-
> >  drivers/vdpa/vdpa_user/vduse_dev.c   | 13 +++++++++++--
> >  3 files changed, 14 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
> > index 309cd5a039d1..0ae52890518c 100644
> > --- a/drivers/vdpa/vdpa_user/iova_domain.c
> > +++ b/drivers/vdpa/vdpa_user/iova_domain.c
> > @@ -495,14 +495,13 @@ void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
> >
> >  void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
> >                                   size_t size, dma_addr_t *dma_addr,
> > -                                 gfp_t flag)
> > +                                 void *orig)
> >  {
> >         struct iova_domain *iovad = &domain->consistent_iovad;
> >         unsigned long limit = domain->iova_limit;
> >         dma_addr_t iova = vduse_domain_alloc_iova(iovad, size, limit);
> > -       void *orig = alloc_pages_exact(size, flag);
> >
> > -       if (!iova || !orig)
> > +       if (!iova)
> >                 goto err;
> >
> >         spin_lock(&domain->iotlb_lock);
> > @@ -519,8 +518,6 @@ void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
> >         return orig;
> >  err:
> >         *dma_addr = DMA_MAPPING_ERROR;
> > -       if (orig)
> > -               free_pages_exact(orig, size);
> >         if (iova)
> >                 vduse_domain_free_iova(iovad, iova, size);
> >
> > diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
> > index 42090cd1a622..86b7c7eadfd0 100644
> > --- a/drivers/vdpa/vdpa_user/iova_domain.h
> > +++ b/drivers/vdpa/vdpa_user/iova_domain.h
> > @@ -69,7 +69,7 @@ void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
> >
> >  void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
> >                                   size_t size, dma_addr_t *dma_addr,
> > -                                 gfp_t flag);
> > +                                 void *orig);
> >
> >  void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
> >                                 dma_addr_t dma_addr, unsigned long attrs);
> > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > index a278cff7a4fa..767abcb7e375 100644
> > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > @@ -924,16 +924,24 @@ static void *vduse_dev_alloc_coherent(union virtio_map token, size_t size,
> >         if (!token.group)
> >                 return NULL;
> >
> > +       addr = alloc_pages_exact(size, flag);
> > +       if (!addr)
> > +               return NULL;
> > +
> >         vdev = token.group->dev;
> >         domain = vdev->domain;
> >         addr = vduse_domain_alloc_coherent(domain, size,
> > -                                          (dma_addr_t *)&iova, flag);
> > +                                          (dma_addr_t *)&iova, addr);
> >         if (!addr)
> > -               return NULL;
> > +               goto err;
> >
> >         *dma_addr = (dma_addr_t)iova;
> >
> >         return addr;
> > +
> > +err:
> > +       free_pages_exact(addr, size);
> > +       return NULL;
> >  }
> >
> >  static void vduse_dev_free_coherent(union virtio_map token, size_t size,
> > @@ -950,6 +958,7 @@ static void vduse_dev_free_coherent(union virtio_map token, size_t size,
> >         domain = vdev->domain;
> >
> >         vduse_domain_free_coherent(domain, size, dma_addr, attrs);
> > +       free_pages_exact(vaddr, size);
>
> There looks like a double-free as there's another free_page_exact() in
> vduse_domain_free_coherent.
>

You're right, removing the duplicated one. Thanks!


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v10 2/8] vduse: add vq group support
  2025-12-18  6:46   ` Jason Wang
@ 2025-12-18 10:06     ` Eugenio Perez Martin
  0 siblings, 0 replies; 22+ messages in thread
From: Eugenio Perez Martin @ 2025-12-18 10:06 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S . Tsirkin, Maxime Coquelin, Laurent Vivier,
	virtualization, linux-kernel, Stefano Garzarella, Yongji Xie,
	Xuan Zhuo, Cindy Lu

On Thu, Dec 18, 2025 at 7:47 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Wed, Dec 17, 2025 at 7:24 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> >
> > This allows separate the different virtqueues in groups that shares the
> > same address space.  Asking the VDUSE device for the groups of the vq at
> > the beginning as they're needed for the DMA API.
> >
> > Allocating 3 vq groups as net is the device that need the most groups:
> > * Dataplane (guest passthrough)
> > * CVQ
> > * Shadowed vrings.
> >
> > Future versions of the series can include dynamic allocation of the
> > groups array so VDUSE can declare more groups.
> >
> > Acked-by: Jason Wang <jasowang@redhat.com>
> > Reviewed-by: Xie Yongji <xieyongji@bytedance.com>
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> > v6:
> > * s/sepparate/separate (MST).
> > * s/dev->api_version < 1/dev->api_version < VDUSE_API_VERSION_1
> >
> > v5:
> > * Revert core vdpa changes (Jason).
> > * Fix group == ngroup case in checking VQ_SETUP argument (Jason).
> >
> > v4:
> > * Revert the "invalid vq group" concept and assume 0 if not set (Jason).
> > * Make config->ngroups == 0 invalid (Jason).
> >
> > v3:
> > * Make the default group an invalid group as long as VDUSE device does
> >   not set it to some valid u32 value.  Modify the vdpa core to take that
> >   into account (Jason).
> > * Create the VDUSE_DEV_MAX_GROUPS instead of using a magic number
> >
> > v2:
> > * Now the vq group is in vduse_vq_config struct instead of issuing one
> >   VDUSE message per vq.
> >
> > v1:
> > * Fix: Remove BIT_ULL(VIRTIO_S_*), as _S_ is already the bit (Maxime)
> >
> > RFC v3:
> > * Increase VDUSE_MAX_VQ_GROUPS to 0xffff (Jason).  It was set to a lower
> >   value to reduce memory consumption, but vqs are already limited to
> >   that value and userspace VDUSE is able to allocate that many vqs.
> > * Remove the descs vq group capability as it will not be used and we can
> >   add it on top.
> > * Do not ask for vq groups in number of vq groups < 2.
> > * Move the valid vq groups range check to vduse_validate_config.
> >
> > RFC v2:
> > * Cache group information in kernel, as we need to provide the vq map
> >   tokens properly.
> > * Add descs vq group to optimize SVQ forwarding and support indirect
> >   descriptors out of the box.
> > ---
> >  drivers/vdpa/vdpa_user/vduse_dev.c | 48 ++++++++++++++++++++++++++----
> >  include/uapi/linux/vduse.h         | 12 ++++++--
> >  2 files changed, 52 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > index ae357d014564..b012dc3557b9 100644
> > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > @@ -39,6 +39,7 @@
> >  #define DRV_LICENSE  "GPL v2"
> >
> >  #define VDUSE_DEV_MAX (1U << MINORBITS)
> > +#define VDUSE_DEV_MAX_GROUPS 0xffff
> >  #define VDUSE_MAX_BOUNCE_SIZE (1024 * 1024 * 1024)
> >  #define VDUSE_MIN_BOUNCE_SIZE (1024 * 1024)
> >  #define VDUSE_BOUNCE_SIZE (64 * 1024 * 1024)
> > @@ -58,6 +59,7 @@ struct vduse_virtqueue {
> >         struct vdpa_vq_state state;
> >         bool ready;
> >         bool kicked;
> > +       u32 vq_group;
>
> Nit, since we are under the context of virtqueue, I'd rename this as a
> simple "group".

Sure, I'll change it in the next version.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v10 7/8] vduse: add vq group asid support
  2025-12-18  6:44   ` Jason Wang
@ 2025-12-18 13:10     ` Eugenio Perez Martin
  2025-12-23  1:11       ` Jason Wang
  0 siblings, 1 reply; 22+ messages in thread
From: Eugenio Perez Martin @ 2025-12-18 13:10 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S . Tsirkin, Maxime Coquelin, Laurent Vivier,
	virtualization, linux-kernel, Stefano Garzarella, Yongji Xie,
	Xuan Zhuo, Cindy Lu

On Thu, Dec 18, 2025 at 7:45 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Wed, Dec 17, 2025 at 7:24 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> >
> > Add support for assigning Address Space Identifiers (ASIDs) to each VQ
> > group.  This enables mapping each group into a distinct memory space.
> >
> > The vq group to ASID association is protected by a rwlock now.  But the
> > mutex domain_lock keeps protecting the domains of all ASIDs, as some
> > operations like the one related with the bounce buffer size still
> > requires to lock all the ASIDs.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >
> > ---
> > Future improvements can include performance optimizations on top could
> > move to per-ASID locks, or hardening by tracking ASID or ASID hashes on
> > unused bits of the DMA address.
> >
> > Tested virtio_vdpa by adding manually two threads in vduse_set_status:
> > one of them modifies the vq group 0 ASID and the other one map and unmap
> > memory continuously.  After a while, the two threads stop and the usual
> > work continues.
> >
> > Tested with vhost_vdpa by migrating a VM while ping on OVS+VDUSE.  A few
> > workaround were needed in some parts:
> > * Do not enable CVQ before data vqs in QEMU, as VDUSE does not forward
> >   the enable message to the userland device.  This will be solved in the
> >   future.
> > * Share the suspended state between all vhost devices in QEMU:
> >   https://lists.nongnu.org/archive/html/qemu-devel/2025-11/msg02947.html
> > * Implement a fake VDUSE suspend vdpa operation callback that always
> >   returns true in the kernel.  DPDK suspend the device at the first
> >   GET_VRING_BASE.
> > * Remove the CVQ blocker in ASID.
> >
> > ---
> > v10:
> > * Back to rwlock version so stronger locks are used.
> > * Take out allocations from rwlock.
> > * Forbid changing ASID of a vq group after DRIVER_OK (Jason)
> > * Remove bad fetching again of domain variable in
> >   vduse_dev_max_mapping_size (Yongji).
> > * Remove unused vdev definition in vdpa map_ops callbacks (kernel test
> >   robot).
> >
> > v9:
> > * Replace mutex with rwlock, as the vdpa map_ops can run from atomic
> >   context.
> >
> > v8:
> > * Revert the mutex to rwlock change, it needs proper profiling to
> >   justify it.
> >
> > v7:
> > * Take write lock in the error path (Jason).
> >
> > v6:
> > * Make vdpa_dev_add use gotos for error handling (MST).
> > * s/(dev->api_version < 1) ?/(dev->api_version < VDUSE_API_VERSION_1) ?/
> >   (MST).
> > * Fix struct name not matching in the doc.
> >
> > v5:
> > * Properly return errno if copy_to_user returns >0 in VDUSE_IOTLB_GET_FD
> >   ioctl (Jason).
> > * Properly set domain bounce size to divide equally between nas (Jason).
> > * Exclude "padding" member from the only >V1 members in
> >   vduse_dev_request.
> >
> > v4:
> > * Divide each domain bounce size between the device bounce size (Jason).
> > * revert unneeded addr = NULL assignment (Jason)
> > * Change if (x && (y || z)) return to if (x) { if (y) return; if (z)
> >   return; } (Jason)
> > * Change a bad multiline comment, using @ caracter instead of * (Jason).
> > * Consider config->nas == 0 as a fail (Jason).
> >
> > v3:
> > * Get the vduse domain through the vduse_as in the map functions
> >   (Jason).
> > * Squash with the patch creating the vduse_as struct (Jason).
> > * Create VDUSE_DEV_MAX_AS instead of comparing agains a magic number
> >   (Jason)
> >
> > v2:
> > * Convert the use of mutex to rwlock.
> >
> > RFC v3:
> > * Increase VDUSE_MAX_VQ_GROUPS to 0xffff (Jason). It was set to a lower
> >   value to reduce memory consumption, but vqs are already limited to
> >   that value and userspace VDUSE is able to allocate that many vqs.
> > * Remove TODO about merging VDUSE_IOTLB_GET_FD ioctl with
> >   VDUSE_IOTLB_GET_INFO.
> > * Use of array_index_nospec in VDUSE device ioctls.
> > * Embed vduse_iotlb_entry into vduse_iotlb_entry_v2.
> > * Move the umem mutex to asid struct so there is no contention between
> >   ASIDs.
> >
> > RFC v2:
> > * Make iotlb entry the last one of vduse_iotlb_entry_v2 so the first
> >   part of the struct is the same.
> > ---
> >  drivers/vdpa/vdpa_user/vduse_dev.c | 366 +++++++++++++++++++----------
> >  include/uapi/linux/vduse.h         |  53 ++++-
> >  2 files changed, 295 insertions(+), 124 deletions(-)
> >
> > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > index 767abcb7e375..786ab2378825 100644
> > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > @@ -41,6 +41,7 @@
> >
> >  #define VDUSE_DEV_MAX (1U << MINORBITS)
> >  #define VDUSE_DEV_MAX_GROUPS 0xffff
> > +#define VDUSE_DEV_MAX_AS 0xffff
> >  #define VDUSE_MAX_BOUNCE_SIZE (1024 * 1024 * 1024)
> >  #define VDUSE_MIN_BOUNCE_SIZE (1024 * 1024)
> >  #define VDUSE_BOUNCE_SIZE (64 * 1024 * 1024)
> > @@ -86,7 +87,15 @@ struct vduse_umem {
> >         struct mm_struct *mm;
> >  };
> >
> > +struct vduse_as {
> > +       struct vduse_iova_domain *domain;
> > +       struct vduse_umem *umem;
> > +       struct mutex mem_lock;
>
> Not related to this patch. But If I was not wrong we have 1:1 mapping
> between domain and as. If this is true, can we use bounce_lock instead
> of a new mem_lock? Since I see mem_lock is only used for synchronizing
> umem reg/degreg which has been synchronized with domain rwlock.
>

I think you're right, but they work at different levels at the moment.
The mem_lock is at vduse_dev and also protect the umem pointer, and
bounce_lock is at iova_domain.c .

Maybe the right thing to do is to move umem in iova_domain. Yongji
Xie, what do you think?

> > +};
> > +
> >  struct vduse_vq_group {
> > +       rwlock_t as_lock;
> > +       struct vduse_as *as; /* Protected by as_lock */
> >         struct vduse_dev *dev;
> >  };
> >
> > @@ -94,7 +103,7 @@ struct vduse_dev {
> >         struct vduse_vdpa *vdev;
> >         struct device *dev;
> >         struct vduse_virtqueue **vqs;
> > -       struct vduse_iova_domain *domain;
> > +       struct vduse_as *as;
> >         char *name;
> >         struct mutex lock;
> >         spinlock_t msg_lock;
> > @@ -122,9 +131,8 @@ struct vduse_dev {
> >         u32 vq_num;
> >         u32 vq_align;
> >         u32 ngroups;
> > -       struct vduse_umem *umem;
> > +       u32 nas;
> >         struct vduse_vq_group *groups;
> > -       struct mutex mem_lock;
> >         unsigned int bounce_size;
> >         struct mutex domain_lock;
> >  };
> > @@ -314,7 +322,7 @@ static int vduse_dev_set_status(struct vduse_dev *dev, u8 status)
> >         return vduse_dev_msg_sync(dev, &msg);
> >  }
> >
> > -static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> > +static int vduse_dev_update_iotlb(struct vduse_dev *dev, u32 asid,
> >                                   u64 start, u64 last)
> >  {
> >         struct vduse_dev_msg msg = { 0 };
> > @@ -323,8 +331,14 @@ static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> >                 return -EINVAL;
> >
> >         msg.req.type = VDUSE_UPDATE_IOTLB;
> > -       msg.req.iova.start = start;
> > -       msg.req.iova.last = last;
> > +       if (dev->api_version < VDUSE_API_VERSION_1) {
> > +               msg.req.iova.start = start;
> > +               msg.req.iova.last = last;
> > +       } else {
> > +               msg.req.iova_v2.start = start;
> > +               msg.req.iova_v2.last = last;
> > +               msg.req.iova_v2.asid = asid;
> > +       }
> >
> >         return vduse_dev_msg_sync(dev, &msg);
> >  }
> > @@ -436,14 +450,29 @@ static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> >         return mask;
> >  }
> >
> > +/* Force set the asid to a vq group without a message to the VDUSE device */
> > +static void vduse_set_group_asid_nomsg(struct vduse_dev *dev,
> > +                                      unsigned int group, unsigned int asid)
> > +{
> > +       write_lock(&dev->groups[group].as_lock);
> > +       dev->groups[group].as = &dev->as[asid];
> > +       write_unlock(&dev->groups[group].as_lock);
> > +}
> > +
> >  static void vduse_dev_reset(struct vduse_dev *dev)
> >  {
> >         int i;
> > -       struct vduse_iova_domain *domain = dev->domain;
> >
> >         /* The coherent mappings are handled in vduse_dev_free_coherent() */
> > -       if (domain && domain->bounce_map)
> > -               vduse_domain_reset_bounce_map(domain);
> > +       for (i = 0; i < dev->nas; i++) {
> > +               struct vduse_iova_domain *domain = dev->as[i].domain;
> > +
> > +               if (domain && domain->bounce_map)
> > +                       vduse_domain_reset_bounce_map(domain);
>
> Btw, I see this:
>
> void vduse_domain_reset_bounce_map(struct vduse_iova_domain *domain)
> {
>         if (!domain->bounce_map)
>                 return;
>
>         spin_lock(&domain->iotlb_lock);
>         if (!domain->bounce_map)
>                 goto unlock;
>
>
> The counbe_map is checked twice, let's fix that.
>

Double checked locking to avoid taking the lock? I don't think it is
worth it to keep it as it is not in the hot path anyway. But that
would also be another patch independent of this series, isn't it?

> > +       }
> > +
> > +       for (i = 0; i < dev->ngroups; i++)
> > +               vduse_set_group_asid_nomsg(dev, i, 0);
>
> Note that this function still does:
>
>                 vq->vq_group = 0;
>
> Which is wrong.
>

Right, removing it for the next version. Thanks for the catch!

> >
> >         down_write(&dev->rwsem);
> >
> > @@ -623,6 +652,30 @@ static union virtio_map vduse_get_vq_map(struct vdpa_device *vdpa, u16 idx)
> >         return ret;
> >  }
> >
> > +static int vduse_set_group_asid(struct vdpa_device *vdpa, unsigned int group,
> > +                               unsigned int asid)
> > +{
> > +       struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +       struct vduse_dev_msg msg = { 0 };
> > +       int r;
> > +
> > +       if (dev->api_version < VDUSE_API_VERSION_1 ||
> > +           group >= dev->ngroups || asid >= dev->nas ||
> > +           dev->status & VIRTIO_CONFIG_S_DRIVER_OK)
> > +               return -EINVAL;
>
> If we forbid setting group asid for !DRIVER_OK, why do we still need a
> rwlock?

virtio_map_ops->alloc is still called before DRIVER_OK to allocate the
vrings in the bounce buffer, for example. If you're ok with that I'm
ok with removing the lock, as all the calls are issued by the driver
setup process anyway. Or just to keep it for alloc?

Anyway, I think I misunderstood your comment from [1] then.

> All we need to do is to synchronize set_group_asid() with
> set_status()/reset()?
>

That's also a good one. There is no synchronization if one thread call
reset and then the device is set up from another thread. As all this
situation is still hypothetical because virtio_vdpa does not support
set_group_asid, and vhost one is already protected by vhost lock, do
we need it?

> Or if you want to synchronize map ops with set_status() that looks
> like an independent thing (hardening).
>
> > +
> > +       msg.req.type = VDUSE_SET_VQ_GROUP_ASID;
> > +       msg.req.vq_group_asid.group = group;
> > +       msg.req.vq_group_asid.asid = asid;
> > +
> > +       r = vduse_dev_msg_sync(dev, &msg);
> > +       if (r < 0)
> > +               return r;
> > +
> > +       vduse_set_group_asid_nomsg(dev, group, asid);
>
> I'm not sure this has been discussed before, but I think it would be
> better to introduce a new ioctl to get group -> as mapping. This helps
> to avoid vduse_dev_msg_sync() as much as possible. And it doesn't
> require the userspace to poll vduse fd before DRIVER_OK.
>

I'm fine with that, but how do we communicate that they have changed?
Or how to communicate to the driver that the device does not accept
the assignment of the ASID to the group?

> > +       return 0;
> > +}
> > +
> >  static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> >                                 struct vdpa_vq_state *state)
> >  {
> > @@ -794,13 +847,13 @@ static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
> >         struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >         int ret;
> >
> > -       ret = vduse_domain_set_map(dev->domain, iotlb);
> > +       ret = vduse_domain_set_map(dev->as[asid].domain, iotlb);
> >         if (ret)
> >                 return ret;
> >
> > -       ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> > +       ret = vduse_dev_update_iotlb(dev, asid, 0ULL, ULLONG_MAX);
> >         if (ret) {
> > -               vduse_domain_clear_map(dev->domain, iotlb);
> > +               vduse_domain_clear_map(dev->as[asid].domain, iotlb);
> >                 return ret;
> >         }
> >
> > @@ -843,6 +896,7 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> >         .get_vq_affinity        = vduse_vdpa_get_vq_affinity,
> >         .reset                  = vduse_vdpa_reset,
> >         .set_map                = vduse_vdpa_set_map,
> > +       .set_group_asid         = vduse_set_group_asid,
> >         .get_vq_map             = vduse_get_vq_map,
> >         .free                   = vduse_vdpa_free,
> >  };
> > @@ -851,32 +905,30 @@ static void vduse_dev_sync_single_for_device(union virtio_map token,
> >                                              dma_addr_t dma_addr, size_t size,
> >                                              enum dma_data_direction dir)
> >  {
> > -       struct vduse_dev *vdev;
> >         struct vduse_iova_domain *domain;
> >
> >         if (!token.group)
> >                 return;
> >
> > -       vdev = token.group->dev;
> > -       domain = vdev->domain;
> > -
> > +       read_lock(&token.group->as_lock);
>
> I think we could optimize the lock here. E.g when nas is 1, we don't
> need any lock in fact.
>

Good point! Not taking the lock in that case for the next version, thanks!

> > +       domain = token.group->as->domain;
> >         vduse_domain_sync_single_for_device(domain, dma_addr, size, dir);
> > +       read_unlock(&token.group->as_lock);
> >  }
> >
> >  static void vduse_dev_sync_single_for_cpu(union virtio_map token,
> >                                              dma_addr_t dma_addr, size_t size,
> >                                              enum dma_data_direction dir)
> >  {
> > -       struct vduse_dev *vdev;
> >         struct vduse_iova_domain *domain;
> >
> >         if (!token.group)
> >                 return;
> >
> > -       vdev = token.group->dev;
> > -       domain = vdev->domain;
> > -
> > +       read_lock(&token.group->as_lock);
> > +       domain = token.group->as->domain;
> >         vduse_domain_sync_single_for_cpu(domain, dma_addr, size, dir);
> > +       read_unlock(&token.group->as_lock);
> >  }
> >
> >  static dma_addr_t vduse_dev_map_page(union virtio_map token, struct page *page,
> > @@ -884,38 +936,38 @@ static dma_addr_t vduse_dev_map_page(union virtio_map token, struct page *page,
> >                                      enum dma_data_direction dir,
> >                                      unsigned long attrs)
> >  {
> > -       struct vduse_dev *vdev;
> >         struct vduse_iova_domain *domain;
> > +       dma_addr_t r;
> >
> >         if (!token.group)
> >                 return DMA_MAPPING_ERROR;
> >
> > -       vdev = token.group->dev;
> > -       domain = vdev->domain;
> > +       read_lock(&token.group->as_lock);
> > +       domain = token.group->as->domain;
> > +       r = vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> > +       read_unlock(&token.group->as_lock);
> >
> > -       return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> > +       return r;
> >  }
> >
> >  static void vduse_dev_unmap_page(union virtio_map token, dma_addr_t dma_addr,
> >                                  size_t size, enum dma_data_direction dir,
> >                                  unsigned long attrs)
> >  {
> > -       struct vduse_dev *vdev;
> >         struct vduse_iova_domain *domain;
> >
> >         if (!token.group)
> >                 return;
> >
> > -       vdev = token.group->dev;
> > -       domain = vdev->domain;
> > -
> > -       return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> > +       read_lock(&token.group->as_lock);
> > +       domain = token.group->as->domain;
> > +       vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> > +       read_unlock(&token.group->as_lock);
> >  }
> >
> >  static void *vduse_dev_alloc_coherent(union virtio_map token, size_t size,
> >                                       dma_addr_t *dma_addr, gfp_t flag)
> >  {
> > -       struct vduse_dev *vdev;
> >         struct vduse_iova_domain *domain;
> >         unsigned long iova;
> >         void *addr;
> > @@ -928,18 +980,21 @@ static void *vduse_dev_alloc_coherent(union virtio_map token, size_t size,
> >         if (!addr)
> >                 return NULL;
> >
> > -       vdev = token.group->dev;
> > -       domain = vdev->domain;
> > +       *dma_addr = (dma_addr_t)iova;
>
> Any reason we need to touch *dma_addr here? It might trigger UBSAN/KMSAN.
>

No, this is a leftover. I'm fixing it for the next version. Thanks!

> > +       read_lock(&token.group->as_lock);
> > +       domain = token.group->as->domain;
> >         addr = vduse_domain_alloc_coherent(domain, size,
> >                                            (dma_addr_t *)&iova, addr);
> >         if (!addr)
> >                 goto err;
> >
> >         *dma_addr = (dma_addr_t)iova;
> > +       read_unlock(&token.group->as_lock);
> >
> >         return addr;
> >
> >  err:
> > +       read_unlock(&token.group->as_lock);
> >         free_pages_exact(addr, size);
> >         return NULL;
> >  }
> > @@ -948,31 +1003,30 @@ static void vduse_dev_free_coherent(union virtio_map token, size_t size,
> >                                     void *vaddr, dma_addr_t dma_addr,
> >                                     unsigned long attrs)
> >  {
> > -       struct vduse_dev *vdev;
> >         struct vduse_iova_domain *domain;
> >
> >         if (!token.group)
> >                 return;
> >
> > -       vdev = token.group->dev;
> > -       domain = vdev->domain;
> > -
> > +       read_lock(&token.group->as_lock);
> > +       domain = token.group->as->domain;
> >         vduse_domain_free_coherent(domain, size, dma_addr, attrs);
> > +       read_unlock(&token.group->as_lock);
> >         free_pages_exact(vaddr, size);
> >  }
> >
> >  static bool vduse_dev_need_sync(union virtio_map token, dma_addr_t dma_addr)
> >  {
> > -       struct vduse_dev *vdev;
> > -       struct vduse_iova_domain *domain;
> > +       size_t bounce_size;
> >
> >         if (!token.group)
> >                 return false;
> >
> > -       vdev = token.group->dev;
> > -       domain = vdev->domain;
> > +       read_lock(&token.group->as_lock);
> > +       bounce_size = token.group->as->domain->bounce_size;
> > +       read_unlock(&token.group->as_lock);
> >
> > -       return dma_addr < domain->bounce_size;
> > +       return dma_addr < bounce_size;
> >  }
> >
> >  static int vduse_dev_mapping_error(union virtio_map token, dma_addr_t dma_addr)
> > @@ -984,16 +1038,16 @@ static int vduse_dev_mapping_error(union virtio_map token, dma_addr_t dma_addr)
> >
> >  static size_t vduse_dev_max_mapping_size(union virtio_map token)
> >  {
> > -       struct vduse_dev *vdev;
> > -       struct vduse_iova_domain *domain;
> > +       size_t bounce_size;
> >
> >         if (!token.group)
> >                 return 0;
> >
> > -       vdev = token.group->dev;
> > -       domain = vdev->domain;
> > +       read_lock(&token.group->as_lock);
> > +       bounce_size = token.group->as->domain->bounce_size;
> > +       read_unlock(&token.group->as_lock);
> >
> > -       return domain->bounce_size;
> > +       return bounce_size;
> >  }
> >
> >  static const struct virtio_map_ops vduse_map_ops = {
> > @@ -1133,39 +1187,40 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
> >         return ret;
> >  }
> >
> > -static int vduse_dev_dereg_umem(struct vduse_dev *dev,
> > +static int vduse_dev_dereg_umem(struct vduse_dev *dev, u32 asid,
> >                                 u64 iova, u64 size)
> >  {
> >         int ret;
> >
> > -       mutex_lock(&dev->mem_lock);
> > +       mutex_lock(&dev->as[asid].mem_lock);
> >         ret = -ENOENT;
> > -       if (!dev->umem)
> > +       if (!dev->as[asid].umem)
> >                 goto unlock;
> >
> >         ret = -EINVAL;
> > -       if (!dev->domain)
> > +       if (!dev->as[asid].domain)
> >                 goto unlock;
> >
> > -       if (dev->umem->iova != iova || size != dev->domain->bounce_size)
> > +       if (dev->as[asid].umem->iova != iova ||
> > +           size != dev->as[asid].domain->bounce_size)
> >                 goto unlock;
> >
> > -       vduse_domain_remove_user_bounce_pages(dev->domain);
> > -       unpin_user_pages_dirty_lock(dev->umem->pages,
> > -                                   dev->umem->npages, true);
> > -       atomic64_sub(dev->umem->npages, &dev->umem->mm->pinned_vm);
> > -       mmdrop(dev->umem->mm);
> > -       vfree(dev->umem->pages);
> > -       kfree(dev->umem);
> > -       dev->umem = NULL;
> > +       vduse_domain_remove_user_bounce_pages(dev->as[asid].domain);
> > +       unpin_user_pages_dirty_lock(dev->as[asid].umem->pages,
> > +                                   dev->as[asid].umem->npages, true);
> > +       atomic64_sub(dev->as[asid].umem->npages, &dev->as[asid].umem->mm->pinned_vm);
> > +       mmdrop(dev->as[asid].umem->mm);
> > +       vfree(dev->as[asid].umem->pages);
> > +       kfree(dev->as[asid].umem);
> > +       dev->as[asid].umem = NULL;
> >         ret = 0;
> >  unlock:
> > -       mutex_unlock(&dev->mem_lock);
> > +       mutex_unlock(&dev->as[asid].mem_lock);
> >         return ret;
> >  }
> >
> >  static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > -                             u64 iova, u64 uaddr, u64 size)
> > +                             u32 asid, u64 iova, u64 uaddr, u64 size)
> >  {
> >         struct page **page_list = NULL;
> >         struct vduse_umem *umem = NULL;
> > @@ -1173,14 +1228,14 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> >         unsigned long npages, lock_limit;
> >         int ret;
> >
> > -       if (!dev->domain || !dev->domain->bounce_map ||
> > -           size != dev->domain->bounce_size ||
> > +       if (!dev->as[asid].domain || !dev->as[asid].domain->bounce_map ||
> > +           size != dev->as[asid].domain->bounce_size ||
> >             iova != 0 || uaddr & ~PAGE_MASK)
> >                 return -EINVAL;
> >
> > -       mutex_lock(&dev->mem_lock);
> > +       mutex_lock(&dev->as[asid].mem_lock);
> >         ret = -EEXIST;
> > -       if (dev->umem)
> > +       if (dev->as[asid].umem)
> >                 goto unlock;
> >
> >         ret = -ENOMEM;
> > @@ -1204,7 +1259,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> >                 goto out;
> >         }
> >
> > -       ret = vduse_domain_add_user_bounce_pages(dev->domain,
> > +       ret = vduse_domain_add_user_bounce_pages(dev->as[asid].domain,
> >                                                  page_list, pinned);
> >         if (ret)
> >                 goto out;
> > @@ -1217,7 +1272,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> >         umem->mm = current->mm;
> >         mmgrab(current->mm);
> >
> > -       dev->umem = umem;
> > +       dev->as[asid].umem = umem;
> >  out:
> >         if (ret && pinned > 0)
> >                 unpin_user_pages(page_list, pinned);
> > @@ -1228,7 +1283,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> >                 vfree(page_list);
> >                 kfree(umem);
> >         }
> > -       mutex_unlock(&dev->mem_lock);
> > +       mutex_unlock(&dev->as[asid].mem_lock);
> >         return ret;
> >  }
> >
> > @@ -1260,47 +1315,66 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> >
> >         switch (cmd) {
> >         case VDUSE_IOTLB_GET_FD: {
> > -               struct vduse_iotlb_entry entry;
> > +               struct vduse_iotlb_entry_v2 entry;
>
> Nit: if we stick entry and do copy_from_user() twice it might save
> lots of unnecessary changes.
>

I'm happy to move to something else but most of the changes happen
because of s/entry/entry.v1/ . If we stick with just vduse_iotlb_entry
and a separate asid variable, we also need to duplicate the
copy_from_user [2].

> >                 struct vhost_iotlb_map *map;
> >                 struct vdpa_map_file *map_file;
> >                 struct file *f = NULL;
> > +               u32 asid;
> >
> >                 ret = -EFAULT;
> > -               if (copy_from_user(&entry, argp, sizeof(entry)))
> > -                       break;
> > +               if (dev->api_version >= VDUSE_API_VERSION_1) {
> > +                       if (copy_from_user(&entry, argp, sizeof(entry)))
> > +                               break;
> > +               } else {
> > +                       entry.asid = 0;
> > +                       if (copy_from_user(&entry.v1, argp,
> > +                                          sizeof(entry.v1)))
> > +                               break;
> > +               }
> >
> >                 ret = -EINVAL;
> > -               if (entry.start > entry.last)
> > +               if (entry.v1.start > entry.v1.last)
> > +                       break;
> > +
> > +               if (entry.asid >= dev->nas)
> >                         break;
> >
> >                 mutex_lock(&dev->domain_lock);
> > -               if (!dev->domain) {
> > +               asid = array_index_nospec(entry.asid, dev->nas);
> > +               if (!dev->as[asid].domain) {
> >                         mutex_unlock(&dev->domain_lock);
> >                         break;
> >                 }
> > -               spin_lock(&dev->domain->iotlb_lock);
> > -               map = vhost_iotlb_itree_first(dev->domain->iotlb,
> > -                                             entry.start, entry.last);
> > +               spin_lock(&dev->as[asid].domain->iotlb_lock);
> > +               map = vhost_iotlb_itree_first(dev->as[asid].domain->iotlb,
> > +                                             entry.v1.start, entry.v1.last);
> >                 if (map) {
> >                         map_file = (struct vdpa_map_file *)map->opaque;
> >                         f = get_file(map_file->file);
> > -                       entry.offset = map_file->offset;
> > -                       entry.start = map->start;
> > -                       entry.last = map->last;
> > -                       entry.perm = map->perm;
> > +                       entry.v1.offset = map_file->offset;
> > +                       entry.v1.start = map->start;
> > +                       entry.v1.last = map->last;
> > +                       entry.v1.perm = map->perm;
> >                 }
> > -               spin_unlock(&dev->domain->iotlb_lock);
> > +               spin_unlock(&dev->as[asid].domain->iotlb_lock);
> >                 mutex_unlock(&dev->domain_lock);
> >                 ret = -EINVAL;
> >                 if (!f)
> >                         break;
> >
> > -               ret = -EFAULT;
> > -               if (copy_to_user(argp, &entry, sizeof(entry))) {
> > +               if (dev->api_version >= VDUSE_API_VERSION_1)
> > +                       ret = copy_to_user(argp, &entry,
> > +                                          sizeof(entry));
> > +               else
> > +                       ret = copy_to_user(argp, &entry.v1,
> > +                                          sizeof(entry.v1));
> > +
> > +               if (ret) {
> > +                       ret = -EFAULT;
> >                         fput(f);
> >                         break;
> >                 }
> > -               ret = receive_fd(f, NULL, perm_to_file_flags(entry.perm));
> > +               ret = receive_fd(f, NULL, perm_to_file_flags(entry.v1.perm));
> >                 fput(f);
> >                 break;
> >         }
> > @@ -1445,6 +1519,7 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> >         }
> >         case VDUSE_IOTLB_REG_UMEM: {
> >                 struct vduse_iova_umem umem;
> > +               u32 asid;
> >
> >                 ret = -EFAULT;
> >                 if (copy_from_user(&umem, argp, sizeof(umem)))
> > @@ -1452,17 +1527,21 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> >
> >                 ret = -EINVAL;
> >                 if (!is_mem_zero((const char *)umem.reserved,
> > -                                sizeof(umem.reserved)))
> > +                                sizeof(umem.reserved)) ||
> > +                   (dev->api_version < VDUSE_API_VERSION_1 &&
> > +                    umem.asid != 0) || umem.asid >= dev->nas)
> >                         break;
> >
> >                 mutex_lock(&dev->domain_lock);
> > -               ret = vduse_dev_reg_umem(dev, umem.iova,
> > +               asid = array_index_nospec(umem.asid, dev->nas);
> > +               ret = vduse_dev_reg_umem(dev, asid, umem.iova,
> >                                          umem.uaddr, umem.size);
> >                 mutex_unlock(&dev->domain_lock);
> >                 break;
> >         }
> >         case VDUSE_IOTLB_DEREG_UMEM: {
> >                 struct vduse_iova_umem umem;
> > +               u32 asid;
> >
> >                 ret = -EFAULT;
> >                 if (copy_from_user(&umem, argp, sizeof(umem)))
> > @@ -1470,10 +1549,15 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> >
> >                 ret = -EINVAL;
> >                 if (!is_mem_zero((const char *)umem.reserved,
> > -                                sizeof(umem.reserved)))
> > +                                sizeof(umem.reserved)) ||
> > +                   (dev->api_version < VDUSE_API_VERSION_1 &&
> > +                    umem.asid != 0) ||
> > +                    umem.asid >= dev->nas)
> >                         break;
> > +
> >                 mutex_lock(&dev->domain_lock);
> > -               ret = vduse_dev_dereg_umem(dev, umem.iova,
> > +               asid = array_index_nospec(umem.asid, dev->nas);
> > +               ret = vduse_dev_dereg_umem(dev, asid, umem.iova,
> >                                            umem.size);
> >                 mutex_unlock(&dev->domain_lock);
> >                 break;
> > @@ -1481,6 +1565,7 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> >         case VDUSE_IOTLB_GET_INFO: {
>
> Btw I see this:
>
>                 dev->vqs[index]->vq_group = config.group;
>
> In VDUSE_VQ_SETUP:
>
> I wonder what's the reason that it is not a part of CREATE_DEV? I
> meant it might be racy if DMA happens between CREATE_DEV and
> VDUSE_VQ_SETUP.
>

The reason the vq index -> vq group association cannot be part of
device creation is that we need CVQ to be isolated for live migration,
but the device doesn't know the CVQ index at the CREATE_DEV time, only
after the feature negotiation happens [3].

[1] https://lore.kernel.org/lkml/CACGkMEvRQ86dYeY3Enqoj1vkSpefU3roq4XGS+y5B5kmsXEkYg@mail.gmail.com/
[2] https://lore.kernel.org/lkml/CACGkMEtszQeZLTegxEbjODYxu-giTvURu=pKj4kYTHQYoKOzkQ@mail.gmail.com
[3] https://lore.kernel.org/lkml/CAJaqyWcvHx7kwcTceN2jazT0nKNo1r5zdzqWHqpxdna-kCS1RA@mail.gmail.com

> >                 struct vduse_iova_info info;
> >                 struct vhost_iotlb_map *map;
> > +               u32 asid;
> >
> >                 ret = -EFAULT;
> >                 if (copy_from_user(&info, argp, sizeof(info)))
> > @@ -1494,23 +1579,31 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> >                                  sizeof(info.reserved)))
> >                         break;
> >
> > +               if (dev->api_version < VDUSE_API_VERSION_1) {
> > +                       if (info.asid)
> > +                               break;
> > +               } else if (info.asid >= dev->nas)
> > +                       break;
> > +
> >                 mutex_lock(&dev->domain_lock);
> > -               if (!dev->domain) {
> > +               asid = array_index_nospec(info.asid, dev->nas);
> > +               if (!dev->as[asid].domain) {
> >                         mutex_unlock(&dev->domain_lock);
> >                         break;
> >                 }
> > -               spin_lock(&dev->domain->iotlb_lock);
> > -               map = vhost_iotlb_itree_first(dev->domain->iotlb,
> > +               spin_lock(&dev->as[asid].domain->iotlb_lock);
> > +               map = vhost_iotlb_itree_first(dev->as[asid].domain->iotlb,
> >                                               info.start, info.last);
> >                 if (map) {
> >                         info.start = map->start;
> >                         info.last = map->last;
> >                         info.capability = 0;
> > -                       if (dev->domain->bounce_map && map->start == 0 &&
> > -                           map->last == dev->domain->bounce_size - 1)
> > +                       if (dev->as[asid].domain->bounce_map &&
> > +                           map->start == 0 &&
> > +                           map->last == dev->as[asid].domain->bounce_size - 1)
> >                                 info.capability |= VDUSE_IOVA_CAP_UMEM;
> >                 }
> > -               spin_unlock(&dev->domain->iotlb_lock);
> > +               spin_unlock(&dev->as[asid].domain->iotlb_lock);
> >                 mutex_unlock(&dev->domain_lock);
> >                 if (!map)
> >                         break;
> > @@ -1535,8 +1628,10 @@ static int vduse_dev_release(struct inode *inode, struct file *file)
> >         struct vduse_dev *dev = file->private_data;
> >
> >         mutex_lock(&dev->domain_lock);
> > -       if (dev->domain)
> > -               vduse_dev_dereg_umem(dev, 0, dev->domain->bounce_size);
> > +       for (int i = 0; i < dev->nas; i++)
> > +               if (dev->as[i].domain)
> > +                       vduse_dev_dereg_umem(dev, i, 0,
> > +                                            dev->as[i].domain->bounce_size);
> >         mutex_unlock(&dev->domain_lock);
> >         spin_lock(&dev->msg_lock);
> >         /* Make sure the inflight messages can processed after reconncection */
> > @@ -1755,7 +1850,6 @@ static struct vduse_dev *vduse_dev_create(void)
> >                 return NULL;
> >
> >         mutex_init(&dev->lock);
> > -       mutex_init(&dev->mem_lock);
> >         mutex_init(&dev->domain_lock);
> >         spin_lock_init(&dev->msg_lock);
> >         INIT_LIST_HEAD(&dev->send_list);
> > @@ -1806,8 +1900,11 @@ static int vduse_destroy_dev(char *name)
> >         idr_remove(&vduse_idr, dev->minor);
> >         kvfree(dev->config);
> >         vduse_dev_deinit_vqs(dev);
> > -       if (dev->domain)
> > -               vduse_domain_destroy(dev->domain);
> > +       for (int i = 0; i < dev->nas; i++) {
> > +               if (dev->as[i].domain)
> > +                       vduse_domain_destroy(dev->as[i].domain);
> > +       }
> > +       kfree(dev->as);
> >         kfree(dev->name);
> >         kfree(dev->groups);
> >         vduse_dev_destroy(dev);
> > @@ -1854,12 +1951,17 @@ static bool vduse_validate_config(struct vduse_dev_config *config,
> >                          sizeof(config->reserved)))
> >                 return false;
> >
> > -       if (api_version < VDUSE_API_VERSION_1 && config->ngroups)
> > +       if (api_version < VDUSE_API_VERSION_1 &&
> > +           (config->ngroups || config->nas))
> >                 return false;
> >
> > -       if (api_version >= VDUSE_API_VERSION_1 &&
> > -           (!config->ngroups || config->ngroups > VDUSE_DEV_MAX_GROUPS))
> > -               return false;
> > +       if (api_version >= VDUSE_API_VERSION_1) {
> > +               if (!config->ngroups || config->ngroups > VDUSE_DEV_MAX_GROUPS)
> > +                       return false;
> > +
> > +               if (!config->nas || config->nas > VDUSE_DEV_MAX_AS)
> > +                       return false;
> > +       }
> >
> >         if (config->vq_align > PAGE_SIZE)
> >                 return false;
> > @@ -1924,7 +2026,8 @@ static ssize_t bounce_size_store(struct device *device,
> >
> >         ret = -EPERM;
> >         mutex_lock(&dev->domain_lock);
> > -       if (dev->domain)
> > +       /* Assuming that if the first domain is allocated, all are allocated */
> > +       if (dev->as[0].domain)
> >                 goto unlock;
>
> Not for this patch but I don't understand why we need to check dev->domain here.
>

I guess you need to know the bounce size before you allocate the
domain. To make shrink logic for it seems not to be worth it, but
maybe I'm missing something.

> >
> >         ret = kstrtouint(buf, 10, &bounce_size);
> > @@ -1983,8 +2086,17 @@ static int vduse_create_dev(struct vduse_dev_config *config,
> >                               GFP_KERNEL);
> >         if (!dev->groups)
> >                 goto err_vq_groups;
> > -       for (u32 i = 0; i < dev->ngroups; ++i)
> > +       for (u32 i = 0; i < dev->ngroups; ++i) {
> >                 dev->groups[i].dev = dev;
> > +               rwlock_init(&dev->groups[i].as_lock);
> > +       }
> > +
> > +       dev->nas = (dev->api_version < VDUSE_API_VERSION_1) ? 1 : config->nas;
> > +       dev->as = kcalloc(dev->nas, sizeof(dev->as[0]), GFP_KERNEL);
> > +       if (!dev->as)
> > +               goto err_as;
> > +       for (int i = 0; i < dev->nas; i++)
> > +               mutex_init(&dev->as[i].mem_lock);
> >
> >         dev->name = kstrdup(config->name, GFP_KERNEL);
> >         if (!dev->name)
> > @@ -2022,6 +2134,8 @@ static int vduse_create_dev(struct vduse_dev_config *config,
> >  err_idr:
> >         kfree(dev->name);
> >  err_str:
> > +       kfree(dev->as);
> > +err_as:
> >         kfree(dev->groups);
> >  err_vq_groups:
> >         vduse_dev_destroy(dev);
> > @@ -2147,7 +2261,7 @@ static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
> >
> >         vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev,
> >                                  &vduse_vdpa_config_ops, &vduse_map_ops,
> > -                                dev->ngroups, 1, name, true);
> > +                                dev->ngroups, dev->nas, name, true);
> >         if (IS_ERR(vdev))
> >                 return PTR_ERR(vdev);
> >
> > @@ -2162,7 +2276,8 @@ static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name,
> >                         const struct vdpa_dev_set_config *config)
> >  {
> >         struct vduse_dev *dev;
> > -       int ret;
> > +       size_t domain_bounce_size;
> > +       int ret, i;
> >
> >         mutex_lock(&vduse_lock);
> >         dev = vduse_find_dev(name);
> > @@ -2176,29 +2291,38 @@ static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name,
> >                 return ret;
> >
> >         mutex_lock(&dev->domain_lock);
> > -       if (!dev->domain)
> > -               dev->domain = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
> > -                                                 dev->bounce_size);
> > -       mutex_unlock(&dev->domain_lock);
> > -       if (!dev->domain) {
> > -               ret = -ENOMEM;
> > -               goto domain_err;
> > +       ret = 0;
> > +
> > +       domain_bounce_size = dev->bounce_size / dev->nas;
> > +       for (i = 0; i < dev->nas; ++i) {
> > +               dev->as[i].domain = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
> > +                                                       domain_bounce_size);
> > +               if (!dev->as[i].domain) {
> > +                       ret = -ENOMEM;
> > +                       goto err;
> > +               }
> >         }
> >
> > +       mutex_unlock(&dev->domain_lock);
> > +
> >         ret = _vdpa_register_device(&dev->vdev->vdpa, dev->vq_num);
> > -       if (ret) {
> > -               goto register_err;
> > -       }
> > +       if (ret)
> > +               goto err_register;
> >
> >         return 0;
> >
> > -register_err:
> > +err_register:
> >         mutex_lock(&dev->domain_lock);
> > -       vduse_domain_destroy(dev->domain);
> > -       dev->domain = NULL;
> > +
> > +err:
> > +       for (int j = 0; j < i; j++) {
> > +               if (dev->as[j].domain) {
> > +                       vduse_domain_destroy(dev->as[j].domain);
> > +                       dev->as[j].domain = NULL;
> > +               }
> > +       }
> >         mutex_unlock(&dev->domain_lock);
> >
> > -domain_err:
> >         put_device(&dev->vdev->vdpa.dev);
> >
> >         return ret;
> > diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> > index a3d51cf6df3a..da2c5e47990e 100644
> > --- a/include/uapi/linux/vduse.h
> > +++ b/include/uapi/linux/vduse.h
> > @@ -47,7 +47,8 @@ struct vduse_dev_config {
> >         __u32 vq_num;
> >         __u32 vq_align;
> >         __u32 ngroups; /* if VDUSE_API_VERSION >= 1 */
> > -       __u32 reserved[12];
> > +       __u32 nas; /* if VDUSE_API_VERSION >= 1 */
> > +       __u32 reserved[11];
> >         __u32 config_size;
> >         __u8 config[];
> >  };
> > @@ -82,6 +83,18 @@ struct vduse_iotlb_entry {
> >         __u8 perm;
> >  };
> >
> > +/**
> > + * struct vduse_iotlb_entry_v2 - entry of IOTLB to describe one IOVA region in an ASID
> > + * @v1: the original vduse_iotlb_entry
> > + * @asid: address space ID of the IOVA region
> > + *
> > + * Structure used by VDUSE_IOTLB_GET_FD ioctl to find an overlapped IOVA region.
> > + */
> > +struct vduse_iotlb_entry_v2 {
> > +       struct vduse_iotlb_entry v1;
> > +       __u32 asid;
> > +};
> > +
> >  /*
> >   * Find the first IOVA region that overlaps with the range [start, last]
> >   * and return the corresponding file descriptor. Return -EINVAL means the
> > @@ -166,6 +179,16 @@ struct vduse_vq_state_packed {
> >         __u16 last_used_idx;
> >  };
> >
> > +/**
> > + * struct vduse_vq_group_asid - virtqueue group ASID
> > + * @group: Index of the virtqueue group
> > + * @asid: Address space ID of the group
> > + */
> > +struct vduse_vq_group_asid {
> > +       __u32 group;
> > +       __u32 asid;
> > +};
> > +
> >  /**
> >   * struct vduse_vq_info - information of a virtqueue
> >   * @index: virtqueue index
> > @@ -225,6 +248,7 @@ struct vduse_vq_eventfd {
> >   * @uaddr: start address of userspace memory, it must be aligned to page size
> >   * @iova: start of the IOVA region
> >   * @size: size of the IOVA region
> > + * @asid: Address space ID of the IOVA region
> >   * @reserved: for future use, needs to be initialized to zero
> >   *
> >   * Structure used by VDUSE_IOTLB_REG_UMEM and VDUSE_IOTLB_DEREG_UMEM
> > @@ -234,7 +258,8 @@ struct vduse_iova_umem {
> >         __u64 uaddr;
> >         __u64 iova;
> >         __u64 size;
> > -       __u64 reserved[3];
> > +       __u32 asid;
> > +       __u32 reserved[5];
> >  };
> >
> >  /* Register userspace memory for IOVA regions */
> > @@ -248,6 +273,7 @@ struct vduse_iova_umem {
> >   * @start: start of the IOVA region
> >   * @last: last of the IOVA region
> >   * @capability: capability of the IOVA region
> > + * @asid: Address space ID of the IOVA region, only if device API version >= 1
> >   * @reserved: for future use, needs to be initialized to zero
> >   *
> >   * Structure used by VDUSE_IOTLB_GET_INFO ioctl to get information of
> > @@ -258,7 +284,8 @@ struct vduse_iova_info {
> >         __u64 last;
> >  #define VDUSE_IOVA_CAP_UMEM (1 << 0)
> >         __u64 capability;
> > -       __u64 reserved[3];
> > +       __u32 asid; /* Only if device API version >= 1 */
> > +       __u32 reserved[5];
> >  };
> >
> >  /*
> > @@ -280,6 +307,7 @@ enum vduse_req_type {
> >         VDUSE_GET_VQ_STATE,
> >         VDUSE_SET_STATUS,
> >         VDUSE_UPDATE_IOTLB,
> > +       VDUSE_SET_VQ_GROUP_ASID,
> >  };
> >
> >  /**
> > @@ -314,6 +342,18 @@ struct vduse_iova_range {
> >         __u64 last;
> >  };
> >
> > +/**
> > + * struct vduse_iova_range - IOVA range [start, last] if API_VERSION >= 1
> > + * @start: start of the IOVA range
> > + * @last: last of the IOVA range
> > + * @asid: address space ID of the IOVA range
> > + */
> > +struct vduse_iova_range_v2 {
> > +       __u64 start;
> > +       __u64 last;
> > +       __u32 asid;
> > +};
> > +
> >  /**
> >   * struct vduse_dev_request - control request
> >   * @type: request type
> > @@ -322,6 +362,8 @@ struct vduse_iova_range {
> >   * @vq_state: virtqueue state, only index field is available
> >   * @s: device status
> >   * @iova: IOVA range for updating
> > + * @iova_v2: IOVA range for updating if API_VERSION >= 1
> > + * @vq_group_asid: ASID of a virtqueue group
> >   * @padding: padding
> >   *
> >   * Structure used by read(2) on /dev/vduse/$NAME.
> > @@ -334,6 +376,11 @@ struct vduse_dev_request {
> >                 struct vduse_vq_state vq_state;
> >                 struct vduse_dev_status s;
> >                 struct vduse_iova_range iova;
> > +               /* Following members but padding exist only if vduse api
> > +                * version >= 1
> > +                */;
> > +               struct vduse_iova_range_v2 iova_v2;
> > +               struct vduse_vq_group_asid vq_group_asid;
> >                 __u32 padding[32];
> >         };
> >  };
> > --
> > 2.52.0
> >
>
> Thanks
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v10 7/8] vduse: add vq group asid support
  2025-12-18 13:10     ` Eugenio Perez Martin
@ 2025-12-23  1:11       ` Jason Wang
  2025-12-23 13:15         ` Eugenio Perez Martin
  0 siblings, 1 reply; 22+ messages in thread
From: Jason Wang @ 2025-12-23  1:11 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Michael S . Tsirkin, Maxime Coquelin, Laurent Vivier,
	virtualization, linux-kernel, Stefano Garzarella, Yongji Xie,
	Xuan Zhuo, Cindy Lu

On Thu, Dec 18, 2025 at 9:11 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Thu, Dec 18, 2025 at 7:45 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Wed, Dec 17, 2025 at 7:24 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > >
> > > Add support for assigning Address Space Identifiers (ASIDs) to each VQ
> > > group.  This enables mapping each group into a distinct memory space.
> > >
> > > The vq group to ASID association is protected by a rwlock now.  But the
> > > mutex domain_lock keeps protecting the domains of all ASIDs, as some
> > > operations like the one related with the bounce buffer size still
> > > requires to lock all the ASIDs.
> > >
> > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > >
> > > ---
> > > Future improvements can include performance optimizations on top could
> > > move to per-ASID locks, or hardening by tracking ASID or ASID hashes on
> > > unused bits of the DMA address.
> > >
> > > Tested virtio_vdpa by adding manually two threads in vduse_set_status:
> > > one of them modifies the vq group 0 ASID and the other one map and unmap
> > > memory continuously.  After a while, the two threads stop and the usual
> > > work continues.
> > >
> > > Tested with vhost_vdpa by migrating a VM while ping on OVS+VDUSE.  A few
> > > workaround were needed in some parts:
> > > * Do not enable CVQ before data vqs in QEMU, as VDUSE does not forward
> > >   the enable message to the userland device.  This will be solved in the
> > >   future.
> > > * Share the suspended state between all vhost devices in QEMU:
> > >   https://lists.nongnu.org/archive/html/qemu-devel/2025-11/msg02947.html
> > > * Implement a fake VDUSE suspend vdpa operation callback that always
> > >   returns true in the kernel.  DPDK suspend the device at the first
> > >   GET_VRING_BASE.
> > > * Remove the CVQ blocker in ASID.
> > >
> > > ---
> > > v10:
> > > * Back to rwlock version so stronger locks are used.
> > > * Take out allocations from rwlock.
> > > * Forbid changing ASID of a vq group after DRIVER_OK (Jason)
> > > * Remove bad fetching again of domain variable in
> > >   vduse_dev_max_mapping_size (Yongji).
> > > * Remove unused vdev definition in vdpa map_ops callbacks (kernel test
> > >   robot).
> > >
> > > v9:
> > > * Replace mutex with rwlock, as the vdpa map_ops can run from atomic
> > >   context.
> > >
> > > v8:
> > > * Revert the mutex to rwlock change, it needs proper profiling to
> > >   justify it.
> > >
> > > v7:
> > > * Take write lock in the error path (Jason).
> > >
> > > v6:
> > > * Make vdpa_dev_add use gotos for error handling (MST).
> > > * s/(dev->api_version < 1) ?/(dev->api_version < VDUSE_API_VERSION_1) ?/
> > >   (MST).
> > > * Fix struct name not matching in the doc.
> > >
> > > v5:
> > > * Properly return errno if copy_to_user returns >0 in VDUSE_IOTLB_GET_FD
> > >   ioctl (Jason).
> > > * Properly set domain bounce size to divide equally between nas (Jason).
> > > * Exclude "padding" member from the only >V1 members in
> > >   vduse_dev_request.
> > >
> > > v4:
> > > * Divide each domain bounce size between the device bounce size (Jason).
> > > * revert unneeded addr = NULL assignment (Jason)
> > > * Change if (x && (y || z)) return to if (x) { if (y) return; if (z)
> > >   return; } (Jason)
> > > * Change a bad multiline comment, using @ caracter instead of * (Jason).
> > > * Consider config->nas == 0 as a fail (Jason).
> > >
> > > v3:
> > > * Get the vduse domain through the vduse_as in the map functions
> > >   (Jason).
> > > * Squash with the patch creating the vduse_as struct (Jason).
> > > * Create VDUSE_DEV_MAX_AS instead of comparing agains a magic number
> > >   (Jason)
> > >
> > > v2:
> > > * Convert the use of mutex to rwlock.
> > >
> > > RFC v3:
> > > * Increase VDUSE_MAX_VQ_GROUPS to 0xffff (Jason). It was set to a lower
> > >   value to reduce memory consumption, but vqs are already limited to
> > >   that value and userspace VDUSE is able to allocate that many vqs.
> > > * Remove TODO about merging VDUSE_IOTLB_GET_FD ioctl with
> > >   VDUSE_IOTLB_GET_INFO.
> > > * Use of array_index_nospec in VDUSE device ioctls.
> > > * Embed vduse_iotlb_entry into vduse_iotlb_entry_v2.
> > > * Move the umem mutex to asid struct so there is no contention between
> > >   ASIDs.
> > >
> > > RFC v2:
> > > * Make iotlb entry the last one of vduse_iotlb_entry_v2 so the first
> > >   part of the struct is the same.
> > > ---
> > >  drivers/vdpa/vdpa_user/vduse_dev.c | 366 +++++++++++++++++++----------
> > >  include/uapi/linux/vduse.h         |  53 ++++-
> > >  2 files changed, 295 insertions(+), 124 deletions(-)
> > >
> > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > index 767abcb7e375..786ab2378825 100644
> > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > @@ -41,6 +41,7 @@
> > >
> > >  #define VDUSE_DEV_MAX (1U << MINORBITS)
> > >  #define VDUSE_DEV_MAX_GROUPS 0xffff
> > > +#define VDUSE_DEV_MAX_AS 0xffff
> > >  #define VDUSE_MAX_BOUNCE_SIZE (1024 * 1024 * 1024)
> > >  #define VDUSE_MIN_BOUNCE_SIZE (1024 * 1024)
> > >  #define VDUSE_BOUNCE_SIZE (64 * 1024 * 1024)
> > > @@ -86,7 +87,15 @@ struct vduse_umem {
> > >         struct mm_struct *mm;
> > >  };
> > >
> > > +struct vduse_as {
> > > +       struct vduse_iova_domain *domain;
> > > +       struct vduse_umem *umem;
> > > +       struct mutex mem_lock;
> >
> > Not related to this patch. But If I was not wrong we have 1:1 mapping
> > between domain and as. If this is true, can we use bounce_lock instead
> > of a new mem_lock? Since I see mem_lock is only used for synchronizing
> > umem reg/degreg which has been synchronized with domain rwlock.
> >
>
> I think you're right, but they work at different levels at the moment.
> The mem_lock is at vduse_dev and also protect the umem pointer, and
> bounce_lock is at iova_domain.c .
>
> Maybe the right thing to do is to move umem in iova_domain. Yongji
> Xie, what do you think?
>
> > > +};
> > > +
> > >  struct vduse_vq_group {
> > > +       rwlock_t as_lock;
> > > +       struct vduse_as *as; /* Protected by as_lock */
> > >         struct vduse_dev *dev;
> > >  };
> > >
> > > @@ -94,7 +103,7 @@ struct vduse_dev {
> > >         struct vduse_vdpa *vdev;
> > >         struct device *dev;
> > >         struct vduse_virtqueue **vqs;
> > > -       struct vduse_iova_domain *domain;
> > > +       struct vduse_as *as;
> > >         char *name;
> > >         struct mutex lock;
> > >         spinlock_t msg_lock;
> > > @@ -122,9 +131,8 @@ struct vduse_dev {
> > >         u32 vq_num;
> > >         u32 vq_align;
> > >         u32 ngroups;
> > > -       struct vduse_umem *umem;
> > > +       u32 nas;
> > >         struct vduse_vq_group *groups;
> > > -       struct mutex mem_lock;
> > >         unsigned int bounce_size;
> > >         struct mutex domain_lock;
> > >  };
> > > @@ -314,7 +322,7 @@ static int vduse_dev_set_status(struct vduse_dev *dev, u8 status)
> > >         return vduse_dev_msg_sync(dev, &msg);
> > >  }
> > >
> > > -static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> > > +static int vduse_dev_update_iotlb(struct vduse_dev *dev, u32 asid,
> > >                                   u64 start, u64 last)
> > >  {
> > >         struct vduse_dev_msg msg = { 0 };
> > > @@ -323,8 +331,14 @@ static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> > >                 return -EINVAL;
> > >
> > >         msg.req.type = VDUSE_UPDATE_IOTLB;
> > > -       msg.req.iova.start = start;
> > > -       msg.req.iova.last = last;
> > > +       if (dev->api_version < VDUSE_API_VERSION_1) {
> > > +               msg.req.iova.start = start;
> > > +               msg.req.iova.last = last;
> > > +       } else {
> > > +               msg.req.iova_v2.start = start;
> > > +               msg.req.iova_v2.last = last;
> > > +               msg.req.iova_v2.asid = asid;
> > > +       }
> > >
> > >         return vduse_dev_msg_sync(dev, &msg);
> > >  }
> > > @@ -436,14 +450,29 @@ static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> > >         return mask;
> > >  }
> > >
> > > +/* Force set the asid to a vq group without a message to the VDUSE device */
> > > +static void vduse_set_group_asid_nomsg(struct vduse_dev *dev,
> > > +                                      unsigned int group, unsigned int asid)
> > > +{
> > > +       write_lock(&dev->groups[group].as_lock);
> > > +       dev->groups[group].as = &dev->as[asid];
> > > +       write_unlock(&dev->groups[group].as_lock);
> > > +}
> > > +
> > >  static void vduse_dev_reset(struct vduse_dev *dev)
> > >  {
> > >         int i;
> > > -       struct vduse_iova_domain *domain = dev->domain;
> > >
> > >         /* The coherent mappings are handled in vduse_dev_free_coherent() */
> > > -       if (domain && domain->bounce_map)
> > > -               vduse_domain_reset_bounce_map(domain);
> > > +       for (i = 0; i < dev->nas; i++) {
> > > +               struct vduse_iova_domain *domain = dev->as[i].domain;
> > > +
> > > +               if (domain && domain->bounce_map)
> > > +                       vduse_domain_reset_bounce_map(domain);
> >
> > Btw, I see this:
> >
> > void vduse_domain_reset_bounce_map(struct vduse_iova_domain *domain)
> > {
> >         if (!domain->bounce_map)
> >                 return;
> >
> >         spin_lock(&domain->iotlb_lock);
> >         if (!domain->bounce_map)
> >                 goto unlock;
> >
> >
> > The counbe_map is checked twice, let's fix that.
> >
>
> Double checked locking to avoid taking the lock?

I don't know, but I think we don't care too much about the performance
of vduse_domain_reset_bounce_map().

> I don't think it is
> worth it to keep it as it is not in the hot path anyway. But that
> would also be another patch independent of this series, isn't it?

Yes, it's another independent issue I just found when reviewing this patch.

>
> > > +       }
> > > +
> > > +       for (i = 0; i < dev->ngroups; i++)
> > > +               vduse_set_group_asid_nomsg(dev, i, 0);
> >
> > Note that this function still does:
> >
> >                 vq->vq_group = 0;
> >
> > Which is wrong.
> >
>
> Right, removing it for the next version. Thanks for the catch!
>
> > >
> > >         down_write(&dev->rwsem);
> > >
> > > @@ -623,6 +652,30 @@ static union virtio_map vduse_get_vq_map(struct vdpa_device *vdpa, u16 idx)
> > >         return ret;
> > >  }
> > >
> > > +static int vduse_set_group_asid(struct vdpa_device *vdpa, unsigned int group,
> > > +                               unsigned int asid)
> > > +{
> > > +       struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > > +       struct vduse_dev_msg msg = { 0 };
> > > +       int r;
> > > +
> > > +       if (dev->api_version < VDUSE_API_VERSION_1 ||
> > > +           group >= dev->ngroups || asid >= dev->nas ||
> > > +           dev->status & VIRTIO_CONFIG_S_DRIVER_OK)
> > > +               return -EINVAL;
> >
> > If we forbid setting group asid for !DRIVER_OK, why do we still need a
> > rwlock?
>
> virtio_map_ops->alloc is still called before DRIVER_OK to allocate the
> vrings in the bounce buffer, for example. If you're ok with that I'm
> ok with removing the lock, as all the calls are issued by the driver
> setup process anyway. Or just to keep it for alloc?

I see, then I think we need to keep that. The reason is that there's
no guarantee that the alloc() must be called before DRIVER_OK.

>
> Anyway, I think I misunderstood your comment from [1] then.
>
> > All we need to do is to synchronize set_group_asid() with
> > set_status()/reset()?
> >
>
> That's also a good one. There is no synchronization if one thread call
> reset and then the device is set up from another thread. As all this
> situation is still hypothetical because virtio_vdpa does not support
> set_group_asid,

Right.

> and vhost one is already protected by vhost lock, do
> we need it?

Let's add a TODO in the code.

>
> > Or if you want to synchronize map ops with set_status() that looks
> > like an independent thing (hardening).
> >
> > > +
> > > +       msg.req.type = VDUSE_SET_VQ_GROUP_ASID;
> > > +       msg.req.vq_group_asid.group = group;
> > > +       msg.req.vq_group_asid.asid = asid;
> > > +
> > > +       r = vduse_dev_msg_sync(dev, &msg);
> > > +       if (r < 0)
> > > +               return r;
> > > +
> > > +       vduse_set_group_asid_nomsg(dev, group, asid);
> >
> > I'm not sure this has been discussed before, but I think it would be
> > better to introduce a new ioctl to get group -> as mapping. This helps
> > to avoid vduse_dev_msg_sync() as much as possible. And it doesn't
> > require the userspace to poll vduse fd before DRIVER_OK.
> >
>
> I'm fine with that, but how do we communicate that they have changed?

Since we forbid changing the group->as mapping after DRIVER_OK,
userspace just needs to use that ioctl once after DRIVER_OK.

> Or how to communicate to the driver that the device does not accept
> the assignment of the ASID to the group?

See above.

>
> > > +       return 0;
> > > +}
> > > +
> > >  static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> > >                                 struct vdpa_vq_state *state)
> > >  {
> > > @@ -794,13 +847,13 @@ static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
> > >         struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > >         int ret;
> > >
> > > -       ret = vduse_domain_set_map(dev->domain, iotlb);
> > > +       ret = vduse_domain_set_map(dev->as[asid].domain, iotlb);
> > >         if (ret)
> > >                 return ret;
> > >
> > > -       ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> > > +       ret = vduse_dev_update_iotlb(dev, asid, 0ULL, ULLONG_MAX);
> > >         if (ret) {
> > > -               vduse_domain_clear_map(dev->domain, iotlb);
> > > +               vduse_domain_clear_map(dev->as[asid].domain, iotlb);
> > >                 return ret;
> > >         }
> > >
> > > @@ -843,6 +896,7 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> > >         .get_vq_affinity        = vduse_vdpa_get_vq_affinity,
> > >         .reset                  = vduse_vdpa_reset,
> > >         .set_map                = vduse_vdpa_set_map,
> > > +       .set_group_asid         = vduse_set_group_asid,
> > >         .get_vq_map             = vduse_get_vq_map,
> > >         .free                   = vduse_vdpa_free,
> > >  };
> > > @@ -851,32 +905,30 @@ static void vduse_dev_sync_single_for_device(union virtio_map token,
> > >                                              dma_addr_t dma_addr, size_t size,
> > >                                              enum dma_data_direction dir)
> > >  {
> > > -       struct vduse_dev *vdev;
> > >         struct vduse_iova_domain *domain;
> > >
> > >         if (!token.group)
> > >                 return;
> > >
> > > -       vdev = token.group->dev;
> > > -       domain = vdev->domain;
> > > -
> > > +       read_lock(&token.group->as_lock);
> >
> > I think we could optimize the lock here. E.g when nas is 1, we don't
> > need any lock in fact.
> >
>
> Good point! Not taking the lock in that case for the next version, thanks!
>
> > > +       domain = token.group->as->domain;
> > >         vduse_domain_sync_single_for_device(domain, dma_addr, size, dir);
> > > +       read_unlock(&token.group->as_lock);
> > >  }
> > >
> > >  static void vduse_dev_sync_single_for_cpu(union virtio_map token,
> > >                                              dma_addr_t dma_addr, size_t size,
> > >                                              enum dma_data_direction dir)
> > >  {
> > > -       struct vduse_dev *vdev;
> > >         struct vduse_iova_domain *domain;
> > >
> > >         if (!token.group)
> > >                 return;
> > >
> > > -       vdev = token.group->dev;
> > > -       domain = vdev->domain;
> > > -
> > > +       read_lock(&token.group->as_lock);
> > > +       domain = token.group->as->domain;
> > >         vduse_domain_sync_single_for_cpu(domain, dma_addr, size, dir);
> > > +       read_unlock(&token.group->as_lock);
> > >  }
> > >
> > >  static dma_addr_t vduse_dev_map_page(union virtio_map token, struct page *page,
> > > @@ -884,38 +936,38 @@ static dma_addr_t vduse_dev_map_page(union virtio_map token, struct page *page,
> > >                                      enum dma_data_direction dir,
> > >                                      unsigned long attrs)
> > >  {
> > > -       struct vduse_dev *vdev;
> > >         struct vduse_iova_domain *domain;
> > > +       dma_addr_t r;
> > >
> > >         if (!token.group)
> > >                 return DMA_MAPPING_ERROR;
> > >
> > > -       vdev = token.group->dev;
> > > -       domain = vdev->domain;
> > > +       read_lock(&token.group->as_lock);
> > > +       domain = token.group->as->domain;
> > > +       r = vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> > > +       read_unlock(&token.group->as_lock);
> > >
> > > -       return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> > > +       return r;
> > >  }
> > >
> > >  static void vduse_dev_unmap_page(union virtio_map token, dma_addr_t dma_addr,
> > >                                  size_t size, enum dma_data_direction dir,
> > >                                  unsigned long attrs)
> > >  {
> > > -       struct vduse_dev *vdev;
> > >         struct vduse_iova_domain *domain;
> > >
> > >         if (!token.group)
> > >                 return;
> > >
> > > -       vdev = token.group->dev;
> > > -       domain = vdev->domain;
> > > -
> > > -       return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> > > +       read_lock(&token.group->as_lock);
> > > +       domain = token.group->as->domain;
> > > +       vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> > > +       read_unlock(&token.group->as_lock);
> > >  }
> > >
> > >  static void *vduse_dev_alloc_coherent(union virtio_map token, size_t size,
> > >                                       dma_addr_t *dma_addr, gfp_t flag)
> > >  {
> > > -       struct vduse_dev *vdev;
> > >         struct vduse_iova_domain *domain;
> > >         unsigned long iova;
> > >         void *addr;
> > > @@ -928,18 +980,21 @@ static void *vduse_dev_alloc_coherent(union virtio_map token, size_t size,
> > >         if (!addr)
> > >                 return NULL;
> > >
> > > -       vdev = token.group->dev;
> > > -       domain = vdev->domain;
> > > +       *dma_addr = (dma_addr_t)iova;
> >
> > Any reason we need to touch *dma_addr here? It might trigger UBSAN/KMSAN.
> >
>
> No, this is a leftover. I'm fixing it for the next version. Thanks!
>
> > > +       read_lock(&token.group->as_lock);
> > > +       domain = token.group->as->domain;
> > >         addr = vduse_domain_alloc_coherent(domain, size,
> > >                                            (dma_addr_t *)&iova, addr);
> > >         if (!addr)
> > >                 goto err;
> > >
> > >         *dma_addr = (dma_addr_t)iova;
> > > +       read_unlock(&token.group->as_lock);
> > >
> > >         return addr;
> > >
> > >  err:
> > > +       read_unlock(&token.group->as_lock);
> > >         free_pages_exact(addr, size);
> > >         return NULL;
> > >  }
> > > @@ -948,31 +1003,30 @@ static void vduse_dev_free_coherent(union virtio_map token, size_t size,
> > >                                     void *vaddr, dma_addr_t dma_addr,
> > >                                     unsigned long attrs)
> > >  {
> > > -       struct vduse_dev *vdev;
> > >         struct vduse_iova_domain *domain;
> > >
> > >         if (!token.group)
> > >                 return;
> > >
> > > -       vdev = token.group->dev;
> > > -       domain = vdev->domain;
> > > -
> > > +       read_lock(&token.group->as_lock);
> > > +       domain = token.group->as->domain;
> > >         vduse_domain_free_coherent(domain, size, dma_addr, attrs);
> > > +       read_unlock(&token.group->as_lock);
> > >         free_pages_exact(vaddr, size);
> > >  }
> > >
> > >  static bool vduse_dev_need_sync(union virtio_map token, dma_addr_t dma_addr)
> > >  {
> > > -       struct vduse_dev *vdev;
> > > -       struct vduse_iova_domain *domain;
> > > +       size_t bounce_size;
> > >
> > >         if (!token.group)
> > >                 return false;
> > >
> > > -       vdev = token.group->dev;
> > > -       domain = vdev->domain;
> > > +       read_lock(&token.group->as_lock);
> > > +       bounce_size = token.group->as->domain->bounce_size;
> > > +       read_unlock(&token.group->as_lock);
> > >
> > > -       return dma_addr < domain->bounce_size;
> > > +       return dma_addr < bounce_size;
> > >  }
> > >
> > >  static int vduse_dev_mapping_error(union virtio_map token, dma_addr_t dma_addr)
> > > @@ -984,16 +1038,16 @@ static int vduse_dev_mapping_error(union virtio_map token, dma_addr_t dma_addr)
> > >
> > >  static size_t vduse_dev_max_mapping_size(union virtio_map token)
> > >  {
> > > -       struct vduse_dev *vdev;
> > > -       struct vduse_iova_domain *domain;
> > > +       size_t bounce_size;
> > >
> > >         if (!token.group)
> > >                 return 0;
> > >
> > > -       vdev = token.group->dev;
> > > -       domain = vdev->domain;
> > > +       read_lock(&token.group->as_lock);
> > > +       bounce_size = token.group->as->domain->bounce_size;
> > > +       read_unlock(&token.group->as_lock);
> > >
> > > -       return domain->bounce_size;
> > > +       return bounce_size;
> > >  }
> > >
> > >  static const struct virtio_map_ops vduse_map_ops = {
> > > @@ -1133,39 +1187,40 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
> > >         return ret;
> > >  }
> > >
> > > -static int vduse_dev_dereg_umem(struct vduse_dev *dev,
> > > +static int vduse_dev_dereg_umem(struct vduse_dev *dev, u32 asid,
> > >                                 u64 iova, u64 size)
> > >  {
> > >         int ret;
> > >
> > > -       mutex_lock(&dev->mem_lock);
> > > +       mutex_lock(&dev->as[asid].mem_lock);
> > >         ret = -ENOENT;
> > > -       if (!dev->umem)
> > > +       if (!dev->as[asid].umem)
> > >                 goto unlock;
> > >
> > >         ret = -EINVAL;
> > > -       if (!dev->domain)
> > > +       if (!dev->as[asid].domain)
> > >                 goto unlock;
> > >
> > > -       if (dev->umem->iova != iova || size != dev->domain->bounce_size)
> > > +       if (dev->as[asid].umem->iova != iova ||
> > > +           size != dev->as[asid].domain->bounce_size)
> > >                 goto unlock;
> > >
> > > -       vduse_domain_remove_user_bounce_pages(dev->domain);
> > > -       unpin_user_pages_dirty_lock(dev->umem->pages,
> > > -                                   dev->umem->npages, true);
> > > -       atomic64_sub(dev->umem->npages, &dev->umem->mm->pinned_vm);
> > > -       mmdrop(dev->umem->mm);
> > > -       vfree(dev->umem->pages);
> > > -       kfree(dev->umem);
> > > -       dev->umem = NULL;
> > > +       vduse_domain_remove_user_bounce_pages(dev->as[asid].domain);
> > > +       unpin_user_pages_dirty_lock(dev->as[asid].umem->pages,
> > > +                                   dev->as[asid].umem->npages, true);
> > > +       atomic64_sub(dev->as[asid].umem->npages, &dev->as[asid].umem->mm->pinned_vm);
> > > +       mmdrop(dev->as[asid].umem->mm);
> > > +       vfree(dev->as[asid].umem->pages);
> > > +       kfree(dev->as[asid].umem);
> > > +       dev->as[asid].umem = NULL;
> > >         ret = 0;
> > >  unlock:
> > > -       mutex_unlock(&dev->mem_lock);
> > > +       mutex_unlock(&dev->as[asid].mem_lock);
> > >         return ret;
> > >  }
> > >
> > >  static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > > -                             u64 iova, u64 uaddr, u64 size)
> > > +                             u32 asid, u64 iova, u64 uaddr, u64 size)
> > >  {
> > >         struct page **page_list = NULL;
> > >         struct vduse_umem *umem = NULL;
> > > @@ -1173,14 +1228,14 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > >         unsigned long npages, lock_limit;
> > >         int ret;
> > >
> > > -       if (!dev->domain || !dev->domain->bounce_map ||
> > > -           size != dev->domain->bounce_size ||
> > > +       if (!dev->as[asid].domain || !dev->as[asid].domain->bounce_map ||
> > > +           size != dev->as[asid].domain->bounce_size ||
> > >             iova != 0 || uaddr & ~PAGE_MASK)
> > >                 return -EINVAL;
> > >
> > > -       mutex_lock(&dev->mem_lock);
> > > +       mutex_lock(&dev->as[asid].mem_lock);
> > >         ret = -EEXIST;
> > > -       if (dev->umem)
> > > +       if (dev->as[asid].umem)
> > >                 goto unlock;
> > >
> > >         ret = -ENOMEM;
> > > @@ -1204,7 +1259,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > >                 goto out;
> > >         }
> > >
> > > -       ret = vduse_domain_add_user_bounce_pages(dev->domain,
> > > +       ret = vduse_domain_add_user_bounce_pages(dev->as[asid].domain,
> > >                                                  page_list, pinned);
> > >         if (ret)
> > >                 goto out;
> > > @@ -1217,7 +1272,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > >         umem->mm = current->mm;
> > >         mmgrab(current->mm);
> > >
> > > -       dev->umem = umem;
> > > +       dev->as[asid].umem = umem;
> > >  out:
> > >         if (ret && pinned > 0)
> > >                 unpin_user_pages(page_list, pinned);
> > > @@ -1228,7 +1283,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > >                 vfree(page_list);
> > >                 kfree(umem);
> > >         }
> > > -       mutex_unlock(&dev->mem_lock);
> > > +       mutex_unlock(&dev->as[asid].mem_lock);
> > >         return ret;
> > >  }
> > >
> > > @@ -1260,47 +1315,66 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > >
> > >         switch (cmd) {
> > >         case VDUSE_IOTLB_GET_FD: {
> > > -               struct vduse_iotlb_entry entry;
> > > +               struct vduse_iotlb_entry_v2 entry;
> >
> > Nit: if we stick entry and do copy_from_user() twice it might save
> > lots of unnecessary changes.
> >
>
> I'm happy to move to something else but most of the changes happen
> because of s/entry/entry.v1/ . If we stick with just vduse_iotlb_entry
> and a separate asid variable, we also need to duplicate the
> copy_from_user [2].
>
> > >                 struct vhost_iotlb_map *map;
> > >                 struct vdpa_map_file *map_file;
> > >                 struct file *f = NULL;
> > > +               u32 asid;
> > >
> > >                 ret = -EFAULT;
> > > -               if (copy_from_user(&entry, argp, sizeof(entry)))
> > > -                       break;
> > > +               if (dev->api_version >= VDUSE_API_VERSION_1) {
> > > +                       if (copy_from_user(&entry, argp, sizeof(entry)))
> > > +                               break;
> > > +               } else {
> > > +                       entry.asid = 0;
> > > +                       if (copy_from_user(&entry.v1, argp,
> > > +                                          sizeof(entry.v1)))
> > > +                               break;
> > > +               }
> > >
> > >                 ret = -EINVAL;
> > > -               if (entry.start > entry.last)
> > > +               if (entry.v1.start > entry.v1.last)
> > > +                       break;
> > > +
> > > +               if (entry.asid >= dev->nas)
> > >                         break;
> > >
> > >                 mutex_lock(&dev->domain_lock);
> > > -               if (!dev->domain) {
> > > +               asid = array_index_nospec(entry.asid, dev->nas);
> > > +               if (!dev->as[asid].domain) {
> > >                         mutex_unlock(&dev->domain_lock);
> > >                         break;
> > >                 }
> > > -               spin_lock(&dev->domain->iotlb_lock);
> > > -               map = vhost_iotlb_itree_first(dev->domain->iotlb,
> > > -                                             entry.start, entry.last);
> > > +               spin_lock(&dev->as[asid].domain->iotlb_lock);
> > > +               map = vhost_iotlb_itree_first(dev->as[asid].domain->iotlb,
> > > +                                             entry.v1.start, entry.v1.last);
> > >                 if (map) {
> > >                         map_file = (struct vdpa_map_file *)map->opaque;
> > >                         f = get_file(map_file->file);
> > > -                       entry.offset = map_file->offset;
> > > -                       entry.start = map->start;
> > > -                       entry.last = map->last;
> > > -                       entry.perm = map->perm;
> > > +                       entry.v1.offset = map_file->offset;
> > > +                       entry.v1.start = map->start;
> > > +                       entry.v1.last = map->last;
> > > +                       entry.v1.perm = map->perm;
> > >                 }
> > > -               spin_unlock(&dev->domain->iotlb_lock);
> > > +               spin_unlock(&dev->as[asid].domain->iotlb_lock);
> > >                 mutex_unlock(&dev->domain_lock);
> > >                 ret = -EINVAL;
> > >                 if (!f)
> > >                         break;
> > >
> > > -               ret = -EFAULT;
> > > -               if (copy_to_user(argp, &entry, sizeof(entry))) {
> > > +               if (dev->api_version >= VDUSE_API_VERSION_1)
> > > +                       ret = copy_to_user(argp, &entry,
> > > +                                          sizeof(entry));
> > > +               else
> > > +                       ret = copy_to_user(argp, &entry.v1,
> > > +                                          sizeof(entry.v1));
> > > +
> > > +               if (ret) {
> > > +                       ret = -EFAULT;
> > >                         fput(f);
> > >                         break;
> > >                 }
> > > -               ret = receive_fd(f, NULL, perm_to_file_flags(entry.perm));
> > > +               ret = receive_fd(f, NULL, perm_to_file_flags(entry.v1.perm));
> > >                 fput(f);
> > >                 break;
> > >         }
> > > @@ -1445,6 +1519,7 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > >         }
> > >         case VDUSE_IOTLB_REG_UMEM: {
> > >                 struct vduse_iova_umem umem;
> > > +               u32 asid;
> > >
> > >                 ret = -EFAULT;
> > >                 if (copy_from_user(&umem, argp, sizeof(umem)))
> > > @@ -1452,17 +1527,21 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > >
> > >                 ret = -EINVAL;
> > >                 if (!is_mem_zero((const char *)umem.reserved,
> > > -                                sizeof(umem.reserved)))
> > > +                                sizeof(umem.reserved)) ||
> > > +                   (dev->api_version < VDUSE_API_VERSION_1 &&
> > > +                    umem.asid != 0) || umem.asid >= dev->nas)
> > >                         break;
> > >
> > >                 mutex_lock(&dev->domain_lock);
> > > -               ret = vduse_dev_reg_umem(dev, umem.iova,
> > > +               asid = array_index_nospec(umem.asid, dev->nas);
> > > +               ret = vduse_dev_reg_umem(dev, asid, umem.iova,
> > >                                          umem.uaddr, umem.size);
> > >                 mutex_unlock(&dev->domain_lock);
> > >                 break;
> > >         }
> > >         case VDUSE_IOTLB_DEREG_UMEM: {
> > >                 struct vduse_iova_umem umem;
> > > +               u32 asid;
> > >
> > >                 ret = -EFAULT;
> > >                 if (copy_from_user(&umem, argp, sizeof(umem)))
> > > @@ -1470,10 +1549,15 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > >
> > >                 ret = -EINVAL;
> > >                 if (!is_mem_zero((const char *)umem.reserved,
> > > -                                sizeof(umem.reserved)))
> > > +                                sizeof(umem.reserved)) ||
> > > +                   (dev->api_version < VDUSE_API_VERSION_1 &&
> > > +                    umem.asid != 0) ||
> > > +                    umem.asid >= dev->nas)
> > >                         break;
> > > +
> > >                 mutex_lock(&dev->domain_lock);
> > > -               ret = vduse_dev_dereg_umem(dev, umem.iova,
> > > +               asid = array_index_nospec(umem.asid, dev->nas);
> > > +               ret = vduse_dev_dereg_umem(dev, asid, umem.iova,
> > >                                            umem.size);
> > >                 mutex_unlock(&dev->domain_lock);
> > >                 break;
> > > @@ -1481,6 +1565,7 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > >         case VDUSE_IOTLB_GET_INFO: {
> >
> > Btw I see this:
> >
> >                 dev->vqs[index]->vq_group = config.group;
> >
> > In VDUSE_VQ_SETUP:
> >
> > I wonder what's the reason that it is not a part of CREATE_DEV? I
> > meant it might be racy if DMA happens between CREATE_DEV and
> > VDUSE_VQ_SETUP.
> >
>
> The reason the vq index -> vq group association cannot be part of
> device creation is that we need CVQ to be isolated for live migration,
> but the device doesn't know the CVQ index at the CREATE_DEV time, only
> after the feature negotiation happens [3].

Exactly, the cvq index is changed.

>
> [1] https://lore.kernel.org/lkml/CACGkMEvRQ86dYeY3Enqoj1vkSpefU3roq4XGS+y5B5kmsXEkYg@mail.gmail.com/
> [2] https://lore.kernel.org/lkml/CACGkMEtszQeZLTegxEbjODYxu-giTvURu=pKj4kYTHQYoKOzkQ@mail.gmail.com
> [3] https://lore.kernel.org/lkml/CAJaqyWcvHx7kwcTceN2jazT0nKNo1r5zdzqWHqpxdna-kCS1RA@mail.gmail.com

I see, but a question. What happens if there's a DMA between
CREATE_DEV and VDUSE_VQ_SETUP. If we can find ways to forbid this (or
it has been forbidden), we are probably fine.

>
> > >                 struct vduse_iova_info info;
> > >                 struct vhost_iotlb_map *map;
> > > +               u32 asid;
> > >
> > >                 ret = -EFAULT;
> > >                 if (copy_from_user(&info, argp, sizeof(info)))
> > > @@ -1494,23 +1579,31 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > >                                  sizeof(info.reserved)))
> > >                         break;
> > >
> > > +               if (dev->api_version < VDUSE_API_VERSION_1) {
> > > +                       if (info.asid)
> > > +                               break;
> > > +               } else if (info.asid >= dev->nas)
> > > +                       break;
> > > +
> > >                 mutex_lock(&dev->domain_lock);
> > > -               if (!dev->domain) {
> > > +               asid = array_index_nospec(info.asid, dev->nas);
> > > +               if (!dev->as[asid].domain) {
> > >                         mutex_unlock(&dev->domain_lock);
> > >                         break;
> > >                 }
> > > -               spin_lock(&dev->domain->iotlb_lock);
> > > -               map = vhost_iotlb_itree_first(dev->domain->iotlb,
> > > +               spin_lock(&dev->as[asid].domain->iotlb_lock);
> > > +               map = vhost_iotlb_itree_first(dev->as[asid].domain->iotlb,
> > >                                               info.start, info.last);
> > >                 if (map) {
> > >                         info.start = map->start;
> > >                         info.last = map->last;
> > >                         info.capability = 0;
> > > -                       if (dev->domain->bounce_map && map->start == 0 &&
> > > -                           map->last == dev->domain->bounce_size - 1)
> > > +                       if (dev->as[asid].domain->bounce_map &&
> > > +                           map->start == 0 &&
> > > +                           map->last == dev->as[asid].domain->bounce_size - 1)
> > >                                 info.capability |= VDUSE_IOVA_CAP_UMEM;
> > >                 }
> > > -               spin_unlock(&dev->domain->iotlb_lock);
> > > +               spin_unlock(&dev->as[asid].domain->iotlb_lock);
> > >                 mutex_unlock(&dev->domain_lock);
> > >                 if (!map)
> > >                         break;
> > > @@ -1535,8 +1628,10 @@ static int vduse_dev_release(struct inode *inode, struct file *file)
> > >         struct vduse_dev *dev = file->private_data;
> > >
> > >         mutex_lock(&dev->domain_lock);
> > > -       if (dev->domain)
> > > -               vduse_dev_dereg_umem(dev, 0, dev->domain->bounce_size);
> > > +       for (int i = 0; i < dev->nas; i++)
> > > +               if (dev->as[i].domain)
> > > +                       vduse_dev_dereg_umem(dev, i, 0,
> > > +                                            dev->as[i].domain->bounce_size);
> > >         mutex_unlock(&dev->domain_lock);
> > >         spin_lock(&dev->msg_lock);
> > >         /* Make sure the inflight messages can processed after reconncection */
> > > @@ -1755,7 +1850,6 @@ static struct vduse_dev *vduse_dev_create(void)
> > >                 return NULL;
> > >
> > >         mutex_init(&dev->lock);
> > > -       mutex_init(&dev->mem_lock);
> > >         mutex_init(&dev->domain_lock);
> > >         spin_lock_init(&dev->msg_lock);
> > >         INIT_LIST_HEAD(&dev->send_list);
> > > @@ -1806,8 +1900,11 @@ static int vduse_destroy_dev(char *name)
> > >         idr_remove(&vduse_idr, dev->minor);
> > >         kvfree(dev->config);
> > >         vduse_dev_deinit_vqs(dev);
> > > -       if (dev->domain)
> > > -               vduse_domain_destroy(dev->domain);
> > > +       for (int i = 0; i < dev->nas; i++) {
> > > +               if (dev->as[i].domain)
> > > +                       vduse_domain_destroy(dev->as[i].domain);
> > > +       }
> > > +       kfree(dev->as);
> > >         kfree(dev->name);
> > >         kfree(dev->groups);
> > >         vduse_dev_destroy(dev);
> > > @@ -1854,12 +1951,17 @@ static bool vduse_validate_config(struct vduse_dev_config *config,
> > >                          sizeof(config->reserved)))
> > >                 return false;
> > >
> > > -       if (api_version < VDUSE_API_VERSION_1 && config->ngroups)
> > > +       if (api_version < VDUSE_API_VERSION_1 &&
> > > +           (config->ngroups || config->nas))
> > >                 return false;
> > >
> > > -       if (api_version >= VDUSE_API_VERSION_1 &&
> > > -           (!config->ngroups || config->ngroups > VDUSE_DEV_MAX_GROUPS))
> > > -               return false;
> > > +       if (api_version >= VDUSE_API_VERSION_1) {
> > > +               if (!config->ngroups || config->ngroups > VDUSE_DEV_MAX_GROUPS)
> > > +                       return false;
> > > +
> > > +               if (!config->nas || config->nas > VDUSE_DEV_MAX_AS)
> > > +                       return false;
> > > +       }
> > >
> > >         if (config->vq_align > PAGE_SIZE)
> > >                 return false;
> > > @@ -1924,7 +2026,8 @@ static ssize_t bounce_size_store(struct device *device,
> > >
> > >         ret = -EPERM;
> > >         mutex_lock(&dev->domain_lock);
> > > -       if (dev->domain)
> > > +       /* Assuming that if the first domain is allocated, all are allocated */
> > > +       if (dev->as[0].domain)
> > >                 goto unlock;
> >
> > Not for this patch but I don't understand why we need to check dev->domain here.
> >
>
> I guess you need to know the bounce size before you allocate the
> domain. To make shrink logic for it seems not to be worth it, but
> maybe I'm missing something.

Right.

Thanks


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v10 7/8] vduse: add vq group asid support
  2025-12-23  1:11       ` Jason Wang
@ 2025-12-23 13:15         ` Eugenio Perez Martin
  2025-12-24  0:20           ` Jason Wang
  0 siblings, 1 reply; 22+ messages in thread
From: Eugenio Perez Martin @ 2025-12-23 13:15 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S . Tsirkin, Maxime Coquelin, Laurent Vivier,
	virtualization, linux-kernel, Stefano Garzarella, Yongji Xie,
	Xuan Zhuo, Cindy Lu

On Tue, Dec 23, 2025 at 2:11 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Thu, Dec 18, 2025 at 9:11 PM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > On Thu, Dec 18, 2025 at 7:45 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Wed, Dec 17, 2025 at 7:24 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > >
> > > > Add support for assigning Address Space Identifiers (ASIDs) to each VQ
> > > > group.  This enables mapping each group into a distinct memory space.
> > > >
> > > > The vq group to ASID association is protected by a rwlock now.  But the
> > > > mutex domain_lock keeps protecting the domains of all ASIDs, as some
> > > > operations like the one related with the bounce buffer size still
> > > > requires to lock all the ASIDs.
> > > >
> > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > >
> > > > ---
> > > > Future improvements can include performance optimizations on top could
> > > > move to per-ASID locks, or hardening by tracking ASID or ASID hashes on
> > > > unused bits of the DMA address.
> > > >
> > > > Tested virtio_vdpa by adding manually two threads in vduse_set_status:
> > > > one of them modifies the vq group 0 ASID and the other one map and unmap
> > > > memory continuously.  After a while, the two threads stop and the usual
> > > > work continues.
> > > >
> > > > Tested with vhost_vdpa by migrating a VM while ping on OVS+VDUSE.  A few
> > > > workaround were needed in some parts:
> > > > * Do not enable CVQ before data vqs in QEMU, as VDUSE does not forward
> > > >   the enable message to the userland device.  This will be solved in the
> > > >   future.
> > > > * Share the suspended state between all vhost devices in QEMU:
> > > >   https://lists.nongnu.org/archive/html/qemu-devel/2025-11/msg02947.html
> > > > * Implement a fake VDUSE suspend vdpa operation callback that always
> > > >   returns true in the kernel.  DPDK suspend the device at the first
> > > >   GET_VRING_BASE.
> > > > * Remove the CVQ blocker in ASID.
> > > >
> > > > ---
> > > > v10:
> > > > * Back to rwlock version so stronger locks are used.
> > > > * Take out allocations from rwlock.
> > > > * Forbid changing ASID of a vq group after DRIVER_OK (Jason)
> > > > * Remove bad fetching again of domain variable in
> > > >   vduse_dev_max_mapping_size (Yongji).
> > > > * Remove unused vdev definition in vdpa map_ops callbacks (kernel test
> > > >   robot).
> > > >
> > > > v9:
> > > > * Replace mutex with rwlock, as the vdpa map_ops can run from atomic
> > > >   context.
> > > >
> > > > v8:
> > > > * Revert the mutex to rwlock change, it needs proper profiling to
> > > >   justify it.
> > > >
> > > > v7:
> > > > * Take write lock in the error path (Jason).
> > > >
> > > > v6:
> > > > * Make vdpa_dev_add use gotos for error handling (MST).
> > > > * s/(dev->api_version < 1) ?/(dev->api_version < VDUSE_API_VERSION_1) ?/
> > > >   (MST).
> > > > * Fix struct name not matching in the doc.
> > > >
> > > > v5:
> > > > * Properly return errno if copy_to_user returns >0 in VDUSE_IOTLB_GET_FD
> > > >   ioctl (Jason).
> > > > * Properly set domain bounce size to divide equally between nas (Jason).
> > > > * Exclude "padding" member from the only >V1 members in
> > > >   vduse_dev_request.
> > > >
> > > > v4:
> > > > * Divide each domain bounce size between the device bounce size (Jason).
> > > > * revert unneeded addr = NULL assignment (Jason)
> > > > * Change if (x && (y || z)) return to if (x) { if (y) return; if (z)
> > > >   return; } (Jason)
> > > > * Change a bad multiline comment, using @ caracter instead of * (Jason).
> > > > * Consider config->nas == 0 as a fail (Jason).
> > > >
> > > > v3:
> > > > * Get the vduse domain through the vduse_as in the map functions
> > > >   (Jason).
> > > > * Squash with the patch creating the vduse_as struct (Jason).
> > > > * Create VDUSE_DEV_MAX_AS instead of comparing agains a magic number
> > > >   (Jason)
> > > >
> > > > v2:
> > > > * Convert the use of mutex to rwlock.
> > > >
> > > > RFC v3:
> > > > * Increase VDUSE_MAX_VQ_GROUPS to 0xffff (Jason). It was set to a lower
> > > >   value to reduce memory consumption, but vqs are already limited to
> > > >   that value and userspace VDUSE is able to allocate that many vqs.
> > > > * Remove TODO about merging VDUSE_IOTLB_GET_FD ioctl with
> > > >   VDUSE_IOTLB_GET_INFO.
> > > > * Use of array_index_nospec in VDUSE device ioctls.
> > > > * Embed vduse_iotlb_entry into vduse_iotlb_entry_v2.
> > > > * Move the umem mutex to asid struct so there is no contention between
> > > >   ASIDs.
> > > >
> > > > RFC v2:
> > > > * Make iotlb entry the last one of vduse_iotlb_entry_v2 so the first
> > > >   part of the struct is the same.
> > > > ---
> > > >  drivers/vdpa/vdpa_user/vduse_dev.c | 366 +++++++++++++++++++----------
> > > >  include/uapi/linux/vduse.h         |  53 ++++-
> > > >  2 files changed, 295 insertions(+), 124 deletions(-)
> > > >
> > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > index 767abcb7e375..786ab2378825 100644
> > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > @@ -41,6 +41,7 @@
> > > >
> > > >  #define VDUSE_DEV_MAX (1U << MINORBITS)
> > > >  #define VDUSE_DEV_MAX_GROUPS 0xffff
> > > > +#define VDUSE_DEV_MAX_AS 0xffff
> > > >  #define VDUSE_MAX_BOUNCE_SIZE (1024 * 1024 * 1024)
> > > >  #define VDUSE_MIN_BOUNCE_SIZE (1024 * 1024)
> > > >  #define VDUSE_BOUNCE_SIZE (64 * 1024 * 1024)
> > > > @@ -86,7 +87,15 @@ struct vduse_umem {
> > > >         struct mm_struct *mm;
> > > >  };
> > > >
> > > > +struct vduse_as {
> > > > +       struct vduse_iova_domain *domain;
> > > > +       struct vduse_umem *umem;
> > > > +       struct mutex mem_lock;
> > >
> > > Not related to this patch. But If I was not wrong we have 1:1 mapping
> > > between domain and as. If this is true, can we use bounce_lock instead
> > > of a new mem_lock? Since I see mem_lock is only used for synchronizing
> > > umem reg/degreg which has been synchronized with domain rwlock.
> > >
> >
> > I think you're right, but they work at different levels at the moment.
> > The mem_lock is at vduse_dev and also protect the umem pointer, and
> > bounce_lock is at iova_domain.c .
> >
> > Maybe the right thing to do is to move umem in iova_domain. Yongji
> > Xie, what do you think?
> >
> > > > +};
> > > > +
> > > >  struct vduse_vq_group {
> > > > +       rwlock_t as_lock;
> > > > +       struct vduse_as *as; /* Protected by as_lock */
> > > >         struct vduse_dev *dev;
> > > >  };
> > > >
> > > > @@ -94,7 +103,7 @@ struct vduse_dev {
> > > >         struct vduse_vdpa *vdev;
> > > >         struct device *dev;
> > > >         struct vduse_virtqueue **vqs;
> > > > -       struct vduse_iova_domain *domain;
> > > > +       struct vduse_as *as;
> > > >         char *name;
> > > >         struct mutex lock;
> > > >         spinlock_t msg_lock;
> > > > @@ -122,9 +131,8 @@ struct vduse_dev {
> > > >         u32 vq_num;
> > > >         u32 vq_align;
> > > >         u32 ngroups;
> > > > -       struct vduse_umem *umem;
> > > > +       u32 nas;
> > > >         struct vduse_vq_group *groups;
> > > > -       struct mutex mem_lock;
> > > >         unsigned int bounce_size;
> > > >         struct mutex domain_lock;
> > > >  };
> > > > @@ -314,7 +322,7 @@ static int vduse_dev_set_status(struct vduse_dev *dev, u8 status)
> > > >         return vduse_dev_msg_sync(dev, &msg);
> > > >  }
> > > >
> > > > -static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> > > > +static int vduse_dev_update_iotlb(struct vduse_dev *dev, u32 asid,
> > > >                                   u64 start, u64 last)
> > > >  {
> > > >         struct vduse_dev_msg msg = { 0 };
> > > > @@ -323,8 +331,14 @@ static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> > > >                 return -EINVAL;
> > > >
> > > >         msg.req.type = VDUSE_UPDATE_IOTLB;
> > > > -       msg.req.iova.start = start;
> > > > -       msg.req.iova.last = last;
> > > > +       if (dev->api_version < VDUSE_API_VERSION_1) {
> > > > +               msg.req.iova.start = start;
> > > > +               msg.req.iova.last = last;
> > > > +       } else {
> > > > +               msg.req.iova_v2.start = start;
> > > > +               msg.req.iova_v2.last = last;
> > > > +               msg.req.iova_v2.asid = asid;
> > > > +       }
> > > >
> > > >         return vduse_dev_msg_sync(dev, &msg);
> > > >  }
> > > > @@ -436,14 +450,29 @@ static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> > > >         return mask;
> > > >  }
> > > >
> > > > +/* Force set the asid to a vq group without a message to the VDUSE device */
> > > > +static void vduse_set_group_asid_nomsg(struct vduse_dev *dev,
> > > > +                                      unsigned int group, unsigned int asid)
> > > > +{
> > > > +       write_lock(&dev->groups[group].as_lock);
> > > > +       dev->groups[group].as = &dev->as[asid];
> > > > +       write_unlock(&dev->groups[group].as_lock);
> > > > +}
> > > > +
> > > >  static void vduse_dev_reset(struct vduse_dev *dev)
> > > >  {
> > > >         int i;
> > > > -       struct vduse_iova_domain *domain = dev->domain;
> > > >
> > > >         /* The coherent mappings are handled in vduse_dev_free_coherent() */
> > > > -       if (domain && domain->bounce_map)
> > > > -               vduse_domain_reset_bounce_map(domain);
> > > > +       for (i = 0; i < dev->nas; i++) {
> > > > +               struct vduse_iova_domain *domain = dev->as[i].domain;
> > > > +
> > > > +               if (domain && domain->bounce_map)
> > > > +                       vduse_domain_reset_bounce_map(domain);
> > >
> > > Btw, I see this:
> > >
> > > void vduse_domain_reset_bounce_map(struct vduse_iova_domain *domain)
> > > {
> > >         if (!domain->bounce_map)
> > >                 return;
> > >
> > >         spin_lock(&domain->iotlb_lock);
> > >         if (!domain->bounce_map)
> > >                 goto unlock;
> > >
> > >
> > > The counbe_map is checked twice, let's fix that.
> > >
> >
> > Double checked locking to avoid taking the lock?
>
> I don't know, but I think we don't care too much about the performance
> of vduse_domain_reset_bounce_map().
>
> > I don't think it is
> > worth it to keep it as it is not in the hot path anyway. But that
> > would also be another patch independent of this series, isn't it?
>
> Yes, it's another independent issue I just found when reviewing this patch.
>
> >
> > > > +       }
> > > > +
> > > > +       for (i = 0; i < dev->ngroups; i++)
> > > > +               vduse_set_group_asid_nomsg(dev, i, 0);
> > >
> > > Note that this function still does:
> > >
> > >                 vq->vq_group = 0;
> > >
> > > Which is wrong.
> > >
> >
> > Right, removing it for the next version. Thanks for the catch!
> >
> > > >
> > > >         down_write(&dev->rwsem);
> > > >
> > > > @@ -623,6 +652,30 @@ static union virtio_map vduse_get_vq_map(struct vdpa_device *vdpa, u16 idx)
> > > >         return ret;
> > > >  }
> > > >
> > > > +static int vduse_set_group_asid(struct vdpa_device *vdpa, unsigned int group,
> > > > +                               unsigned int asid)
> > > > +{
> > > > +       struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > > > +       struct vduse_dev_msg msg = { 0 };
> > > > +       int r;
> > > > +
> > > > +       if (dev->api_version < VDUSE_API_VERSION_1 ||
> > > > +           group >= dev->ngroups || asid >= dev->nas ||
> > > > +           dev->status & VIRTIO_CONFIG_S_DRIVER_OK)
> > > > +               return -EINVAL;
> > >
> > > If we forbid setting group asid for !DRIVER_OK, why do we still need a
> > > rwlock?
> >
> > virtio_map_ops->alloc is still called before DRIVER_OK to allocate the
> > vrings in the bounce buffer, for example. If you're ok with that I'm
> > ok with removing the lock, as all the calls are issued by the driver
> > setup process anyway. Or just to keep it for alloc?
>
> I see, then I think we need to keep that. The reason is that there's
> no guarantee that the alloc() must be called before DRIVER_OK.
>
> >
> > Anyway, I think I misunderstood your comment from [1] then.
> >
> > > All we need to do is to synchronize set_group_asid() with
> > > set_status()/reset()?
> > >
> >
> > That's also a good one. There is no synchronization if one thread call
> > reset and then the device is set up from another thread. As all this
> > situation is still hypothetical because virtio_vdpa does not support
> > set_group_asid,
>
> Right.
>
> > and vhost one is already protected by vhost lock, do
> > we need it?
>
> Let's add a TODO in the code.
>

Can you expand on this? I meant that there will be no synchronization
if we remove the rwlock (or similar). If we keep the rwlock we don't
need to add any TODO, or am I missing something?

> >
> > > Or if you want to synchronize map ops with set_status() that looks
> > > like an independent thing (hardening).
> > >
> > > > +
> > > > +       msg.req.type = VDUSE_SET_VQ_GROUP_ASID;
> > > > +       msg.req.vq_group_asid.group = group;
> > > > +       msg.req.vq_group_asid.asid = asid;
> > > > +
> > > > +       r = vduse_dev_msg_sync(dev, &msg);
> > > > +       if (r < 0)
> > > > +               return r;
> > > > +
> > > > +       vduse_set_group_asid_nomsg(dev, group, asid);
> > >
> > > I'm not sure this has been discussed before, but I think it would be
> > > better to introduce a new ioctl to get group -> as mapping. This helps
> > > to avoid vduse_dev_msg_sync() as much as possible. And it doesn't
> > > require the userspace to poll vduse fd before DRIVER_OK.
> > >

The userspace VDUSE device must poll vduse fd to get the DRIVER_OK
actually. so it cannot avoid polling the vduse device. What are the reasons to
avoid vduse_dev_msg_sync?

> >
> > I'm fine with that, but how do we communicate that they have changed?
>
> Since we forbid changing the group->as mapping after DRIVER_OK,
> userspace just needs to use that ioctl once after DRIVER_OK.
>

But the userland VDUSE device needs to know when ASIDs are set by the
driver and will not change anymore. In this series it is solved by the
order of the messages, but now we would need a way to know that time.
One idea is to issue this new ioctl when it receives VDUSE_SET_STATUS
msg.

If we do it that way there is still a window where the hypothetical
(malicious) virtio_vdpa driver can read and write vq groups from
different threads. It could issue a set_group_asid after the VDUSE
device returns from the ioctl but before dev->status has not been
updated.

I don't think we should protect that, but if we want to do it we
should protect that part either acquiring the rwlock and trusting the
vduse_dev_msg_sync timeout or proceed with atomics, smp_store_release
/ smp_load_acquire, read and write barriers...

Note that I still think this is overthinking. We have the same
problems with driver features, where changing in bits like packed vq
changes the behavior of vDPA callbacks and could be able to
desynchronize the vduse kernel and the userland device. But since
virtio_vdpa and vduse kernel modules run in the same kernel, so they should
be able to trust each other.

> > Or how to communicate to the driver that the device does not accept
> > the assignment of the ASID to the group?
>
> See above.
>

I didn't find the answer :(.

Let me put an example with QEMU and vhost_vdpa:
- QEMU calls VHOST_VDPA_SET_GROUP_ASID with a valid vq group and asid.
- vduse cannot send this information to the device as it must wait for
the ioctl. It returns success to the QEMU ioctl.
- Now the vduse userland device doesn't accept the vq group ASID, so
it returns an error through that ioctl. I'm not sure how, should it
just reset the whole device? NEED_RESET?
- If it does not reset the whole device, how to return that single set
vq group asid error to QEMU?

> >
> > > > +       return 0;
> > > > +}
> > > > +
> > > >  static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> > > >                                 struct vdpa_vq_state *state)
> > > >  {
> > > > @@ -794,13 +847,13 @@ static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
> > > >         struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > > >         int ret;
> > > >
> > > > -       ret = vduse_domain_set_map(dev->domain, iotlb);
> > > > +       ret = vduse_domain_set_map(dev->as[asid].domain, iotlb);
> > > >         if (ret)
> > > >                 return ret;
> > > >
> > > > -       ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> > > > +       ret = vduse_dev_update_iotlb(dev, asid, 0ULL, ULLONG_MAX);
> > > >         if (ret) {
> > > > -               vduse_domain_clear_map(dev->domain, iotlb);
> > > > +               vduse_domain_clear_map(dev->as[asid].domain, iotlb);
> > > >                 return ret;
> > > >         }
> > > >
> > > > @@ -843,6 +896,7 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> > > >         .get_vq_affinity        = vduse_vdpa_get_vq_affinity,
> > > >         .reset                  = vduse_vdpa_reset,
> > > >         .set_map                = vduse_vdpa_set_map,
> > > > +       .set_group_asid         = vduse_set_group_asid,
> > > >         .get_vq_map             = vduse_get_vq_map,
> > > >         .free                   = vduse_vdpa_free,
> > > >  };
> > > > @@ -851,32 +905,30 @@ static void vduse_dev_sync_single_for_device(union virtio_map token,
> > > >                                              dma_addr_t dma_addr, size_t size,
> > > >                                              enum dma_data_direction dir)
> > > >  {
> > > > -       struct vduse_dev *vdev;
> > > >         struct vduse_iova_domain *domain;
> > > >
> > > >         if (!token.group)
> > > >                 return;
> > > >
> > > > -       vdev = token.group->dev;
> > > > -       domain = vdev->domain;
> > > > -
> > > > +       read_lock(&token.group->as_lock);
> > >
> > > I think we could optimize the lock here. E.g when nas is 1, we don't
> > > need any lock in fact.
> > >
> >
> > Good point! Not taking the lock in that case for the next version, thanks!
> >
> > > > +       domain = token.group->as->domain;
> > > >         vduse_domain_sync_single_for_device(domain, dma_addr, size, dir);
> > > > +       read_unlock(&token.group->as_lock);
> > > >  }
> > > >
> > > >  static void vduse_dev_sync_single_for_cpu(union virtio_map token,
> > > >                                              dma_addr_t dma_addr, size_t size,
> > > >                                              enum dma_data_direction dir)
> > > >  {
> > > > -       struct vduse_dev *vdev;
> > > >         struct vduse_iova_domain *domain;
> > > >
> > > >         if (!token.group)
> > > >                 return;
> > > >
> > > > -       vdev = token.group->dev;
> > > > -       domain = vdev->domain;
> > > > -
> > > > +       read_lock(&token.group->as_lock);
> > > > +       domain = token.group->as->domain;
> > > >         vduse_domain_sync_single_for_cpu(domain, dma_addr, size, dir);
> > > > +       read_unlock(&token.group->as_lock);
> > > >  }
> > > >
> > > >  static dma_addr_t vduse_dev_map_page(union virtio_map token, struct page *page,
> > > > @@ -884,38 +936,38 @@ static dma_addr_t vduse_dev_map_page(union virtio_map token, struct page *page,
> > > >                                      enum dma_data_direction dir,
> > > >                                      unsigned long attrs)
> > > >  {
> > > > -       struct vduse_dev *vdev;
> > > >         struct vduse_iova_domain *domain;
> > > > +       dma_addr_t r;
> > > >
> > > >         if (!token.group)
> > > >                 return DMA_MAPPING_ERROR;
> > > >
> > > > -       vdev = token.group->dev;
> > > > -       domain = vdev->domain;
> > > > +       read_lock(&token.group->as_lock);
> > > > +       domain = token.group->as->domain;
> > > > +       r = vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> > > > +       read_unlock(&token.group->as_lock);
> > > >
> > > > -       return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> > > > +       return r;
> > > >  }
> > > >
> > > >  static void vduse_dev_unmap_page(union virtio_map token, dma_addr_t dma_addr,
> > > >                                  size_t size, enum dma_data_direction dir,
> > > >                                  unsigned long attrs)
> > > >  {
> > > > -       struct vduse_dev *vdev;
> > > >         struct vduse_iova_domain *domain;
> > > >
> > > >         if (!token.group)
> > > >                 return;
> > > >
> > > > -       vdev = token.group->dev;
> > > > -       domain = vdev->domain;
> > > > -
> > > > -       return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> > > > +       read_lock(&token.group->as_lock);
> > > > +       domain = token.group->as->domain;
> > > > +       vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> > > > +       read_unlock(&token.group->as_lock);
> > > >  }
> > > >
> > > >  static void *vduse_dev_alloc_coherent(union virtio_map token, size_t size,
> > > >                                       dma_addr_t *dma_addr, gfp_t flag)
> > > >  {
> > > > -       struct vduse_dev *vdev;
> > > >         struct vduse_iova_domain *domain;
> > > >         unsigned long iova;
> > > >         void *addr;
> > > > @@ -928,18 +980,21 @@ static void *vduse_dev_alloc_coherent(union virtio_map token, size_t size,
> > > >         if (!addr)
> > > >                 return NULL;
> > > >
> > > > -       vdev = token.group->dev;
> > > > -       domain = vdev->domain;
> > > > +       *dma_addr = (dma_addr_t)iova;
> > >
> > > Any reason we need to touch *dma_addr here? It might trigger UBSAN/KMSAN.
> > >
> >
> > No, this is a leftover. I'm fixing it for the next version. Thanks!
> >
> > > > +       read_lock(&token.group->as_lock);
> > > > +       domain = token.group->as->domain;
> > > >         addr = vduse_domain_alloc_coherent(domain, size,
> > > >                                            (dma_addr_t *)&iova, addr);
> > > >         if (!addr)
> > > >                 goto err;
> > > >
> > > >         *dma_addr = (dma_addr_t)iova;
> > > > +       read_unlock(&token.group->as_lock);
> > > >
> > > >         return addr;
> > > >
> > > >  err:
> > > > +       read_unlock(&token.group->as_lock);
> > > >         free_pages_exact(addr, size);
> > > >         return NULL;
> > > >  }
> > > > @@ -948,31 +1003,30 @@ static void vduse_dev_free_coherent(union virtio_map token, size_t size,
> > > >                                     void *vaddr, dma_addr_t dma_addr,
> > > >                                     unsigned long attrs)
> > > >  {
> > > > -       struct vduse_dev *vdev;
> > > >         struct vduse_iova_domain *domain;
> > > >
> > > >         if (!token.group)
> > > >                 return;
> > > >
> > > > -       vdev = token.group->dev;
> > > > -       domain = vdev->domain;
> > > > -
> > > > +       read_lock(&token.group->as_lock);
> > > > +       domain = token.group->as->domain;
> > > >         vduse_domain_free_coherent(domain, size, dma_addr, attrs);
> > > > +       read_unlock(&token.group->as_lock);
> > > >         free_pages_exact(vaddr, size);
> > > >  }
> > > >
> > > >  static bool vduse_dev_need_sync(union virtio_map token, dma_addr_t dma_addr)
> > > >  {
> > > > -       struct vduse_dev *vdev;
> > > > -       struct vduse_iova_domain *domain;
> > > > +       size_t bounce_size;
> > > >
> > > >         if (!token.group)
> > > >                 return false;
> > > >
> > > > -       vdev = token.group->dev;
> > > > -       domain = vdev->domain;
> > > > +       read_lock(&token.group->as_lock);
> > > > +       bounce_size = token.group->as->domain->bounce_size;
> > > > +       read_unlock(&token.group->as_lock);
> > > >
> > > > -       return dma_addr < domain->bounce_size;
> > > > +       return dma_addr < bounce_size;
> > > >  }
> > > >
> > > >  static int vduse_dev_mapping_error(union virtio_map token, dma_addr_t dma_addr)
> > > > @@ -984,16 +1038,16 @@ static int vduse_dev_mapping_error(union virtio_map token, dma_addr_t dma_addr)
> > > >
> > > >  static size_t vduse_dev_max_mapping_size(union virtio_map token)
> > > >  {
> > > > -       struct vduse_dev *vdev;
> > > > -       struct vduse_iova_domain *domain;
> > > > +       size_t bounce_size;
> > > >
> > > >         if (!token.group)
> > > >                 return 0;
> > > >
> > > > -       vdev = token.group->dev;
> > > > -       domain = vdev->domain;
> > > > +       read_lock(&token.group->as_lock);
> > > > +       bounce_size = token.group->as->domain->bounce_size;
> > > > +       read_unlock(&token.group->as_lock);
> > > >
> > > > -       return domain->bounce_size;
> > > > +       return bounce_size;
> > > >  }
> > > >
> > > >  static const struct virtio_map_ops vduse_map_ops = {
> > > > @@ -1133,39 +1187,40 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
> > > >         return ret;
> > > >  }
> > > >
> > > > -static int vduse_dev_dereg_umem(struct vduse_dev *dev,
> > > > +static int vduse_dev_dereg_umem(struct vduse_dev *dev, u32 asid,
> > > >                                 u64 iova, u64 size)
> > > >  {
> > > >         int ret;
> > > >
> > > > -       mutex_lock(&dev->mem_lock);
> > > > +       mutex_lock(&dev->as[asid].mem_lock);
> > > >         ret = -ENOENT;
> > > > -       if (!dev->umem)
> > > > +       if (!dev->as[asid].umem)
> > > >                 goto unlock;
> > > >
> > > >         ret = -EINVAL;
> > > > -       if (!dev->domain)
> > > > +       if (!dev->as[asid].domain)
> > > >                 goto unlock;
> > > >
> > > > -       if (dev->umem->iova != iova || size != dev->domain->bounce_size)
> > > > +       if (dev->as[asid].umem->iova != iova ||
> > > > +           size != dev->as[asid].domain->bounce_size)
> > > >                 goto unlock;
> > > >
> > > > -       vduse_domain_remove_user_bounce_pages(dev->domain);
> > > > -       unpin_user_pages_dirty_lock(dev->umem->pages,
> > > > -                                   dev->umem->npages, true);
> > > > -       atomic64_sub(dev->umem->npages, &dev->umem->mm->pinned_vm);
> > > > -       mmdrop(dev->umem->mm);
> > > > -       vfree(dev->umem->pages);
> > > > -       kfree(dev->umem);
> > > > -       dev->umem = NULL;
> > > > +       vduse_domain_remove_user_bounce_pages(dev->as[asid].domain);
> > > > +       unpin_user_pages_dirty_lock(dev->as[asid].umem->pages,
> > > > +                                   dev->as[asid].umem->npages, true);
> > > > +       atomic64_sub(dev->as[asid].umem->npages, &dev->as[asid].umem->mm->pinned_vm);
> > > > +       mmdrop(dev->as[asid].umem->mm);
> > > > +       vfree(dev->as[asid].umem->pages);
> > > > +       kfree(dev->as[asid].umem);
> > > > +       dev->as[asid].umem = NULL;
> > > >         ret = 0;
> > > >  unlock:
> > > > -       mutex_unlock(&dev->mem_lock);
> > > > +       mutex_unlock(&dev->as[asid].mem_lock);
> > > >         return ret;
> > > >  }
> > > >
> > > >  static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > > > -                             u64 iova, u64 uaddr, u64 size)
> > > > +                             u32 asid, u64 iova, u64 uaddr, u64 size)
> > > >  {
> > > >         struct page **page_list = NULL;
> > > >         struct vduse_umem *umem = NULL;
> > > > @@ -1173,14 +1228,14 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > > >         unsigned long npages, lock_limit;
> > > >         int ret;
> > > >
> > > > -       if (!dev->domain || !dev->domain->bounce_map ||
> > > > -           size != dev->domain->bounce_size ||
> > > > +       if (!dev->as[asid].domain || !dev->as[asid].domain->bounce_map ||
> > > > +           size != dev->as[asid].domain->bounce_size ||
> > > >             iova != 0 || uaddr & ~PAGE_MASK)
> > > >                 return -EINVAL;
> > > >
> > > > -       mutex_lock(&dev->mem_lock);
> > > > +       mutex_lock(&dev->as[asid].mem_lock);
> > > >         ret = -EEXIST;
> > > > -       if (dev->umem)
> > > > +       if (dev->as[asid].umem)
> > > >                 goto unlock;
> > > >
> > > >         ret = -ENOMEM;
> > > > @@ -1204,7 +1259,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > > >                 goto out;
> > > >         }
> > > >
> > > > -       ret = vduse_domain_add_user_bounce_pages(dev->domain,
> > > > +       ret = vduse_domain_add_user_bounce_pages(dev->as[asid].domain,
> > > >                                                  page_list, pinned);
> > > >         if (ret)
> > > >                 goto out;
> > > > @@ -1217,7 +1272,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > > >         umem->mm = current->mm;
> > > >         mmgrab(current->mm);
> > > >
> > > > -       dev->umem = umem;
> > > > +       dev->as[asid].umem = umem;
> > > >  out:
> > > >         if (ret && pinned > 0)
> > > >                 unpin_user_pages(page_list, pinned);
> > > > @@ -1228,7 +1283,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > > >                 vfree(page_list);
> > > >                 kfree(umem);
> > > >         }
> > > > -       mutex_unlock(&dev->mem_lock);
> > > > +       mutex_unlock(&dev->as[asid].mem_lock);
> > > >         return ret;
> > > >  }
> > > >
> > > > @@ -1260,47 +1315,66 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > > >
> > > >         switch (cmd) {
> > > >         case VDUSE_IOTLB_GET_FD: {
> > > > -               struct vduse_iotlb_entry entry;
> > > > +               struct vduse_iotlb_entry_v2 entry;
> > >
> > > Nit: if we stick entry and do copy_from_user() twice it might save
> > > lots of unnecessary changes.
> > >
> >
> > I'm happy to move to something else but most of the changes happen
> > because of s/entry/entry.v1/ . If we stick with just vduse_iotlb_entry
> > and a separate asid variable, we also need to duplicate the
> > copy_from_user [2].
> >
> > > >                 struct vhost_iotlb_map *map;
> > > >                 struct vdpa_map_file *map_file;
> > > >                 struct file *f = NULL;
> > > > +               u32 asid;
> > > >
> > > >                 ret = -EFAULT;
> > > > -               if (copy_from_user(&entry, argp, sizeof(entry)))
> > > > -                       break;
> > > > +               if (dev->api_version >= VDUSE_API_VERSION_1) {
> > > > +                       if (copy_from_user(&entry, argp, sizeof(entry)))
> > > > +                               break;
> > > > +               } else {
> > > > +                       entry.asid = 0;
> > > > +                       if (copy_from_user(&entry.v1, argp,
> > > > +                                          sizeof(entry.v1)))
> > > > +                               break;
> > > > +               }
> > > >
> > > >                 ret = -EINVAL;
> > > > -               if (entry.start > entry.last)
> > > > +               if (entry.v1.start > entry.v1.last)
> > > > +                       break;
> > > > +
> > > > +               if (entry.asid >= dev->nas)
> > > >                         break;
> > > >
> > > >                 mutex_lock(&dev->domain_lock);
> > > > -               if (!dev->domain) {
> > > > +               asid = array_index_nospec(entry.asid, dev->nas);
> > > > +               if (!dev->as[asid].domain) {
> > > >                         mutex_unlock(&dev->domain_lock);
> > > >                         break;
> > > >                 }
> > > > -               spin_lock(&dev->domain->iotlb_lock);
> > > > -               map = vhost_iotlb_itree_first(dev->domain->iotlb,
> > > > -                                             entry.start, entry.last);
> > > > +               spin_lock(&dev->as[asid].domain->iotlb_lock);
> > > > +               map = vhost_iotlb_itree_first(dev->as[asid].domain->iotlb,
> > > > +                                             entry.v1.start, entry.v1.last);
> > > >                 if (map) {
> > > >                         map_file = (struct vdpa_map_file *)map->opaque;
> > > >                         f = get_file(map_file->file);
> > > > -                       entry.offset = map_file->offset;
> > > > -                       entry.start = map->start;
> > > > -                       entry.last = map->last;
> > > > -                       entry.perm = map->perm;
> > > > +                       entry.v1.offset = map_file->offset;
> > > > +                       entry.v1.start = map->start;
> > > > +                       entry.v1.last = map->last;
> > > > +                       entry.v1.perm = map->perm;
> > > >                 }
> > > > -               spin_unlock(&dev->domain->iotlb_lock);
> > > > +               spin_unlock(&dev->as[asid].domain->iotlb_lock);
> > > >                 mutex_unlock(&dev->domain_lock);
> > > >                 ret = -EINVAL;
> > > >                 if (!f)
> > > >                         break;
> > > >
> > > > -               ret = -EFAULT;
> > > > -               if (copy_to_user(argp, &entry, sizeof(entry))) {
> > > > +               if (dev->api_version >= VDUSE_API_VERSION_1)
> > > > +                       ret = copy_to_user(argp, &entry,
> > > > +                                          sizeof(entry));
> > > > +               else
> > > > +                       ret = copy_to_user(argp, &entry.v1,
> > > > +                                          sizeof(entry.v1));
> > > > +
> > > > +               if (ret) {
> > > > +                       ret = -EFAULT;
> > > >                         fput(f);
> > > >                         break;
> > > >                 }
> > > > -               ret = receive_fd(f, NULL, perm_to_file_flags(entry.perm));
> > > > +               ret = receive_fd(f, NULL, perm_to_file_flags(entry.v1.perm));
> > > >                 fput(f);
> > > >                 break;
> > > >         }
> > > > @@ -1445,6 +1519,7 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > > >         }
> > > >         case VDUSE_IOTLB_REG_UMEM: {
> > > >                 struct vduse_iova_umem umem;
> > > > +               u32 asid;
> > > >
> > > >                 ret = -EFAULT;
> > > >                 if (copy_from_user(&umem, argp, sizeof(umem)))
> > > > @@ -1452,17 +1527,21 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > > >
> > > >                 ret = -EINVAL;
> > > >                 if (!is_mem_zero((const char *)umem.reserved,
> > > > -                                sizeof(umem.reserved)))
> > > > +                                sizeof(umem.reserved)) ||
> > > > +                   (dev->api_version < VDUSE_API_VERSION_1 &&
> > > > +                    umem.asid != 0) || umem.asid >= dev->nas)
> > > >                         break;
> > > >
> > > >                 mutex_lock(&dev->domain_lock);
> > > > -               ret = vduse_dev_reg_umem(dev, umem.iova,
> > > > +               asid = array_index_nospec(umem.asid, dev->nas);
> > > > +               ret = vduse_dev_reg_umem(dev, asid, umem.iova,
> > > >                                          umem.uaddr, umem.size);
> > > >                 mutex_unlock(&dev->domain_lock);
> > > >                 break;
> > > >         }
> > > >         case VDUSE_IOTLB_DEREG_UMEM: {
> > > >                 struct vduse_iova_umem umem;
> > > > +               u32 asid;
> > > >
> > > >                 ret = -EFAULT;
> > > >                 if (copy_from_user(&umem, argp, sizeof(umem)))
> > > > @@ -1470,10 +1549,15 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > > >
> > > >                 ret = -EINVAL;
> > > >                 if (!is_mem_zero((const char *)umem.reserved,
> > > > -                                sizeof(umem.reserved)))
> > > > +                                sizeof(umem.reserved)) ||
> > > > +                   (dev->api_version < VDUSE_API_VERSION_1 &&
> > > > +                    umem.asid != 0) ||
> > > > +                    umem.asid >= dev->nas)
> > > >                         break;
> > > > +
> > > >                 mutex_lock(&dev->domain_lock);
> > > > -               ret = vduse_dev_dereg_umem(dev, umem.iova,
> > > > +               asid = array_index_nospec(umem.asid, dev->nas);
> > > > +               ret = vduse_dev_dereg_umem(dev, asid, umem.iova,
> > > >                                            umem.size);
> > > >                 mutex_unlock(&dev->domain_lock);
> > > >                 break;
> > > > @@ -1481,6 +1565,7 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > > >         case VDUSE_IOTLB_GET_INFO: {
> > >
> > > Btw I see this:
> > >
> > >                 dev->vqs[index]->vq_group = config.group;
> > >
> > > In VDUSE_VQ_SETUP:
> > >
> > > I wonder what's the reason that it is not a part of CREATE_DEV? I
> > > meant it might be racy if DMA happens between CREATE_DEV and
> > > VDUSE_VQ_SETUP.
> > >
> >
> > The reason the vq index -> vq group association cannot be part of
> > device creation is that we need CVQ to be isolated for live migration,
> > but the device doesn't know the CVQ index at the CREATE_DEV time, only
> > after the feature negotiation happens [3].
>
> Exactly, the cvq index is changed.
>
> >
> > [1] https://lore.kernel.org/lkml/CACGkMEvRQ86dYeY3Enqoj1vkSpefU3roq4XGS+y5B5kmsXEkYg@mail.gmail.com/
> > [2] https://lore.kernel.org/lkml/CACGkMEtszQeZLTegxEbjODYxu-giTvURu=pKj4kYTHQYoKOzkQ@mail.gmail.com
> > [3] https://lore.kernel.org/lkml/CAJaqyWcvHx7kwcTceN2jazT0nKNo1r5zdzqWHqpxdna-kCS1RA@mail.gmail.com
>
> I see, but a question. What happens if there's a DMA between
> CREATE_DEV and VDUSE_VQ_SETUP. If we can find ways to forbid this (or
> it has been forbidden), we are probably fine.
>

It's already forbidden by vdpa_dev_add:

dev = vduse_find_dev(name);
if (!dev || !vduse_dev_is_ready(dev)) {
        mutex_unlock(&vduse_lock);
        return -EINVAL;
}

where vduse_dev_is_ready():
static bool vduse_dev_is_ready(struct vduse_dev *dev)
{
        int i;

        for (i = 0; i < dev->vq_num; i++)
                if (!dev->vqs[i]->num_max)
                        return false;

        return true;
}

Since we set the vq group with the same ioctl than vq->num_max, we are
safe here. I didn't catch it until now, so thanks for proposing to
move the vq group parameter to that ioctl back then! :).


> >
> > > >                 struct vduse_iova_info info;
> > > >                 struct vhost_iotlb_map *map;
> > > > +               u32 asid;
> > > >
> > > >                 ret = -EFAULT;
> > > >                 if (copy_from_user(&info, argp, sizeof(info)))
> > > > @@ -1494,23 +1579,31 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > > >                                  sizeof(info.reserved)))
> > > >                         break;
> > > >
> > > > +               if (dev->api_version < VDUSE_API_VERSION_1) {
> > > > +                       if (info.asid)
> > > > +                               break;
> > > > +               } else if (info.asid >= dev->nas)
> > > > +                       break;
> > > > +
> > > >                 mutex_lock(&dev->domain_lock);
> > > > -               if (!dev->domain) {
> > > > +               asid = array_index_nospec(info.asid, dev->nas);
> > > > +               if (!dev->as[asid].domain) {
> > > >                         mutex_unlock(&dev->domain_lock);
> > > >                         break;
> > > >                 }
> > > > -               spin_lock(&dev->domain->iotlb_lock);
> > > > -               map = vhost_iotlb_itree_first(dev->domain->iotlb,
> > > > +               spin_lock(&dev->as[asid].domain->iotlb_lock);
> > > > +               map = vhost_iotlb_itree_first(dev->as[asid].domain->iotlb,
> > > >                                               info.start, info.last);
> > > >                 if (map) {
> > > >                         info.start = map->start;
> > > >                         info.last = map->last;
> > > >                         info.capability = 0;
> > > > -                       if (dev->domain->bounce_map && map->start == 0 &&
> > > > -                           map->last == dev->domain->bounce_size - 1)
> > > > +                       if (dev->as[asid].domain->bounce_map &&
> > > > +                           map->start == 0 &&
> > > > +                           map->last == dev->as[asid].domain->bounce_size - 1)
> > > >                                 info.capability |= VDUSE_IOVA_CAP_UMEM;
> > > >                 }
> > > > -               spin_unlock(&dev->domain->iotlb_lock);
> > > > +               spin_unlock(&dev->as[asid].domain->iotlb_lock);
> > > >                 mutex_unlock(&dev->domain_lock);
> > > >                 if (!map)
> > > >                         break;
> > > > @@ -1535,8 +1628,10 @@ static int vduse_dev_release(struct inode *inode, struct file *file)
> > > >         struct vduse_dev *dev = file->private_data;
> > > >
> > > >         mutex_lock(&dev->domain_lock);
> > > > -       if (dev->domain)
> > > > -               vduse_dev_dereg_umem(dev, 0, dev->domain->bounce_size);
> > > > +       for (int i = 0; i < dev->nas; i++)
> > > > +               if (dev->as[i].domain)
> > > > +                       vduse_dev_dereg_umem(dev, i, 0,
> > > > +                                            dev->as[i].domain->bounce_size);
> > > >         mutex_unlock(&dev->domain_lock);
> > > >         spin_lock(&dev->msg_lock);
> > > >         /* Make sure the inflight messages can processed after reconncection */
> > > > @@ -1755,7 +1850,6 @@ static struct vduse_dev *vduse_dev_create(void)
> > > >                 return NULL;
> > > >
> > > >         mutex_init(&dev->lock);
> > > > -       mutex_init(&dev->mem_lock);
> > > >         mutex_init(&dev->domain_lock);
> > > >         spin_lock_init(&dev->msg_lock);
> > > >         INIT_LIST_HEAD(&dev->send_list);
> > > > @@ -1806,8 +1900,11 @@ static int vduse_destroy_dev(char *name)
> > > >         idr_remove(&vduse_idr, dev->minor);
> > > >         kvfree(dev->config);
> > > >         vduse_dev_deinit_vqs(dev);
> > > > -       if (dev->domain)
> > > > -               vduse_domain_destroy(dev->domain);
> > > > +       for (int i = 0; i < dev->nas; i++) {
> > > > +               if (dev->as[i].domain)
> > > > +                       vduse_domain_destroy(dev->as[i].domain);
> > > > +       }
> > > > +       kfree(dev->as);
> > > >         kfree(dev->name);
> > > >         kfree(dev->groups);
> > > >         vduse_dev_destroy(dev);
> > > > @@ -1854,12 +1951,17 @@ static bool vduse_validate_config(struct vduse_dev_config *config,
> > > >                          sizeof(config->reserved)))
> > > >                 return false;
> > > >
> > > > -       if (api_version < VDUSE_API_VERSION_1 && config->ngroups)
> > > > +       if (api_version < VDUSE_API_VERSION_1 &&
> > > > +           (config->ngroups || config->nas))
> > > >                 return false;
> > > >
> > > > -       if (api_version >= VDUSE_API_VERSION_1 &&
> > > > -           (!config->ngroups || config->ngroups > VDUSE_DEV_MAX_GROUPS))
> > > > -               return false;
> > > > +       if (api_version >= VDUSE_API_VERSION_1) {
> > > > +               if (!config->ngroups || config->ngroups > VDUSE_DEV_MAX_GROUPS)
> > > > +                       return false;
> > > > +
> > > > +               if (!config->nas || config->nas > VDUSE_DEV_MAX_AS)
> > > > +                       return false;
> > > > +       }
> > > >
> > > >         if (config->vq_align > PAGE_SIZE)
> > > >                 return false;
> > > > @@ -1924,7 +2026,8 @@ static ssize_t bounce_size_store(struct device *device,
> > > >
> > > >         ret = -EPERM;
> > > >         mutex_lock(&dev->domain_lock);
> > > > -       if (dev->domain)
> > > > +       /* Assuming that if the first domain is allocated, all are allocated */
> > > > +       if (dev->as[0].domain)
> > > >                 goto unlock;
> > >
> > > Not for this patch but I don't understand why we need to check dev->domain here.
> > >
> >
> > I guess you need to know the bounce size before you allocate the
> > domain. To make shrink logic for it seems not to be worth it, but
> > maybe I'm missing something.
>
> Right.
>
> Thanks
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v10 7/8] vduse: add vq group asid support
  2025-12-23 13:15         ` Eugenio Perez Martin
@ 2025-12-24  0:20           ` Jason Wang
  2025-12-24  7:38             ` Eugenio Perez Martin
  0 siblings, 1 reply; 22+ messages in thread
From: Jason Wang @ 2025-12-24  0:20 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Michael S . Tsirkin, Maxime Coquelin, Laurent Vivier,
	virtualization, linux-kernel, Stefano Garzarella, Yongji Xie,
	Xuan Zhuo, Cindy Lu

On Tue, Dec 23, 2025 at 9:16 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Tue, Dec 23, 2025 at 2:11 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Thu, Dec 18, 2025 at 9:11 PM Eugenio Perez Martin
> > <eperezma@redhat.com> wrote:
> > >
> > > On Thu, Dec 18, 2025 at 7:45 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Wed, Dec 17, 2025 at 7:24 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > >
> > > > > Add support for assigning Address Space Identifiers (ASIDs) to each VQ
> > > > > group.  This enables mapping each group into a distinct memory space.
> > > > >
> > > > > The vq group to ASID association is protected by a rwlock now.  But the
> > > > > mutex domain_lock keeps protecting the domains of all ASIDs, as some
> > > > > operations like the one related with the bounce buffer size still
> > > > > requires to lock all the ASIDs.
> > > > >
> > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > > >
> > > > > ---
> > > > > Future improvements can include performance optimizations on top could
> > > > > move to per-ASID locks, or hardening by tracking ASID or ASID hashes on
> > > > > unused bits of the DMA address.
> > > > >
> > > > > Tested virtio_vdpa by adding manually two threads in vduse_set_status:
> > > > > one of them modifies the vq group 0 ASID and the other one map and unmap
> > > > > memory continuously.  After a while, the two threads stop and the usual
> > > > > work continues.
> > > > >
> > > > > Tested with vhost_vdpa by migrating a VM while ping on OVS+VDUSE.  A few
> > > > > workaround were needed in some parts:
> > > > > * Do not enable CVQ before data vqs in QEMU, as VDUSE does not forward
> > > > >   the enable message to the userland device.  This will be solved in the
> > > > >   future.
> > > > > * Share the suspended state between all vhost devices in QEMU:
> > > > >   https://lists.nongnu.org/archive/html/qemu-devel/2025-11/msg02947.html
> > > > > * Implement a fake VDUSE suspend vdpa operation callback that always
> > > > >   returns true in the kernel.  DPDK suspend the device at the first
> > > > >   GET_VRING_BASE.
> > > > > * Remove the CVQ blocker in ASID.
> > > > >
> > > > > ---
> > > > > v10:
> > > > > * Back to rwlock version so stronger locks are used.
> > > > > * Take out allocations from rwlock.
> > > > > * Forbid changing ASID of a vq group after DRIVER_OK (Jason)
> > > > > * Remove bad fetching again of domain variable in
> > > > >   vduse_dev_max_mapping_size (Yongji).
> > > > > * Remove unused vdev definition in vdpa map_ops callbacks (kernel test
> > > > >   robot).
> > > > >
> > > > > v9:
> > > > > * Replace mutex with rwlock, as the vdpa map_ops can run from atomic
> > > > >   context.
> > > > >
> > > > > v8:
> > > > > * Revert the mutex to rwlock change, it needs proper profiling to
> > > > >   justify it.
> > > > >
> > > > > v7:
> > > > > * Take write lock in the error path (Jason).
> > > > >
> > > > > v6:
> > > > > * Make vdpa_dev_add use gotos for error handling (MST).
> > > > > * s/(dev->api_version < 1) ?/(dev->api_version < VDUSE_API_VERSION_1) ?/
> > > > >   (MST).
> > > > > * Fix struct name not matching in the doc.
> > > > >
> > > > > v5:
> > > > > * Properly return errno if copy_to_user returns >0 in VDUSE_IOTLB_GET_FD
> > > > >   ioctl (Jason).
> > > > > * Properly set domain bounce size to divide equally between nas (Jason).
> > > > > * Exclude "padding" member from the only >V1 members in
> > > > >   vduse_dev_request.
> > > > >
> > > > > v4:
> > > > > * Divide each domain bounce size between the device bounce size (Jason).
> > > > > * revert unneeded addr = NULL assignment (Jason)
> > > > > * Change if (x && (y || z)) return to if (x) { if (y) return; if (z)
> > > > >   return; } (Jason)
> > > > > * Change a bad multiline comment, using @ caracter instead of * (Jason).
> > > > > * Consider config->nas == 0 as a fail (Jason).
> > > > >
> > > > > v3:
> > > > > * Get the vduse domain through the vduse_as in the map functions
> > > > >   (Jason).
> > > > > * Squash with the patch creating the vduse_as struct (Jason).
> > > > > * Create VDUSE_DEV_MAX_AS instead of comparing agains a magic number
> > > > >   (Jason)
> > > > >
> > > > > v2:
> > > > > * Convert the use of mutex to rwlock.
> > > > >
> > > > > RFC v3:
> > > > > * Increase VDUSE_MAX_VQ_GROUPS to 0xffff (Jason). It was set to a lower
> > > > >   value to reduce memory consumption, but vqs are already limited to
> > > > >   that value and userspace VDUSE is able to allocate that many vqs.
> > > > > * Remove TODO about merging VDUSE_IOTLB_GET_FD ioctl with
> > > > >   VDUSE_IOTLB_GET_INFO.
> > > > > * Use of array_index_nospec in VDUSE device ioctls.
> > > > > * Embed vduse_iotlb_entry into vduse_iotlb_entry_v2.
> > > > > * Move the umem mutex to asid struct so there is no contention between
> > > > >   ASIDs.
> > > > >
> > > > > RFC v2:
> > > > > * Make iotlb entry the last one of vduse_iotlb_entry_v2 so the first
> > > > >   part of the struct is the same.
> > > > > ---
> > > > >  drivers/vdpa/vdpa_user/vduse_dev.c | 366 +++++++++++++++++++----------
> > > > >  include/uapi/linux/vduse.h         |  53 ++++-
> > > > >  2 files changed, 295 insertions(+), 124 deletions(-)
> > > > >
> > > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > index 767abcb7e375..786ab2378825 100644
> > > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > @@ -41,6 +41,7 @@
> > > > >
> > > > >  #define VDUSE_DEV_MAX (1U << MINORBITS)
> > > > >  #define VDUSE_DEV_MAX_GROUPS 0xffff
> > > > > +#define VDUSE_DEV_MAX_AS 0xffff
> > > > >  #define VDUSE_MAX_BOUNCE_SIZE (1024 * 1024 * 1024)
> > > > >  #define VDUSE_MIN_BOUNCE_SIZE (1024 * 1024)
> > > > >  #define VDUSE_BOUNCE_SIZE (64 * 1024 * 1024)
> > > > > @@ -86,7 +87,15 @@ struct vduse_umem {
> > > > >         struct mm_struct *mm;
> > > > >  };
> > > > >
> > > > > +struct vduse_as {
> > > > > +       struct vduse_iova_domain *domain;
> > > > > +       struct vduse_umem *umem;
> > > > > +       struct mutex mem_lock;
> > > >
> > > > Not related to this patch. But If I was not wrong we have 1:1 mapping
> > > > between domain and as. If this is true, can we use bounce_lock instead
> > > > of a new mem_lock? Since I see mem_lock is only used for synchronizing
> > > > umem reg/degreg which has been synchronized with domain rwlock.
> > > >
> > >
> > > I think you're right, but they work at different levels at the moment.
> > > The mem_lock is at vduse_dev and also protect the umem pointer, and
> > > bounce_lock is at iova_domain.c .
> > >
> > > Maybe the right thing to do is to move umem in iova_domain. Yongji
> > > Xie, what do you think?
> > >
> > > > > +};
> > > > > +
> > > > >  struct vduse_vq_group {
> > > > > +       rwlock_t as_lock;
> > > > > +       struct vduse_as *as; /* Protected by as_lock */
> > > > >         struct vduse_dev *dev;
> > > > >  };
> > > > >
> > > > > @@ -94,7 +103,7 @@ struct vduse_dev {
> > > > >         struct vduse_vdpa *vdev;
> > > > >         struct device *dev;
> > > > >         struct vduse_virtqueue **vqs;
> > > > > -       struct vduse_iova_domain *domain;
> > > > > +       struct vduse_as *as;
> > > > >         char *name;
> > > > >         struct mutex lock;
> > > > >         spinlock_t msg_lock;
> > > > > @@ -122,9 +131,8 @@ struct vduse_dev {
> > > > >         u32 vq_num;
> > > > >         u32 vq_align;
> > > > >         u32 ngroups;
> > > > > -       struct vduse_umem *umem;
> > > > > +       u32 nas;
> > > > >         struct vduse_vq_group *groups;
> > > > > -       struct mutex mem_lock;
> > > > >         unsigned int bounce_size;
> > > > >         struct mutex domain_lock;
> > > > >  };
> > > > > @@ -314,7 +322,7 @@ static int vduse_dev_set_status(struct vduse_dev *dev, u8 status)
> > > > >         return vduse_dev_msg_sync(dev, &msg);
> > > > >  }
> > > > >
> > > > > -static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> > > > > +static int vduse_dev_update_iotlb(struct vduse_dev *dev, u32 asid,
> > > > >                                   u64 start, u64 last)
> > > > >  {
> > > > >         struct vduse_dev_msg msg = { 0 };
> > > > > @@ -323,8 +331,14 @@ static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> > > > >                 return -EINVAL;
> > > > >
> > > > >         msg.req.type = VDUSE_UPDATE_IOTLB;
> > > > > -       msg.req.iova.start = start;
> > > > > -       msg.req.iova.last = last;
> > > > > +       if (dev->api_version < VDUSE_API_VERSION_1) {
> > > > > +               msg.req.iova.start = start;
> > > > > +               msg.req.iova.last = last;
> > > > > +       } else {
> > > > > +               msg.req.iova_v2.start = start;
> > > > > +               msg.req.iova_v2.last = last;
> > > > > +               msg.req.iova_v2.asid = asid;
> > > > > +       }
> > > > >
> > > > >         return vduse_dev_msg_sync(dev, &msg);
> > > > >  }
> > > > > @@ -436,14 +450,29 @@ static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> > > > >         return mask;
> > > > >  }
> > > > >
> > > > > +/* Force set the asid to a vq group without a message to the VDUSE device */
> > > > > +static void vduse_set_group_asid_nomsg(struct vduse_dev *dev,
> > > > > +                                      unsigned int group, unsigned int asid)
> > > > > +{
> > > > > +       write_lock(&dev->groups[group].as_lock);
> > > > > +       dev->groups[group].as = &dev->as[asid];
> > > > > +       write_unlock(&dev->groups[group].as_lock);
> > > > > +}
> > > > > +
> > > > >  static void vduse_dev_reset(struct vduse_dev *dev)
> > > > >  {
> > > > >         int i;
> > > > > -       struct vduse_iova_domain *domain = dev->domain;
> > > > >
> > > > >         /* The coherent mappings are handled in vduse_dev_free_coherent() */
> > > > > -       if (domain && domain->bounce_map)
> > > > > -               vduse_domain_reset_bounce_map(domain);
> > > > > +       for (i = 0; i < dev->nas; i++) {
> > > > > +               struct vduse_iova_domain *domain = dev->as[i].domain;
> > > > > +
> > > > > +               if (domain && domain->bounce_map)
> > > > > +                       vduse_domain_reset_bounce_map(domain);
> > > >
> > > > Btw, I see this:
> > > >
> > > > void vduse_domain_reset_bounce_map(struct vduse_iova_domain *domain)
> > > > {
> > > >         if (!domain->bounce_map)
> > > >                 return;
> > > >
> > > >         spin_lock(&domain->iotlb_lock);
> > > >         if (!domain->bounce_map)
> > > >                 goto unlock;
> > > >
> > > >
> > > > The counbe_map is checked twice, let's fix that.
> > > >
> > >
> > > Double checked locking to avoid taking the lock?
> >
> > I don't know, but I think we don't care too much about the performance
> > of vduse_domain_reset_bounce_map().
> >
> > > I don't think it is
> > > worth it to keep it as it is not in the hot path anyway. But that
> > > would also be another patch independent of this series, isn't it?
> >
> > Yes, it's another independent issue I just found when reviewing this patch.
> >
> > >
> > > > > +       }
> > > > > +
> > > > > +       for (i = 0; i < dev->ngroups; i++)
> > > > > +               vduse_set_group_asid_nomsg(dev, i, 0);
> > > >
> > > > Note that this function still does:
> > > >
> > > >                 vq->vq_group = 0;
> > > >
> > > > Which is wrong.
> > > >
> > >
> > > Right, removing it for the next version. Thanks for the catch!
> > >
> > > > >
> > > > >         down_write(&dev->rwsem);
> > > > >
> > > > > @@ -623,6 +652,30 @@ static union virtio_map vduse_get_vq_map(struct vdpa_device *vdpa, u16 idx)
> > > > >         return ret;
> > > > >  }
> > > > >
> > > > > +static int vduse_set_group_asid(struct vdpa_device *vdpa, unsigned int group,
> > > > > +                               unsigned int asid)
> > > > > +{
> > > > > +       struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > > > > +       struct vduse_dev_msg msg = { 0 };
> > > > > +       int r;
> > > > > +
> > > > > +       if (dev->api_version < VDUSE_API_VERSION_1 ||
> > > > > +           group >= dev->ngroups || asid >= dev->nas ||
> > > > > +           dev->status & VIRTIO_CONFIG_S_DRIVER_OK)
> > > > > +               return -EINVAL;
> > > >
> > > > If we forbid setting group asid for !DRIVER_OK, why do we still need a
> > > > rwlock?
> > >
> > > virtio_map_ops->alloc is still called before DRIVER_OK to allocate the
> > > vrings in the bounce buffer, for example. If you're ok with that I'm
> > > ok with removing the lock, as all the calls are issued by the driver
> > > setup process anyway. Or just to keep it for alloc?
> >
> > I see, then I think we need to keep that. The reason is that there's
> > no guarantee that the alloc() must be called before DRIVER_OK.
> >
> > >
> > > Anyway, I think I misunderstood your comment from [1] then.
> > >
> > > > All we need to do is to synchronize set_group_asid() with
> > > > set_status()/reset()?
> > > >
> > >
> > > That's also a good one. There is no synchronization if one thread call
> > > reset and then the device is set up from another thread. As all this
> > > situation is still hypothetical because virtio_vdpa does not support
> > > set_group_asid,
> >
> > Right.
> >
> > > and vhost one is already protected by vhost lock, do
> > > we need it?
> >
> > Let's add a TODO in the code.
> >
>
> Can you expand on this? I meant that there will be no synchronization
> if we remove the rwlock (or similar). If we keep the rwlock we don't
> need to add any TODO, or am I missing something?

I meant even with rwlock we don't synchronize set_group_as_id() and
set_status(). Since you said vhost has been synchronized, I think we
should either

1) document the synchronization that needs to be done in the upper layer or
2) add a todo to synchronize the set_status() and set_group_asid()

>
> > >
> > > > Or if you want to synchronize map ops with set_status() that looks
> > > > like an independent thing (hardening).
> > > >
> > > > > +
> > > > > +       msg.req.type = VDUSE_SET_VQ_GROUP_ASID;
> > > > > +       msg.req.vq_group_asid.group = group;
> > > > > +       msg.req.vq_group_asid.asid = asid;
> > > > > +
> > > > > +       r = vduse_dev_msg_sync(dev, &msg);
> > > > > +       if (r < 0)
> > > > > +               return r;
> > > > > +
> > > > > +       vduse_set_group_asid_nomsg(dev, group, asid);
> > > >
> > > > I'm not sure this has been discussed before, but I think it would be
> > > > better to introduce a new ioctl to get group -> as mapping. This helps
> > > > to avoid vduse_dev_msg_sync() as much as possible. And it doesn't
> > > > require the userspace to poll vduse fd before DRIVER_OK.
> > > >
>
> The userspace VDUSE device must poll vduse fd to get the DRIVER_OK
> actually. so it cannot avoid polling the vduse device. What are the reasons to
> avoid vduse_dev_msg_sync?

One less chance to avoid synchronization with userspace.

>
> > >
> > > I'm fine with that, but how do we communicate that they have changed?
> >
> > Since we forbid changing the group->as mapping after DRIVER_OK,
> > userspace just needs to use that ioctl once after DRIVER_OK.
> >
>
> But the userland VDUSE device needs to know when ASIDs are set by the
> driver and will not change anymore. In this series it is solved by the
> order of the messages, but now we would need a way to know that time.
> One idea is to issue this new ioctl when it receives VDUSE_SET_STATUS
> msg.

Yes.

>
> If we do it that way there is still a window where the hypothetical
> (malicious) virtio_vdpa driver can read and write vq groups from
> different threads.

See above, we need to add synchronization in either vdpa or virtio_vdpa.

> It could issue a set_group_asid after the VDUSE
> device returns from the ioctl but before dev->status has not been
> updated.

In the case of VDUSE I think either side should not trust the other
side. So userspace needs to be prepared for this, so did the driver.
Since the memory accessing is initiated from the userspace, so it
should not perform any memory access before the ioctl to fetch the
group->asid mapping.

>
> I don't think we should protect that, but if we want to do it we
> should protect that part either acquiring the rwlock and trusting the
> vduse_dev_msg_sync timeout or proceed with atomics, smp_store_release
> / smp_load_acquire, read and write barriers...
>
> Note that I still think this is overthinking. We have the same
> problems with driver features, where changing in bits like packed vq
> changes the behavior of vDPA callbacks and could be able to
> desynchronize the vduse kernel and the userland device.

Probably, but for driver features, it's too late to do the change.

> But since
> virtio_vdpa and vduse kernel modules run in the same kernel, so they should
> be able to trust each other.

Better not, usually the lower layer (vdpa/vduse) should not trust the
upper layer (virtio-vdpa) in this case.

But if you stick to the method with vduse_dev_msg_sync, I'm also fine with that.

>
> > > Or how to communicate to the driver that the device does not accept
> > > the assignment of the ASID to the group?
> >
> > See above.
> >
>
> I didn't find the answer :(.
>
> Let me put an example with QEMU and vhost_vdpa:
> - QEMU calls VHOST_VDPA_SET_GROUP_ASID with a valid vq group and asid.
> - vduse cannot send this information to the device as it must wait for
> the ioctl.

Which ioctl did you mean in this case?

> It returns success to the QEMU ioctl.
> - Now the vduse userland device doesn't accept the vq group ASID, so
> it returns an error through that ioctl. I'm not sure how, should it
> just reset the whole device? NEED_RESET?

Vduse userland needs to do ioctl VDUSE_GET_GROUP_ASID after DRIVER_OK,
then everything is fine?

> - If it does not reset the whole device, how to return that single set
> vq group asid error to QEMU?
>
> > >
> > > > > +       return 0;
> > > > > +}
> > > > > +
> > > > >  static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> > > > >                                 struct vdpa_vq_state *state)
> > > > >  {
> > > > > @@ -794,13 +847,13 @@ static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
> > > > >         struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > > > >         int ret;
> > > > >
> > > > > -       ret = vduse_domain_set_map(dev->domain, iotlb);
> > > > > +       ret = vduse_domain_set_map(dev->as[asid].domain, iotlb);
> > > > >         if (ret)
> > > > >                 return ret;
> > > > >
> > > > > -       ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> > > > > +       ret = vduse_dev_update_iotlb(dev, asid, 0ULL, ULLONG_MAX);
> > > > >         if (ret) {
> > > > > -               vduse_domain_clear_map(dev->domain, iotlb);
> > > > > +               vduse_domain_clear_map(dev->as[asid].domain, iotlb);
> > > > >                 return ret;
> > > > >         }
> > > > >
> > > > > @@ -843,6 +896,7 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> > > > >         .get_vq_affinity        = vduse_vdpa_get_vq_affinity,
> > > > >         .reset                  = vduse_vdpa_reset,
> > > > >         .set_map                = vduse_vdpa_set_map,
> > > > > +       .set_group_asid         = vduse_set_group_asid,
> > > > >         .get_vq_map             = vduse_get_vq_map,
> > > > >         .free                   = vduse_vdpa_free,
> > > > >  };
> > > > > @@ -851,32 +905,30 @@ static void vduse_dev_sync_single_for_device(union virtio_map token,
> > > > >                                              dma_addr_t dma_addr, size_t size,
> > > > >                                              enum dma_data_direction dir)
> > > > >  {
> > > > > -       struct vduse_dev *vdev;
> > > > >         struct vduse_iova_domain *domain;
> > > > >
> > > > >         if (!token.group)
> > > > >                 return;
> > > > >
> > > > > -       vdev = token.group->dev;
> > > > > -       domain = vdev->domain;
> > > > > -
> > > > > +       read_lock(&token.group->as_lock);
> > > >
> > > > I think we could optimize the lock here. E.g when nas is 1, we don't
> > > > need any lock in fact.
> > > >
> > >
> > > Good point! Not taking the lock in that case for the next version, thanks!
> > >
> > > > > +       domain = token.group->as->domain;
> > > > >         vduse_domain_sync_single_for_device(domain, dma_addr, size, dir);
> > > > > +       read_unlock(&token.group->as_lock);
> > > > >  }
> > > > >
> > > > >  static void vduse_dev_sync_single_for_cpu(union virtio_map token,
> > > > >                                              dma_addr_t dma_addr, size_t size,
> > > > >                                              enum dma_data_direction dir)
> > > > >  {
> > > > > -       struct vduse_dev *vdev;
> > > > >         struct vduse_iova_domain *domain;
> > > > >
> > > > >         if (!token.group)
> > > > >                 return;
> > > > >
> > > > > -       vdev = token.group->dev;
> > > > > -       domain = vdev->domain;
> > > > > -
> > > > > +       read_lock(&token.group->as_lock);
> > > > > +       domain = token.group->as->domain;
> > > > >         vduse_domain_sync_single_for_cpu(domain, dma_addr, size, dir);
> > > > > +       read_unlock(&token.group->as_lock);
> > > > >  }
> > > > >
> > > > >  static dma_addr_t vduse_dev_map_page(union virtio_map token, struct page *page,
> > > > > @@ -884,38 +936,38 @@ static dma_addr_t vduse_dev_map_page(union virtio_map token, struct page *page,
> > > > >                                      enum dma_data_direction dir,
> > > > >                                      unsigned long attrs)
> > > > >  {
> > > > > -       struct vduse_dev *vdev;
> > > > >         struct vduse_iova_domain *domain;
> > > > > +       dma_addr_t r;
> > > > >
> > > > >         if (!token.group)
> > > > >                 return DMA_MAPPING_ERROR;
> > > > >
> > > > > -       vdev = token.group->dev;
> > > > > -       domain = vdev->domain;
> > > > > +       read_lock(&token.group->as_lock);
> > > > > +       domain = token.group->as->domain;
> > > > > +       r = vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> > > > > +       read_unlock(&token.group->as_lock);
> > > > >
> > > > > -       return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> > > > > +       return r;
> > > > >  }
> > > > >
> > > > >  static void vduse_dev_unmap_page(union virtio_map token, dma_addr_t dma_addr,
> > > > >                                  size_t size, enum dma_data_direction dir,
> > > > >                                  unsigned long attrs)
> > > > >  {
> > > > > -       struct vduse_dev *vdev;
> > > > >         struct vduse_iova_domain *domain;
> > > > >
> > > > >         if (!token.group)
> > > > >                 return;
> > > > >
> > > > > -       vdev = token.group->dev;
> > > > > -       domain = vdev->domain;
> > > > > -
> > > > > -       return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> > > > > +       read_lock(&token.group->as_lock);
> > > > > +       domain = token.group->as->domain;
> > > > > +       vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> > > > > +       read_unlock(&token.group->as_lock);
> > > > >  }
> > > > >
> > > > >  static void *vduse_dev_alloc_coherent(union virtio_map token, size_t size,
> > > > >                                       dma_addr_t *dma_addr, gfp_t flag)
> > > > >  {
> > > > > -       struct vduse_dev *vdev;
> > > > >         struct vduse_iova_domain *domain;
> > > > >         unsigned long iova;
> > > > >         void *addr;
> > > > > @@ -928,18 +980,21 @@ static void *vduse_dev_alloc_coherent(union virtio_map token, size_t size,
> > > > >         if (!addr)
> > > > >                 return NULL;
> > > > >
> > > > > -       vdev = token.group->dev;
> > > > > -       domain = vdev->domain;
> > > > > +       *dma_addr = (dma_addr_t)iova;
> > > >
> > > > Any reason we need to touch *dma_addr here? It might trigger UBSAN/KMSAN.
> > > >
> > >
> > > No, this is a leftover. I'm fixing it for the next version. Thanks!
> > >
> > > > > +       read_lock(&token.group->as_lock);
> > > > > +       domain = token.group->as->domain;
> > > > >         addr = vduse_domain_alloc_coherent(domain, size,
> > > > >                                            (dma_addr_t *)&iova, addr);
> > > > >         if (!addr)
> > > > >                 goto err;
> > > > >
> > > > >         *dma_addr = (dma_addr_t)iova;
> > > > > +       read_unlock(&token.group->as_lock);
> > > > >
> > > > >         return addr;
> > > > >
> > > > >  err:
> > > > > +       read_unlock(&token.group->as_lock);
> > > > >         free_pages_exact(addr, size);
> > > > >         return NULL;
> > > > >  }
> > > > > @@ -948,31 +1003,30 @@ static void vduse_dev_free_coherent(union virtio_map token, size_t size,
> > > > >                                     void *vaddr, dma_addr_t dma_addr,
> > > > >                                     unsigned long attrs)
> > > > >  {
> > > > > -       struct vduse_dev *vdev;
> > > > >         struct vduse_iova_domain *domain;
> > > > >
> > > > >         if (!token.group)
> > > > >                 return;
> > > > >
> > > > > -       vdev = token.group->dev;
> > > > > -       domain = vdev->domain;
> > > > > -
> > > > > +       read_lock(&token.group->as_lock);
> > > > > +       domain = token.group->as->domain;
> > > > >         vduse_domain_free_coherent(domain, size, dma_addr, attrs);
> > > > > +       read_unlock(&token.group->as_lock);
> > > > >         free_pages_exact(vaddr, size);
> > > > >  }
> > > > >
> > > > >  static bool vduse_dev_need_sync(union virtio_map token, dma_addr_t dma_addr)
> > > > >  {
> > > > > -       struct vduse_dev *vdev;
> > > > > -       struct vduse_iova_domain *domain;
> > > > > +       size_t bounce_size;
> > > > >
> > > > >         if (!token.group)
> > > > >                 return false;
> > > > >
> > > > > -       vdev = token.group->dev;
> > > > > -       domain = vdev->domain;
> > > > > +       read_lock(&token.group->as_lock);
> > > > > +       bounce_size = token.group->as->domain->bounce_size;
> > > > > +       read_unlock(&token.group->as_lock);
> > > > >
> > > > > -       return dma_addr < domain->bounce_size;
> > > > > +       return dma_addr < bounce_size;
> > > > >  }
> > > > >
> > > > >  static int vduse_dev_mapping_error(union virtio_map token, dma_addr_t dma_addr)
> > > > > @@ -984,16 +1038,16 @@ static int vduse_dev_mapping_error(union virtio_map token, dma_addr_t dma_addr)
> > > > >
> > > > >  static size_t vduse_dev_max_mapping_size(union virtio_map token)
> > > > >  {
> > > > > -       struct vduse_dev *vdev;
> > > > > -       struct vduse_iova_domain *domain;
> > > > > +       size_t bounce_size;
> > > > >
> > > > >         if (!token.group)
> > > > >                 return 0;
> > > > >
> > > > > -       vdev = token.group->dev;
> > > > > -       domain = vdev->domain;
> > > > > +       read_lock(&token.group->as_lock);
> > > > > +       bounce_size = token.group->as->domain->bounce_size;
> > > > > +       read_unlock(&token.group->as_lock);
> > > > >
> > > > > -       return domain->bounce_size;
> > > > > +       return bounce_size;
> > > > >  }
> > > > >
> > > > >  static const struct virtio_map_ops vduse_map_ops = {
> > > > > @@ -1133,39 +1187,40 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
> > > > >         return ret;
> > > > >  }
> > > > >
> > > > > -static int vduse_dev_dereg_umem(struct vduse_dev *dev,
> > > > > +static int vduse_dev_dereg_umem(struct vduse_dev *dev, u32 asid,
> > > > >                                 u64 iova, u64 size)
> > > > >  {
> > > > >         int ret;
> > > > >
> > > > > -       mutex_lock(&dev->mem_lock);
> > > > > +       mutex_lock(&dev->as[asid].mem_lock);
> > > > >         ret = -ENOENT;
> > > > > -       if (!dev->umem)
> > > > > +       if (!dev->as[asid].umem)
> > > > >                 goto unlock;
> > > > >
> > > > >         ret = -EINVAL;
> > > > > -       if (!dev->domain)
> > > > > +       if (!dev->as[asid].domain)
> > > > >                 goto unlock;
> > > > >
> > > > > -       if (dev->umem->iova != iova || size != dev->domain->bounce_size)
> > > > > +       if (dev->as[asid].umem->iova != iova ||
> > > > > +           size != dev->as[asid].domain->bounce_size)
> > > > >                 goto unlock;
> > > > >
> > > > > -       vduse_domain_remove_user_bounce_pages(dev->domain);
> > > > > -       unpin_user_pages_dirty_lock(dev->umem->pages,
> > > > > -                                   dev->umem->npages, true);
> > > > > -       atomic64_sub(dev->umem->npages, &dev->umem->mm->pinned_vm);
> > > > > -       mmdrop(dev->umem->mm);
> > > > > -       vfree(dev->umem->pages);
> > > > > -       kfree(dev->umem);
> > > > > -       dev->umem = NULL;
> > > > > +       vduse_domain_remove_user_bounce_pages(dev->as[asid].domain);
> > > > > +       unpin_user_pages_dirty_lock(dev->as[asid].umem->pages,
> > > > > +                                   dev->as[asid].umem->npages, true);
> > > > > +       atomic64_sub(dev->as[asid].umem->npages, &dev->as[asid].umem->mm->pinned_vm);
> > > > > +       mmdrop(dev->as[asid].umem->mm);
> > > > > +       vfree(dev->as[asid].umem->pages);
> > > > > +       kfree(dev->as[asid].umem);
> > > > > +       dev->as[asid].umem = NULL;
> > > > >         ret = 0;
> > > > >  unlock:
> > > > > -       mutex_unlock(&dev->mem_lock);
> > > > > +       mutex_unlock(&dev->as[asid].mem_lock);
> > > > >         return ret;
> > > > >  }
> > > > >
> > > > >  static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > > > > -                             u64 iova, u64 uaddr, u64 size)
> > > > > +                             u32 asid, u64 iova, u64 uaddr, u64 size)
> > > > >  {
> > > > >         struct page **page_list = NULL;
> > > > >         struct vduse_umem *umem = NULL;
> > > > > @@ -1173,14 +1228,14 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > > > >         unsigned long npages, lock_limit;
> > > > >         int ret;
> > > > >
> > > > > -       if (!dev->domain || !dev->domain->bounce_map ||
> > > > > -           size != dev->domain->bounce_size ||
> > > > > +       if (!dev->as[asid].domain || !dev->as[asid].domain->bounce_map ||
> > > > > +           size != dev->as[asid].domain->bounce_size ||
> > > > >             iova != 0 || uaddr & ~PAGE_MASK)
> > > > >                 return -EINVAL;
> > > > >
> > > > > -       mutex_lock(&dev->mem_lock);
> > > > > +       mutex_lock(&dev->as[asid].mem_lock);
> > > > >         ret = -EEXIST;
> > > > > -       if (dev->umem)
> > > > > +       if (dev->as[asid].umem)
> > > > >                 goto unlock;
> > > > >
> > > > >         ret = -ENOMEM;
> > > > > @@ -1204,7 +1259,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > > > >                 goto out;
> > > > >         }
> > > > >
> > > > > -       ret = vduse_domain_add_user_bounce_pages(dev->domain,
> > > > > +       ret = vduse_domain_add_user_bounce_pages(dev->as[asid].domain,
> > > > >                                                  page_list, pinned);
> > > > >         if (ret)
> > > > >                 goto out;
> > > > > @@ -1217,7 +1272,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > > > >         umem->mm = current->mm;
> > > > >         mmgrab(current->mm);
> > > > >
> > > > > -       dev->umem = umem;
> > > > > +       dev->as[asid].umem = umem;
> > > > >  out:
> > > > >         if (ret && pinned > 0)
> > > > >                 unpin_user_pages(page_list, pinned);
> > > > > @@ -1228,7 +1283,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > > > >                 vfree(page_list);
> > > > >                 kfree(umem);
> > > > >         }
> > > > > -       mutex_unlock(&dev->mem_lock);
> > > > > +       mutex_unlock(&dev->as[asid].mem_lock);
> > > > >         return ret;
> > > > >  }
> > > > >
> > > > > @@ -1260,47 +1315,66 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > > > >
> > > > >         switch (cmd) {
> > > > >         case VDUSE_IOTLB_GET_FD: {
> > > > > -               struct vduse_iotlb_entry entry;
> > > > > +               struct vduse_iotlb_entry_v2 entry;
> > > >
> > > > Nit: if we stick entry and do copy_from_user() twice it might save
> > > > lots of unnecessary changes.
> > > >
> > >
> > > I'm happy to move to something else but most of the changes happen
> > > because of s/entry/entry.v1/ . If we stick with just vduse_iotlb_entry
> > > and a separate asid variable, we also need to duplicate the
> > > copy_from_user [2].
> > >
> > > > >                 struct vhost_iotlb_map *map;
> > > > >                 struct vdpa_map_file *map_file;
> > > > >                 struct file *f = NULL;
> > > > > +               u32 asid;
> > > > >
> > > > >                 ret = -EFAULT;
> > > > > -               if (copy_from_user(&entry, argp, sizeof(entry)))
> > > > > -                       break;
> > > > > +               if (dev->api_version >= VDUSE_API_VERSION_1) {
> > > > > +                       if (copy_from_user(&entry, argp, sizeof(entry)))
> > > > > +                               break;
> > > > > +               } else {
> > > > > +                       entry.asid = 0;
> > > > > +                       if (copy_from_user(&entry.v1, argp,
> > > > > +                                          sizeof(entry.v1)))
> > > > > +                               break;
> > > > > +               }
> > > > >
> > > > >                 ret = -EINVAL;
> > > > > -               if (entry.start > entry.last)
> > > > > +               if (entry.v1.start > entry.v1.last)
> > > > > +                       break;
> > > > > +
> > > > > +               if (entry.asid >= dev->nas)
> > > > >                         break;
> > > > >
> > > > >                 mutex_lock(&dev->domain_lock);
> > > > > -               if (!dev->domain) {
> > > > > +               asid = array_index_nospec(entry.asid, dev->nas);
> > > > > +               if (!dev->as[asid].domain) {
> > > > >                         mutex_unlock(&dev->domain_lock);
> > > > >                         break;
> > > > >                 }
> > > > > -               spin_lock(&dev->domain->iotlb_lock);
> > > > > -               map = vhost_iotlb_itree_first(dev->domain->iotlb,
> > > > > -                                             entry.start, entry.last);
> > > > > +               spin_lock(&dev->as[asid].domain->iotlb_lock);
> > > > > +               map = vhost_iotlb_itree_first(dev->as[asid].domain->iotlb,
> > > > > +                                             entry.v1.start, entry.v1.last);
> > > > >                 if (map) {
> > > > >                         map_file = (struct vdpa_map_file *)map->opaque;
> > > > >                         f = get_file(map_file->file);
> > > > > -                       entry.offset = map_file->offset;
> > > > > -                       entry.start = map->start;
> > > > > -                       entry.last = map->last;
> > > > > -                       entry.perm = map->perm;
> > > > > +                       entry.v1.offset = map_file->offset;
> > > > > +                       entry.v1.start = map->start;
> > > > > +                       entry.v1.last = map->last;
> > > > > +                       entry.v1.perm = map->perm;
> > > > >                 }
> > > > > -               spin_unlock(&dev->domain->iotlb_lock);
> > > > > +               spin_unlock(&dev->as[asid].domain->iotlb_lock);
> > > > >                 mutex_unlock(&dev->domain_lock);
> > > > >                 ret = -EINVAL;
> > > > >                 if (!f)
> > > > >                         break;
> > > > >
> > > > > -               ret = -EFAULT;
> > > > > -               if (copy_to_user(argp, &entry, sizeof(entry))) {
> > > > > +               if (dev->api_version >= VDUSE_API_VERSION_1)
> > > > > +                       ret = copy_to_user(argp, &entry,
> > > > > +                                          sizeof(entry));
> > > > > +               else
> > > > > +                       ret = copy_to_user(argp, &entry.v1,
> > > > > +                                          sizeof(entry.v1));
> > > > > +
> > > > > +               if (ret) {
> > > > > +                       ret = -EFAULT;
> > > > >                         fput(f);
> > > > >                         break;
> > > > >                 }
> > > > > -               ret = receive_fd(f, NULL, perm_to_file_flags(entry.perm));
> > > > > +               ret = receive_fd(f, NULL, perm_to_file_flags(entry.v1.perm));
> > > > >                 fput(f);
> > > > >                 break;
> > > > >         }
> > > > > @@ -1445,6 +1519,7 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > > > >         }
> > > > >         case VDUSE_IOTLB_REG_UMEM: {
> > > > >                 struct vduse_iova_umem umem;
> > > > > +               u32 asid;
> > > > >
> > > > >                 ret = -EFAULT;
> > > > >                 if (copy_from_user(&umem, argp, sizeof(umem)))
> > > > > @@ -1452,17 +1527,21 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > > > >
> > > > >                 ret = -EINVAL;
> > > > >                 if (!is_mem_zero((const char *)umem.reserved,
> > > > > -                                sizeof(umem.reserved)))
> > > > > +                                sizeof(umem.reserved)) ||
> > > > > +                   (dev->api_version < VDUSE_API_VERSION_1 &&
> > > > > +                    umem.asid != 0) || umem.asid >= dev->nas)
> > > > >                         break;
> > > > >
> > > > >                 mutex_lock(&dev->domain_lock);
> > > > > -               ret = vduse_dev_reg_umem(dev, umem.iova,
> > > > > +               asid = array_index_nospec(umem.asid, dev->nas);
> > > > > +               ret = vduse_dev_reg_umem(dev, asid, umem.iova,
> > > > >                                          umem.uaddr, umem.size);
> > > > >                 mutex_unlock(&dev->domain_lock);
> > > > >                 break;
> > > > >         }
> > > > >         case VDUSE_IOTLB_DEREG_UMEM: {
> > > > >                 struct vduse_iova_umem umem;
> > > > > +               u32 asid;
> > > > >
> > > > >                 ret = -EFAULT;
> > > > >                 if (copy_from_user(&umem, argp, sizeof(umem)))
> > > > > @@ -1470,10 +1549,15 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > > > >
> > > > >                 ret = -EINVAL;
> > > > >                 if (!is_mem_zero((const char *)umem.reserved,
> > > > > -                                sizeof(umem.reserved)))
> > > > > +                                sizeof(umem.reserved)) ||
> > > > > +                   (dev->api_version < VDUSE_API_VERSION_1 &&
> > > > > +                    umem.asid != 0) ||
> > > > > +                    umem.asid >= dev->nas)
> > > > >                         break;
> > > > > +
> > > > >                 mutex_lock(&dev->domain_lock);
> > > > > -               ret = vduse_dev_dereg_umem(dev, umem.iova,
> > > > > +               asid = array_index_nospec(umem.asid, dev->nas);
> > > > > +               ret = vduse_dev_dereg_umem(dev, asid, umem.iova,
> > > > >                                            umem.size);
> > > > >                 mutex_unlock(&dev->domain_lock);
> > > > >                 break;
> > > > > @@ -1481,6 +1565,7 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > > > >         case VDUSE_IOTLB_GET_INFO: {
> > > >
> > > > Btw I see this:
> > > >
> > > >                 dev->vqs[index]->vq_group = config.group;
> > > >
> > > > In VDUSE_VQ_SETUP:
> > > >
> > > > I wonder what's the reason that it is not a part of CREATE_DEV? I
> > > > meant it might be racy if DMA happens between CREATE_DEV and
> > > > VDUSE_VQ_SETUP.
> > > >
> > >
> > > The reason the vq index -> vq group association cannot be part of
> > > device creation is that we need CVQ to be isolated for live migration,
> > > but the device doesn't know the CVQ index at the CREATE_DEV time, only
> > > after the feature negotiation happens [3].
> >
> > Exactly, the cvq index is changed.
> >
> > >
> > > [1] https://lore.kernel.org/lkml/CACGkMEvRQ86dYeY3Enqoj1vkSpefU3roq4XGS+y5B5kmsXEkYg@mail.gmail.com/
> > > [2] https://lore.kernel.org/lkml/CACGkMEtszQeZLTegxEbjODYxu-giTvURu=pKj4kYTHQYoKOzkQ@mail.gmail.com
> > > [3] https://lore.kernel.org/lkml/CAJaqyWcvHx7kwcTceN2jazT0nKNo1r5zdzqWHqpxdna-kCS1RA@mail.gmail.com
> >
> > I see, but a question. What happens if there's a DMA between
> > CREATE_DEV and VDUSE_VQ_SETUP. If we can find ways to forbid this (or
> > it has been forbidden), we are probably fine.
> >
>
> It's already forbidden by vdpa_dev_add:
>
> dev = vduse_find_dev(name);
> if (!dev || !vduse_dev_is_ready(dev)) {
>         mutex_unlock(&vduse_lock);
>         return -EINVAL;
> }
>
> where vduse_dev_is_ready():
> static bool vduse_dev_is_ready(struct vduse_dev *dev)
> {
>         int i;
>
>         for (i = 0; i < dev->vq_num; i++)
>                 if (!dev->vqs[i]->num_max)
>                         return false;
>
>         return true;
> }
>
> Since we set the vq group with the same ioctl than vq->num_max, we are
> safe here. I didn't catch it until now, so thanks for proposing to
> move the vq group parameter to that ioctl back then! :).

Great.

Thanks


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v10 7/8] vduse: add vq group asid support
  2025-12-24  0:20           ` Jason Wang
@ 2025-12-24  7:38             ` Eugenio Perez Martin
  2025-12-25  2:23               ` Jason Wang
  0 siblings, 1 reply; 22+ messages in thread
From: Eugenio Perez Martin @ 2025-12-24  7:38 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S . Tsirkin, Maxime Coquelin, Laurent Vivier,
	virtualization, linux-kernel, Stefano Garzarella, Yongji Xie,
	Xuan Zhuo, Cindy Lu

On Wed, Dec 24, 2025 at 1:20 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Tue, Dec 23, 2025 at 9:16 PM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > On Tue, Dec 23, 2025 at 2:11 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Thu, Dec 18, 2025 at 9:11 PM Eugenio Perez Martin
> > > <eperezma@redhat.com> wrote:
> > > >
> > > > On Thu, Dec 18, 2025 at 7:45 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > On Wed, Dec 17, 2025 at 7:24 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > > >
> > > > > > Add support for assigning Address Space Identifiers (ASIDs) to each VQ
> > > > > > group.  This enables mapping each group into a distinct memory space.
> > > > > >
> > > > > > The vq group to ASID association is protected by a rwlock now.  But the
> > > > > > mutex domain_lock keeps protecting the domains of all ASIDs, as some
> > > > > > operations like the one related with the bounce buffer size still
> > > > > > requires to lock all the ASIDs.
> > > > > >
> > > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > > > >
> > > > > > ---
> > > > > > Future improvements can include performance optimizations on top could
> > > > > > move to per-ASID locks, or hardening by tracking ASID or ASID hashes on
> > > > > > unused bits of the DMA address.
> > > > > >
> > > > > > Tested virtio_vdpa by adding manually two threads in vduse_set_status:
> > > > > > one of them modifies the vq group 0 ASID and the other one map and unmap
> > > > > > memory continuously.  After a while, the two threads stop and the usual
> > > > > > work continues.
> > > > > >
> > > > > > Tested with vhost_vdpa by migrating a VM while ping on OVS+VDUSE.  A few
> > > > > > workaround were needed in some parts:
> > > > > > * Do not enable CVQ before data vqs in QEMU, as VDUSE does not forward
> > > > > >   the enable message to the userland device.  This will be solved in the
> > > > > >   future.
> > > > > > * Share the suspended state between all vhost devices in QEMU:
> > > > > >   https://lists.nongnu.org/archive/html/qemu-devel/2025-11/msg02947.html
> > > > > > * Implement a fake VDUSE suspend vdpa operation callback that always
> > > > > >   returns true in the kernel.  DPDK suspend the device at the first
> > > > > >   GET_VRING_BASE.
> > > > > > * Remove the CVQ blocker in ASID.
> > > > > >
> > > > > > ---
> > > > > > v10:
> > > > > > * Back to rwlock version so stronger locks are used.
> > > > > > * Take out allocations from rwlock.
> > > > > > * Forbid changing ASID of a vq group after DRIVER_OK (Jason)
> > > > > > * Remove bad fetching again of domain variable in
> > > > > >   vduse_dev_max_mapping_size (Yongji).
> > > > > > * Remove unused vdev definition in vdpa map_ops callbacks (kernel test
> > > > > >   robot).
> > > > > >
> > > > > > v9:
> > > > > > * Replace mutex with rwlock, as the vdpa map_ops can run from atomic
> > > > > >   context.
> > > > > >
> > > > > > v8:
> > > > > > * Revert the mutex to rwlock change, it needs proper profiling to
> > > > > >   justify it.
> > > > > >
> > > > > > v7:
> > > > > > * Take write lock in the error path (Jason).
> > > > > >
> > > > > > v6:
> > > > > > * Make vdpa_dev_add use gotos for error handling (MST).
> > > > > > * s/(dev->api_version < 1) ?/(dev->api_version < VDUSE_API_VERSION_1) ?/
> > > > > >   (MST).
> > > > > > * Fix struct name not matching in the doc.
> > > > > >
> > > > > > v5:
> > > > > > * Properly return errno if copy_to_user returns >0 in VDUSE_IOTLB_GET_FD
> > > > > >   ioctl (Jason).
> > > > > > * Properly set domain bounce size to divide equally between nas (Jason).
> > > > > > * Exclude "padding" member from the only >V1 members in
> > > > > >   vduse_dev_request.
> > > > > >
> > > > > > v4:
> > > > > > * Divide each domain bounce size between the device bounce size (Jason).
> > > > > > * revert unneeded addr = NULL assignment (Jason)
> > > > > > * Change if (x && (y || z)) return to if (x) { if (y) return; if (z)
> > > > > >   return; } (Jason)
> > > > > > * Change a bad multiline comment, using @ caracter instead of * (Jason).
> > > > > > * Consider config->nas == 0 as a fail (Jason).
> > > > > >
> > > > > > v3:
> > > > > > * Get the vduse domain through the vduse_as in the map functions
> > > > > >   (Jason).
> > > > > > * Squash with the patch creating the vduse_as struct (Jason).
> > > > > > * Create VDUSE_DEV_MAX_AS instead of comparing agains a magic number
> > > > > >   (Jason)
> > > > > >
> > > > > > v2:
> > > > > > * Convert the use of mutex to rwlock.
> > > > > >
> > > > > > RFC v3:
> > > > > > * Increase VDUSE_MAX_VQ_GROUPS to 0xffff (Jason). It was set to a lower
> > > > > >   value to reduce memory consumption, but vqs are already limited to
> > > > > >   that value and userspace VDUSE is able to allocate that many vqs.
> > > > > > * Remove TODO about merging VDUSE_IOTLB_GET_FD ioctl with
> > > > > >   VDUSE_IOTLB_GET_INFO.
> > > > > > * Use of array_index_nospec in VDUSE device ioctls.
> > > > > > * Embed vduse_iotlb_entry into vduse_iotlb_entry_v2.
> > > > > > * Move the umem mutex to asid struct so there is no contention between
> > > > > >   ASIDs.
> > > > > >
> > > > > > RFC v2:
> > > > > > * Make iotlb entry the last one of vduse_iotlb_entry_v2 so the first
> > > > > >   part of the struct is the same.
> > > > > > ---
> > > > > >  drivers/vdpa/vdpa_user/vduse_dev.c | 366 +++++++++++++++++++----------
> > > > > >  include/uapi/linux/vduse.h         |  53 ++++-
> > > > > >  2 files changed, 295 insertions(+), 124 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > index 767abcb7e375..786ab2378825 100644
> > > > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > @@ -41,6 +41,7 @@
> > > > > >
> > > > > >  #define VDUSE_DEV_MAX (1U << MINORBITS)
> > > > > >  #define VDUSE_DEV_MAX_GROUPS 0xffff
> > > > > > +#define VDUSE_DEV_MAX_AS 0xffff
> > > > > >  #define VDUSE_MAX_BOUNCE_SIZE (1024 * 1024 * 1024)
> > > > > >  #define VDUSE_MIN_BOUNCE_SIZE (1024 * 1024)
> > > > > >  #define VDUSE_BOUNCE_SIZE (64 * 1024 * 1024)
> > > > > > @@ -86,7 +87,15 @@ struct vduse_umem {
> > > > > >         struct mm_struct *mm;
> > > > > >  };
> > > > > >
> > > > > > +struct vduse_as {
> > > > > > +       struct vduse_iova_domain *domain;
> > > > > > +       struct vduse_umem *umem;
> > > > > > +       struct mutex mem_lock;
> > > > >
> > > > > Not related to this patch. But If I was not wrong we have 1:1 mapping
> > > > > between domain and as. If this is true, can we use bounce_lock instead
> > > > > of a new mem_lock? Since I see mem_lock is only used for synchronizing
> > > > > umem reg/degreg which has been synchronized with domain rwlock.
> > > > >
> > > >
> > > > I think you're right, but they work at different levels at the moment.
> > > > The mem_lock is at vduse_dev and also protect the umem pointer, and
> > > > bounce_lock is at iova_domain.c .
> > > >
> > > > Maybe the right thing to do is to move umem in iova_domain. Yongji
> > > > Xie, what do you think?
> > > >
> > > > > > +};
> > > > > > +
> > > > > >  struct vduse_vq_group {
> > > > > > +       rwlock_t as_lock;
> > > > > > +       struct vduse_as *as; /* Protected by as_lock */
> > > > > >         struct vduse_dev *dev;
> > > > > >  };
> > > > > >
> > > > > > @@ -94,7 +103,7 @@ struct vduse_dev {
> > > > > >         struct vduse_vdpa *vdev;
> > > > > >         struct device *dev;
> > > > > >         struct vduse_virtqueue **vqs;
> > > > > > -       struct vduse_iova_domain *domain;
> > > > > > +       struct vduse_as *as;
> > > > > >         char *name;
> > > > > >         struct mutex lock;
> > > > > >         spinlock_t msg_lock;
> > > > > > @@ -122,9 +131,8 @@ struct vduse_dev {
> > > > > >         u32 vq_num;
> > > > > >         u32 vq_align;
> > > > > >         u32 ngroups;
> > > > > > -       struct vduse_umem *umem;
> > > > > > +       u32 nas;
> > > > > >         struct vduse_vq_group *groups;
> > > > > > -       struct mutex mem_lock;
> > > > > >         unsigned int bounce_size;
> > > > > >         struct mutex domain_lock;
> > > > > >  };
> > > > > > @@ -314,7 +322,7 @@ static int vduse_dev_set_status(struct vduse_dev *dev, u8 status)
> > > > > >         return vduse_dev_msg_sync(dev, &msg);
> > > > > >  }
> > > > > >
> > > > > > -static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> > > > > > +static int vduse_dev_update_iotlb(struct vduse_dev *dev, u32 asid,
> > > > > >                                   u64 start, u64 last)
> > > > > >  {
> > > > > >         struct vduse_dev_msg msg = { 0 };
> > > > > > @@ -323,8 +331,14 @@ static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> > > > > >                 return -EINVAL;
> > > > > >
> > > > > >         msg.req.type = VDUSE_UPDATE_IOTLB;
> > > > > > -       msg.req.iova.start = start;
> > > > > > -       msg.req.iova.last = last;
> > > > > > +       if (dev->api_version < VDUSE_API_VERSION_1) {
> > > > > > +               msg.req.iova.start = start;
> > > > > > +               msg.req.iova.last = last;
> > > > > > +       } else {
> > > > > > +               msg.req.iova_v2.start = start;
> > > > > > +               msg.req.iova_v2.last = last;
> > > > > > +               msg.req.iova_v2.asid = asid;
> > > > > > +       }
> > > > > >
> > > > > >         return vduse_dev_msg_sync(dev, &msg);
> > > > > >  }
> > > > > > @@ -436,14 +450,29 @@ static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> > > > > >         return mask;
> > > > > >  }
> > > > > >
> > > > > > +/* Force set the asid to a vq group without a message to the VDUSE device */
> > > > > > +static void vduse_set_group_asid_nomsg(struct vduse_dev *dev,
> > > > > > +                                      unsigned int group, unsigned int asid)
> > > > > > +{
> > > > > > +       write_lock(&dev->groups[group].as_lock);
> > > > > > +       dev->groups[group].as = &dev->as[asid];
> > > > > > +       write_unlock(&dev->groups[group].as_lock);
> > > > > > +}
> > > > > > +
> > > > > >  static void vduse_dev_reset(struct vduse_dev *dev)
> > > > > >  {
> > > > > >         int i;
> > > > > > -       struct vduse_iova_domain *domain = dev->domain;
> > > > > >
> > > > > >         /* The coherent mappings are handled in vduse_dev_free_coherent() */
> > > > > > -       if (domain && domain->bounce_map)
> > > > > > -               vduse_domain_reset_bounce_map(domain);
> > > > > > +       for (i = 0; i < dev->nas; i++) {
> > > > > > +               struct vduse_iova_domain *domain = dev->as[i].domain;
> > > > > > +
> > > > > > +               if (domain && domain->bounce_map)
> > > > > > +                       vduse_domain_reset_bounce_map(domain);
> > > > >
> > > > > Btw, I see this:
> > > > >
> > > > > void vduse_domain_reset_bounce_map(struct vduse_iova_domain *domain)
> > > > > {
> > > > >         if (!domain->bounce_map)
> > > > >                 return;
> > > > >
> > > > >         spin_lock(&domain->iotlb_lock);
> > > > >         if (!domain->bounce_map)
> > > > >                 goto unlock;
> > > > >
> > > > >
> > > > > The counbe_map is checked twice, let's fix that.
> > > > >
> > > >
> > > > Double checked locking to avoid taking the lock?
> > >
> > > I don't know, but I think we don't care too much about the performance
> > > of vduse_domain_reset_bounce_map().
> > >
> > > > I don't think it is
> > > > worth it to keep it as it is not in the hot path anyway. But that
> > > > would also be another patch independent of this series, isn't it?
> > >
> > > Yes, it's another independent issue I just found when reviewing this patch.
> > >
> > > >
> > > > > > +       }
> > > > > > +
> > > > > > +       for (i = 0; i < dev->ngroups; i++)
> > > > > > +               vduse_set_group_asid_nomsg(dev, i, 0);
> > > > >
> > > > > Note that this function still does:
> > > > >
> > > > >                 vq->vq_group = 0;
> > > > >
> > > > > Which is wrong.
> > > > >
> > > >
> > > > Right, removing it for the next version. Thanks for the catch!
> > > >
> > > > > >
> > > > > >         down_write(&dev->rwsem);
> > > > > >
> > > > > > @@ -623,6 +652,30 @@ static union virtio_map vduse_get_vq_map(struct vdpa_device *vdpa, u16 idx)
> > > > > >         return ret;
> > > > > >  }
> > > > > >
> > > > > > +static int vduse_set_group_asid(struct vdpa_device *vdpa, unsigned int group,
> > > > > > +                               unsigned int asid)
> > > > > > +{
> > > > > > +       struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > > > > > +       struct vduse_dev_msg msg = { 0 };
> > > > > > +       int r;
> > > > > > +
> > > > > > +       if (dev->api_version < VDUSE_API_VERSION_1 ||
> > > > > > +           group >= dev->ngroups || asid >= dev->nas ||
> > > > > > +           dev->status & VIRTIO_CONFIG_S_DRIVER_OK)
> > > > > > +               return -EINVAL;
> > > > >
> > > > > If we forbid setting group asid for !DRIVER_OK, why do we still need a
> > > > > rwlock?
> > > >
> > > > virtio_map_ops->alloc is still called before DRIVER_OK to allocate the
> > > > vrings in the bounce buffer, for example. If you're ok with that I'm
> > > > ok with removing the lock, as all the calls are issued by the driver
> > > > setup process anyway. Or just to keep it for alloc?
> > >
> > > I see, then I think we need to keep that. The reason is that there's
> > > no guarantee that the alloc() must be called before DRIVER_OK.
> > >
> > > >
> > > > Anyway, I think I misunderstood your comment from [1] then.
> > > >
> > > > > All we need to do is to synchronize set_group_asid() with
> > > > > set_status()/reset()?
> > > > >
> > > >
> > > > That's also a good one. There is no synchronization if one thread call
> > > > reset and then the device is set up from another thread. As all this
> > > > situation is still hypothetical because virtio_vdpa does not support
> > > > set_group_asid,
> > >
> > > Right.
> > >
> > > > and vhost one is already protected by vhost lock, do
> > > > we need it?
> > >
> > > Let's add a TODO in the code.
> > >
> >
> > Can you expand on this? I meant that there will be no synchronization
> > if we remove the rwlock (or similar). If we keep the rwlock we don't
> > need to add any TODO, or am I missing something?
>
> I meant even with rwlock we don't synchronize set_group_as_id() and
> set_status(). Since you said vhost has been synchronized, I think we
> should either
>
> 1) document the synchronization that needs to be done in the upper layer or
> 2) add a todo to synchronize the set_status() and set_group_asid()
>

With the VDUSE messages they are synchronized by dev->msg_lock, the
vduse module always has a coherent status from both sides: the VDUSE
userland device and virtio_vdpa. If you want vduse not to trust vdpa
we can go the extra mile and make dev->status atomic for both device
features and group ASID, would that work?

> >
> > > >
> > > > > Or if you want to synchronize map ops with set_status() that looks
> > > > > like an independent thing (hardening).
> > > > >
> > > > > > +
> > > > > > +       msg.req.type = VDUSE_SET_VQ_GROUP_ASID;
> > > > > > +       msg.req.vq_group_asid.group = group;
> > > > > > +       msg.req.vq_group_asid.asid = asid;
> > > > > > +
> > > > > > +       r = vduse_dev_msg_sync(dev, &msg);
> > > > > > +       if (r < 0)
> > > > > > +               return r;
> > > > > > +
> > > > > > +       vduse_set_group_asid_nomsg(dev, group, asid);
> > > > >
> > > > > I'm not sure this has been discussed before, but I think it would be
> > > > > better to introduce a new ioctl to get group -> as mapping. This helps
> > > > > to avoid vduse_dev_msg_sync() as much as possible. And it doesn't
> > > > > require the userspace to poll vduse fd before DRIVER_OK.
> > > > >
> >
> > The userspace VDUSE device must poll vduse fd to get the DRIVER_OK
> > actually. so it cannot avoid polling the vduse device. What are the reasons to
> > avoid vduse_dev_msg_sync?
>
> One less chance to avoid synchronization with userspace.
>

But the ioctl alternative is more synchronization from my POV, not
less. Instead of receiving notifications that we could even batch, the
VDUSE userland device needs to issue many synchronous ioctls.

> >
> > > >
> > > > I'm fine with that, but how do we communicate that they have changed?
> > >
> > > Since we forbid changing the group->as mapping after DRIVER_OK,
> > > userspace just needs to use that ioctl once after DRIVER_OK.
> > >
> >
> > But the userland VDUSE device needs to know when ASIDs are set by the
> > driver and will not change anymore. In this series it is solved by the
> > order of the messages, but now we would need a way to know that time.
> > One idea is to issue this new ioctl when it receives VDUSE_SET_STATUS
> > msg.
>
> Yes.
>
> >
> > If we do it that way there is still a window where the hypothetical
> > (malicious) virtio_vdpa driver can read and write vq groups from
> > different threads.
>
> See above, we need to add synchronization in either vdpa or virtio_vdpa.
>
> > It could issue a set_group_asid after the VDUSE
> > device returns from the ioctl but before dev->status has not been
> > updated.
>
> In the case of VDUSE I think either side should not trust the other
> side. So userspace needs to be prepared for this, so did the driver.
> Since the memory accessing is initiated from the userspace, so it
> should not perform any memory access before the ioctl to fetch the
> group->asid mapping.
>

I don't follow this. In the case of virtio_vdpa the ASID groups and
features are handled by the kernel entirely. And in the case of
vhost_vdpa they are handled by the vhost_dev->mutex lock. So can we
trust them?

> >
> > I don't think we should protect that, but if we want to do it we
> > should protect that part either acquiring the rwlock and trusting the
> > vduse_dev_msg_sync timeout or proceed with atomics, smp_store_release
> > / smp_load_acquire, read and write barriers...
> >
> > Note that I still think this is overthinking. We have the same
> > problems with driver features, where changing in bits like packed vq
> > changes the behavior of vDPA callbacks and could be able to
> > desynchronize the vduse kernel and the userland device.
>
> Probably, but for driver features, it's too late to do the change.
>

Why is it late? We just need to add the same synchronization in both
driver features and ASID groups. The change is not visible for either
virtio_vdpa, vhost_vdpa, or VDUSE userland device, at all.

> > But since
> > virtio_vdpa and vduse kernel modules run in the same kernel, so they should
> > be able to trust each other.
>
> Better not, usually the lower layer (vdpa/vduse) should not trust the
> upper layer (virtio-vdpa) in this case.
>

Ok, adding the hardening in the next version!

> But if you stick to the method with vduse_dev_msg_sync, I'm also fine with that.
>
> >
> > > > Or how to communicate to the driver that the device does not accept
> > > > the assignment of the ASID to the group?
> > >
> > > See above.
> > >
> >
> > I didn't find the answer :(.
> >
> > Let me put an example with QEMU and vhost_vdpa:
> > - QEMU calls VHOST_VDPA_SET_GROUP_ASID with a valid vq group and asid.
> > - vduse cannot send this information to the device as it must wait for
> > the ioctl.
>
> Which ioctl did you mean in this case?
>

The new ioctl from the userland VDUSE device that you're proposing to
accept or reject the set ASID group. Let's call it
VDUSE_GET_GROUP_ASID, so let me rewrite the from from that moment:

- QEMU calls VHOST_VDPA_SET_GROUP_ASID with a valid vq group (in
vdpa->ngroups range) and a valid asid (in vdpa->nas range)
- The vduse kernel module cannot send the information to the VDUSE
userland device, it must wait until the VDUSE userland device calls
the new VDUSE_GET_GROUP_ASID. It will not happen until QEMU calls
VHOST_VDPA_SET_STATUS with DRIVER_OK, but QEMU will not call
VHOST_VDPA_SET_STATUS until the kernel returns from
VHOST_VDPA_SET_GROUP_ASID. The vduse kernel module needs to return
success, without knowing if the device will accept it in the future or
not.
- Now QEMU sends DIRVER_OK, so the vduse kernel module forwards it to
the VDUSE userland device. The VDUSE userland device then calls
VDUSE_GET_GROUP_ASID, and the flow continues.
- The vduse userland device doesn't accept the vq group ASID map for
whatever reason, so it returns an error through another ioctl
(VDUSE_CONFIRM_GROUP_ASID with a boolean as argument?). How should the
vduse kernel module proceed? just reset the whole device with
NEED_RESET?

> >
> > > >
> > > > > > +       return 0;
> > > > > > +}
> > > > > > +
> > > > > >  static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> > > > > >                                 struct vdpa_vq_state *state)
> > > > > >  {
> > > > > > @@ -794,13 +847,13 @@ static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
> > > > > >         struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > > > > >         int ret;
> > > > > >
> > > > > > -       ret = vduse_domain_set_map(dev->domain, iotlb);
> > > > > > +       ret = vduse_domain_set_map(dev->as[asid].domain, iotlb);
> > > > > >         if (ret)
> > > > > >                 return ret;
> > > > > >
> > > > > > -       ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> > > > > > +       ret = vduse_dev_update_iotlb(dev, asid, 0ULL, ULLONG_MAX);
> > > > > >         if (ret) {
> > > > > > -               vduse_domain_clear_map(dev->domain, iotlb);
> > > > > > +               vduse_domain_clear_map(dev->as[asid].domain, iotlb);
> > > > > >                 return ret;
> > > > > >         }
> > > > > >
> > > > > > @@ -843,6 +896,7 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> > > > > >         .get_vq_affinity        = vduse_vdpa_get_vq_affinity,
> > > > > >         .reset                  = vduse_vdpa_reset,
> > > > > >         .set_map                = vduse_vdpa_set_map,
> > > > > > +       .set_group_asid         = vduse_set_group_asid,
> > > > > >         .get_vq_map             = vduse_get_vq_map,
> > > > > >         .free                   = vduse_vdpa_free,
> > > > > >  };
> > > > > > @@ -851,32 +905,30 @@ static void vduse_dev_sync_single_for_device(union virtio_map token,
> > > > > >                                              dma_addr_t dma_addr, size_t size,
> > > > > >                                              enum dma_data_direction dir)
> > > > > >  {
> > > > > > -       struct vduse_dev *vdev;
> > > > > >         struct vduse_iova_domain *domain;
> > > > > >
> > > > > >         if (!token.group)
> > > > > >                 return;
> > > > > >
> > > > > > -       vdev = token.group->dev;
> > > > > > -       domain = vdev->domain;
> > > > > > -
> > > > > > +       read_lock(&token.group->as_lock);
> > > > >
> > > > > I think we could optimize the lock here. E.g when nas is 1, we don't
> > > > > need any lock in fact.
> > > > >
> > > >
> > > > Good point! Not taking the lock in that case for the next version, thanks!
> > > >
> > > > > > +       domain = token.group->as->domain;
> > > > > >         vduse_domain_sync_single_for_device(domain, dma_addr, size, dir);
> > > > > > +       read_unlock(&token.group->as_lock);
> > > > > >  }
> > > > > >
> > > > > >  static void vduse_dev_sync_single_for_cpu(union virtio_map token,
> > > > > >                                              dma_addr_t dma_addr, size_t size,
> > > > > >                                              enum dma_data_direction dir)
> > > > > >  {
> > > > > > -       struct vduse_dev *vdev;
> > > > > >         struct vduse_iova_domain *domain;
> > > > > >
> > > > > >         if (!token.group)
> > > > > >                 return;
> > > > > >
> > > > > > -       vdev = token.group->dev;
> > > > > > -       domain = vdev->domain;
> > > > > > -
> > > > > > +       read_lock(&token.group->as_lock);
> > > > > > +       domain = token.group->as->domain;
> > > > > >         vduse_domain_sync_single_for_cpu(domain, dma_addr, size, dir);
> > > > > > +       read_unlock(&token.group->as_lock);
> > > > > >  }
> > > > > >
> > > > > >  static dma_addr_t vduse_dev_map_page(union virtio_map token, struct page *page,
> > > > > > @@ -884,38 +936,38 @@ static dma_addr_t vduse_dev_map_page(union virtio_map token, struct page *page,
> > > > > >                                      enum dma_data_direction dir,
> > > > > >                                      unsigned long attrs)
> > > > > >  {
> > > > > > -       struct vduse_dev *vdev;
> > > > > >         struct vduse_iova_domain *domain;
> > > > > > +       dma_addr_t r;
> > > > > >
> > > > > >         if (!token.group)
> > > > > >                 return DMA_MAPPING_ERROR;
> > > > > >
> > > > > > -       vdev = token.group->dev;
> > > > > > -       domain = vdev->domain;
> > > > > > +       read_lock(&token.group->as_lock);
> > > > > > +       domain = token.group->as->domain;
> > > > > > +       r = vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> > > > > > +       read_unlock(&token.group->as_lock);
> > > > > >
> > > > > > -       return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> > > > > > +       return r;
> > > > > >  }
> > > > > >
> > > > > >  static void vduse_dev_unmap_page(union virtio_map token, dma_addr_t dma_addr,
> > > > > >                                  size_t size, enum dma_data_direction dir,
> > > > > >                                  unsigned long attrs)
> > > > > >  {
> > > > > > -       struct vduse_dev *vdev;
> > > > > >         struct vduse_iova_domain *domain;
> > > > > >
> > > > > >         if (!token.group)
> > > > > >                 return;
> > > > > >
> > > > > > -       vdev = token.group->dev;
> > > > > > -       domain = vdev->domain;
> > > > > > -
> > > > > > -       return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> > > > > > +       read_lock(&token.group->as_lock);
> > > > > > +       domain = token.group->as->domain;
> > > > > > +       vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> > > > > > +       read_unlock(&token.group->as_lock);
> > > > > >  }
> > > > > >
> > > > > >  static void *vduse_dev_alloc_coherent(union virtio_map token, size_t size,
> > > > > >                                       dma_addr_t *dma_addr, gfp_t flag)
> > > > > >  {
> > > > > > -       struct vduse_dev *vdev;
> > > > > >         struct vduse_iova_domain *domain;
> > > > > >         unsigned long iova;
> > > > > >         void *addr;
> > > > > > @@ -928,18 +980,21 @@ static void *vduse_dev_alloc_coherent(union virtio_map token, size_t size,
> > > > > >         if (!addr)
> > > > > >                 return NULL;
> > > > > >
> > > > > > -       vdev = token.group->dev;
> > > > > > -       domain = vdev->domain;
> > > > > > +       *dma_addr = (dma_addr_t)iova;
> > > > >
> > > > > Any reason we need to touch *dma_addr here? It might trigger UBSAN/KMSAN.
> > > > >
> > > >
> > > > No, this is a leftover. I'm fixing it for the next version. Thanks!
> > > >
> > > > > > +       read_lock(&token.group->as_lock);
> > > > > > +       domain = token.group->as->domain;
> > > > > >         addr = vduse_domain_alloc_coherent(domain, size,
> > > > > >                                            (dma_addr_t *)&iova, addr);
> > > > > >         if (!addr)
> > > > > >                 goto err;
> > > > > >
> > > > > >         *dma_addr = (dma_addr_t)iova;
> > > > > > +       read_unlock(&token.group->as_lock);
> > > > > >
> > > > > >         return addr;
> > > > > >
> > > > > >  err:
> > > > > > +       read_unlock(&token.group->as_lock);
> > > > > >         free_pages_exact(addr, size);
> > > > > >         return NULL;
> > > > > >  }
> > > > > > @@ -948,31 +1003,30 @@ static void vduse_dev_free_coherent(union virtio_map token, size_t size,
> > > > > >                                     void *vaddr, dma_addr_t dma_addr,
> > > > > >                                     unsigned long attrs)
> > > > > >  {
> > > > > > -       struct vduse_dev *vdev;
> > > > > >         struct vduse_iova_domain *domain;
> > > > > >
> > > > > >         if (!token.group)
> > > > > >                 return;
> > > > > >
> > > > > > -       vdev = token.group->dev;
> > > > > > -       domain = vdev->domain;
> > > > > > -
> > > > > > +       read_lock(&token.group->as_lock);
> > > > > > +       domain = token.group->as->domain;
> > > > > >         vduse_domain_free_coherent(domain, size, dma_addr, attrs);
> > > > > > +       read_unlock(&token.group->as_lock);
> > > > > >         free_pages_exact(vaddr, size);
> > > > > >  }
> > > > > >
> > > > > >  static bool vduse_dev_need_sync(union virtio_map token, dma_addr_t dma_addr)
> > > > > >  {
> > > > > > -       struct vduse_dev *vdev;
> > > > > > -       struct vduse_iova_domain *domain;
> > > > > > +       size_t bounce_size;
> > > > > >
> > > > > >         if (!token.group)
> > > > > >                 return false;
> > > > > >
> > > > > > -       vdev = token.group->dev;
> > > > > > -       domain = vdev->domain;
> > > > > > +       read_lock(&token.group->as_lock);
> > > > > > +       bounce_size = token.group->as->domain->bounce_size;
> > > > > > +       read_unlock(&token.group->as_lock);
> > > > > >
> > > > > > -       return dma_addr < domain->bounce_size;
> > > > > > +       return dma_addr < bounce_size;
> > > > > >  }
> > > > > >
> > > > > >  static int vduse_dev_mapping_error(union virtio_map token, dma_addr_t dma_addr)
> > > > > > @@ -984,16 +1038,16 @@ static int vduse_dev_mapping_error(union virtio_map token, dma_addr_t dma_addr)
> > > > > >
> > > > > >  static size_t vduse_dev_max_mapping_size(union virtio_map token)
> > > > > >  {
> > > > > > -       struct vduse_dev *vdev;
> > > > > > -       struct vduse_iova_domain *domain;
> > > > > > +       size_t bounce_size;
> > > > > >
> > > > > >         if (!token.group)
> > > > > >                 return 0;
> > > > > >
> > > > > > -       vdev = token.group->dev;
> > > > > > -       domain = vdev->domain;
> > > > > > +       read_lock(&token.group->as_lock);
> > > > > > +       bounce_size = token.group->as->domain->bounce_size;
> > > > > > +       read_unlock(&token.group->as_lock);
> > > > > >
> > > > > > -       return domain->bounce_size;
> > > > > > +       return bounce_size;
> > > > > >  }
> > > > > >
> > > > > >  static const struct virtio_map_ops vduse_map_ops = {
> > > > > > @@ -1133,39 +1187,40 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
> > > > > >         return ret;
> > > > > >  }
> > > > > >
> > > > > > -static int vduse_dev_dereg_umem(struct vduse_dev *dev,
> > > > > > +static int vduse_dev_dereg_umem(struct vduse_dev *dev, u32 asid,
> > > > > >                                 u64 iova, u64 size)
> > > > > >  {
> > > > > >         int ret;
> > > > > >
> > > > > > -       mutex_lock(&dev->mem_lock);
> > > > > > +       mutex_lock(&dev->as[asid].mem_lock);
> > > > > >         ret = -ENOENT;
> > > > > > -       if (!dev->umem)
> > > > > > +       if (!dev->as[asid].umem)
> > > > > >                 goto unlock;
> > > > > >
> > > > > >         ret = -EINVAL;
> > > > > > -       if (!dev->domain)
> > > > > > +       if (!dev->as[asid].domain)
> > > > > >                 goto unlock;
> > > > > >
> > > > > > -       if (dev->umem->iova != iova || size != dev->domain->bounce_size)
> > > > > > +       if (dev->as[asid].umem->iova != iova ||
> > > > > > +           size != dev->as[asid].domain->bounce_size)
> > > > > >                 goto unlock;
> > > > > >
> > > > > > -       vduse_domain_remove_user_bounce_pages(dev->domain);
> > > > > > -       unpin_user_pages_dirty_lock(dev->umem->pages,
> > > > > > -                                   dev->umem->npages, true);
> > > > > > -       atomic64_sub(dev->umem->npages, &dev->umem->mm->pinned_vm);
> > > > > > -       mmdrop(dev->umem->mm);
> > > > > > -       vfree(dev->umem->pages);
> > > > > > -       kfree(dev->umem);
> > > > > > -       dev->umem = NULL;
> > > > > > +       vduse_domain_remove_user_bounce_pages(dev->as[asid].domain);
> > > > > > +       unpin_user_pages_dirty_lock(dev->as[asid].umem->pages,
> > > > > > +                                   dev->as[asid].umem->npages, true);
> > > > > > +       atomic64_sub(dev->as[asid].umem->npages, &dev->as[asid].umem->mm->pinned_vm);
> > > > > > +       mmdrop(dev->as[asid].umem->mm);
> > > > > > +       vfree(dev->as[asid].umem->pages);
> > > > > > +       kfree(dev->as[asid].umem);
> > > > > > +       dev->as[asid].umem = NULL;
> > > > > >         ret = 0;
> > > > > >  unlock:
> > > > > > -       mutex_unlock(&dev->mem_lock);
> > > > > > +       mutex_unlock(&dev->as[asid].mem_lock);
> > > > > >         return ret;
> > > > > >  }
> > > > > >
> > > > > >  static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > > > > > -                             u64 iova, u64 uaddr, u64 size)
> > > > > > +                             u32 asid, u64 iova, u64 uaddr, u64 size)
> > > > > >  {
> > > > > >         struct page **page_list = NULL;
> > > > > >         struct vduse_umem *umem = NULL;
> > > > > > @@ -1173,14 +1228,14 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > > > > >         unsigned long npages, lock_limit;
> > > > > >         int ret;
> > > > > >
> > > > > > -       if (!dev->domain || !dev->domain->bounce_map ||
> > > > > > -           size != dev->domain->bounce_size ||
> > > > > > +       if (!dev->as[asid].domain || !dev->as[asid].domain->bounce_map ||
> > > > > > +           size != dev->as[asid].domain->bounce_size ||
> > > > > >             iova != 0 || uaddr & ~PAGE_MASK)
> > > > > >                 return -EINVAL;
> > > > > >
> > > > > > -       mutex_lock(&dev->mem_lock);
> > > > > > +       mutex_lock(&dev->as[asid].mem_lock);
> > > > > >         ret = -EEXIST;
> > > > > > -       if (dev->umem)
> > > > > > +       if (dev->as[asid].umem)
> > > > > >                 goto unlock;
> > > > > >
> > > > > >         ret = -ENOMEM;
> > > > > > @@ -1204,7 +1259,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > > > > >                 goto out;
> > > > > >         }
> > > > > >
> > > > > > -       ret = vduse_domain_add_user_bounce_pages(dev->domain,
> > > > > > +       ret = vduse_domain_add_user_bounce_pages(dev->as[asid].domain,
> > > > > >                                                  page_list, pinned);
> > > > > >         if (ret)
> > > > > >                 goto out;
> > > > > > @@ -1217,7 +1272,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > > > > >         umem->mm = current->mm;
> > > > > >         mmgrab(current->mm);
> > > > > >
> > > > > > -       dev->umem = umem;
> > > > > > +       dev->as[asid].umem = umem;
> > > > > >  out:
> > > > > >         if (ret && pinned > 0)
> > > > > >                 unpin_user_pages(page_list, pinned);
> > > > > > @@ -1228,7 +1283,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > > > > >                 vfree(page_list);
> > > > > >                 kfree(umem);
> > > > > >         }
> > > > > > -       mutex_unlock(&dev->mem_lock);
> > > > > > +       mutex_unlock(&dev->as[asid].mem_lock);
> > > > > >         return ret;
> > > > > >  }
> > > > > >
> > > > > > @@ -1260,47 +1315,66 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > > > > >
> > > > > >         switch (cmd) {
> > > > > >         case VDUSE_IOTLB_GET_FD: {
> > > > > > -               struct vduse_iotlb_entry entry;
> > > > > > +               struct vduse_iotlb_entry_v2 entry;
> > > > >
> > > > > Nit: if we stick entry and do copy_from_user() twice it might save
> > > > > lots of unnecessary changes.
> > > > >
> > > >
> > > > I'm happy to move to something else but most of the changes happen
> > > > because of s/entry/entry.v1/ . If we stick with just vduse_iotlb_entry
> > > > and a separate asid variable, we also need to duplicate the
> > > > copy_from_user [2].
> > > >
> > > > > >                 struct vhost_iotlb_map *map;
> > > > > >                 struct vdpa_map_file *map_file;
> > > > > >                 struct file *f = NULL;
> > > > > > +               u32 asid;
> > > > > >
> > > > > >                 ret = -EFAULT;
> > > > > > -               if (copy_from_user(&entry, argp, sizeof(entry)))
> > > > > > -                       break;
> > > > > > +               if (dev->api_version >= VDUSE_API_VERSION_1) {
> > > > > > +                       if (copy_from_user(&entry, argp, sizeof(entry)))
> > > > > > +                               break;
> > > > > > +               } else {
> > > > > > +                       entry.asid = 0;
> > > > > > +                       if (copy_from_user(&entry.v1, argp,
> > > > > > +                                          sizeof(entry.v1)))
> > > > > > +                               break;
> > > > > > +               }
> > > > > >
> > > > > >                 ret = -EINVAL;
> > > > > > -               if (entry.start > entry.last)
> > > > > > +               if (entry.v1.start > entry.v1.last)
> > > > > > +                       break;
> > > > > > +
> > > > > > +               if (entry.asid >= dev->nas)
> > > > > >                         break;
> > > > > >
> > > > > >                 mutex_lock(&dev->domain_lock);
> > > > > > -               if (!dev->domain) {
> > > > > > +               asid = array_index_nospec(entry.asid, dev->nas);
> > > > > > +               if (!dev->as[asid].domain) {
> > > > > >                         mutex_unlock(&dev->domain_lock);
> > > > > >                         break;
> > > > > >                 }
> > > > > > -               spin_lock(&dev->domain->iotlb_lock);
> > > > > > -               map = vhost_iotlb_itree_first(dev->domain->iotlb,
> > > > > > -                                             entry.start, entry.last);
> > > > > > +               spin_lock(&dev->as[asid].domain->iotlb_lock);
> > > > > > +               map = vhost_iotlb_itree_first(dev->as[asid].domain->iotlb,
> > > > > > +                                             entry.v1.start, entry.v1.last);
> > > > > >                 if (map) {
> > > > > >                         map_file = (struct vdpa_map_file *)map->opaque;
> > > > > >                         f = get_file(map_file->file);
> > > > > > -                       entry.offset = map_file->offset;
> > > > > > -                       entry.start = map->start;
> > > > > > -                       entry.last = map->last;
> > > > > > -                       entry.perm = map->perm;
> > > > > > +                       entry.v1.offset = map_file->offset;
> > > > > > +                       entry.v1.start = map->start;
> > > > > > +                       entry.v1.last = map->last;
> > > > > > +                       entry.v1.perm = map->perm;
> > > > > >                 }
> > > > > > -               spin_unlock(&dev->domain->iotlb_lock);
> > > > > > +               spin_unlock(&dev->as[asid].domain->iotlb_lock);
> > > > > >                 mutex_unlock(&dev->domain_lock);
> > > > > >                 ret = -EINVAL;
> > > > > >                 if (!f)
> > > > > >                         break;
> > > > > >
> > > > > > -               ret = -EFAULT;
> > > > > > -               if (copy_to_user(argp, &entry, sizeof(entry))) {
> > > > > > +               if (dev->api_version >= VDUSE_API_VERSION_1)
> > > > > > +                       ret = copy_to_user(argp, &entry,
> > > > > > +                                          sizeof(entry));
> > > > > > +               else
> > > > > > +                       ret = copy_to_user(argp, &entry.v1,
> > > > > > +                                          sizeof(entry.v1));
> > > > > > +
> > > > > > +               if (ret) {
> > > > > > +                       ret = -EFAULT;
> > > > > >                         fput(f);
> > > > > >                         break;
> > > > > >                 }
> > > > > > -               ret = receive_fd(f, NULL, perm_to_file_flags(entry.perm));
> > > > > > +               ret = receive_fd(f, NULL, perm_to_file_flags(entry.v1.perm));
> > > > > >                 fput(f);
> > > > > >                 break;
> > > > > >         }
> > > > > > @@ -1445,6 +1519,7 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > > > > >         }
> > > > > >         case VDUSE_IOTLB_REG_UMEM: {
> > > > > >                 struct vduse_iova_umem umem;
> > > > > > +               u32 asid;
> > > > > >
> > > > > >                 ret = -EFAULT;
> > > > > >                 if (copy_from_user(&umem, argp, sizeof(umem)))
> > > > > > @@ -1452,17 +1527,21 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > > > > >
> > > > > >                 ret = -EINVAL;
> > > > > >                 if (!is_mem_zero((const char *)umem.reserved,
> > > > > > -                                sizeof(umem.reserved)))
> > > > > > +                                sizeof(umem.reserved)) ||
> > > > > > +                   (dev->api_version < VDUSE_API_VERSION_1 &&
> > > > > > +                    umem.asid != 0) || umem.asid >= dev->nas)
> > > > > >                         break;
> > > > > >
> > > > > >                 mutex_lock(&dev->domain_lock);
> > > > > > -               ret = vduse_dev_reg_umem(dev, umem.iova,
> > > > > > +               asid = array_index_nospec(umem.asid, dev->nas);
> > > > > > +               ret = vduse_dev_reg_umem(dev, asid, umem.iova,
> > > > > >                                          umem.uaddr, umem.size);
> > > > > >                 mutex_unlock(&dev->domain_lock);
> > > > > >                 break;
> > > > > >         }
> > > > > >         case VDUSE_IOTLB_DEREG_UMEM: {
> > > > > >                 struct vduse_iova_umem umem;
> > > > > > +               u32 asid;
> > > > > >
> > > > > >                 ret = -EFAULT;
> > > > > >                 if (copy_from_user(&umem, argp, sizeof(umem)))
> > > > > > @@ -1470,10 +1549,15 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > > > > >
> > > > > >                 ret = -EINVAL;
> > > > > >                 if (!is_mem_zero((const char *)umem.reserved,
> > > > > > -                                sizeof(umem.reserved)))
> > > > > > +                                sizeof(umem.reserved)) ||
> > > > > > +                   (dev->api_version < VDUSE_API_VERSION_1 &&
> > > > > > +                    umem.asid != 0) ||
> > > > > > +                    umem.asid >= dev->nas)
> > > > > >                         break;
> > > > > > +
> > > > > >                 mutex_lock(&dev->domain_lock);
> > > > > > -               ret = vduse_dev_dereg_umem(dev, umem.iova,
> > > > > > +               asid = array_index_nospec(umem.asid, dev->nas);
> > > > > > +               ret = vduse_dev_dereg_umem(dev, asid, umem.iova,
> > > > > >                                            umem.size);
> > > > > >                 mutex_unlock(&dev->domain_lock);
> > > > > >                 break;
> > > > > > @@ -1481,6 +1565,7 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > > > > >         case VDUSE_IOTLB_GET_INFO: {
> > > > >
> > > > > Btw I see this:
> > > > >
> > > > >                 dev->vqs[index]->vq_group = config.group;
> > > > >
> > > > > In VDUSE_VQ_SETUP:
> > > > >
> > > > > I wonder what's the reason that it is not a part of CREATE_DEV? I
> > > > > meant it might be racy if DMA happens between CREATE_DEV and
> > > > > VDUSE_VQ_SETUP.
> > > > >
> > > >
> > > > The reason the vq index -> vq group association cannot be part of
> > > > device creation is that we need CVQ to be isolated for live migration,
> > > > but the device doesn't know the CVQ index at the CREATE_DEV time, only
> > > > after the feature negotiation happens [3].
> > >
> > > Exactly, the cvq index is changed.
> > >
> > > >
> > > > [1] https://lore.kernel.org/lkml/CACGkMEvRQ86dYeY3Enqoj1vkSpefU3roq4XGS+y5B5kmsXEkYg@mail.gmail.com/
> > > > [2] https://lore.kernel.org/lkml/CACGkMEtszQeZLTegxEbjODYxu-giTvURu=pKj4kYTHQYoKOzkQ@mail.gmail.com
> > > > [3] https://lore.kernel.org/lkml/CAJaqyWcvHx7kwcTceN2jazT0nKNo1r5zdzqWHqpxdna-kCS1RA@mail.gmail.com
> > >
> > > I see, but a question. What happens if there's a DMA between
> > > CREATE_DEV and VDUSE_VQ_SETUP. If we can find ways to forbid this (or
> > > it has been forbidden), we are probably fine.
> > >
> >
> > It's already forbidden by vdpa_dev_add:
> >
> > dev = vduse_find_dev(name);
> > if (!dev || !vduse_dev_is_ready(dev)) {
> >         mutex_unlock(&vduse_lock);
> >         return -EINVAL;
> > }
> >
> > where vduse_dev_is_ready():
> > static bool vduse_dev_is_ready(struct vduse_dev *dev)
> > {
> >         int i;
> >
> >         for (i = 0; i < dev->vq_num; i++)
> >                 if (!dev->vqs[i]->num_max)
> >                         return false;
> >
> >         return true;
> > }
> >
> > Since we set the vq group with the same ioctl than vq->num_max, we are
> > safe here. I didn't catch it until now, so thanks for proposing to
> > move the vq group parameter to that ioctl back then! :).
>
> Great.
>
> Thanks
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v10 7/8] vduse: add vq group asid support
  2025-12-24  7:38             ` Eugenio Perez Martin
@ 2025-12-25  2:23               ` Jason Wang
  2025-12-26 11:38                 ` Eugenio Perez Martin
  0 siblings, 1 reply; 22+ messages in thread
From: Jason Wang @ 2025-12-25  2:23 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Michael S . Tsirkin, Maxime Coquelin, Laurent Vivier,
	virtualization, linux-kernel, Stefano Garzarella, Yongji Xie,
	Xuan Zhuo, Cindy Lu

On Wed, Dec 24, 2025 at 3:39 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Wed, Dec 24, 2025 at 1:20 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Tue, Dec 23, 2025 at 9:16 PM Eugenio Perez Martin
> > <eperezma@redhat.com> wrote:
> > >
> > > On Tue, Dec 23, 2025 at 2:11 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Thu, Dec 18, 2025 at 9:11 PM Eugenio Perez Martin
> > > > <eperezma@redhat.com> wrote:
> > > > >
> > > > > On Thu, Dec 18, 2025 at 7:45 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > >
> > > > > > On Wed, Dec 17, 2025 at 7:24 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > > > >
> > > > > > > Add support for assigning Address Space Identifiers (ASIDs) to each VQ
> > > > > > > group.  This enables mapping each group into a distinct memory space.
> > > > > > >
> > > > > > > The vq group to ASID association is protected by a rwlock now.  But the
> > > > > > > mutex domain_lock keeps protecting the domains of all ASIDs, as some
> > > > > > > operations like the one related with the bounce buffer size still
> > > > > > > requires to lock all the ASIDs.
> > > > > > >
> > > > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > > > > >
> > > > > > > ---
> > > > > > > Future improvements can include performance optimizations on top could
> > > > > > > move to per-ASID locks, or hardening by tracking ASID or ASID hashes on
> > > > > > > unused bits of the DMA address.
> > > > > > >
> > > > > > > Tested virtio_vdpa by adding manually two threads in vduse_set_status:
> > > > > > > one of them modifies the vq group 0 ASID and the other one map and unmap
> > > > > > > memory continuously.  After a while, the two threads stop and the usual
> > > > > > > work continues.
> > > > > > >
> > > > > > > Tested with vhost_vdpa by migrating a VM while ping on OVS+VDUSE.  A few
> > > > > > > workaround were needed in some parts:
> > > > > > > * Do not enable CVQ before data vqs in QEMU, as VDUSE does not forward
> > > > > > >   the enable message to the userland device.  This will be solved in the
> > > > > > >   future.
> > > > > > > * Share the suspended state between all vhost devices in QEMU:
> > > > > > >   https://lists.nongnu.org/archive/html/qemu-devel/2025-11/msg02947.html
> > > > > > > * Implement a fake VDUSE suspend vdpa operation callback that always
> > > > > > >   returns true in the kernel.  DPDK suspend the device at the first
> > > > > > >   GET_VRING_BASE.
> > > > > > > * Remove the CVQ blocker in ASID.
> > > > > > >
> > > > > > > ---
> > > > > > > v10:
> > > > > > > * Back to rwlock version so stronger locks are used.
> > > > > > > * Take out allocations from rwlock.
> > > > > > > * Forbid changing ASID of a vq group after DRIVER_OK (Jason)
> > > > > > > * Remove bad fetching again of domain variable in
> > > > > > >   vduse_dev_max_mapping_size (Yongji).
> > > > > > > * Remove unused vdev definition in vdpa map_ops callbacks (kernel test
> > > > > > >   robot).
> > > > > > >
> > > > > > > v9:
> > > > > > > * Replace mutex with rwlock, as the vdpa map_ops can run from atomic
> > > > > > >   context.
> > > > > > >
> > > > > > > v8:
> > > > > > > * Revert the mutex to rwlock change, it needs proper profiling to
> > > > > > >   justify it.
> > > > > > >
> > > > > > > v7:
> > > > > > > * Take write lock in the error path (Jason).
> > > > > > >
> > > > > > > v6:
> > > > > > > * Make vdpa_dev_add use gotos for error handling (MST).
> > > > > > > * s/(dev->api_version < 1) ?/(dev->api_version < VDUSE_API_VERSION_1) ?/
> > > > > > >   (MST).
> > > > > > > * Fix struct name not matching in the doc.
> > > > > > >
> > > > > > > v5:
> > > > > > > * Properly return errno if copy_to_user returns >0 in VDUSE_IOTLB_GET_FD
> > > > > > >   ioctl (Jason).
> > > > > > > * Properly set domain bounce size to divide equally between nas (Jason).
> > > > > > > * Exclude "padding" member from the only >V1 members in
> > > > > > >   vduse_dev_request.
> > > > > > >
> > > > > > > v4:
> > > > > > > * Divide each domain bounce size between the device bounce size (Jason).
> > > > > > > * revert unneeded addr = NULL assignment (Jason)
> > > > > > > * Change if (x && (y || z)) return to if (x) { if (y) return; if (z)
> > > > > > >   return; } (Jason)
> > > > > > > * Change a bad multiline comment, using @ caracter instead of * (Jason).
> > > > > > > * Consider config->nas == 0 as a fail (Jason).
> > > > > > >
> > > > > > > v3:
> > > > > > > * Get the vduse domain through the vduse_as in the map functions
> > > > > > >   (Jason).
> > > > > > > * Squash with the patch creating the vduse_as struct (Jason).
> > > > > > > * Create VDUSE_DEV_MAX_AS instead of comparing agains a magic number
> > > > > > >   (Jason)
> > > > > > >
> > > > > > > v2:
> > > > > > > * Convert the use of mutex to rwlock.
> > > > > > >
> > > > > > > RFC v3:
> > > > > > > * Increase VDUSE_MAX_VQ_GROUPS to 0xffff (Jason). It was set to a lower
> > > > > > >   value to reduce memory consumption, but vqs are already limited to
> > > > > > >   that value and userspace VDUSE is able to allocate that many vqs.
> > > > > > > * Remove TODO about merging VDUSE_IOTLB_GET_FD ioctl with
> > > > > > >   VDUSE_IOTLB_GET_INFO.
> > > > > > > * Use of array_index_nospec in VDUSE device ioctls.
> > > > > > > * Embed vduse_iotlb_entry into vduse_iotlb_entry_v2.
> > > > > > > * Move the umem mutex to asid struct so there is no contention between
> > > > > > >   ASIDs.
> > > > > > >
> > > > > > > RFC v2:
> > > > > > > * Make iotlb entry the last one of vduse_iotlb_entry_v2 so the first
> > > > > > >   part of the struct is the same.
> > > > > > > ---
> > > > > > >  drivers/vdpa/vdpa_user/vduse_dev.c | 366 +++++++++++++++++++----------
> > > > > > >  include/uapi/linux/vduse.h         |  53 ++++-
> > > > > > >  2 files changed, 295 insertions(+), 124 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > index 767abcb7e375..786ab2378825 100644
> > > > > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > @@ -41,6 +41,7 @@
> > > > > > >
> > > > > > >  #define VDUSE_DEV_MAX (1U << MINORBITS)
> > > > > > >  #define VDUSE_DEV_MAX_GROUPS 0xffff
> > > > > > > +#define VDUSE_DEV_MAX_AS 0xffff
> > > > > > >  #define VDUSE_MAX_BOUNCE_SIZE (1024 * 1024 * 1024)
> > > > > > >  #define VDUSE_MIN_BOUNCE_SIZE (1024 * 1024)
> > > > > > >  #define VDUSE_BOUNCE_SIZE (64 * 1024 * 1024)
> > > > > > > @@ -86,7 +87,15 @@ struct vduse_umem {
> > > > > > >         struct mm_struct *mm;
> > > > > > >  };
> > > > > > >
> > > > > > > +struct vduse_as {
> > > > > > > +       struct vduse_iova_domain *domain;
> > > > > > > +       struct vduse_umem *umem;
> > > > > > > +       struct mutex mem_lock;
> > > > > >
> > > > > > Not related to this patch. But If I was not wrong we have 1:1 mapping
> > > > > > between domain and as. If this is true, can we use bounce_lock instead
> > > > > > of a new mem_lock? Since I see mem_lock is only used for synchronizing
> > > > > > umem reg/degreg which has been synchronized with domain rwlock.
> > > > > >
> > > > >
> > > > > I think you're right, but they work at different levels at the moment.
> > > > > The mem_lock is at vduse_dev and also protect the umem pointer, and
> > > > > bounce_lock is at iova_domain.c .
> > > > >
> > > > > Maybe the right thing to do is to move umem in iova_domain. Yongji
> > > > > Xie, what do you think?
> > > > >
> > > > > > > +};
> > > > > > > +
> > > > > > >  struct vduse_vq_group {
> > > > > > > +       rwlock_t as_lock;
> > > > > > > +       struct vduse_as *as; /* Protected by as_lock */
> > > > > > >         struct vduse_dev *dev;
> > > > > > >  };
> > > > > > >
> > > > > > > @@ -94,7 +103,7 @@ struct vduse_dev {
> > > > > > >         struct vduse_vdpa *vdev;
> > > > > > >         struct device *dev;
> > > > > > >         struct vduse_virtqueue **vqs;
> > > > > > > -       struct vduse_iova_domain *domain;
> > > > > > > +       struct vduse_as *as;
> > > > > > >         char *name;
> > > > > > >         struct mutex lock;
> > > > > > >         spinlock_t msg_lock;
> > > > > > > @@ -122,9 +131,8 @@ struct vduse_dev {
> > > > > > >         u32 vq_num;
> > > > > > >         u32 vq_align;
> > > > > > >         u32 ngroups;
> > > > > > > -       struct vduse_umem *umem;
> > > > > > > +       u32 nas;
> > > > > > >         struct vduse_vq_group *groups;
> > > > > > > -       struct mutex mem_lock;
> > > > > > >         unsigned int bounce_size;
> > > > > > >         struct mutex domain_lock;
> > > > > > >  };
> > > > > > > @@ -314,7 +322,7 @@ static int vduse_dev_set_status(struct vduse_dev *dev, u8 status)
> > > > > > >         return vduse_dev_msg_sync(dev, &msg);
> > > > > > >  }
> > > > > > >
> > > > > > > -static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> > > > > > > +static int vduse_dev_update_iotlb(struct vduse_dev *dev, u32 asid,
> > > > > > >                                   u64 start, u64 last)
> > > > > > >  {
> > > > > > >         struct vduse_dev_msg msg = { 0 };
> > > > > > > @@ -323,8 +331,14 @@ static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> > > > > > >                 return -EINVAL;
> > > > > > >
> > > > > > >         msg.req.type = VDUSE_UPDATE_IOTLB;
> > > > > > > -       msg.req.iova.start = start;
> > > > > > > -       msg.req.iova.last = last;
> > > > > > > +       if (dev->api_version < VDUSE_API_VERSION_1) {
> > > > > > > +               msg.req.iova.start = start;
> > > > > > > +               msg.req.iova.last = last;
> > > > > > > +       } else {
> > > > > > > +               msg.req.iova_v2.start = start;
> > > > > > > +               msg.req.iova_v2.last = last;
> > > > > > > +               msg.req.iova_v2.asid = asid;
> > > > > > > +       }
> > > > > > >
> > > > > > >         return vduse_dev_msg_sync(dev, &msg);
> > > > > > >  }
> > > > > > > @@ -436,14 +450,29 @@ static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> > > > > > >         return mask;
> > > > > > >  }
> > > > > > >
> > > > > > > +/* Force set the asid to a vq group without a message to the VDUSE device */
> > > > > > > +static void vduse_set_group_asid_nomsg(struct vduse_dev *dev,
> > > > > > > +                                      unsigned int group, unsigned int asid)
> > > > > > > +{
> > > > > > > +       write_lock(&dev->groups[group].as_lock);
> > > > > > > +       dev->groups[group].as = &dev->as[asid];
> > > > > > > +       write_unlock(&dev->groups[group].as_lock);
> > > > > > > +}
> > > > > > > +
> > > > > > >  static void vduse_dev_reset(struct vduse_dev *dev)
> > > > > > >  {
> > > > > > >         int i;
> > > > > > > -       struct vduse_iova_domain *domain = dev->domain;
> > > > > > >
> > > > > > >         /* The coherent mappings are handled in vduse_dev_free_coherent() */
> > > > > > > -       if (domain && domain->bounce_map)
> > > > > > > -               vduse_domain_reset_bounce_map(domain);
> > > > > > > +       for (i = 0; i < dev->nas; i++) {
> > > > > > > +               struct vduse_iova_domain *domain = dev->as[i].domain;
> > > > > > > +
> > > > > > > +               if (domain && domain->bounce_map)
> > > > > > > +                       vduse_domain_reset_bounce_map(domain);
> > > > > >
> > > > > > Btw, I see this:
> > > > > >
> > > > > > void vduse_domain_reset_bounce_map(struct vduse_iova_domain *domain)
> > > > > > {
> > > > > >         if (!domain->bounce_map)
> > > > > >                 return;
> > > > > >
> > > > > >         spin_lock(&domain->iotlb_lock);
> > > > > >         if (!domain->bounce_map)
> > > > > >                 goto unlock;
> > > > > >
> > > > > >
> > > > > > The counbe_map is checked twice, let's fix that.
> > > > > >
> > > > >
> > > > > Double checked locking to avoid taking the lock?
> > > >
> > > > I don't know, but I think we don't care too much about the performance
> > > > of vduse_domain_reset_bounce_map().
> > > >
> > > > > I don't think it is
> > > > > worth it to keep it as it is not in the hot path anyway. But that
> > > > > would also be another patch independent of this series, isn't it?
> > > >
> > > > Yes, it's another independent issue I just found when reviewing this patch.
> > > >
> > > > >
> > > > > > > +       }
> > > > > > > +
> > > > > > > +       for (i = 0; i < dev->ngroups; i++)
> > > > > > > +               vduse_set_group_asid_nomsg(dev, i, 0);
> > > > > >
> > > > > > Note that this function still does:
> > > > > >
> > > > > >                 vq->vq_group = 0;
> > > > > >
> > > > > > Which is wrong.
> > > > > >
> > > > >
> > > > > Right, removing it for the next version. Thanks for the catch!
> > > > >
> > > > > > >
> > > > > > >         down_write(&dev->rwsem);
> > > > > > >
> > > > > > > @@ -623,6 +652,30 @@ static union virtio_map vduse_get_vq_map(struct vdpa_device *vdpa, u16 idx)
> > > > > > >         return ret;
> > > > > > >  }
> > > > > > >
> > > > > > > +static int vduse_set_group_asid(struct vdpa_device *vdpa, unsigned int group,
> > > > > > > +                               unsigned int asid)
> > > > > > > +{
> > > > > > > +       struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > > > > > > +       struct vduse_dev_msg msg = { 0 };
> > > > > > > +       int r;
> > > > > > > +
> > > > > > > +       if (dev->api_version < VDUSE_API_VERSION_1 ||
> > > > > > > +           group >= dev->ngroups || asid >= dev->nas ||
> > > > > > > +           dev->status & VIRTIO_CONFIG_S_DRIVER_OK)
> > > > > > > +               return -EINVAL;
> > > > > >
> > > > > > If we forbid setting group asid for !DRIVER_OK, why do we still need a
> > > > > > rwlock?
> > > > >
> > > > > virtio_map_ops->alloc is still called before DRIVER_OK to allocate the
> > > > > vrings in the bounce buffer, for example. If you're ok with that I'm
> > > > > ok with removing the lock, as all the calls are issued by the driver
> > > > > setup process anyway. Or just to keep it for alloc?
> > > >
> > > > I see, then I think we need to keep that. The reason is that there's
> > > > no guarantee that the alloc() must be called before DRIVER_OK.
> > > >
> > > > >
> > > > > Anyway, I think I misunderstood your comment from [1] then.
> > > > >
> > > > > > All we need to do is to synchronize set_group_asid() with
> > > > > > set_status()/reset()?
> > > > > >
> > > > >
> > > > > That's also a good one. There is no synchronization if one thread call
> > > > > reset and then the device is set up from another thread. As all this
> > > > > situation is still hypothetical because virtio_vdpa does not support
> > > > > set_group_asid,
> > > >
> > > > Right.
> > > >
> > > > > and vhost one is already protected by vhost lock, do
> > > > > we need it?
> > > >
> > > > Let's add a TODO in the code.
> > > >
> > >
> > > Can you expand on this? I meant that there will be no synchronization
> > > if we remove the rwlock (or similar). If we keep the rwlock we don't
> > > need to add any TODO, or am I missing something?
> >
> > I meant even with rwlock we don't synchronize set_group_as_id() and
> > set_status(). Since you said vhost has been synchronized, I think we
> > should either
> >
> > 1) document the synchronization that needs to be done in the upper layer or
> > 2) add a todo to synchronize the set_status() and set_group_asid()
> >
>
> With the VDUSE messages they are synchronized by dev->msg_lock, the
> vduse module always has a coherent status from both sides: the VDUSE
> userland device and virtio_vdpa. If you want vduse not to trust vdpa
> we can go the extra mile and make dev->status atomic for both device
> features and group ASID, would that work?

Yes, something like this, for example, msg_lock is only used to
synchronize the IPC with the userspace, it doesn't synchronize the
set_status() with set_group_asid().

>
> > >
> > > > >
> > > > > > Or if you want to synchronize map ops with set_status() that looks
> > > > > > like an independent thing (hardening).
> > > > > >
> > > > > > > +
> > > > > > > +       msg.req.type = VDUSE_SET_VQ_GROUP_ASID;
> > > > > > > +       msg.req.vq_group_asid.group = group;
> > > > > > > +       msg.req.vq_group_asid.asid = asid;
> > > > > > > +
> > > > > > > +       r = vduse_dev_msg_sync(dev, &msg);
> > > > > > > +       if (r < 0)
> > > > > > > +               return r;
> > > > > > > +
> > > > > > > +       vduse_set_group_asid_nomsg(dev, group, asid);
> > > > > >
> > > > > > I'm not sure this has been discussed before, but I think it would be
> > > > > > better to introduce a new ioctl to get group -> as mapping. This helps
> > > > > > to avoid vduse_dev_msg_sync() as much as possible. And it doesn't
> > > > > > require the userspace to poll vduse fd before DRIVER_OK.
> > > > > >
> > >
> > > The userspace VDUSE device must poll vduse fd to get the DRIVER_OK
> > > actually. so it cannot avoid polling the vduse device. What are the reasons to
> > > avoid vduse_dev_msg_sync?
> >
> > One less chance to avoid synchronization with userspace.
> >
>
> But the ioctl alternative is more synchronization from my POV, not
> less. Instead of receiving notifications that we could even batch, the
> VDUSE userland device needs to issue many synchronous ioctls.

Synchronous ioctl seems to be better than the trick like letting
kernel to wait for the userspace to response (and timeout).

>
> > >
> > > > >
> > > > > I'm fine with that, but how do we communicate that they have changed?
> > > >
> > > > Since we forbid changing the group->as mapping after DRIVER_OK,
> > > > userspace just needs to use that ioctl once after DRIVER_OK.
> > > >
> > >
> > > But the userland VDUSE device needs to know when ASIDs are set by the
> > > driver and will not change anymore. In this series it is solved by the
> > > order of the messages, but now we would need a way to know that time.
> > > One idea is to issue this new ioctl when it receives VDUSE_SET_STATUS
> > > msg.
> >
> > Yes.
> >
> > >
> > > If we do it that way there is still a window where the hypothetical
> > > (malicious) virtio_vdpa driver can read and write vq groups from
> > > different threads.
> >
> > See above, we need to add synchronization in either vdpa or virtio_vdpa.
> >
> > > It could issue a set_group_asid after the VDUSE
> > > device returns from the ioctl but before dev->status has not been
> > > updated.
> >
> > In the case of VDUSE I think either side should not trust the other
> > side. So userspace needs to be prepared for this, so did the driver.
> > Since the memory accessing is initiated from the userspace, so it
> > should not perform any memory access before the ioctl to fetch the
> > group->asid mapping.
> >
>
> I don't follow this. In the case of virtio_vdpa the ASID groups and
> features are handled by the kernel entirely. And in the case of
> vhost_vdpa they are handled by the vhost_dev->mutex lock. So can we
> trust them?

The problem is there's no way for VDUSE to know whether or not the
upper layer is vhost or not?

>
> > >
> > > I don't think we should protect that, but if we want to do it we
> > > should protect that part either acquiring the rwlock and trusting the
> > > vduse_dev_msg_sync timeout or proceed with atomics, smp_store_release
> > > / smp_load_acquire, read and write barriers...
> > >
> > > Note that I still think this is overthinking. We have the same
> > > problems with driver features, where changing in bits like packed vq
> > > changes the behavior of vDPA callbacks and could be able to
> > > desynchronize the vduse kernel and the userland device.
> >
> > Probably, but for driver features, it's too late to do the change.
> >
>
> Why is it late? We just need to add the same synchronization in both
> driver features and ASID groups. The change is not visible for either
> virtio_vdpa, vhost_vdpa, or VDUSE userland device, at all.

Ok, I think I get you. I thought we were talking about reducing the
vduse_dev_msg_sync() when doing SET_FETURES. But you meant the
synchronization with set_status(). Yes, we can do that.

>
> > > But since
> > > virtio_vdpa and vduse kernel modules run in the same kernel, so they should
> > > be able to trust each other.
> >
> > Better not, usually the lower layer (vdpa/vduse) should not trust the
> > upper layer (virtio-vdpa) in this case.
> >
>
> Ok, adding the hardening in the next version!
>
> > But if you stick to the method with vduse_dev_msg_sync, I'm also fine with that.
> >
> > >
> > > > > Or how to communicate to the driver that the device does not accept
> > > > > the assignment of the ASID to the group?
> > > >
> > > > See above.
> > > >
> > >
> > > I didn't find the answer :(.
> > >
> > > Let me put an example with QEMU and vhost_vdpa:
> > > - QEMU calls VHOST_VDPA_SET_GROUP_ASID with a valid vq group and asid.
> > > - vduse cannot send this information to the device as it must wait for
> > > the ioctl.
> >
> > Which ioctl did you mean in this case?
> >
>
> The new ioctl from the userland VDUSE device that you're proposing to
> accept or reject the set ASID group. Let's call it
> VDUSE_GET_GROUP_ASID, so let me rewrite the from from that moment:
>
> - QEMU calls VHOST_VDPA_SET_GROUP_ASID with a valid vq group (in
> vdpa->ngroups range) and a valid asid (in vdpa->nas range)
> - The vduse kernel module cannot send the information to the VDUSE
> userland device, it must wait until the VDUSE userland device calls
> the new VDUSE_GET_GROUP_ASID. It will not happen until QEMU calls
> VHOST_VDPA_SET_STATUS with DRIVER_OK, but QEMU will not call
> VHOST_VDPA_SET_STATUS until the kernel returns from
> VHOST_VDPA_SET_GROUP_ASID. The vduse kernel module needs to return
> success, without knowing if the device will accept it in the future or
> not.
> - Now QEMU sends DIRVER_OK, so the vduse kernel module forwards it to
> the VDUSE userland device. The VDUSE userland device then calls
> VDUSE_GET_GROUP_ASID, and the flow continues.
> - The vduse userland device doesn't accept the vq group ASID map for
> whatever reason, so it returns an error through another ioctl

If the asid is less than the total number of address spaces, any
reason for the usersapce to reject such configuration?

Thanks


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v10 7/8] vduse: add vq group asid support
  2025-12-25  2:23               ` Jason Wang
@ 2025-12-26 11:38                 ` Eugenio Perez Martin
  2025-12-29  2:56                   ` Jason Wang
  0 siblings, 1 reply; 22+ messages in thread
From: Eugenio Perez Martin @ 2025-12-26 11:38 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S . Tsirkin, Maxime Coquelin, Laurent Vivier,
	virtualization, linux-kernel, Stefano Garzarella, Yongji Xie,
	Xuan Zhuo, Cindy Lu

On Thu, Dec 25, 2025 at 3:23 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Wed, Dec 24, 2025 at 3:39 PM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > On Wed, Dec 24, 2025 at 1:20 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Tue, Dec 23, 2025 at 9:16 PM Eugenio Perez Martin
> > > <eperezma@redhat.com> wrote:
> > > >
> > > > On Tue, Dec 23, 2025 at 2:11 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > On Thu, Dec 18, 2025 at 9:11 PM Eugenio Perez Martin
> > > > > <eperezma@redhat.com> wrote:
> > > > > >
> > > > > > On Thu, Dec 18, 2025 at 7:45 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > >
> > > > > > > On Wed, Dec 17, 2025 at 7:24 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > > > > >
> > > > > > > > Add support for assigning Address Space Identifiers (ASIDs) to each VQ
> > > > > > > > group.  This enables mapping each group into a distinct memory space.
> > > > > > > >
> > > > > > > > The vq group to ASID association is protected by a rwlock now.  But the
> > > > > > > > mutex domain_lock keeps protecting the domains of all ASIDs, as some
> > > > > > > > operations like the one related with the bounce buffer size still
> > > > > > > > requires to lock all the ASIDs.
> > > > > > > >
> > > > > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > > > > > >
> > > > > > > > ---
> > > > > > > > Future improvements can include performance optimizations on top could
> > > > > > > > move to per-ASID locks, or hardening by tracking ASID or ASID hashes on
> > > > > > > > unused bits of the DMA address.
> > > > > > > >
> > > > > > > > Tested virtio_vdpa by adding manually two threads in vduse_set_status:
> > > > > > > > one of them modifies the vq group 0 ASID and the other one map and unmap
> > > > > > > > memory continuously.  After a while, the two threads stop and the usual
> > > > > > > > work continues.
> > > > > > > >
> > > > > > > > Tested with vhost_vdpa by migrating a VM while ping on OVS+VDUSE.  A few
> > > > > > > > workaround were needed in some parts:
> > > > > > > > * Do not enable CVQ before data vqs in QEMU, as VDUSE does not forward
> > > > > > > >   the enable message to the userland device.  This will be solved in the
> > > > > > > >   future.
> > > > > > > > * Share the suspended state between all vhost devices in QEMU:
> > > > > > > >   https://lists.nongnu.org/archive/html/qemu-devel/2025-11/msg02947.html
> > > > > > > > * Implement a fake VDUSE suspend vdpa operation callback that always
> > > > > > > >   returns true in the kernel.  DPDK suspend the device at the first
> > > > > > > >   GET_VRING_BASE.
> > > > > > > > * Remove the CVQ blocker in ASID.
> > > > > > > >
> > > > > > > > ---
> > > > > > > > v10:
> > > > > > > > * Back to rwlock version so stronger locks are used.
> > > > > > > > * Take out allocations from rwlock.
> > > > > > > > * Forbid changing ASID of a vq group after DRIVER_OK (Jason)
> > > > > > > > * Remove bad fetching again of domain variable in
> > > > > > > >   vduse_dev_max_mapping_size (Yongji).
> > > > > > > > * Remove unused vdev definition in vdpa map_ops callbacks (kernel test
> > > > > > > >   robot).
> > > > > > > >
> > > > > > > > v9:
> > > > > > > > * Replace mutex with rwlock, as the vdpa map_ops can run from atomic
> > > > > > > >   context.
> > > > > > > >
> > > > > > > > v8:
> > > > > > > > * Revert the mutex to rwlock change, it needs proper profiling to
> > > > > > > >   justify it.
> > > > > > > >
> > > > > > > > v7:
> > > > > > > > * Take write lock in the error path (Jason).
> > > > > > > >
> > > > > > > > v6:
> > > > > > > > * Make vdpa_dev_add use gotos for error handling (MST).
> > > > > > > > * s/(dev->api_version < 1) ?/(dev->api_version < VDUSE_API_VERSION_1) ?/
> > > > > > > >   (MST).
> > > > > > > > * Fix struct name not matching in the doc.
> > > > > > > >
> > > > > > > > v5:
> > > > > > > > * Properly return errno if copy_to_user returns >0 in VDUSE_IOTLB_GET_FD
> > > > > > > >   ioctl (Jason).
> > > > > > > > * Properly set domain bounce size to divide equally between nas (Jason).
> > > > > > > > * Exclude "padding" member from the only >V1 members in
> > > > > > > >   vduse_dev_request.
> > > > > > > >
> > > > > > > > v4:
> > > > > > > > * Divide each domain bounce size between the device bounce size (Jason).
> > > > > > > > * revert unneeded addr = NULL assignment (Jason)
> > > > > > > > * Change if (x && (y || z)) return to if (x) { if (y) return; if (z)
> > > > > > > >   return; } (Jason)
> > > > > > > > * Change a bad multiline comment, using @ caracter instead of * (Jason).
> > > > > > > > * Consider config->nas == 0 as a fail (Jason).
> > > > > > > >
> > > > > > > > v3:
> > > > > > > > * Get the vduse domain through the vduse_as in the map functions
> > > > > > > >   (Jason).
> > > > > > > > * Squash with the patch creating the vduse_as struct (Jason).
> > > > > > > > * Create VDUSE_DEV_MAX_AS instead of comparing agains a magic number
> > > > > > > >   (Jason)
> > > > > > > >
> > > > > > > > v2:
> > > > > > > > * Convert the use of mutex to rwlock.
> > > > > > > >
> > > > > > > > RFC v3:
> > > > > > > > * Increase VDUSE_MAX_VQ_GROUPS to 0xffff (Jason). It was set to a lower
> > > > > > > >   value to reduce memory consumption, but vqs are already limited to
> > > > > > > >   that value and userspace VDUSE is able to allocate that many vqs.
> > > > > > > > * Remove TODO about merging VDUSE_IOTLB_GET_FD ioctl with
> > > > > > > >   VDUSE_IOTLB_GET_INFO.
> > > > > > > > * Use of array_index_nospec in VDUSE device ioctls.
> > > > > > > > * Embed vduse_iotlb_entry into vduse_iotlb_entry_v2.
> > > > > > > > * Move the umem mutex to asid struct so there is no contention between
> > > > > > > >   ASIDs.
> > > > > > > >
> > > > > > > > RFC v2:
> > > > > > > > * Make iotlb entry the last one of vduse_iotlb_entry_v2 so the first
> > > > > > > >   part of the struct is the same.
> > > > > > > > ---
> > > > > > > >  drivers/vdpa/vdpa_user/vduse_dev.c | 366 +++++++++++++++++++----------
> > > > > > > >  include/uapi/linux/vduse.h         |  53 ++++-
> > > > > > > >  2 files changed, 295 insertions(+), 124 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > > index 767abcb7e375..786ab2378825 100644
> > > > > > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > > @@ -41,6 +41,7 @@
> > > > > > > >
> > > > > > > >  #define VDUSE_DEV_MAX (1U << MINORBITS)
> > > > > > > >  #define VDUSE_DEV_MAX_GROUPS 0xffff
> > > > > > > > +#define VDUSE_DEV_MAX_AS 0xffff
> > > > > > > >  #define VDUSE_MAX_BOUNCE_SIZE (1024 * 1024 * 1024)
> > > > > > > >  #define VDUSE_MIN_BOUNCE_SIZE (1024 * 1024)
> > > > > > > >  #define VDUSE_BOUNCE_SIZE (64 * 1024 * 1024)
> > > > > > > > @@ -86,7 +87,15 @@ struct vduse_umem {
> > > > > > > >         struct mm_struct *mm;
> > > > > > > >  };
> > > > > > > >
> > > > > > > > +struct vduse_as {
> > > > > > > > +       struct vduse_iova_domain *domain;
> > > > > > > > +       struct vduse_umem *umem;
> > > > > > > > +       struct mutex mem_lock;
> > > > > > >
> > > > > > > Not related to this patch. But If I was not wrong we have 1:1 mapping
> > > > > > > between domain and as. If this is true, can we use bounce_lock instead
> > > > > > > of a new mem_lock? Since I see mem_lock is only used for synchronizing
> > > > > > > umem reg/degreg which has been synchronized with domain rwlock.
> > > > > > >
> > > > > >
> > > > > > I think you're right, but they work at different levels at the moment.
> > > > > > The mem_lock is at vduse_dev and also protect the umem pointer, and
> > > > > > bounce_lock is at iova_domain.c .
> > > > > >
> > > > > > Maybe the right thing to do is to move umem in iova_domain. Yongji
> > > > > > Xie, what do you think?
> > > > > >
> > > > > > > > +};
> > > > > > > > +
> > > > > > > >  struct vduse_vq_group {
> > > > > > > > +       rwlock_t as_lock;
> > > > > > > > +       struct vduse_as *as; /* Protected by as_lock */
> > > > > > > >         struct vduse_dev *dev;
> > > > > > > >  };
> > > > > > > >
> > > > > > > > @@ -94,7 +103,7 @@ struct vduse_dev {
> > > > > > > >         struct vduse_vdpa *vdev;
> > > > > > > >         struct device *dev;
> > > > > > > >         struct vduse_virtqueue **vqs;
> > > > > > > > -       struct vduse_iova_domain *domain;
> > > > > > > > +       struct vduse_as *as;
> > > > > > > >         char *name;
> > > > > > > >         struct mutex lock;
> > > > > > > >         spinlock_t msg_lock;
> > > > > > > > @@ -122,9 +131,8 @@ struct vduse_dev {
> > > > > > > >         u32 vq_num;
> > > > > > > >         u32 vq_align;
> > > > > > > >         u32 ngroups;
> > > > > > > > -       struct vduse_umem *umem;
> > > > > > > > +       u32 nas;
> > > > > > > >         struct vduse_vq_group *groups;
> > > > > > > > -       struct mutex mem_lock;
> > > > > > > >         unsigned int bounce_size;
> > > > > > > >         struct mutex domain_lock;
> > > > > > > >  };
> > > > > > > > @@ -314,7 +322,7 @@ static int vduse_dev_set_status(struct vduse_dev *dev, u8 status)
> > > > > > > >         return vduse_dev_msg_sync(dev, &msg);
> > > > > > > >  }
> > > > > > > >
> > > > > > > > -static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> > > > > > > > +static int vduse_dev_update_iotlb(struct vduse_dev *dev, u32 asid,
> > > > > > > >                                   u64 start, u64 last)
> > > > > > > >  {
> > > > > > > >         struct vduse_dev_msg msg = { 0 };
> > > > > > > > @@ -323,8 +331,14 @@ static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> > > > > > > >                 return -EINVAL;
> > > > > > > >
> > > > > > > >         msg.req.type = VDUSE_UPDATE_IOTLB;
> > > > > > > > -       msg.req.iova.start = start;
> > > > > > > > -       msg.req.iova.last = last;
> > > > > > > > +       if (dev->api_version < VDUSE_API_VERSION_1) {
> > > > > > > > +               msg.req.iova.start = start;
> > > > > > > > +               msg.req.iova.last = last;
> > > > > > > > +       } else {
> > > > > > > > +               msg.req.iova_v2.start = start;
> > > > > > > > +               msg.req.iova_v2.last = last;
> > > > > > > > +               msg.req.iova_v2.asid = asid;
> > > > > > > > +       }
> > > > > > > >
> > > > > > > >         return vduse_dev_msg_sync(dev, &msg);
> > > > > > > >  }
> > > > > > > > @@ -436,14 +450,29 @@ static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> > > > > > > >         return mask;
> > > > > > > >  }
> > > > > > > >
> > > > > > > > +/* Force set the asid to a vq group without a message to the VDUSE device */
> > > > > > > > +static void vduse_set_group_asid_nomsg(struct vduse_dev *dev,
> > > > > > > > +                                      unsigned int group, unsigned int asid)
> > > > > > > > +{
> > > > > > > > +       write_lock(&dev->groups[group].as_lock);
> > > > > > > > +       dev->groups[group].as = &dev->as[asid];
> > > > > > > > +       write_unlock(&dev->groups[group].as_lock);
> > > > > > > > +}
> > > > > > > > +
> > > > > > > >  static void vduse_dev_reset(struct vduse_dev *dev)
> > > > > > > >  {
> > > > > > > >         int i;
> > > > > > > > -       struct vduse_iova_domain *domain = dev->domain;
> > > > > > > >
> > > > > > > >         /* The coherent mappings are handled in vduse_dev_free_coherent() */
> > > > > > > > -       if (domain && domain->bounce_map)
> > > > > > > > -               vduse_domain_reset_bounce_map(domain);
> > > > > > > > +       for (i = 0; i < dev->nas; i++) {
> > > > > > > > +               struct vduse_iova_domain *domain = dev->as[i].domain;
> > > > > > > > +
> > > > > > > > +               if (domain && domain->bounce_map)
> > > > > > > > +                       vduse_domain_reset_bounce_map(domain);
> > > > > > >
> > > > > > > Btw, I see this:
> > > > > > >
> > > > > > > void vduse_domain_reset_bounce_map(struct vduse_iova_domain *domain)
> > > > > > > {
> > > > > > >         if (!domain->bounce_map)
> > > > > > >                 return;
> > > > > > >
> > > > > > >         spin_lock(&domain->iotlb_lock);
> > > > > > >         if (!domain->bounce_map)
> > > > > > >                 goto unlock;
> > > > > > >
> > > > > > >
> > > > > > > The counbe_map is checked twice, let's fix that.
> > > > > > >
> > > > > >
> > > > > > Double checked locking to avoid taking the lock?
> > > > >
> > > > > I don't know, but I think we don't care too much about the performance
> > > > > of vduse_domain_reset_bounce_map().
> > > > >
> > > > > > I don't think it is
> > > > > > worth it to keep it as it is not in the hot path anyway. But that
> > > > > > would also be another patch independent of this series, isn't it?
> > > > >
> > > > > Yes, it's another independent issue I just found when reviewing this patch.
> > > > >
> > > > > >
> > > > > > > > +       }
> > > > > > > > +
> > > > > > > > +       for (i = 0; i < dev->ngroups; i++)
> > > > > > > > +               vduse_set_group_asid_nomsg(dev, i, 0);
> > > > > > >
> > > > > > > Note that this function still does:
> > > > > > >
> > > > > > >                 vq->vq_group = 0;
> > > > > > >
> > > > > > > Which is wrong.
> > > > > > >
> > > > > >
> > > > > > Right, removing it for the next version. Thanks for the catch!
> > > > > >
> > > > > > > >
> > > > > > > >         down_write(&dev->rwsem);
> > > > > > > >
> > > > > > > > @@ -623,6 +652,30 @@ static union virtio_map vduse_get_vq_map(struct vdpa_device *vdpa, u16 idx)
> > > > > > > >         return ret;
> > > > > > > >  }
> > > > > > > >
> > > > > > > > +static int vduse_set_group_asid(struct vdpa_device *vdpa, unsigned int group,
> > > > > > > > +                               unsigned int asid)
> > > > > > > > +{
> > > > > > > > +       struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > > > > > > > +       struct vduse_dev_msg msg = { 0 };
> > > > > > > > +       int r;
> > > > > > > > +
> > > > > > > > +       if (dev->api_version < VDUSE_API_VERSION_1 ||
> > > > > > > > +           group >= dev->ngroups || asid >= dev->nas ||
> > > > > > > > +           dev->status & VIRTIO_CONFIG_S_DRIVER_OK)
> > > > > > > > +               return -EINVAL;
> > > > > > >
> > > > > > > If we forbid setting group asid for !DRIVER_OK, why do we still need a
> > > > > > > rwlock?
> > > > > >
> > > > > > virtio_map_ops->alloc is still called before DRIVER_OK to allocate the
> > > > > > vrings in the bounce buffer, for example. If you're ok with that I'm
> > > > > > ok with removing the lock, as all the calls are issued by the driver
> > > > > > setup process anyway. Or just to keep it for alloc?
> > > > >
> > > > > I see, then I think we need to keep that. The reason is that there's
> > > > > no guarantee that the alloc() must be called before DRIVER_OK.
> > > > >
> > > > > >
> > > > > > Anyway, I think I misunderstood your comment from [1] then.
> > > > > >
> > > > > > > All we need to do is to synchronize set_group_asid() with
> > > > > > > set_status()/reset()?
> > > > > > >
> > > > > >
> > > > > > That's also a good one. There is no synchronization if one thread call
> > > > > > reset and then the device is set up from another thread. As all this
> > > > > > situation is still hypothetical because virtio_vdpa does not support
> > > > > > set_group_asid,
> > > > >
> > > > > Right.
> > > > >
> > > > > > and vhost one is already protected by vhost lock, do
> > > > > > we need it?
> > > > >
> > > > > Let's add a TODO in the code.
> > > > >
> > > >
> > > > Can you expand on this? I meant that there will be no synchronization
> > > > if we remove the rwlock (or similar). If we keep the rwlock we don't
> > > > need to add any TODO, or am I missing something?
> > >
> > > I meant even with rwlock we don't synchronize set_group_as_id() and
> > > set_status(). Since you said vhost has been synchronized, I think we
> > > should either
> > >
> > > 1) document the synchronization that needs to be done in the upper layer or
> > > 2) add a todo to synchronize the set_status() and set_group_asid()
> > >
> >
> > With the VDUSE messages they are synchronized by dev->msg_lock, the
> > vduse module always has a coherent status from both sides: the VDUSE
> > userland device and virtio_vdpa. If you want vduse not to trust vdpa
> > we can go the extra mile and make dev->status atomic for both device
> > features and group ASID, would that work?
>
> Yes, something like this, for example, msg_lock is only used to
> synchronize the IPC with the userspace, it doesn't synchronize the
> set_status() with set_group_asid().
>
> >
> > > >
> > > > > >
> > > > > > > Or if you want to synchronize map ops with set_status() that looks
> > > > > > > like an independent thing (hardening).
> > > > > > >
> > > > > > > > +
> > > > > > > > +       msg.req.type = VDUSE_SET_VQ_GROUP_ASID;
> > > > > > > > +       msg.req.vq_group_asid.group = group;
> > > > > > > > +       msg.req.vq_group_asid.asid = asid;
> > > > > > > > +
> > > > > > > > +       r = vduse_dev_msg_sync(dev, &msg);
> > > > > > > > +       if (r < 0)
> > > > > > > > +               return r;
> > > > > > > > +
> > > > > > > > +       vduse_set_group_asid_nomsg(dev, group, asid);
> > > > > > >
> > > > > > > I'm not sure this has been discussed before, but I think it would be
> > > > > > > better to introduce a new ioctl to get group -> as mapping. This helps
> > > > > > > to avoid vduse_dev_msg_sync() as much as possible. And it doesn't
> > > > > > > require the userspace to poll vduse fd before DRIVER_OK.
> > > > > > >
> > > >
> > > > The userspace VDUSE device must poll vduse fd to get the DRIVER_OK
> > > > actually. so it cannot avoid polling the vduse device. What are the reasons to
> > > > avoid vduse_dev_msg_sync?
> > >
> > > One less chance to avoid synchronization with userspace.
> > >
> >
> > But the ioctl alternative is more synchronization from my POV, not
> > less. Instead of receiving notifications that we could even batch, the
> > VDUSE userland device needs to issue many synchronous ioctls.
>
> Synchronous ioctl seems to be better than the trick like letting
> kernel to wait for the userspace to response (and timeout).
>

While I don't love the timeout part, I don't see it that way. To add a
"waiting for userspace device ioctl" status to the vduse device
complicates the code in many different places, and it is hard to
return an error from that point. But it is interesting to have on the
table for sure.

> >
> > > >
> > > > > >
> > > > > > I'm fine with that, but how do we communicate that they have changed?
> > > > >
> > > > > Since we forbid changing the group->as mapping after DRIVER_OK,
> > > > > userspace just needs to use that ioctl once after DRIVER_OK.
> > > > >
> > > >
> > > > But the userland VDUSE device needs to know when ASIDs are set by the
> > > > driver and will not change anymore. In this series it is solved by the
> > > > order of the messages, but now we would need a way to know that time.
> > > > One idea is to issue this new ioctl when it receives VDUSE_SET_STATUS
> > > > msg.
> > >
> > > Yes.
> > >
> > > >
> > > > If we do it that way there is still a window where the hypothetical
> > > > (malicious) virtio_vdpa driver can read and write vq groups from
> > > > different threads.
> > >
> > > See above, we need to add synchronization in either vdpa or virtio_vdpa.
> > >
> > > > It could issue a set_group_asid after the VDUSE
> > > > device returns from the ioctl but before dev->status has not been
> > > > updated.
> > >
> > > In the case of VDUSE I think either side should not trust the other
> > > side. So userspace needs to be prepared for this, so did the driver.
> > > Since the memory accessing is initiated from the userspace, so it
> > > should not perform any memory access before the ioctl to fetch the
> > > group->asid mapping.
> > >
> >
> > I don't follow this. In the case of virtio_vdpa the ASID groups and
> > features are handled by the kernel entirely. And in the case of
> > vhost_vdpa they are handled by the vhost_dev->mutex lock. So can we
> > trust them?
>
> The problem is there's no way for VDUSE to know whether or not the
> upper layer is vhost or not?
>

My point is not that VDUSE can trust in vhost_vdpa but not in
virtio_vdpa or the reverse, like if we could set a conditional flag in
VDUSE and then modify the behavior based on that.

My point is that the code needs to be consistent in how the VDUSE
kernel module trusts the vdpa driver. It does not make sense to trust
that it will never issue a state change (like a reset) concurrently
with a DMA operation [1] but then don't trust that it will change the
feature flag.

At this moment VDUSE trusts that the vDPA driver does not launch these
operations concurrently, and each vDPA driver has its own way to
synchronize them. In the case of vhost is the vhost_dev mutex, and
virtio_vdpa has other ways. I'll add the hardening here, but this can
of worms can be extended through all the chain. How is that
virtio_vdpa trust virtio_net will never call vdpa_set_features after
DRIVER_OK? and virtio_net with virtio_ring? Should we harden all of
that?

virtio_net does protect against this, as is the one interacting with
userland. Same with vhost_vdpa.

> >
> > > >
> > > > I don't think we should protect that, but if we want to do it we
> > > > should protect that part either acquiring the rwlock and trusting the
> > > > vduse_dev_msg_sync timeout or proceed with atomics, smp_store_release
> > > > / smp_load_acquire, read and write barriers...
> > > >
> > > > Note that I still think this is overthinking. We have the same
> > > > problems with driver features, where changing in bits like packed vq
> > > > changes the behavior of vDPA callbacks and could be able to
> > > > desynchronize the vduse kernel and the userland device.
> > >
> > > Probably, but for driver features, it's too late to do the change.
> > >
> >
> > Why is it late? We just need to add the same synchronization in both
> > driver features and ASID groups. The change is not visible for either
> > virtio_vdpa, vhost_vdpa, or VDUSE userland device, at all.
>
> Ok, I think I get you. I thought we were talking about reducing the
> vduse_dev_msg_sync() when doing SET_FETURES. But you meant the
> synchronization with set_status(). Yes, we can do that.
>
> >
> > > > But since
> > > > virtio_vdpa and vduse kernel modules run in the same kernel, so they should
> > > > be able to trust each other.
> > >
> > > Better not, usually the lower layer (vdpa/vduse) should not trust the
> > > upper layer (virtio-vdpa) in this case.
> > >
> >
> > Ok, adding the hardening in the next version!
> >
> > > But if you stick to the method with vduse_dev_msg_sync, I'm also fine with that.
> > >
> > > >
> > > > > > Or how to communicate to the driver that the device does not accept
> > > > > > the assignment of the ASID to the group?
> > > > >
> > > > > See above.
> > > > >
> > > >
> > > > I didn't find the answer :(.
> > > >
> > > > Let me put an example with QEMU and vhost_vdpa:
> > > > - QEMU calls VHOST_VDPA_SET_GROUP_ASID with a valid vq group and asid.
> > > > - vduse cannot send this information to the device as it must wait for
> > > > the ioctl.
> > >
> > > Which ioctl did you mean in this case?
> > >
> >
> > The new ioctl from the userland VDUSE device that you're proposing to
> > accept or reject the set ASID group. Let's call it
> > VDUSE_GET_GROUP_ASID, so let me rewrite the from from that moment:
> >
> > - QEMU calls VHOST_VDPA_SET_GROUP_ASID with a valid vq group (in
> > vdpa->ngroups range) and a valid asid (in vdpa->nas range)
> > - The vduse kernel module cannot send the information to the VDUSE
> > userland device, it must wait until the VDUSE userland device calls
> > the new VDUSE_GET_GROUP_ASID. It will not happen until QEMU calls
> > VHOST_VDPA_SET_STATUS with DRIVER_OK, but QEMU will not call
> > VHOST_VDPA_SET_STATUS until the kernel returns from
> > VHOST_VDPA_SET_GROUP_ASID. The vduse kernel module needs to return
> > success, without knowing if the device will accept it in the future or
> > not.
> > - Now QEMU sends DIRVER_OK, so the vduse kernel module forwards it to
> > the VDUSE userland device. The VDUSE userland device then calls
> > VDUSE_GET_GROUP_ASID, and the flow continues.
> > - The vduse userland device doesn't accept the vq group ASID map for
> > whatever reason, so it returns an error through another ioctl
>
> If the asid is less than the total number of address spaces, any
> reason for the usersapce to reject such configuration?
>

That's a good question I can only answer with "allowing userspace to
communicate back an error here will save us having to expand the
communication protocol when we find the reason".

If we continue the logic that userspace (or vDPA devices) cannot fail
as long as the AS is in range and status is !DRIVER_OK, should we just
move all the checks about valid AS range and valid state to vdpa core
and make the .set_group_asid operation return void? That would be a
good change indeed.

What should VDUSE do if the device replies VDUSE_REQ_RESULT_FAILED?
(or, in your case, if the device never knows the assigned ASID as it
never calls ioctl).

(Actually vhost_vdpa already checks for as < vdpa->nas so that part is
duplicated. It will be simpler in the next version, thanks!).

[1] https://patchew.org/linux/20251217112414.2374672-1-eperezma@redhat.com/20251217112414.2374672-8-eperezma@redhat.com/#CACGkMEs6-j8h7X8navZGC0wraan4WSLO3NMFx++Fe0uaExWZmA@mail.gmail.com


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v10 7/8] vduse: add vq group asid support
  2025-12-26 11:38                 ` Eugenio Perez Martin
@ 2025-12-29  2:56                   ` Jason Wang
  0 siblings, 0 replies; 22+ messages in thread
From: Jason Wang @ 2025-12-29  2:56 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Michael S . Tsirkin, Maxime Coquelin, Laurent Vivier,
	virtualization, linux-kernel, Stefano Garzarella, Yongji Xie,
	Xuan Zhuo, Cindy Lu

On Fri, Dec 26, 2025 at 7:39 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Thu, Dec 25, 2025 at 3:23 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Wed, Dec 24, 2025 at 3:39 PM Eugenio Perez Martin
> > <eperezma@redhat.com> wrote:
> > >
> > > On Wed, Dec 24, 2025 at 1:20 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Tue, Dec 23, 2025 at 9:16 PM Eugenio Perez Martin
> > > > <eperezma@redhat.com> wrote:
> > > > >
> > > > > On Tue, Dec 23, 2025 at 2:11 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > >
> > > > > > On Thu, Dec 18, 2025 at 9:11 PM Eugenio Perez Martin
> > > > > > <eperezma@redhat.com> wrote:
> > > > > > >
> > > > > > > On Thu, Dec 18, 2025 at 7:45 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, Dec 17, 2025 at 7:24 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > Add support for assigning Address Space Identifiers (ASIDs) to each VQ
> > > > > > > > > group.  This enables mapping each group into a distinct memory space.
> > > > > > > > >
> > > > > > > > > The vq group to ASID association is protected by a rwlock now.  But the
> > > > > > > > > mutex domain_lock keeps protecting the domains of all ASIDs, as some
> > > > > > > > > operations like the one related with the bounce buffer size still
> > > > > > > > > requires to lock all the ASIDs.
> > > > > > > > >
> > > > > > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > > > > > > >
> > > > > > > > > ---
> > > > > > > > > Future improvements can include performance optimizations on top could
> > > > > > > > > move to per-ASID locks, or hardening by tracking ASID or ASID hashes on
> > > > > > > > > unused bits of the DMA address.
> > > > > > > > >
> > > > > > > > > Tested virtio_vdpa by adding manually two threads in vduse_set_status:
> > > > > > > > > one of them modifies the vq group 0 ASID and the other one map and unmap
> > > > > > > > > memory continuously.  After a while, the two threads stop and the usual
> > > > > > > > > work continues.
> > > > > > > > >
> > > > > > > > > Tested with vhost_vdpa by migrating a VM while ping on OVS+VDUSE.  A few
> > > > > > > > > workaround were needed in some parts:
> > > > > > > > > * Do not enable CVQ before data vqs in QEMU, as VDUSE does not forward
> > > > > > > > >   the enable message to the userland device.  This will be solved in the
> > > > > > > > >   future.
> > > > > > > > > * Share the suspended state between all vhost devices in QEMU:
> > > > > > > > >   https://lists.nongnu.org/archive/html/qemu-devel/2025-11/msg02947.html
> > > > > > > > > * Implement a fake VDUSE suspend vdpa operation callback that always
> > > > > > > > >   returns true in the kernel.  DPDK suspend the device at the first
> > > > > > > > >   GET_VRING_BASE.
> > > > > > > > > * Remove the CVQ blocker in ASID.
> > > > > > > > >
> > > > > > > > > ---
> > > > > > > > > v10:
> > > > > > > > > * Back to rwlock version so stronger locks are used.
> > > > > > > > > * Take out allocations from rwlock.
> > > > > > > > > * Forbid changing ASID of a vq group after DRIVER_OK (Jason)
> > > > > > > > > * Remove bad fetching again of domain variable in
> > > > > > > > >   vduse_dev_max_mapping_size (Yongji).
> > > > > > > > > * Remove unused vdev definition in vdpa map_ops callbacks (kernel test
> > > > > > > > >   robot).
> > > > > > > > >
> > > > > > > > > v9:
> > > > > > > > > * Replace mutex with rwlock, as the vdpa map_ops can run from atomic
> > > > > > > > >   context.
> > > > > > > > >
> > > > > > > > > v8:
> > > > > > > > > * Revert the mutex to rwlock change, it needs proper profiling to
> > > > > > > > >   justify it.
> > > > > > > > >
> > > > > > > > > v7:
> > > > > > > > > * Take write lock in the error path (Jason).
> > > > > > > > >
> > > > > > > > > v6:
> > > > > > > > > * Make vdpa_dev_add use gotos for error handling (MST).
> > > > > > > > > * s/(dev->api_version < 1) ?/(dev->api_version < VDUSE_API_VERSION_1) ?/
> > > > > > > > >   (MST).
> > > > > > > > > * Fix struct name not matching in the doc.
> > > > > > > > >
> > > > > > > > > v5:
> > > > > > > > > * Properly return errno if copy_to_user returns >0 in VDUSE_IOTLB_GET_FD
> > > > > > > > >   ioctl (Jason).
> > > > > > > > > * Properly set domain bounce size to divide equally between nas (Jason).
> > > > > > > > > * Exclude "padding" member from the only >V1 members in
> > > > > > > > >   vduse_dev_request.
> > > > > > > > >
> > > > > > > > > v4:
> > > > > > > > > * Divide each domain bounce size between the device bounce size (Jason).
> > > > > > > > > * revert unneeded addr = NULL assignment (Jason)
> > > > > > > > > * Change if (x && (y || z)) return to if (x) { if (y) return; if (z)
> > > > > > > > >   return; } (Jason)
> > > > > > > > > * Change a bad multiline comment, using @ caracter instead of * (Jason).
> > > > > > > > > * Consider config->nas == 0 as a fail (Jason).
> > > > > > > > >
> > > > > > > > > v3:
> > > > > > > > > * Get the vduse domain through the vduse_as in the map functions
> > > > > > > > >   (Jason).
> > > > > > > > > * Squash with the patch creating the vduse_as struct (Jason).
> > > > > > > > > * Create VDUSE_DEV_MAX_AS instead of comparing agains a magic number
> > > > > > > > >   (Jason)
> > > > > > > > >
> > > > > > > > > v2:
> > > > > > > > > * Convert the use of mutex to rwlock.
> > > > > > > > >
> > > > > > > > > RFC v3:
> > > > > > > > > * Increase VDUSE_MAX_VQ_GROUPS to 0xffff (Jason). It was set to a lower
> > > > > > > > >   value to reduce memory consumption, but vqs are already limited to
> > > > > > > > >   that value and userspace VDUSE is able to allocate that many vqs.
> > > > > > > > > * Remove TODO about merging VDUSE_IOTLB_GET_FD ioctl with
> > > > > > > > >   VDUSE_IOTLB_GET_INFO.
> > > > > > > > > * Use of array_index_nospec in VDUSE device ioctls.
> > > > > > > > > * Embed vduse_iotlb_entry into vduse_iotlb_entry_v2.
> > > > > > > > > * Move the umem mutex to asid struct so there is no contention between
> > > > > > > > >   ASIDs.
> > > > > > > > >
> > > > > > > > > RFC v2:
> > > > > > > > > * Make iotlb entry the last one of vduse_iotlb_entry_v2 so the first
> > > > > > > > >   part of the struct is the same.
> > > > > > > > > ---
> > > > > > > > >  drivers/vdpa/vdpa_user/vduse_dev.c | 366 +++++++++++++++++++----------
> > > > > > > > >  include/uapi/linux/vduse.h         |  53 ++++-
> > > > > > > > >  2 files changed, 295 insertions(+), 124 deletions(-)
> > > > > > > > >
> > > > > > > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > > > index 767abcb7e375..786ab2378825 100644
> > > > > > > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > > > @@ -41,6 +41,7 @@
> > > > > > > > >
> > > > > > > > >  #define VDUSE_DEV_MAX (1U << MINORBITS)
> > > > > > > > >  #define VDUSE_DEV_MAX_GROUPS 0xffff
> > > > > > > > > +#define VDUSE_DEV_MAX_AS 0xffff
> > > > > > > > >  #define VDUSE_MAX_BOUNCE_SIZE (1024 * 1024 * 1024)
> > > > > > > > >  #define VDUSE_MIN_BOUNCE_SIZE (1024 * 1024)
> > > > > > > > >  #define VDUSE_BOUNCE_SIZE (64 * 1024 * 1024)
> > > > > > > > > @@ -86,7 +87,15 @@ struct vduse_umem {
> > > > > > > > >         struct mm_struct *mm;
> > > > > > > > >  };
> > > > > > > > >
> > > > > > > > > +struct vduse_as {
> > > > > > > > > +       struct vduse_iova_domain *domain;
> > > > > > > > > +       struct vduse_umem *umem;
> > > > > > > > > +       struct mutex mem_lock;
> > > > > > > >
> > > > > > > > Not related to this patch. But If I was not wrong we have 1:1 mapping
> > > > > > > > between domain and as. If this is true, can we use bounce_lock instead
> > > > > > > > of a new mem_lock? Since I see mem_lock is only used for synchronizing
> > > > > > > > umem reg/degreg which has been synchronized with domain rwlock.
> > > > > > > >
> > > > > > >
> > > > > > > I think you're right, but they work at different levels at the moment.
> > > > > > > The mem_lock is at vduse_dev and also protect the umem pointer, and
> > > > > > > bounce_lock is at iova_domain.c .
> > > > > > >
> > > > > > > Maybe the right thing to do is to move umem in iova_domain. Yongji
> > > > > > > Xie, what do you think?
> > > > > > >
> > > > > > > > > +};
> > > > > > > > > +
> > > > > > > > >  struct vduse_vq_group {
> > > > > > > > > +       rwlock_t as_lock;
> > > > > > > > > +       struct vduse_as *as; /* Protected by as_lock */
> > > > > > > > >         struct vduse_dev *dev;
> > > > > > > > >  };
> > > > > > > > >
> > > > > > > > > @@ -94,7 +103,7 @@ struct vduse_dev {
> > > > > > > > >         struct vduse_vdpa *vdev;
> > > > > > > > >         struct device *dev;
> > > > > > > > >         struct vduse_virtqueue **vqs;
> > > > > > > > > -       struct vduse_iova_domain *domain;
> > > > > > > > > +       struct vduse_as *as;
> > > > > > > > >         char *name;
> > > > > > > > >         struct mutex lock;
> > > > > > > > >         spinlock_t msg_lock;
> > > > > > > > > @@ -122,9 +131,8 @@ struct vduse_dev {
> > > > > > > > >         u32 vq_num;
> > > > > > > > >         u32 vq_align;
> > > > > > > > >         u32 ngroups;
> > > > > > > > > -       struct vduse_umem *umem;
> > > > > > > > > +       u32 nas;
> > > > > > > > >         struct vduse_vq_group *groups;
> > > > > > > > > -       struct mutex mem_lock;
> > > > > > > > >         unsigned int bounce_size;
> > > > > > > > >         struct mutex domain_lock;
> > > > > > > > >  };
> > > > > > > > > @@ -314,7 +322,7 @@ static int vduse_dev_set_status(struct vduse_dev *dev, u8 status)
> > > > > > > > >         return vduse_dev_msg_sync(dev, &msg);
> > > > > > > > >  }
> > > > > > > > >
> > > > > > > > > -static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> > > > > > > > > +static int vduse_dev_update_iotlb(struct vduse_dev *dev, u32 asid,
> > > > > > > > >                                   u64 start, u64 last)
> > > > > > > > >  {
> > > > > > > > >         struct vduse_dev_msg msg = { 0 };
> > > > > > > > > @@ -323,8 +331,14 @@ static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> > > > > > > > >                 return -EINVAL;
> > > > > > > > >
> > > > > > > > >         msg.req.type = VDUSE_UPDATE_IOTLB;
> > > > > > > > > -       msg.req.iova.start = start;
> > > > > > > > > -       msg.req.iova.last = last;
> > > > > > > > > +       if (dev->api_version < VDUSE_API_VERSION_1) {
> > > > > > > > > +               msg.req.iova.start = start;
> > > > > > > > > +               msg.req.iova.last = last;
> > > > > > > > > +       } else {
> > > > > > > > > +               msg.req.iova_v2.start = start;
> > > > > > > > > +               msg.req.iova_v2.last = last;
> > > > > > > > > +               msg.req.iova_v2.asid = asid;
> > > > > > > > > +       }
> > > > > > > > >
> > > > > > > > >         return vduse_dev_msg_sync(dev, &msg);
> > > > > > > > >  }
> > > > > > > > > @@ -436,14 +450,29 @@ static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> > > > > > > > >         return mask;
> > > > > > > > >  }
> > > > > > > > >
> > > > > > > > > +/* Force set the asid to a vq group without a message to the VDUSE device */
> > > > > > > > > +static void vduse_set_group_asid_nomsg(struct vduse_dev *dev,
> > > > > > > > > +                                      unsigned int group, unsigned int asid)
> > > > > > > > > +{
> > > > > > > > > +       write_lock(&dev->groups[group].as_lock);
> > > > > > > > > +       dev->groups[group].as = &dev->as[asid];
> > > > > > > > > +       write_unlock(&dev->groups[group].as_lock);
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > >  static void vduse_dev_reset(struct vduse_dev *dev)
> > > > > > > > >  {
> > > > > > > > >         int i;
> > > > > > > > > -       struct vduse_iova_domain *domain = dev->domain;
> > > > > > > > >
> > > > > > > > >         /* The coherent mappings are handled in vduse_dev_free_coherent() */
> > > > > > > > > -       if (domain && domain->bounce_map)
> > > > > > > > > -               vduse_domain_reset_bounce_map(domain);
> > > > > > > > > +       for (i = 0; i < dev->nas; i++) {
> > > > > > > > > +               struct vduse_iova_domain *domain = dev->as[i].domain;
> > > > > > > > > +
> > > > > > > > > +               if (domain && domain->bounce_map)
> > > > > > > > > +                       vduse_domain_reset_bounce_map(domain);
> > > > > > > >
> > > > > > > > Btw, I see this:
> > > > > > > >
> > > > > > > > void vduse_domain_reset_bounce_map(struct vduse_iova_domain *domain)
> > > > > > > > {
> > > > > > > >         if (!domain->bounce_map)
> > > > > > > >                 return;
> > > > > > > >
> > > > > > > >         spin_lock(&domain->iotlb_lock);
> > > > > > > >         if (!domain->bounce_map)
> > > > > > > >                 goto unlock;
> > > > > > > >
> > > > > > > >
> > > > > > > > The counbe_map is checked twice, let's fix that.
> > > > > > > >
> > > > > > >
> > > > > > > Double checked locking to avoid taking the lock?
> > > > > >
> > > > > > I don't know, but I think we don't care too much about the performance
> > > > > > of vduse_domain_reset_bounce_map().
> > > > > >
> > > > > > > I don't think it is
> > > > > > > worth it to keep it as it is not in the hot path anyway. But that
> > > > > > > would also be another patch independent of this series, isn't it?
> > > > > >
> > > > > > Yes, it's another independent issue I just found when reviewing this patch.
> > > > > >
> > > > > > >
> > > > > > > > > +       }
> > > > > > > > > +
> > > > > > > > > +       for (i = 0; i < dev->ngroups; i++)
> > > > > > > > > +               vduse_set_group_asid_nomsg(dev, i, 0);
> > > > > > > >
> > > > > > > > Note that this function still does:
> > > > > > > >
> > > > > > > >                 vq->vq_group = 0;
> > > > > > > >
> > > > > > > > Which is wrong.
> > > > > > > >
> > > > > > >
> > > > > > > Right, removing it for the next version. Thanks for the catch!
> > > > > > >
> > > > > > > > >
> > > > > > > > >         down_write(&dev->rwsem);
> > > > > > > > >
> > > > > > > > > @@ -623,6 +652,30 @@ static union virtio_map vduse_get_vq_map(struct vdpa_device *vdpa, u16 idx)
> > > > > > > > >         return ret;
> > > > > > > > >  }
> > > > > > > > >
> > > > > > > > > +static int vduse_set_group_asid(struct vdpa_device *vdpa, unsigned int group,
> > > > > > > > > +                               unsigned int asid)
> > > > > > > > > +{
> > > > > > > > > +       struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > > > > > > > > +       struct vduse_dev_msg msg = { 0 };
> > > > > > > > > +       int r;
> > > > > > > > > +
> > > > > > > > > +       if (dev->api_version < VDUSE_API_VERSION_1 ||
> > > > > > > > > +           group >= dev->ngroups || asid >= dev->nas ||
> > > > > > > > > +           dev->status & VIRTIO_CONFIG_S_DRIVER_OK)
> > > > > > > > > +               return -EINVAL;
> > > > > > > >
> > > > > > > > If we forbid setting group asid for !DRIVER_OK, why do we still need a
> > > > > > > > rwlock?
> > > > > > >
> > > > > > > virtio_map_ops->alloc is still called before DRIVER_OK to allocate the
> > > > > > > vrings in the bounce buffer, for example. If you're ok with that I'm
> > > > > > > ok with removing the lock, as all the calls are issued by the driver
> > > > > > > setup process anyway. Or just to keep it for alloc?
> > > > > >
> > > > > > I see, then I think we need to keep that. The reason is that there's
> > > > > > no guarantee that the alloc() must be called before DRIVER_OK.
> > > > > >
> > > > > > >
> > > > > > > Anyway, I think I misunderstood your comment from [1] then.
> > > > > > >
> > > > > > > > All we need to do is to synchronize set_group_asid() with
> > > > > > > > set_status()/reset()?
> > > > > > > >
> > > > > > >
> > > > > > > That's also a good one. There is no synchronization if one thread call
> > > > > > > reset and then the device is set up from another thread. As all this
> > > > > > > situation is still hypothetical because virtio_vdpa does not support
> > > > > > > set_group_asid,
> > > > > >
> > > > > > Right.
> > > > > >
> > > > > > > and vhost one is already protected by vhost lock, do
> > > > > > > we need it?
> > > > > >
> > > > > > Let's add a TODO in the code.
> > > > > >
> > > > >
> > > > > Can you expand on this? I meant that there will be no synchronization
> > > > > if we remove the rwlock (or similar). If we keep the rwlock we don't
> > > > > need to add any TODO, or am I missing something?
> > > >
> > > > I meant even with rwlock we don't synchronize set_group_as_id() and
> > > > set_status(). Since you said vhost has been synchronized, I think we
> > > > should either
> > > >
> > > > 1) document the synchronization that needs to be done in the upper layer or
> > > > 2) add a todo to synchronize the set_status() and set_group_asid()
> > > >
> > >
> > > With the VDUSE messages they are synchronized by dev->msg_lock, the
> > > vduse module always has a coherent status from both sides: the VDUSE
> > > userland device and virtio_vdpa. If you want vduse not to trust vdpa
> > > we can go the extra mile and make dev->status atomic for both device
> > > features and group ASID, would that work?
> >
> > Yes, something like this, for example, msg_lock is only used to
> > synchronize the IPC with the userspace, it doesn't synchronize the
> > set_status() with set_group_asid().
> >
> > >
> > > > >
> > > > > > >
> > > > > > > > Or if you want to synchronize map ops with set_status() that looks
> > > > > > > > like an independent thing (hardening).
> > > > > > > >
> > > > > > > > > +
> > > > > > > > > +       msg.req.type = VDUSE_SET_VQ_GROUP_ASID;
> > > > > > > > > +       msg.req.vq_group_asid.group = group;
> > > > > > > > > +       msg.req.vq_group_asid.asid = asid;
> > > > > > > > > +
> > > > > > > > > +       r = vduse_dev_msg_sync(dev, &msg);
> > > > > > > > > +       if (r < 0)
> > > > > > > > > +               return r;
> > > > > > > > > +
> > > > > > > > > +       vduse_set_group_asid_nomsg(dev, group, asid);
> > > > > > > >
> > > > > > > > I'm not sure this has been discussed before, but I think it would be
> > > > > > > > better to introduce a new ioctl to get group -> as mapping. This helps
> > > > > > > > to avoid vduse_dev_msg_sync() as much as possible. And it doesn't
> > > > > > > > require the userspace to poll vduse fd before DRIVER_OK.
> > > > > > > >
> > > > >
> > > > > The userspace VDUSE device must poll vduse fd to get the DRIVER_OK
> > > > > actually. so it cannot avoid polling the vduse device. What are the reasons to
> > > > > avoid vduse_dev_msg_sync?
> > > >
> > > > One less chance to avoid synchronization with userspace.
> > > >
> > >
> > > But the ioctl alternative is more synchronization from my POV, not
> > > less. Instead of receiving notifications that we could even batch, the
> > > VDUSE userland device needs to issue many synchronous ioctls.
> >
> > Synchronous ioctl seems to be better than the trick like letting
> > kernel to wait for the userspace to response (and timeout).
> >
>
> While I don't love the timeout part, I don't see it that way. To add a
> "waiting for userspace device ioctl" status to the vduse device
> complicates the code in many different places, and it is hard to
> return an error from that point. But it is interesting to have on the
> table for sure.

Ok, considering we've already used vduse_dev_msg_sync() for many
places. I'm fine if you stick to that.

>
> > >
> > > > >
> > > > > > >
> > > > > > > I'm fine with that, but how do we communicate that they have changed?
> > > > > >
> > > > > > Since we forbid changing the group->as mapping after DRIVER_OK,
> > > > > > userspace just needs to use that ioctl once after DRIVER_OK.
> > > > > >
> > > > >
> > > > > But the userland VDUSE device needs to know when ASIDs are set by the
> > > > > driver and will not change anymore. In this series it is solved by the
> > > > > order of the messages, but now we would need a way to know that time.
> > > > > One idea is to issue this new ioctl when it receives VDUSE_SET_STATUS
> > > > > msg.
> > > >
> > > > Yes.
> > > >
> > > > >
> > > > > If we do it that way there is still a window where the hypothetical
> > > > > (malicious) virtio_vdpa driver can read and write vq groups from
> > > > > different threads.
> > > >
> > > > See above, we need to add synchronization in either vdpa or virtio_vdpa.
> > > >
> > > > > It could issue a set_group_asid after the VDUSE
> > > > > device returns from the ioctl but before dev->status has not been
> > > > > updated.
> > > >
> > > > In the case of VDUSE I think either side should not trust the other
> > > > side. So userspace needs to be prepared for this, so did the driver.
> > > > Since the memory accessing is initiated from the userspace, so it
> > > > should not perform any memory access before the ioctl to fetch the
> > > > group->asid mapping.
> > > >
> > >
> > > I don't follow this. In the case of virtio_vdpa the ASID groups and
> > > features are handled by the kernel entirely. And in the case of
> > > vhost_vdpa they are handled by the vhost_dev->mutex lock. So can we
> > > trust them?
> >
> > The problem is there's no way for VDUSE to know whether or not the
> > upper layer is vhost or not?
> >
>
> My point is not that VDUSE can trust in vhost_vdpa but not in
> virtio_vdpa or the reverse, like if we could set a conditional flag in
> VDUSE and then modify the behavior based on that.
>
> My point is that the code needs to be consistent in how the VDUSE
> kernel module trusts the vdpa driver. It does not make sense to trust
> that it will never issue a state change (like a reset) concurrently
> with a DMA operation [1] but then don't trust that it will change the
> feature flag.
>
> At this moment VDUSE trusts that the vDPA driver does not launch these
> operations concurrently,

I think at least we should document this somewhere since this is
judged from the code review.

> and each vDPA driver has its own way to
> synchronize them. In the case of vhost is the vhost_dev mutex, and
> virtio_vdpa has other ways. I'll add the hardening here, but this can
> of worms can be extended through all the chain. How is that
> virtio_vdpa trust virtio_net will never call vdpa_set_features after
> DRIVER_OK? and virtio_net with virtio_ring? Should we harden all of
> that?

That's why I think we can leave this for future investigation by leaving a TODO.

>
> virtio_net does protect against this, as is the one interacting with
> userland. Same with vhost_vdpa.
>
> > >
> > > > >
> > > > > I don't think we should protect that, but if we want to do it we
> > > > > should protect that part either acquiring the rwlock and trusting the
> > > > > vduse_dev_msg_sync timeout or proceed with atomics, smp_store_release
> > > > > / smp_load_acquire, read and write barriers...
> > > > >
> > > > > Note that I still think this is overthinking. We have the same
> > > > > problems with driver features, where changing in bits like packed vq
> > > > > changes the behavior of vDPA callbacks and could be able to
> > > > > desynchronize the vduse kernel and the userland device.
> > > >
> > > > Probably, but for driver features, it's too late to do the change.
> > > >
> > >
> > > Why is it late? We just need to add the same synchronization in both
> > > driver features and ASID groups. The change is not visible for either
> > > virtio_vdpa, vhost_vdpa, or VDUSE userland device, at all.
> >
> > Ok, I think I get you. I thought we were talking about reducing the
> > vduse_dev_msg_sync() when doing SET_FETURES. But you meant the
> > synchronization with set_status(). Yes, we can do that.
> >
> > >
> > > > > But since
> > > > > virtio_vdpa and vduse kernel modules run in the same kernel, so they should
> > > > > be able to trust each other.
> > > >
> > > > Better not, usually the lower layer (vdpa/vduse) should not trust the
> > > > upper layer (virtio-vdpa) in this case.
> > > >
> > >
> > > Ok, adding the hardening in the next version!
> > >
> > > > But if you stick to the method with vduse_dev_msg_sync, I'm also fine with that.
> > > >
> > > > >
> > > > > > > Or how to communicate to the driver that the device does not accept
> > > > > > > the assignment of the ASID to the group?
> > > > > >
> > > > > > See above.
> > > > > >
> > > > >
> > > > > I didn't find the answer :(.
> > > > >
> > > > > Let me put an example with QEMU and vhost_vdpa:
> > > > > - QEMU calls VHOST_VDPA_SET_GROUP_ASID with a valid vq group and asid.
> > > > > - vduse cannot send this information to the device as it must wait for
> > > > > the ioctl.
> > > >
> > > > Which ioctl did you mean in this case?
> > > >
> > >
> > > The new ioctl from the userland VDUSE device that you're proposing to
> > > accept or reject the set ASID group. Let's call it
> > > VDUSE_GET_GROUP_ASID, so let me rewrite the from from that moment:
> > >
> > > - QEMU calls VHOST_VDPA_SET_GROUP_ASID with a valid vq group (in
> > > vdpa->ngroups range) and a valid asid (in vdpa->nas range)
> > > - The vduse kernel module cannot send the information to the VDUSE
> > > userland device, it must wait until the VDUSE userland device calls
> > > the new VDUSE_GET_GROUP_ASID. It will not happen until QEMU calls
> > > VHOST_VDPA_SET_STATUS with DRIVER_OK, but QEMU will not call
> > > VHOST_VDPA_SET_STATUS until the kernel returns from
> > > VHOST_VDPA_SET_GROUP_ASID. The vduse kernel module needs to return
> > > success, without knowing if the device will accept it in the future or
> > > not.
> > > - Now QEMU sends DIRVER_OK, so the vduse kernel module forwards it to
> > > the VDUSE userland device. The VDUSE userland device then calls
> > > VDUSE_GET_GROUP_ASID, and the flow continues.
> > > - The vduse userland device doesn't accept the vq group ASID map for
> > > whatever reason, so it returns an error through another ioctl
> >
> > If the asid is less than the total number of address spaces, any
> > reason for the usersapce to reject such configuration?
> >
>
> That's a good question I can only answer with "allowing userspace to
> communicate back an error here will save us having to expand the
> communication protocol when we find the reason".
>
> If we continue the logic that userspace (or vDPA devices) cannot fail
> as long as the AS is in range and status is !DRIVER_OK, should we just
> move all the checks about valid AS range and valid state to vdpa core
> and make the .set_group_asid operation return void? That would be a
> good change indeed.

I'd try to limit the changeset of this series. I think I'm fine with
keeping using the vduse_dev_msg_sync().

>
> What should VDUSE do if the device replies VDUSE_REQ_RESULT_FAILED?
> (or, in your case, if the device never knows the assigned ASID as it
> never calls ioctl).
>
> (Actually vhost_vdpa already checks for as < vdpa->nas so that part is
> duplicated. It will be simpler in the next version, thanks!).
>
> [1] https://patchew.org/linux/20251217112414.2374672-1-eperezma@redhat.com/20251217112414.2374672-8-eperezma@redhat.com/#CACGkMEs6-j8h7X8navZGC0wraan4WSLO3NMFx++Fe0uaExWZmA@mail.gmail.com
>

Thanks


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-12-29  2:56 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-17 11:24 [PATCH v10 0/8] Add multiple address spaces support to VDUSE Eugenio Pérez
2025-12-17 11:24 ` [PATCH v10 1/8] vduse: add v1 API definition Eugenio Pérez
2025-12-17 11:24 ` [PATCH v10 2/8] vduse: add vq group support Eugenio Pérez
2025-12-18  6:46   ` Jason Wang
2025-12-18 10:06     ` Eugenio Perez Martin
2025-12-17 11:24 ` [PATCH v10 3/8] vduse: return internal vq group struct as map token Eugenio Pérez
2025-12-17 11:24 ` [PATCH v10 4/8] vduse: refactor vdpa_dev_add for goto err handling Eugenio Pérez
2025-12-17 11:24 ` [PATCH v10 5/8] vduse: remove unused vaddr parameter of vduse_domain_free_coherent Eugenio Pérez
2025-12-17 11:24 ` [PATCH v10 6/8] vduse: take out allocations from vduse_dev_alloc_coherent Eugenio Pérez
2025-12-18  5:45   ` Jason Wang
2025-12-18  8:40     ` Eugenio Perez Martin
2025-12-17 11:24 ` [PATCH v10 7/8] vduse: add vq group asid support Eugenio Pérez
2025-12-18  6:44   ` Jason Wang
2025-12-18 13:10     ` Eugenio Perez Martin
2025-12-23  1:11       ` Jason Wang
2025-12-23 13:15         ` Eugenio Perez Martin
2025-12-24  0:20           ` Jason Wang
2025-12-24  7:38             ` Eugenio Perez Martin
2025-12-25  2:23               ` Jason Wang
2025-12-26 11:38                 ` Eugenio Perez Martin
2025-12-29  2:56                   ` Jason Wang
2025-12-17 11:24 ` [PATCH v10 8/8] vduse: bump version number Eugenio Pérez

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).