* [RFC v2 0/7] Add multiple address spaces support to VDUSE
@ 2025-08-07 11:57 Eugenio Pérez
2025-08-07 11:57 ` [RFC v2 1/7] vduse: add v1 API definition Eugenio Pérez
` (6 more replies)
0 siblings, 7 replies; 24+ messages in thread
From: Eugenio Pérez @ 2025-08-07 11:57 UTC (permalink / raw)
To: Michael S . Tsirkin
Cc: Cindy Lu, Eugenio Pérez, Yongji Xie, Stefano Garzarella,
virtualization, Laurent Vivier, linux-kernel, Xuan Zhuo, jasowang,
Maxime Coquelin
When used by vhost-vDPA bus driver for VM, the control virtqueue
should be shadowed via userspace VMM (QEMU) instead of being assigned
directly to Guest. This is because QEMU needs to know the device state
in order to start and stop device correctly (e.g for Live Migration).
This requies to isolate the memory mapping for control virtqueue
presented by vhost-vDPA to prevent guest from accessing it directly.
This series add support to multiple address spaces in VDUSE device
allowing selective virtqueue isolation through address space IDs (ASID).
The VDUSE device needs to report:
* Number of virtqueue groups
* Association of each vq group with each virtqueue
* Number of address spaces supported.
Then, the vDPA driver can modify the ASID assigned to each VQ group to
isolate the memory AS. This aligns VDUSE with gq}vdpa_sim and nvidia
mlx5 devices which already support ASID.
This helps to isolate the environments for the virtqueues that will not
be assigned directly. E.g in the case of virtio-net, the control
virtqueue will not be assigned directly to guest.
This is still a RFC as it depends on the series that reworks the virtio mapping
API:
https://lore.kernel.org/all/20250718091616.6140-1-jasowang@redhat.com/
As some changes has been requested to it, these will need to be applied here
too.
Also, to be able to test this patch, the user needs to manually revert
56e71885b034 ("vduse: Temporarily fail if control queue feature requested").
v2:
* Cache group information in kernel, as we need to provide the vq map
tokens properly.
* Add descs vq group to optimize SVQ forwarding and support indirect
descriptors out of the box.
* Make iotlb entry the last one of vduse_iotlb_entry_v2 so the first
part of the struct is the same.
* Fixes detected testing with OVS+VDUSE.
Eugenio Pérez (7):
vduse: add v1 API definition
vduse: add vq group support
vdpa: change get_vq_map_token type to void *(*cb)
vduse: return internal vq group struct as map token
vduse: add vq group asid support
vduse: send update_iotlb_v2 message
vduse: bump version number
drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 +-
drivers/vdpa/vdpa_user/vduse_dev.c | 389 +++++++++++++++++++++++------
include/linux/vdpa.h | 2 +-
include/uapi/linux/vduse.h | 64 ++++-
4 files changed, 371 insertions(+), 86 deletions(-)
--
2.50.1
^ permalink raw reply [flat|nested] 24+ messages in thread
* [RFC v2 1/7] vduse: add v1 API definition
2025-08-07 11:57 [RFC v2 0/7] Add multiple address spaces support to VDUSE Eugenio Pérez
@ 2025-08-07 11:57 ` Eugenio Pérez
2025-08-07 11:57 ` [RFC v2 2/7] vduse: add vq group support Eugenio Pérez
` (5 subsequent siblings)
6 siblings, 0 replies; 24+ messages in thread
From: Eugenio Pérez @ 2025-08-07 11:57 UTC (permalink / raw)
To: Michael S . Tsirkin
Cc: Cindy Lu, Eugenio Pérez, Yongji Xie, Stefano Garzarella,
virtualization, Laurent Vivier, linux-kernel, Xuan Zhuo, jasowang,
Maxime Coquelin
This allows the kernel to detect whether the userspace VDUSE device
supports the VQ group and ASID features. VDUSE devices that don't set
the V1 API will not receive the new messages, and vdpa device will be
created with only one vq group and asid.
The next patches implement the new feature incrementally, only enabling
the VDUSE device to set the V1 API version by the end of the series.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
include/uapi/linux/vduse.h | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
index 68a627d04afa..9a56d0416bfe 100644
--- a/include/uapi/linux/vduse.h
+++ b/include/uapi/linux/vduse.h
@@ -10,6 +10,10 @@
#define VDUSE_API_VERSION 0
+/* VQ groups and ASID support */
+
+#define VDUSE_API_VERSION_1 1
+
/*
* Get the version of VDUSE API that kernel supported (VDUSE_API_VERSION).
* This is used for future extension.
--
2.50.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [RFC v2 2/7] vduse: add vq group support
2025-08-07 11:57 [RFC v2 0/7] Add multiple address spaces support to VDUSE Eugenio Pérez
2025-08-07 11:57 ` [RFC v2 1/7] vduse: add v1 API definition Eugenio Pérez
@ 2025-08-07 11:57 ` Eugenio Pérez
2025-08-08 7:46 ` Maxime Coquelin
2025-08-11 3:09 ` Jason Wang
2025-08-07 11:57 ` [RFC v2 3/7] vdpa: change get_vq_map_token type to void *(*cb) Eugenio Pérez
` (4 subsequent siblings)
6 siblings, 2 replies; 24+ messages in thread
From: Eugenio Pérez @ 2025-08-07 11:57 UTC (permalink / raw)
To: Michael S . Tsirkin
Cc: Cindy Lu, Eugenio Pérez, Yongji Xie, Stefano Garzarella,
virtualization, Laurent Vivier, linux-kernel, Xuan Zhuo, jasowang,
Maxime Coquelin
This allows sepparate the different virtqueues in groups that shares the
same address space. Asking the VDUSE device for the groups of the vq at
the beginning as they're needed for the DMA API.
Allocating 3 vq groups as net is the device that need the most groups:
* Dataplane (guest passthrough)
* CVQ
* Shadowed vrings.
Future versions of the series can include dynamic allocation of the
groups array so VDUSE can declare more groups.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
v2:
* Cache group information in kernel, as we need to provide the vq map
tokens properly.
* Add descs vq group to optimize SVQ forwarding and support indirect
descriptors out of the box.
---
drivers/vdpa/vdpa_user/vduse_dev.c | 71 +++++++++++++++++++++++++++++-
include/uapi/linux/vduse.h | 19 +++++++-
2 files changed, 88 insertions(+), 2 deletions(-)
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index d858c4389cc1..d1f6d00a9c71 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -46,6 +46,11 @@
#define VDUSE_IOVA_SIZE (VDUSE_MAX_BOUNCE_SIZE + 128 * 1024 * 1024)
#define VDUSE_MSG_DEFAULT_TIMEOUT 30
+/*
+ * Let's make it 3 for simplicity.
+ */
+#define VDUSE_MAX_VQ_GROUPS 3
+
#define IRQ_UNBOUND -1
struct vduse_virtqueue {
@@ -58,6 +63,8 @@ struct vduse_virtqueue {
struct vdpa_vq_state state;
bool ready;
bool kicked;
+ u32 vq_group;
+ u32 vq_desc_group;
spinlock_t kick_lock;
spinlock_t irq_lock;
struct eventfd_ctx *kickfd;
@@ -114,6 +121,7 @@ struct vduse_dev {
u8 status;
u32 vq_num;
u32 vq_align;
+ u32 ngroups;
struct vduse_umem *umem;
struct mutex mem_lock;
unsigned int bounce_size;
@@ -592,6 +600,20 @@ static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
return 0;
}
+static u32 vduse_get_vq_group(struct vdpa_device *vdpa, u16 idx)
+{
+ struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+ return dev->vqs[idx]->vq_group;
+}
+
+static u32 vduse_get_vq_desc_group(struct vdpa_device *vdpa, u16 idx)
+{
+ struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+ return dev->vqs[idx]->vq_desc_group;
+}
+
static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
struct vdpa_vq_state *state)
{
@@ -678,13 +700,48 @@ static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
return dev->status;
}
+static int vduse_fill_vq_groups(struct vduse_dev *dev)
+{
+ if (dev->api_version < VDUSE_API_VERSION_1)
+ return 0;
+
+ for (int i = 0; i < dev->vdev->vdpa.nvqs; ++i) {
+ struct vduse_dev_msg msg = { 0 };
+ int ret;
+
+ msg.req.type = VDUSE_GET_VQ_GROUP;
+ msg.req.vq_group.index = i;
+ ret = vduse_dev_msg_sync(dev, &msg);
+ if (ret)
+ return ret;
+
+ dev->vqs[i]->vq_group = msg.resp.vq_group.num;
+
+ msg.req.type = VDUSE_GET_VRING_DESC_GROUP;
+ ret = vduse_dev_msg_sync(dev, &msg);
+ if (ret)
+ return ret;
+
+ dev->vqs[i]->vq_desc_group = msg.resp.vq_group.num;
+ }
+
+ return 0;
+}
+
static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
{
struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+ u8 previous_status = dev->status;
if (vduse_dev_set_status(dev, status))
return;
+ if ((dev->status ^ previous_status) &
+ BIT_ULL(VIRTIO_CONFIG_S_FEATURES_OK) &&
+ status & (1ULL << VIRTIO_CONFIG_S_FEATURES_OK))
+ if (vduse_fill_vq_groups(dev))
+ return;
+
dev->status = status;
}
@@ -789,6 +846,8 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = {
.set_vq_cb = vduse_vdpa_set_vq_cb,
.set_vq_num = vduse_vdpa_set_vq_num,
.get_vq_size = vduse_vdpa_get_vq_size,
+ .get_vq_group = vduse_get_vq_group,
+ .get_vq_desc_group = vduse_get_vq_desc_group,
.set_vq_ready = vduse_vdpa_set_vq_ready,
.get_vq_ready = vduse_vdpa_get_vq_ready,
.set_vq_state = vduse_vdpa_set_vq_state,
@@ -1856,6 +1915,16 @@ static int vduse_create_dev(struct vduse_dev_config *config,
dev->device_features = config->features;
dev->device_id = config->device_id;
dev->vendor_id = config->vendor_id;
+ if (dev->api_version >= 1) {
+ if (config->ngroups > VDUSE_MAX_VQ_GROUPS) {
+ pr_err("Not creating a VDUSE device with %u vq groups. Max: %u",
+ config->ngroups, VDUSE_MAX_VQ_GROUPS);
+ goto err;
+ }
+ dev->ngroups = config->ngroups ?: 1;
+ } else {
+ dev->ngroups = 1;
+ }
dev->name = kstrdup(config->name, GFP_KERNEL);
if (!dev->name)
goto err_str;
@@ -2016,7 +2085,7 @@ static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev,
&vduse_vdpa_config_ops, &vduse_map_ops,
- 1, 1, name, true);
+ dev->ngroups, 1, name, true);
if (IS_ERR(vdev))
return PTR_ERR(vdev);
diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
index 9a56d0416bfe..b4b139dc76bb 100644
--- a/include/uapi/linux/vduse.h
+++ b/include/uapi/linux/vduse.h
@@ -31,6 +31,7 @@
* @features: virtio features
* @vq_num: the number of virtqueues
* @vq_align: the allocation alignment of virtqueue's metadata
+ * @ngroups: number of vq groups that VDUSE device declares
* @reserved: for future use, needs to be initialized to zero
* @config_size: the size of the configuration space
* @config: the buffer of the configuration space
@@ -45,7 +46,8 @@ struct vduse_dev_config {
__u64 features;
__u32 vq_num;
__u32 vq_align;
- __u32 reserved[13];
+ __u32 ngroups; /* if VDUSE_API_VERSION >= 1 */
+ __u32 reserved[12];
__u32 config_size;
__u8 config[];
};
@@ -160,6 +162,16 @@ struct vduse_vq_state_packed {
__u16 last_used_idx;
};
+/**
+ * struct vduse_vq_group - virtqueue group
+ * @num: Index of the virtqueue group
+ * @num: Group
+ */
+struct vduse_vq_group {
+ __u32 index;
+ __u32 num;
+};
+
/**
* struct vduse_vq_info - information of a virtqueue
* @index: virtqueue index
@@ -182,6 +194,7 @@ struct vduse_vq_info {
union {
struct vduse_vq_state_split split;
struct vduse_vq_state_packed packed;
+ struct vduse_vq_group group;
};
__u8 ready;
};
@@ -274,6 +287,8 @@ enum vduse_req_type {
VDUSE_GET_VQ_STATE,
VDUSE_SET_STATUS,
VDUSE_UPDATE_IOTLB,
+ VDUSE_GET_VQ_GROUP,
+ VDUSE_GET_VRING_DESC_GROUP,
};
/**
@@ -328,6 +343,7 @@ struct vduse_dev_request {
struct vduse_vq_state vq_state;
struct vduse_dev_status s;
struct vduse_iova_range iova;
+ struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
__u32 padding[32];
};
};
@@ -350,6 +366,7 @@ struct vduse_dev_response {
__u32 reserved[4];
union {
struct vduse_vq_state vq_state;
+ struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
__u32 padding[32];
};
};
--
2.50.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [RFC v2 3/7] vdpa: change get_vq_map_token type to void *(*cb)
2025-08-07 11:57 [RFC v2 0/7] Add multiple address spaces support to VDUSE Eugenio Pérez
2025-08-07 11:57 ` [RFC v2 1/7] vduse: add v1 API definition Eugenio Pérez
2025-08-07 11:57 ` [RFC v2 2/7] vduse: add vq group support Eugenio Pérez
@ 2025-08-07 11:57 ` Eugenio Pérez
2025-08-07 11:57 ` [RFC v2 4/7] vduse: return internal vq group struct as map token Eugenio Pérez
` (3 subsequent siblings)
6 siblings, 0 replies; 24+ messages in thread
From: Eugenio Pérez @ 2025-08-07 11:57 UTC (permalink / raw)
To: Michael S . Tsirkin
Cc: Cindy Lu, Eugenio Pérez, Yongji Xie, Stefano Garzarella,
virtualization, Laurent Vivier, linux-kernel, Xuan Zhuo, jasowang,
Maxime Coquelin
Mail list requested a better (safer) return type in the DMA rework
series so this is just temporal workaround to develop this RFC.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 +-
include/linux/vdpa.h | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 78421fc709c6..ef39841e6e96 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -3395,7 +3395,7 @@ static int mlx5_vdpa_reset_map(struct vdpa_device *vdev, unsigned int asid)
return err;
}
-static struct device *mlx5_get_vq_map_token(struct vdpa_device *vdev, u16 idx)
+static void *mlx5_get_vq_map_token(struct vdpa_device *vdev, u16 idx)
{
struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev);
diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index cb51b7e2e569..3846791f6868 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -438,7 +438,7 @@ struct vdpa_config_ops {
int (*reset_map)(struct vdpa_device *vdev, unsigned int asid);
int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group,
unsigned int asid);
- struct device *(*get_vq_map_token)(struct vdpa_device *vdev, u16 idx);
+ void *(*get_vq_map_token)(struct vdpa_device *vdev, u16 idx);
int (*bind_mm)(struct vdpa_device *vdev, struct mm_struct *mm);
void (*unbind_mm)(struct vdpa_device *vdev);
--
2.50.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [RFC v2 4/7] vduse: return internal vq group struct as map token
2025-08-07 11:57 [RFC v2 0/7] Add multiple address spaces support to VDUSE Eugenio Pérez
` (2 preceding siblings ...)
2025-08-07 11:57 ` [RFC v2 3/7] vdpa: change get_vq_map_token type to void *(*cb) Eugenio Pérez
@ 2025-08-07 11:57 ` Eugenio Pérez
2025-08-11 3:11 ` Jason Wang
2025-08-07 11:57 ` [RFC v2 5/7] vduse: add vq group asid support Eugenio Pérez
` (2 subsequent siblings)
6 siblings, 1 reply; 24+ messages in thread
From: Eugenio Pérez @ 2025-08-07 11:57 UTC (permalink / raw)
To: Michael S . Tsirkin
Cc: Cindy Lu, Eugenio Pérez, Yongji Xie, Stefano Garzarella,
virtualization, Laurent Vivier, linux-kernel, Xuan Zhuo, jasowang,
Maxime Coquelin
Return the internal struct that represents the vq group as virtqueue map
token, instead of the device. This allows the DMA functions to access
the information per group.
At this moment all the virtqueues share the same vq group, that only
can point to ASID 0. This change prepares the infrastructure for actual
per-group address space handling
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
drivers/vdpa/vdpa_user/vduse_dev.c | 44 ++++++++++++++++++++----------
1 file changed, 30 insertions(+), 14 deletions(-)
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index d1f6d00a9c71..a7a2749f5818 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -90,6 +90,10 @@ struct vduse_umem {
struct mm_struct *mm;
};
+struct vduse_vq_group_int {
+ struct vduse_dev *dev;
+};
+
struct vduse_dev {
struct vduse_vdpa *vdev;
struct device *dev;
@@ -123,6 +127,7 @@ struct vduse_dev {
u32 vq_align;
u32 ngroups;
struct vduse_umem *umem;
+ struct vduse_vq_group_int groups[VDUSE_MAX_VQ_GROUPS];
struct mutex mem_lock;
unsigned int bounce_size;
struct mutex domain_lock;
@@ -614,6 +619,14 @@ static u32 vduse_get_vq_desc_group(struct vdpa_device *vdpa, u16 idx)
return dev->vqs[idx]->vq_desc_group;
}
+static void *vduse_get_vq_map_token(struct vdpa_device *vdpa, u16 idx)
+{
+ struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+ u32 vq_group = dev->vqs[idx]->vq_group;
+
+ return &dev->groups[vq_group];
+}
+
static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
struct vdpa_vq_state *state)
{
@@ -870,6 +883,7 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = {
.get_vq_affinity = vduse_vdpa_get_vq_affinity,
.reset = vduse_vdpa_reset,
.set_map = vduse_vdpa_set_map,
+ .get_vq_map_token = vduse_get_vq_map_token,
.free = vduse_vdpa_free,
};
@@ -877,8 +891,8 @@ static void vduse_dev_sync_single_for_device(void *token,
dma_addr_t dma_addr, size_t size,
enum dma_data_direction dir)
{
- struct device *dev = token;
- struct vduse_dev *vdev = dev_to_vduse(dev);
+ struct vduse_vq_group_int *group = token;
+ struct vduse_dev *vdev = group->dev;
struct vduse_iova_domain *domain = vdev->domain;
vduse_domain_sync_single_for_device(domain, dma_addr, size, dir);
@@ -888,8 +902,8 @@ static void vduse_dev_sync_single_for_cpu(void *token,
dma_addr_t dma_addr, size_t size,
enum dma_data_direction dir)
{
- struct device *dev = token;
- struct vduse_dev *vdev = dev_to_vduse(dev);
+ struct vduse_vq_group_int *group = token;
+ struct vduse_dev *vdev = group->dev;
struct vduse_iova_domain *domain = vdev->domain;
vduse_domain_sync_single_for_cpu(domain, dma_addr, size, dir);
@@ -900,8 +914,8 @@ static dma_addr_t vduse_dev_map_page(void *token, struct page *page,
enum dma_data_direction dir,
unsigned long attrs)
{
- struct device *dev = token;
- struct vduse_dev *vdev = dev_to_vduse(dev);
+ struct vduse_vq_group_int *group = token;
+ struct vduse_dev *vdev = group->dev;
struct vduse_iova_domain *domain = vdev->domain;
return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
@@ -911,8 +925,8 @@ static void vduse_dev_unmap_page(void *token, dma_addr_t dma_addr,
size_t size, enum dma_data_direction dir,
unsigned long attrs)
{
- struct device *dev = token;
- struct vduse_dev *vdev = dev_to_vduse(dev);
+ struct vduse_vq_group_int *group = token;
+ struct vduse_dev *vdev = group->dev;
struct vduse_iova_domain *domain = vdev->domain;
return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
@@ -921,8 +935,8 @@ static void vduse_dev_unmap_page(void *token, dma_addr_t dma_addr,
static void *vduse_dev_alloc_coherent(void *token, size_t size,
dma_addr_t *dma_addr, gfp_t flag)
{
- struct device *dev = token;
- struct vduse_dev *vdev = dev_to_vduse(dev);
+ struct vduse_vq_group_int *group = token;
+ struct vduse_dev *vdev = group->dev;
struct vduse_iova_domain *domain = vdev->domain;
unsigned long iova;
void *addr;
@@ -942,8 +956,8 @@ static void vduse_dev_free_coherent(void *token, size_t size,
void *vaddr, dma_addr_t dma_addr,
unsigned long attrs)
{
- struct device *dev = token;
- struct vduse_dev *vdev = dev_to_vduse(dev);
+ struct vduse_vq_group_int *group = token;
+ struct vduse_dev *vdev = group->dev;
struct vduse_iova_domain *domain = vdev->domain;
vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
@@ -951,8 +965,8 @@ static void vduse_dev_free_coherent(void *token, size_t size,
static size_t vduse_dev_max_mapping_size(void *token)
{
- struct device *dev = token;
- struct vduse_dev *vdev = dev_to_vduse(dev);
+ struct vduse_vq_group_int *group = token;
+ struct vduse_dev *vdev = group->dev;
struct vduse_iova_domain *domain = vdev->domain;
return domain->bounce_size;
@@ -1925,6 +1939,8 @@ static int vduse_create_dev(struct vduse_dev_config *config,
} else {
dev->ngroups = 1;
}
+ for (u32 i = 0; i < dev->ngroups; ++i)
+ dev->groups[i].dev = dev;
dev->name = kstrdup(config->name, GFP_KERNEL);
if (!dev->name)
goto err_str;
--
2.50.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [RFC v2 5/7] vduse: add vq group asid support
2025-08-07 11:57 [RFC v2 0/7] Add multiple address spaces support to VDUSE Eugenio Pérez
` (3 preceding siblings ...)
2025-08-07 11:57 ` [RFC v2 4/7] vduse: return internal vq group struct as map token Eugenio Pérez
@ 2025-08-07 11:57 ` Eugenio Pérez
2025-08-08 7:45 ` Maxime Coquelin
2025-08-11 3:21 ` Jason Wang
2025-08-07 11:57 ` [RFC v2 6/7] vduse: send update_iotlb_v2 message Eugenio Pérez
2025-08-07 11:57 ` [RFC v2 7/7] vduse: bump version number Eugenio Pérez
6 siblings, 2 replies; 24+ messages in thread
From: Eugenio Pérez @ 2025-08-07 11:57 UTC (permalink / raw)
To: Michael S . Tsirkin
Cc: Cindy Lu, Eugenio Pérez, Yongji Xie, Stefano Garzarella,
virtualization, Laurent Vivier, linux-kernel, Xuan Zhuo, jasowang,
Maxime Coquelin
Add support for assigning Address Space Identifiers (ASIDs) to each VQ
group. This enables mapping each group into a distinct memory space.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
v2: Make iotlb entry the last one of vduse_iotlb_entry_v2 so the first
part of the struct is the same.
---
drivers/vdpa/vdpa_user/vduse_dev.c | 259 ++++++++++++++++++++++-------
include/uapi/linux/vduse.h | 36 +++-
2 files changed, 230 insertions(+), 65 deletions(-)
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index a7a2749f5818..145147c49930 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -51,6 +51,12 @@
*/
#define VDUSE_MAX_VQ_GROUPS 3
+/*
+ * We can do ASID thanks to (ab)using the vduse device and the vdpa one. So no
+ * more than 2.
+ */
+#define VDUSE_MAX_ASID 2
+
#define IRQ_UNBOUND -1
struct vduse_virtqueue {
@@ -91,6 +97,7 @@ struct vduse_umem {
};
struct vduse_vq_group_int {
+ struct vduse_iova_domain *domain;
struct vduse_dev *dev;
};
@@ -98,7 +105,7 @@ struct vduse_dev {
struct vduse_vdpa *vdev;
struct device *dev;
struct vduse_virtqueue **vqs;
- struct vduse_iova_domain *domain;
+ struct vduse_iova_domain *domain[VDUSE_MAX_ASID];
char *name;
struct mutex lock;
spinlock_t msg_lock;
@@ -126,7 +133,8 @@ struct vduse_dev {
u32 vq_num;
u32 vq_align;
u32 ngroups;
- struct vduse_umem *umem;
+ u32 nas;
+ struct vduse_umem *umem[VDUSE_MAX_ASID];
struct vduse_vq_group_int groups[VDUSE_MAX_VQ_GROUPS];
struct mutex mem_lock;
unsigned int bounce_size;
@@ -440,14 +448,28 @@ static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
return mask;
}
+/* Force set the asid to a vq group without a message to the VDUSE device */
+static void vduse_set_group_asid_nomsg(struct vduse_dev *dev,
+ unsigned int group, unsigned int asid)
+{
+ guard(mutex)(&dev->domain_lock);
+ dev->groups[group].domain = dev->domain[asid];
+}
+
static void vduse_dev_reset(struct vduse_dev *dev)
{
int i;
- struct vduse_iova_domain *domain = dev->domain;
/* The coherent mappings are handled in vduse_dev_free_coherent() */
- if (domain && domain->bounce_map)
- vduse_domain_reset_bounce_map(domain);
+ for (i = 0; i < dev->nas; i++) {
+ struct vduse_iova_domain *domain = dev->domain[i];
+
+ if (domain && domain->bounce_map)
+ vduse_domain_reset_bounce_map(domain);
+ }
+
+ for (i = 0; i < dev->ngroups; i++)
+ vduse_set_group_asid_nomsg(dev, i, 0);
down_write(&dev->rwsem);
@@ -619,6 +641,29 @@ static u32 vduse_get_vq_desc_group(struct vdpa_device *vdpa, u16 idx)
return dev->vqs[idx]->vq_desc_group;
}
+static int vduse_set_group_asid(struct vdpa_device *vdpa, unsigned int group,
+ unsigned int asid)
+{
+ struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+ struct vduse_dev_msg msg = { 0 };
+ int r;
+
+ if (dev->api_version < VDUSE_API_VERSION_1 ||
+ group > dev->ngroups || asid > dev->nas)
+ return -EINVAL;
+
+ msg.req.type = VDUSE_SET_VQ_GROUP_ASID;
+ msg.req.vq_group_asid.group = group;
+ msg.req.vq_group_asid.asid = asid;
+
+ r = vduse_dev_msg_sync(dev, &msg);
+ if (r < 0)
+ return r;
+
+ vduse_set_group_asid_nomsg(dev, group, asid);
+ return 0;
+}
+
static void *vduse_get_vq_map_token(struct vdpa_device *vdpa, u16 idx)
{
struct vduse_dev *dev = vdpa_to_vduse(vdpa);
@@ -833,13 +878,13 @@ static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
struct vduse_dev *dev = vdpa_to_vduse(vdpa);
int ret;
- ret = vduse_domain_set_map(dev->domain, iotlb);
+ ret = vduse_domain_set_map(dev->domain[asid], iotlb);
if (ret)
return ret;
ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
if (ret) {
- vduse_domain_clear_map(dev->domain, iotlb);
+ vduse_domain_clear_map(dev->domain[asid], iotlb);
return ret;
}
@@ -883,6 +928,7 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = {
.get_vq_affinity = vduse_vdpa_get_vq_affinity,
.reset = vduse_vdpa_reset,
.set_map = vduse_vdpa_set_map,
+ .set_group_asid = vduse_set_group_asid,
.get_vq_map_token = vduse_get_vq_map_token,
.free = vduse_vdpa_free,
};
@@ -893,8 +939,10 @@ static void vduse_dev_sync_single_for_device(void *token,
{
struct vduse_vq_group_int *group = token;
struct vduse_dev *vdev = group->dev;
- struct vduse_iova_domain *domain = vdev->domain;
+ struct vduse_iova_domain *domain;
+ guard(mutex)(&vdev->domain_lock);
+ domain = group->domain;
vduse_domain_sync_single_for_device(domain, dma_addr, size, dir);
}
@@ -904,8 +952,10 @@ static void vduse_dev_sync_single_for_cpu(void *token,
{
struct vduse_vq_group_int *group = token;
struct vduse_dev *vdev = group->dev;
- struct vduse_iova_domain *domain = vdev->domain;
+ struct vduse_iova_domain *domain;
+ guard(mutex)(&vdev->domain_lock);
+ domain = group->domain;
vduse_domain_sync_single_for_cpu(domain, dma_addr, size, dir);
}
@@ -916,8 +966,10 @@ static dma_addr_t vduse_dev_map_page(void *token, struct page *page,
{
struct vduse_vq_group_int *group = token;
struct vduse_dev *vdev = group->dev;
- struct vduse_iova_domain *domain = vdev->domain;
+ struct vduse_iova_domain *domain;
+ guard(mutex)(&vdev->domain_lock);
+ domain = group->domain;
return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
}
@@ -927,8 +979,10 @@ static void vduse_dev_unmap_page(void *token, dma_addr_t dma_addr,
{
struct vduse_vq_group_int *group = token;
struct vduse_dev *vdev = group->dev;
- struct vduse_iova_domain *domain = vdev->domain;
+ struct vduse_iova_domain *domain;
+ guard(mutex)(&vdev->domain_lock);
+ domain = group->domain;
return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
}
@@ -937,11 +991,13 @@ static void *vduse_dev_alloc_coherent(void *token, size_t size,
{
struct vduse_vq_group_int *group = token;
struct vduse_dev *vdev = group->dev;
- struct vduse_iova_domain *domain = vdev->domain;
+ struct vduse_iova_domain *domain;
unsigned long iova;
void *addr;
*dma_addr = DMA_MAPPING_ERROR;
+ guard(mutex)(&vdev->domain_lock);
+ domain = group->domain;
addr = vduse_domain_alloc_coherent(domain, size,
(dma_addr_t *)&iova, flag);
if (!addr)
@@ -958,8 +1014,10 @@ static void vduse_dev_free_coherent(void *token, size_t size,
{
struct vduse_vq_group_int *group = token;
struct vduse_dev *vdev = group->dev;
- struct vduse_iova_domain *domain = vdev->domain;
+ struct vduse_iova_domain *domain;
+ guard(mutex)(&vdev->domain_lock);
+ domain = group->domain;
vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
}
@@ -967,8 +1025,10 @@ static size_t vduse_dev_max_mapping_size(void *token)
{
struct vduse_vq_group_int *group = token;
struct vduse_dev *vdev = group->dev;
- struct vduse_iova_domain *domain = vdev->domain;
+ struct vduse_iova_domain *domain;
+ guard(mutex)(&vdev->domain_lock);
+ domain = group->domain;
return domain->bounce_size;
}
@@ -1107,31 +1167,32 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
return ret;
}
-static int vduse_dev_dereg_umem(struct vduse_dev *dev,
+static int vduse_dev_dereg_umem(struct vduse_dev *dev, u32 asid,
u64 iova, u64 size)
{
int ret;
mutex_lock(&dev->mem_lock);
ret = -ENOENT;
- if (!dev->umem)
+ if (!dev->umem[asid])
goto unlock;
ret = -EINVAL;
- if (!dev->domain)
+ if (!dev->domain[asid])
goto unlock;
- if (dev->umem->iova != iova || size != dev->domain->bounce_size)
+ if (dev->umem[asid]->iova != iova ||
+ size != dev->domain[asid]->bounce_size)
goto unlock;
- vduse_domain_remove_user_bounce_pages(dev->domain);
- unpin_user_pages_dirty_lock(dev->umem->pages,
- dev->umem->npages, true);
- atomic64_sub(dev->umem->npages, &dev->umem->mm->pinned_vm);
- mmdrop(dev->umem->mm);
- vfree(dev->umem->pages);
- kfree(dev->umem);
- dev->umem = NULL;
+ vduse_domain_remove_user_bounce_pages(dev->domain[asid]);
+ unpin_user_pages_dirty_lock(dev->umem[asid]->pages,
+ dev->umem[asid]->npages, true);
+ atomic64_sub(dev->umem[asid]->npages, &dev->umem[asid]->mm->pinned_vm);
+ mmdrop(dev->umem[asid]->mm);
+ vfree(dev->umem[asid]->pages);
+ kfree(dev->umem[asid]);
+ dev->umem[asid] = NULL;
ret = 0;
unlock:
mutex_unlock(&dev->mem_lock);
@@ -1139,7 +1200,7 @@ static int vduse_dev_dereg_umem(struct vduse_dev *dev,
}
static int vduse_dev_reg_umem(struct vduse_dev *dev,
- u64 iova, u64 uaddr, u64 size)
+ u32 asid, u64 iova, u64 uaddr, u64 size)
{
struct page **page_list = NULL;
struct vduse_umem *umem = NULL;
@@ -1147,14 +1208,14 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
unsigned long npages, lock_limit;
int ret;
- if (!dev->domain || !dev->domain->bounce_map ||
- size != dev->domain->bounce_size ||
+ if (!dev->domain[asid] || !dev->domain[asid]->bounce_map ||
+ size != dev->domain[asid]->bounce_size ||
iova != 0 || uaddr & ~PAGE_MASK)
return -EINVAL;
mutex_lock(&dev->mem_lock);
ret = -EEXIST;
- if (dev->umem)
+ if (dev->umem[asid])
goto unlock;
ret = -ENOMEM;
@@ -1178,7 +1239,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
goto out;
}
- ret = vduse_domain_add_user_bounce_pages(dev->domain,
+ ret = vduse_domain_add_user_bounce_pages(dev->domain[asid],
page_list, pinned);
if (ret)
goto out;
@@ -1191,7 +1252,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
umem->mm = current->mm;
mmgrab(current->mm);
- dev->umem = umem;
+ dev->umem[asid] = umem;
out:
if (ret && pinned > 0)
unpin_user_pages(page_list, pinned);
@@ -1234,26 +1295,43 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
switch (cmd) {
case VDUSE_IOTLB_GET_FD: {
- struct vduse_iotlb_entry entry;
+ struct vduse_iotlb_entry_v2 entry;
+ struct vduse_iotlb_entry entry_old;
struct vhost_iotlb_map *map;
struct vdpa_map_file *map_file;
struct file *f = NULL;
ret = -EFAULT;
- if (copy_from_user(&entry, argp, sizeof(entry)))
- break;
+ if (dev->api_version >= VDUSE_API_VERSION_1) {
+ if (copy_from_user(&entry, argp, sizeof(entry)))
+ break;
+ } else {
+ if (copy_from_user(&entry_old, argp,
+ sizeof(entry_old)))
+ break;
+
+ entry.offset = entry_old.offset;
+ entry.start = entry_old.start;
+ entry.last = entry_old.last;
+ entry.asid = 0;
+ entry.perm = entry_old.perm;
+ }
ret = -EINVAL;
if (entry.start > entry.last)
break;
+ if (entry.asid > dev->nas)
+ break;
+
mutex_lock(&dev->domain_lock);
- if (!dev->domain) {
+ /* TODO accessing an array with idx from userspace, mitigations? */
+ if (!dev->domain[entry.asid]) {
mutex_unlock(&dev->domain_lock);
break;
}
- spin_lock(&dev->domain->iotlb_lock);
- map = vhost_iotlb_itree_first(dev->domain->iotlb,
+ spin_lock(&dev->domain[entry.asid]->iotlb_lock);
+ map = vhost_iotlb_itree_first(dev->domain[entry.asid]->iotlb,
entry.start, entry.last);
if (map) {
map_file = (struct vdpa_map_file *)map->opaque;
@@ -1263,7 +1341,7 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
entry.last = map->last;
entry.perm = map->perm;
}
- spin_unlock(&dev->domain->iotlb_lock);
+ spin_unlock(&dev->domain[entry.asid]->iotlb_lock);
mutex_unlock(&dev->domain_lock);
ret = -EINVAL;
if (!f)
@@ -1413,12 +1491,16 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
break;
ret = -EINVAL;
+ /* TODO: Using asid from userspace, need to mitigate? */
if (!is_mem_zero((const char *)umem.reserved,
- sizeof(umem.reserved)))
+ sizeof(umem.reserved)) ||
+ (dev->api_version < VDUSE_API_VERSION_1 &&
+ umem.asid != 0) ||
+ umem.asid > dev->nas)
break;
mutex_lock(&dev->domain_lock);
- ret = vduse_dev_reg_umem(dev, umem.iova,
+ ret = vduse_dev_reg_umem(dev, umem.asid, umem.iova,
umem.uaddr, umem.size);
mutex_unlock(&dev->domain_lock);
break;
@@ -1431,15 +1513,24 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
break;
ret = -EINVAL;
+ /* TODO: Using asid from userspace, need to mitigate? */
if (!is_mem_zero((const char *)umem.reserved,
- sizeof(umem.reserved)))
+ sizeof(umem.reserved)) ||
+ (dev->api_version < VDUSE_API_VERSION_1 &&
+ umem.asid != 0) ||
+ umem.asid > dev->nas)
+ break;
+
+
+ if (umem.asid > dev->nas)
break;
mutex_lock(&dev->domain_lock);
- ret = vduse_dev_dereg_umem(dev, umem.iova,
+ ret = vduse_dev_dereg_umem(dev, umem.asid, umem.iova,
umem.size);
mutex_unlock(&dev->domain_lock);
break;
}
+ /* TODO can we merge this with GET_FD? */
case VDUSE_IOTLB_GET_INFO: {
struct vduse_iova_info info;
struct vhost_iotlb_map *map;
@@ -1452,27 +1543,32 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
if (info.start > info.last)
break;
+ if (info.asid > dev->nas)
+ break;
+
if (!is_mem_zero((const char *)info.reserved,
sizeof(info.reserved)))
break;
mutex_lock(&dev->domain_lock);
- if (!dev->domain) {
+ /* TODO asid comes from userspace. mitigations? */
+ if (!dev->domain[info.asid]) {
mutex_unlock(&dev->domain_lock);
break;
}
- spin_lock(&dev->domain->iotlb_lock);
- map = vhost_iotlb_itree_first(dev->domain->iotlb,
+ spin_lock(&dev->domain[info.asid]->iotlb_lock);
+ map = vhost_iotlb_itree_first(dev->domain[info.asid]->iotlb,
info.start, info.last);
if (map) {
info.start = map->start;
info.last = map->last;
info.capability = 0;
- if (dev->domain->bounce_map && map->start == 0 &&
- map->last == dev->domain->bounce_size - 1)
+ if (dev->domain[info.asid]->bounce_map &&
+ map->start == 0 &&
+ map->last == dev->domain[info.asid]->bounce_size - 1)
info.capability |= VDUSE_IOVA_CAP_UMEM;
}
- spin_unlock(&dev->domain->iotlb_lock);
+ spin_unlock(&dev->domain[info.asid]->iotlb_lock);
mutex_unlock(&dev->domain_lock);
if (!map)
break;
@@ -1497,8 +1593,13 @@ static int vduse_dev_release(struct inode *inode, struct file *file)
struct vduse_dev *dev = file->private_data;
mutex_lock(&dev->domain_lock);
- if (dev->domain)
- vduse_dev_dereg_umem(dev, 0, dev->domain->bounce_size);
+ for (int i = 0; i < dev->nas; i++) {
+ if (dev->domain[i]) {
+ vduse_dev_dereg_umem(dev, i, 0,
+ dev->domain[i]->bounce_size);
+ dev->domain[i] = NULL;
+ }
+ }
mutex_unlock(&dev->domain_lock);
spin_lock(&dev->msg_lock);
/* Make sure the inflight messages can processed after reconncection */
@@ -1768,9 +1869,12 @@ static int vduse_destroy_dev(char *name)
idr_remove(&vduse_idr, dev->minor);
kvfree(dev->config);
vduse_dev_deinit_vqs(dev);
- if (dev->domain)
- vduse_domain_destroy(dev->domain);
+ for (int i = 0; i < dev->nas; i++) {
+ if (dev->domain[i])
+ vduse_domain_destroy(dev->domain[i]);
+ }
kfree(dev->name);
+ kfree(dev->umem);
vduse_dev_destroy(dev);
module_put(THIS_MODULE);
@@ -1877,7 +1981,8 @@ static ssize_t bounce_size_store(struct device *device,
ret = -EPERM;
mutex_lock(&dev->domain_lock);
- if (dev->domain)
+ /* Assuming that if the first domain is allocated, all are allocated */
+ if (dev->domain[0])
goto unlock;
ret = kstrtouint(buf, 10, &bounce_size);
@@ -1933,11 +2038,20 @@ static int vduse_create_dev(struct vduse_dev_config *config,
if (config->ngroups > VDUSE_MAX_VQ_GROUPS) {
pr_err("Not creating a VDUSE device with %u vq groups. Max: %u",
config->ngroups, VDUSE_MAX_VQ_GROUPS);
+ /* TODO: Need to destroy dev here too! */
goto err;
}
+ if (config->nas > VDUSE_MAX_ASID) {
+ pr_err("Not creating a VDUSE device with %u asid. Max: %u",
+ config->nas, VDUSE_MAX_ASID);
+ goto err_nas;
+ }
+
dev->ngroups = config->ngroups ?: 1;
+ dev->nas = config->nas ?: 1;
} else {
dev->ngroups = 1;
+ dev->nas = 1;
}
for (u32 i = 0; i < dev->ngroups; ++i)
dev->groups[i].dev = dev;
@@ -1977,6 +2091,7 @@ static int vduse_create_dev(struct vduse_dev_config *config,
err_idr:
kfree(dev->name);
err_str:
+err_nas:
vduse_dev_destroy(dev);
err:
return ret;
@@ -2069,7 +2184,6 @@ static int vduse_open(struct inode *inode, struct file *file)
if (!control)
return -ENOMEM;
- control->api_version = VDUSE_API_VERSION;
file->private_data = control;
return 0;
@@ -2101,7 +2215,7 @@ static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev,
&vduse_vdpa_config_ops, &vduse_map_ops,
- dev->ngroups, 1, name, true);
+ dev->ngroups, dev->nas, name, true);
if (IS_ERR(vdev))
return PTR_ERR(vdev);
@@ -2137,11 +2251,23 @@ static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name,
return ret;
mutex_lock(&dev->domain_lock);
- if (!dev->domain)
- dev->domain = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
- dev->bounce_size);
+ ret = 0;
+
+ /* TODO we could delay the creation of the domain */
+ for (int i = 0; i < dev->nas; ++i) {
+ if (!dev->domain[i])
+ dev->domain[i] = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
+ dev->bounce_size);
+ if (!dev->domain[i]) {
+ ret = -ENOMEM;
+ for (int j = 0; j < i; ++j)
+ vduse_domain_destroy(dev->domain[j]);
+ goto err_domain;
+ }
+ }
+
mutex_unlock(&dev->domain_lock);
- if (!dev->domain) {
+ if (ret == -ENOMEM) {
put_device(&dev->vdev->vdpa.dev);
return -ENOMEM;
}
@@ -2150,13 +2276,22 @@ static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name,
if (ret) {
put_device(&dev->vdev->vdpa.dev);
mutex_lock(&dev->domain_lock);
- vduse_domain_destroy(dev->domain);
- dev->domain = NULL;
+ for (int i = 0; i < dev->nas; i++) {
+ if (dev->domain[i]) {
+ vduse_domain_destroy(dev->domain[i]);
+ dev->domain[i] = NULL;
+ }
+ }
mutex_unlock(&dev->domain_lock);
return ret;
}
return 0;
+
+err_domain:
+ /* TODO do I need to call put_device? */
+ mutex_unlock(&dev->domain_lock);
+ return ret;
}
static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
index b4b139dc76bb..d300fd5f867f 100644
--- a/include/uapi/linux/vduse.h
+++ b/include/uapi/linux/vduse.h
@@ -47,7 +47,8 @@ struct vduse_dev_config {
__u32 vq_num;
__u32 vq_align;
__u32 ngroups; /* if VDUSE_API_VERSION >= 1 */
- __u32 reserved[12];
+ __u32 nas; /* if VDUSE_API_VERSION >= 1 */
+ __u32 reserved[11];
__u32 config_size;
__u8 config[];
};
@@ -82,6 +83,17 @@ struct vduse_iotlb_entry {
__u8 perm;
};
+struct vduse_iotlb_entry_v2 {
+ __u64 offset;
+ __u64 start;
+ __u64 last;
+#define VDUSE_ACCESS_RO 0x1
+#define VDUSE_ACCESS_WO 0x2
+#define VDUSE_ACCESS_RW 0x3
+ __u8 perm;
+ __u32 asid;
+};
+
/*
* Find the first IOVA region that overlaps with the range [start, last]
* and return the corresponding file descriptor. Return -EINVAL means the
@@ -172,6 +184,16 @@ struct vduse_vq_group {
__u32 num;
};
+/**
+ * struct vduse_vq_group - virtqueue group
+ @ @group: Index of the virtqueue group
+ * @asid: Address space ID of the group
+ */
+struct vduse_vq_group_asid {
+ __u32 group;
+ __u32 asid;
+};
+
/**
* struct vduse_vq_info - information of a virtqueue
* @index: virtqueue index
@@ -232,6 +254,7 @@ struct vduse_vq_eventfd {
* @uaddr: start address of userspace memory, it must be aligned to page size
* @iova: start of the IOVA region
* @size: size of the IOVA region
+ * @asid: Address space ID of the IOVA region
* @reserved: for future use, needs to be initialized to zero
*
* Structure used by VDUSE_IOTLB_REG_UMEM and VDUSE_IOTLB_DEREG_UMEM
@@ -241,7 +264,8 @@ struct vduse_iova_umem {
__u64 uaddr;
__u64 iova;
__u64 size;
- __u64 reserved[3];
+ __u32 asid;
+ __u32 reserved[5];
};
/* Register userspace memory for IOVA regions */
@@ -265,7 +289,8 @@ struct vduse_iova_info {
__u64 last;
#define VDUSE_IOVA_CAP_UMEM (1 << 0)
__u64 capability;
- __u64 reserved[3];
+ __u32 asid; /* Only if device API version >= 1 */
+ __u32 reserved[5];
};
/*
@@ -289,6 +314,7 @@ enum vduse_req_type {
VDUSE_UPDATE_IOTLB,
VDUSE_GET_VQ_GROUP,
VDUSE_GET_VRING_DESC_GROUP,
+ VDUSE_SET_VQ_GROUP_ASID,
};
/**
@@ -344,6 +370,8 @@ struct vduse_dev_request {
struct vduse_dev_status s;
struct vduse_iova_range iova;
struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
+ /* Only if vduse api version >= 1 */
+ struct vduse_vq_group_asid vq_group_asid;
__u32 padding[32];
};
};
@@ -367,6 +395,8 @@ struct vduse_dev_response {
union {
struct vduse_vq_state vq_state;
struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
+ /* Only if vduse api version >= 1 */
+ struct vduse_vq_group_asid vq_group_asid;
__u32 padding[32];
};
};
--
2.50.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [RFC v2 6/7] vduse: send update_iotlb_v2 message
2025-08-07 11:57 [RFC v2 0/7] Add multiple address spaces support to VDUSE Eugenio Pérez
` (4 preceding siblings ...)
2025-08-07 11:57 ` [RFC v2 5/7] vduse: add vq group asid support Eugenio Pérez
@ 2025-08-07 11:57 ` Eugenio Pérez
2025-08-11 3:24 ` Jason Wang
2025-08-07 11:57 ` [RFC v2 7/7] vduse: bump version number Eugenio Pérez
6 siblings, 1 reply; 24+ messages in thread
From: Eugenio Pérez @ 2025-08-07 11:57 UTC (permalink / raw)
To: Michael S . Tsirkin
Cc: Cindy Lu, Eugenio Pérez, Yongji Xie, Stefano Garzarella,
virtualization, Laurent Vivier, linux-kernel, Xuan Zhuo, jasowang,
Maxime Coquelin
This message lets the kernel notify userspace VDUSE backends about
updated IOTLB mappings for a specific ASID.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
drivers/vdpa/vdpa_user/vduse_dev.c | 14 ++++++++++----
include/uapi/linux/vduse.h | 7 +++++++
2 files changed, 17 insertions(+), 4 deletions(-)
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index 145147c49930..ac7897068222 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -326,7 +326,7 @@ static int vduse_dev_set_status(struct vduse_dev *dev, u8 status)
return vduse_dev_msg_sync(dev, &msg);
}
-static int vduse_dev_update_iotlb(struct vduse_dev *dev,
+static int vduse_dev_update_iotlb(struct vduse_dev *dev, u32 asid,
u64 start, u64 last)
{
struct vduse_dev_msg msg = { 0 };
@@ -335,8 +335,14 @@ static int vduse_dev_update_iotlb(struct vduse_dev *dev,
return -EINVAL;
msg.req.type = VDUSE_UPDATE_IOTLB;
- msg.req.iova.start = start;
- msg.req.iova.last = last;
+ if (dev->api_version < VDUSE_API_VERSION_1) {
+ msg.req.iova.start = start;
+ msg.req.iova.last = last;
+ } else {
+ msg.req.iova_v2.start = start;
+ msg.req.iova_v2.last = last;
+ msg.req.iova_v2.asid = asid;
+ }
return vduse_dev_msg_sync(dev, &msg);
}
@@ -882,7 +888,7 @@ static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
if (ret)
return ret;
- ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
+ ret = vduse_dev_update_iotlb(dev, asid, 0ULL, ULLONG_MAX);
if (ret) {
vduse_domain_clear_map(dev->domain[asid], iotlb);
return ret;
diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
index d300fd5f867f..a5f7a5edb8e0 100644
--- a/include/uapi/linux/vduse.h
+++ b/include/uapi/linux/vduse.h
@@ -349,6 +349,12 @@ struct vduse_iova_range {
__u64 last;
};
+struct vduse_iova_range_v2 {
+ __u64 start;
+ __u64 last;
+ __u32 asid;
+};
+
/**
* struct vduse_dev_request - control request
* @type: request type
@@ -369,6 +375,7 @@ struct vduse_dev_request {
struct vduse_vq_state vq_state;
struct vduse_dev_status s;
struct vduse_iova_range iova;
+ struct vduse_iova_range_v2 iova_v2;
struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
/* Only if vduse api version >= 1 */
struct vduse_vq_group_asid vq_group_asid;
--
2.50.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [RFC v2 7/7] vduse: bump version number
2025-08-07 11:57 [RFC v2 0/7] Add multiple address spaces support to VDUSE Eugenio Pérez
` (5 preceding siblings ...)
2025-08-07 11:57 ` [RFC v2 6/7] vduse: send update_iotlb_v2 message Eugenio Pérez
@ 2025-08-07 11:57 ` Eugenio Pérez
6 siblings, 0 replies; 24+ messages in thread
From: Eugenio Pérez @ 2025-08-07 11:57 UTC (permalink / raw)
To: Michael S . Tsirkin
Cc: Cindy Lu, Eugenio Pérez, Yongji Xie, Stefano Garzarella,
virtualization, Laurent Vivier, linux-kernel, Xuan Zhuo, jasowang,
Maxime Coquelin
Finalize the series by advertising VDUSE API v1 support to userspace.
Now that all required infrastructure for v1 (ASIDs, VQ groups,
update_iotlb_v2) is in place, VDUSE devices can opt in to the new
features.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
drivers/vdpa/vdpa_user/vduse_dev.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index ac7897068222..cbbbaafa79a1 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -2123,7 +2123,7 @@ static long vduse_ioctl(struct file *file, unsigned int cmd,
break;
ret = -EINVAL;
- if (api_version > VDUSE_API_VERSION)
+ if (api_version > VDUSE_API_VERSION_1)
break;
ret = 0;
@@ -2190,6 +2190,7 @@ static int vduse_open(struct inode *inode, struct file *file)
if (!control)
return -ENOMEM;
+ control->api_version = VDUSE_API_VERSION_1;
file->private_data = control;
return 0;
--
2.50.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [RFC v2 5/7] vduse: add vq group asid support
2025-08-07 11:57 ` [RFC v2 5/7] vduse: add vq group asid support Eugenio Pérez
@ 2025-08-08 7:45 ` Maxime Coquelin
2025-08-10 9:45 ` Eugenio Perez Martin
2025-08-11 3:21 ` Jason Wang
1 sibling, 1 reply; 24+ messages in thread
From: Maxime Coquelin @ 2025-08-08 7:45 UTC (permalink / raw)
To: Eugenio Pérez, Michael S . Tsirkin
Cc: Cindy Lu, Yongji Xie, Stefano Garzarella, virtualization,
Laurent Vivier, linux-kernel, Xuan Zhuo, jasowang,
Maxime Coquelin
On 8/7/25 1:57 PM, Eugenio Pérez wrote:
> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> index b4b139dc76bb..d300fd5f867f 100644
> --- a/include/uapi/linux/vduse.h
> +++ b/include/uapi/linux/vduse.h
> @@ -47,7 +47,8 @@ struct vduse_dev_config {
> __u32 vq_num;
> __u32 vq_align;
> __u32 ngroups; /* if VDUSE_API_VERSION >= 1 */
> - __u32 reserved[12];
> + __u32 nas; /* if VDUSE_API_VERSION >= 1 */
> + __u32 reserved[11];
> __u32 config_size;
> __u8 config[];
> };
> @@ -82,6 +83,17 @@ struct vduse_iotlb_entry {
> __u8 perm;
> };
>
> +struct vduse_iotlb_entry_v2 {
> + __u64 offset;
> + __u64 start;
> + __u64 last;
> +#define VDUSE_ACCESS_RO 0x1
> +#define VDUSE_ACCESS_WO 0x2
> +#define VDUSE_ACCESS_RW 0x3
> + __u8 perm;
> + __u32 asid;
> +};
> +
> /*
> * Find the first IOVA region that overlaps with the range [start, last]
> * and return the corresponding file descriptor. Return -EINVAL means the
> @@ -172,6 +184,16 @@ struct vduse_vq_group {
> __u32 num;
> };
>
> +/**
> + * struct vduse_vq_group - virtqueue group
> + @ @group: Index of the virtqueue group
> + * @asid: Address space ID of the group
> + */
> +struct vduse_vq_group_asid {
> + __u32 group;
> + __u32 asid;
> +};
> +
> /**
> * struct vduse_vq_info - information of a virtqueue
> * @index: virtqueue index
> @@ -232,6 +254,7 @@ struct vduse_vq_eventfd {
> * @uaddr: start address of userspace memory, it must be aligned to page size
> * @iova: start of the IOVA region
> * @size: size of the IOVA region
> + * @asid: Address space ID of the IOVA region
> * @reserved: for future use, needs to be initialized to zero
> *
> * Structure used by VDUSE_IOTLB_REG_UMEM and VDUSE_IOTLB_DEREG_UMEM
> @@ -241,7 +264,8 @@ struct vduse_iova_umem {
> __u64 uaddr;
> __u64 iova;
> __u64 size;
> - __u64 reserved[3];
> + __u32 asid;
> + __u32 reserved[5];
> };
>
> /* Register userspace memory for IOVA regions */
> @@ -265,7 +289,8 @@ struct vduse_iova_info {
> __u64 last;
> #define VDUSE_IOVA_CAP_UMEM (1 << 0)
> __u64 capability;
> - __u64 reserved[3];
> + __u32 asid; /* Only if device API version >= 1 */
> + __u32 reserved[5];
> };
>
> /*
> @@ -289,6 +314,7 @@ enum vduse_req_type {
> VDUSE_UPDATE_IOTLB,
> VDUSE_GET_VQ_GROUP,
> VDUSE_GET_VRING_DESC_GROUP,
> + VDUSE_SET_VQ_GROUP_ASID,
> };
>
> /**
> @@ -344,6 +370,8 @@ struct vduse_dev_request {
> struct vduse_dev_status s;
> struct vduse_iova_range iova;
> struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
> + /* Only if vduse api version >= 1 */
> + struct vduse_vq_group_asid vq_group_asid;
> __u32 padding[32];
I think you need to update padding size here.
> };
> };
> @@ -367,6 +395,8 @@ struct vduse_dev_response {
> union {
> struct vduse_vq_state vq_state;
> struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
> + /* Only if vduse api version >= 1 */
> + struct vduse_vq_group_asid vq_group_asid;
> __u32 padding[32];
> };
> };
Same
Regards,
Maxime
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC v2 2/7] vduse: add vq group support
2025-08-07 11:57 ` [RFC v2 2/7] vduse: add vq group support Eugenio Pérez
@ 2025-08-08 7:46 ` Maxime Coquelin
2025-08-11 3:09 ` Jason Wang
1 sibling, 0 replies; 24+ messages in thread
From: Maxime Coquelin @ 2025-08-08 7:46 UTC (permalink / raw)
To: Eugenio Pérez, Michael S . Tsirkin
Cc: Cindy Lu, Yongji Xie, Stefano Garzarella, virtualization,
Laurent Vivier, linux-kernel, Xuan Zhuo, jasowang,
Maxime Coquelin
On 8/7/25 1:57 PM, Eugenio Pérez wrote:
> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> index 9a56d0416bfe..b4b139dc76bb 100644
> --- a/include/uapi/linux/vduse.h
> +++ b/include/uapi/linux/vduse.h
> @@ -31,6 +31,7 @@
> * @features: virtio features
> * @vq_num: the number of virtqueues
> * @vq_align: the allocation alignment of virtqueue's metadata
> + * @ngroups: number of vq groups that VDUSE device declares
> * @reserved: for future use, needs to be initialized to zero
> * @config_size: the size of the configuration space
> * @config: the buffer of the configuration space
> @@ -45,7 +46,8 @@ struct vduse_dev_config {
> __u64 features;
> __u32 vq_num;
> __u32 vq_align;
> - __u32 reserved[13];
> + __u32 ngroups; /* if VDUSE_API_VERSION >= 1 */
> + __u32 reserved[12];
> __u32 config_size;
> __u8 config[];
> };
> @@ -160,6 +162,16 @@ struct vduse_vq_state_packed {
> __u16 last_used_idx;
> };
>
> +/**
> + * struct vduse_vq_group - virtqueue group
> + * @num: Index of the virtqueue group
> + * @num: Group
> + */
> +struct vduse_vq_group {
> + __u32 index;
> + __u32 num;
> +};
> +
> /**
> * struct vduse_vq_info - information of a virtqueue
> * @index: virtqueue index
> @@ -182,6 +194,7 @@ struct vduse_vq_info {
> union {
> struct vduse_vq_state_split split;
> struct vduse_vq_state_packed packed;
> + struct vduse_vq_group group;
> };
> __u8 ready;
> };
> @@ -274,6 +287,8 @@ enum vduse_req_type {
> VDUSE_GET_VQ_STATE,
> VDUSE_SET_STATUS,
> VDUSE_UPDATE_IOTLB,
> + VDUSE_GET_VQ_GROUP,
> + VDUSE_GET_VRING_DESC_GROUP,
> };
>
> /**
> @@ -328,6 +343,7 @@ struct vduse_dev_request {
> struct vduse_vq_state vq_state;
> struct vduse_dev_status s;
> struct vduse_iova_range iova;
> + struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
> __u32 padding[32];
> };
> };
> @@ -350,6 +366,7 @@ struct vduse_dev_response {
> __u32 reserved[4];
> union {
> struct vduse_vq_state vq_state;
> + struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
> __u32 padding[32];
> };
> };
Same comment as for patch 5, padding should be updated.
Regards,
Maxime
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC v2 5/7] vduse: add vq group asid support
2025-08-08 7:45 ` Maxime Coquelin
@ 2025-08-10 9:45 ` Eugenio Perez Martin
0 siblings, 0 replies; 24+ messages in thread
From: Eugenio Perez Martin @ 2025-08-10 9:45 UTC (permalink / raw)
To: Maxime Coquelin
Cc: Michael S . Tsirkin, Cindy Lu, Yongji Xie, Stefano Garzarella,
virtualization, Laurent Vivier, linux-kernel, Xuan Zhuo, jasowang,
Maxime Coquelin
On Fri, Aug 8, 2025 at 9:45 AM Maxime Coquelin
<maxime.coquelin@redhat.com> wrote:
>
>
>
> On 8/7/25 1:57 PM, Eugenio Pérez wrote:
> > diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> > index b4b139dc76bb..d300fd5f867f 100644
> > --- a/include/uapi/linux/vduse.h
> > +++ b/include/uapi/linux/vduse.h
> > @@ -47,7 +47,8 @@ struct vduse_dev_config {
> > __u32 vq_num;
> > __u32 vq_align;
> > __u32 ngroups; /* if VDUSE_API_VERSION >= 1 */
> > - __u32 reserved[12];
> > + __u32 nas; /* if VDUSE_API_VERSION >= 1 */
> > + __u32 reserved[11];
> > __u32 config_size;
> > __u8 config[];
> > };
> > @@ -82,6 +83,17 @@ struct vduse_iotlb_entry {
> > __u8 perm;
> > };
> >
> > +struct vduse_iotlb_entry_v2 {
> > + __u64 offset;
> > + __u64 start;
> > + __u64 last;
> > +#define VDUSE_ACCESS_RO 0x1
> > +#define VDUSE_ACCESS_WO 0x2
> > +#define VDUSE_ACCESS_RW 0x3
> > + __u8 perm;
> > + __u32 asid;
> > +};
> > +
> > /*
> > * Find the first IOVA region that overlaps with the range [start, last]
> > * and return the corresponding file descriptor. Return -EINVAL means the
> > @@ -172,6 +184,16 @@ struct vduse_vq_group {
> > __u32 num;
> > };
> >
> > +/**
> > + * struct vduse_vq_group - virtqueue group
> > + @ @group: Index of the virtqueue group
> > + * @asid: Address space ID of the group
> > + */
> > +struct vduse_vq_group_asid {
> > + __u32 group;
> > + __u32 asid;
> > +};
> > +
> > /**
> > * struct vduse_vq_info - information of a virtqueue
> > * @index: virtqueue index
> > @@ -232,6 +254,7 @@ struct vduse_vq_eventfd {
> > * @uaddr: start address of userspace memory, it must be aligned to page size
> > * @iova: start of the IOVA region
> > * @size: size of the IOVA region
> > + * @asid: Address space ID of the IOVA region
> > * @reserved: for future use, needs to be initialized to zero
> > *
> > * Structure used by VDUSE_IOTLB_REG_UMEM and VDUSE_IOTLB_DEREG_UMEM
> > @@ -241,7 +264,8 @@ struct vduse_iova_umem {
> > __u64 uaddr;
> > __u64 iova;
> > __u64 size;
> > - __u64 reserved[3];
> > + __u32 asid;
> > + __u32 reserved[5];
> > };
> >
> > /* Register userspace memory for IOVA regions */
> > @@ -265,7 +289,8 @@ struct vduse_iova_info {
> > __u64 last;
> > #define VDUSE_IOVA_CAP_UMEM (1 << 0)
> > __u64 capability;
> > - __u64 reserved[3];
> > + __u32 asid; /* Only if device API version >= 1 */
> > + __u32 reserved[5];
> > };
> >
> > /*
> > @@ -289,6 +314,7 @@ enum vduse_req_type {
> > VDUSE_UPDATE_IOTLB,
> > VDUSE_GET_VQ_GROUP,
> > VDUSE_GET_VRING_DESC_GROUP,
> > + VDUSE_SET_VQ_GROUP_ASID,
> > };
> >
> > /**
> > @@ -344,6 +370,8 @@ struct vduse_dev_request {
> > struct vduse_dev_status s;
> > struct vduse_iova_range iova;
> > struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
> > + /* Only if vduse api version >= 1 */
> > + struct vduse_vq_group_asid vq_group_asid;
> > __u32 padding[32];
>
> I think you need to update padding size here.
>
Nope, that's an union so u32[32] is the total size. Modifying it would
change uapi :).
> > };
> > };
> > @@ -367,6 +395,8 @@ struct vduse_dev_response {
> > union {
> > struct vduse_vq_state vq_state;
> > struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
> > + /* Only if vduse api version >= 1 */
> > + struct vduse_vq_group_asid vq_group_asid;
> > __u32 padding[32];
> > };
> > };
>
> Same
>
> Regards,
> Maxime
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC v2 2/7] vduse: add vq group support
2025-08-07 11:57 ` [RFC v2 2/7] vduse: add vq group support Eugenio Pérez
2025-08-08 7:46 ` Maxime Coquelin
@ 2025-08-11 3:09 ` Jason Wang
2025-08-11 9:51 ` Eugenio Perez Martin
1 sibling, 1 reply; 24+ messages in thread
From: Jason Wang @ 2025-08-11 3:09 UTC (permalink / raw)
To: Eugenio Pérez
Cc: Michael S . Tsirkin, Cindy Lu, Yongji Xie, Stefano Garzarella,
virtualization, Laurent Vivier, linux-kernel, Xuan Zhuo,
Maxime Coquelin
On Thu, Aug 7, 2025 at 7:58 PM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> This allows sepparate the different virtqueues in groups that shares the
> same address space. Asking the VDUSE device for the groups of the vq at
> the beginning as they're needed for the DMA API.
>
> Allocating 3 vq groups as net is the device that need the most groups:
> * Dataplane (guest passthrough)
> * CVQ
> * Shadowed vrings.
>
> Future versions of the series can include dynamic allocation of the
> groups array so VDUSE can declare more groups.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
> v2:
> * Cache group information in kernel, as we need to provide the vq map
> tokens properly.
> * Add descs vq group to optimize SVQ forwarding and support indirect
> descriptors out of the box.
> ---
> drivers/vdpa/vdpa_user/vduse_dev.c | 71 +++++++++++++++++++++++++++++-
> include/uapi/linux/vduse.h | 19 +++++++-
> 2 files changed, 88 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> index d858c4389cc1..d1f6d00a9c71 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -46,6 +46,11 @@
> #define VDUSE_IOVA_SIZE (VDUSE_MAX_BOUNCE_SIZE + 128 * 1024 * 1024)
> #define VDUSE_MSG_DEFAULT_TIMEOUT 30
>
> +/*
> + * Let's make it 3 for simplicity.
> + */
> +#define VDUSE_MAX_VQ_GROUPS 3
I think we can release this to something like 64. Otherwise we might
bump the version again just to increase the limitation? Or having a
sysfs entry like bounce_size?
> +
> #define IRQ_UNBOUND -1
>
> struct vduse_virtqueue {
> @@ -58,6 +63,8 @@ struct vduse_virtqueue {
> struct vdpa_vq_state state;
> bool ready;
> bool kicked;
> + u32 vq_group;
> + u32 vq_desc_group;
> spinlock_t kick_lock;
> spinlock_t irq_lock;
> struct eventfd_ctx *kickfd;
> @@ -114,6 +121,7 @@ struct vduse_dev {
> u8 status;
> u32 vq_num;
> u32 vq_align;
> + u32 ngroups;
> struct vduse_umem *umem;
> struct mutex mem_lock;
> unsigned int bounce_size;
> @@ -592,6 +600,20 @@ static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
> return 0;
> }
>
> +static u32 vduse_get_vq_group(struct vdpa_device *vdpa, u16 idx)
> +{
> + struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> + return dev->vqs[idx]->vq_group;
> +}
> +
> +static u32 vduse_get_vq_desc_group(struct vdpa_device *vdpa, u16 idx)
> +{
> + struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> + return dev->vqs[idx]->vq_desc_group;
> +}
> +
> static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> struct vdpa_vq_state *state)
> {
> @@ -678,13 +700,48 @@ static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
> return dev->status;
> }
>
> +static int vduse_fill_vq_groups(struct vduse_dev *dev)
> +{
> + if (dev->api_version < VDUSE_API_VERSION_1)
> + return 0;
> +
> + for (int i = 0; i < dev->vdev->vdpa.nvqs; ++i) {
> + struct vduse_dev_msg msg = { 0 };
> + int ret;
> +
> + msg.req.type = VDUSE_GET_VQ_GROUP;
> + msg.req.vq_group.index = i;
> + ret = vduse_dev_msg_sync(dev, &msg);
I fail to understand why the default group mapping is not done during
device creation.
> + if (ret)
> + return ret;
> +
> + dev->vqs[i]->vq_group = msg.resp.vq_group.num;
> +
> + msg.req.type = VDUSE_GET_VRING_DESC_GROUP;
> + ret = vduse_dev_msg_sync(dev, &msg);
> + if (ret)
> + return ret;
> +
> + dev->vqs[i]->vq_desc_group = msg.resp.vq_group.num;
> + }
> +
> + return 0;
> +}
> +
> static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
> {
> struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> + u8 previous_status = dev->status;
>
> if (vduse_dev_set_status(dev, status))
> return;
>
> + if ((dev->status ^ previous_status) &
> + BIT_ULL(VIRTIO_CONFIG_S_FEATURES_OK) &&
> + status & (1ULL << VIRTIO_CONFIG_S_FEATURES_OK))
> + if (vduse_fill_vq_groups(dev))
Can we merge the two messages into a single one? Or can we use a
shared memory for storing such mapping?
For example, if we have 256 queues it would be very slow.
Thanks
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC v2 4/7] vduse: return internal vq group struct as map token
2025-08-07 11:57 ` [RFC v2 4/7] vduse: return internal vq group struct as map token Eugenio Pérez
@ 2025-08-11 3:11 ` Jason Wang
2025-08-11 11:03 ` Eugenio Perez Martin
0 siblings, 1 reply; 24+ messages in thread
From: Jason Wang @ 2025-08-11 3:11 UTC (permalink / raw)
To: Eugenio Pérez
Cc: Michael S . Tsirkin, Cindy Lu, Yongji Xie, Stefano Garzarella,
virtualization, Laurent Vivier, linux-kernel, Xuan Zhuo,
Maxime Coquelin
On Thu, Aug 7, 2025 at 7:58 PM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> Return the internal struct that represents the vq group as virtqueue map
> token, instead of the device.
Note that Michael prefers to use the iova domain. This indeed seems to
be better.
> This allows the DMA functions to access
s/DMA/map/
> the information per group.
>
> At this moment all the virtqueues share the same vq group, that only
> can point to ASID 0. This change prepares the infrastructure for actual
> per-group address space handling
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Thanks
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC v2 5/7] vduse: add vq group asid support
2025-08-07 11:57 ` [RFC v2 5/7] vduse: add vq group asid support Eugenio Pérez
2025-08-08 7:45 ` Maxime Coquelin
@ 2025-08-11 3:21 ` Jason Wang
2025-08-11 11:19 ` Eugenio Perez Martin
1 sibling, 1 reply; 24+ messages in thread
From: Jason Wang @ 2025-08-11 3:21 UTC (permalink / raw)
To: Eugenio Pérez
Cc: Michael S . Tsirkin, Cindy Lu, Yongji Xie, Stefano Garzarella,
virtualization, Laurent Vivier, linux-kernel, Xuan Zhuo,
Maxime Coquelin
On Thu, Aug 7, 2025 at 7:58 PM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> Add support for assigning Address Space Identifiers (ASIDs) to each VQ
> group. This enables mapping each group into a distinct memory space.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
> v2: Make iotlb entry the last one of vduse_iotlb_entry_v2 so the first
> part of the struct is the same.
> ---
> drivers/vdpa/vdpa_user/vduse_dev.c | 259 ++++++++++++++++++++++-------
> include/uapi/linux/vduse.h | 36 +++-
> 2 files changed, 230 insertions(+), 65 deletions(-)
>
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> index a7a2749f5818..145147c49930 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -51,6 +51,12 @@
> */
> #define VDUSE_MAX_VQ_GROUPS 3
>
> +/*
> + * We can do ASID thanks to (ab)using the vduse device and the vdpa one. So no
> + * more than 2.
> + */
> +#define VDUSE_MAX_ASID 2
Similar to group limitation, I'd like to either make it larger to 64
or make it changable via sysfs.
> +
> #define IRQ_UNBOUND -1
>
> struct vduse_virtqueue {
> @@ -91,6 +97,7 @@ struct vduse_umem {
> };
>
> struct vduse_vq_group_int {
> + struct vduse_iova_domain *domain;
> struct vduse_dev *dev;
> };
>
> @@ -98,7 +105,7 @@ struct vduse_dev {
> struct vduse_vdpa *vdev;
> struct device *dev;
> struct vduse_virtqueue **vqs;
> - struct vduse_iova_domain *domain;
> + struct vduse_iova_domain *domain[VDUSE_MAX_ASID];
> char *name;
> struct mutex lock;
> spinlock_t msg_lock;
> @@ -126,7 +133,8 @@ struct vduse_dev {
> u32 vq_num;
> u32 vq_align;
> u32 ngroups;
> - struct vduse_umem *umem;
> + u32 nas;
> + struct vduse_umem *umem[VDUSE_MAX_ASID];
> struct vduse_vq_group_int groups[VDUSE_MAX_VQ_GROUPS];
> struct mutex mem_lock;
> unsigned int bounce_size;
> @@ -440,14 +448,28 @@ static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> return mask;
> }
>
> +/* Force set the asid to a vq group without a message to the VDUSE device */
> +static void vduse_set_group_asid_nomsg(struct vduse_dev *dev,
> + unsigned int group, unsigned int asid)
> +{
> + guard(mutex)(&dev->domain_lock);
> + dev->groups[group].domain = dev->domain[asid];
> +}
> +
> static void vduse_dev_reset(struct vduse_dev *dev)
> {
> int i;
> - struct vduse_iova_domain *domain = dev->domain;
>
> /* The coherent mappings are handled in vduse_dev_free_coherent() */
> - if (domain && domain->bounce_map)
> - vduse_domain_reset_bounce_map(domain);
> + for (i = 0; i < dev->nas; i++) {
> + struct vduse_iova_domain *domain = dev->domain[i];
> +
> + if (domain && domain->bounce_map)
> + vduse_domain_reset_bounce_map(domain);
> + }
> +
> + for (i = 0; i < dev->ngroups; i++)
> + vduse_set_group_asid_nomsg(dev, i, 0);
I vaguely remember Siwei has invented a feature that forbids the reset
of the mapping during virtio reset. If it's true, let's check that
feature flag.
Btw even if we device to reset the mapping, should it be implied via
reset message itself instead of having more operations here?
>
> down_write(&dev->rwsem);
>
> @@ -619,6 +641,29 @@ static u32 vduse_get_vq_desc_group(struct vdpa_device *vdpa, u16 idx)
> return dev->vqs[idx]->vq_desc_group;
> }
>
> +static int vduse_set_group_asid(struct vdpa_device *vdpa, unsigned int group,
> + unsigned int asid)
> +{
> + struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> + struct vduse_dev_msg msg = { 0 };
> + int r;
> +
> + if (dev->api_version < VDUSE_API_VERSION_1 ||
> + group > dev->ngroups || asid > dev->nas)
> + return -EINVAL;
> +
> + msg.req.type = VDUSE_SET_VQ_GROUP_ASID;
> + msg.req.vq_group_asid.group = group;
> + msg.req.vq_group_asid.asid = asid;
I wonder if this could be part of the shared memory?
> +
> + r = vduse_dev_msg_sync(dev, &msg);
> + if (r < 0)
> + return r;
> +
> + vduse_set_group_asid_nomsg(dev, group, asid);
> + return 0;
> +}
> +
> static void *vduse_get_vq_map_token(struct vdpa_device *vdpa, u16 idx)
> {
> struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> @@ -833,13 +878,13 @@ static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
> struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> int ret;
>
> - ret = vduse_domain_set_map(dev->domain, iotlb);
> + ret = vduse_domain_set_map(dev->domain[asid], iotlb);
> if (ret)
> return ret;
>
> ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> if (ret) {
> - vduse_domain_clear_map(dev->domain, iotlb);
> + vduse_domain_clear_map(dev->domain[asid], iotlb);
> return ret;
> }
>
> @@ -883,6 +928,7 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> .get_vq_affinity = vduse_vdpa_get_vq_affinity,
> .reset = vduse_vdpa_reset,
> .set_map = vduse_vdpa_set_map,
> + .set_group_asid = vduse_set_group_asid,
> .get_vq_map_token = vduse_get_vq_map_token,
> .free = vduse_vdpa_free,
> };
> @@ -893,8 +939,10 @@ static void vduse_dev_sync_single_for_device(void *token,
> {
> struct vduse_vq_group_int *group = token;
> struct vduse_dev *vdev = group->dev;
> - struct vduse_iova_domain *domain = vdev->domain;
> + struct vduse_iova_domain *domain;
>
> + guard(mutex)(&vdev->domain_lock);
> + domain = group->domain;
> vduse_domain_sync_single_for_device(domain, dma_addr, size, dir);
> }
>
> @@ -904,8 +952,10 @@ static void vduse_dev_sync_single_for_cpu(void *token,
> {
> struct vduse_vq_group_int *group = token;
> struct vduse_dev *vdev = group->dev;
> - struct vduse_iova_domain *domain = vdev->domain;
> + struct vduse_iova_domain *domain;
>
> + guard(mutex)(&vdev->domain_lock);
> + domain = group->domain;
> vduse_domain_sync_single_for_cpu(domain, dma_addr, size, dir);
> }
>
> @@ -916,8 +966,10 @@ static dma_addr_t vduse_dev_map_page(void *token, struct page *page,
> {
> struct vduse_vq_group_int *group = token;
> struct vduse_dev *vdev = group->dev;
> - struct vduse_iova_domain *domain = vdev->domain;
> + struct vduse_iova_domain *domain;
>
> + guard(mutex)(&vdev->domain_lock);
> + domain = group->domain;
> return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> }
>
> @@ -927,8 +979,10 @@ static void vduse_dev_unmap_page(void *token, dma_addr_t dma_addr,
> {
> struct vduse_vq_group_int *group = token;
> struct vduse_dev *vdev = group->dev;
> - struct vduse_iova_domain *domain = vdev->domain;
> + struct vduse_iova_domain *domain;
>
> + guard(mutex)(&vdev->domain_lock);
> + domain = group->domain;
> return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> }
>
> @@ -937,11 +991,13 @@ static void *vduse_dev_alloc_coherent(void *token, size_t size,
> {
> struct vduse_vq_group_int *group = token;
> struct vduse_dev *vdev = group->dev;
> - struct vduse_iova_domain *domain = vdev->domain;
> + struct vduse_iova_domain *domain;
> unsigned long iova;
> void *addr;
>
> *dma_addr = DMA_MAPPING_ERROR;
> + guard(mutex)(&vdev->domain_lock);
> + domain = group->domain;
> addr = vduse_domain_alloc_coherent(domain, size,
> (dma_addr_t *)&iova, flag);
> if (!addr)
> @@ -958,8 +1014,10 @@ static void vduse_dev_free_coherent(void *token, size_t size,
> {
> struct vduse_vq_group_int *group = token;
> struct vduse_dev *vdev = group->dev;
> - struct vduse_iova_domain *domain = vdev->domain;
> + struct vduse_iova_domain *domain;
>
> + guard(mutex)(&vdev->domain_lock);
> + domain = group->domain;
> vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
> }
>
> @@ -967,8 +1025,10 @@ static size_t vduse_dev_max_mapping_size(void *token)
> {
> struct vduse_vq_group_int *group = token;
> struct vduse_dev *vdev = group->dev;
> - struct vduse_iova_domain *domain = vdev->domain;
> + struct vduse_iova_domain *domain;
>
> + guard(mutex)(&vdev->domain_lock);
> + domain = group->domain;
> return domain->bounce_size;
> }
>
> @@ -1107,31 +1167,32 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
> return ret;
> }
>
> -static int vduse_dev_dereg_umem(struct vduse_dev *dev,
> +static int vduse_dev_dereg_umem(struct vduse_dev *dev, u32 asid,
> u64 iova, u64 size)
> {
> int ret;
>
> mutex_lock(&dev->mem_lock);
> ret = -ENOENT;
> - if (!dev->umem)
> + if (!dev->umem[asid])
> goto unlock;
>
> ret = -EINVAL;
> - if (!dev->domain)
> + if (!dev->domain[asid])
> goto unlock;
>
> - if (dev->umem->iova != iova || size != dev->domain->bounce_size)
> + if (dev->umem[asid]->iova != iova ||
> + size != dev->domain[asid]->bounce_size)
> goto unlock;
>
> - vduse_domain_remove_user_bounce_pages(dev->domain);
> - unpin_user_pages_dirty_lock(dev->umem->pages,
> - dev->umem->npages, true);
> - atomic64_sub(dev->umem->npages, &dev->umem->mm->pinned_vm);
> - mmdrop(dev->umem->mm);
> - vfree(dev->umem->pages);
> - kfree(dev->umem);
> - dev->umem = NULL;
> + vduse_domain_remove_user_bounce_pages(dev->domain[asid]);
> + unpin_user_pages_dirty_lock(dev->umem[asid]->pages,
> + dev->umem[asid]->npages, true);
> + atomic64_sub(dev->umem[asid]->npages, &dev->umem[asid]->mm->pinned_vm);
> + mmdrop(dev->umem[asid]->mm);
> + vfree(dev->umem[asid]->pages);
> + kfree(dev->umem[asid]);
> + dev->umem[asid] = NULL;
> ret = 0;
> unlock:
> mutex_unlock(&dev->mem_lock);
> @@ -1139,7 +1200,7 @@ static int vduse_dev_dereg_umem(struct vduse_dev *dev,
> }
>
> static int vduse_dev_reg_umem(struct vduse_dev *dev,
> - u64 iova, u64 uaddr, u64 size)
> + u32 asid, u64 iova, u64 uaddr, u64 size)
> {
> struct page **page_list = NULL;
> struct vduse_umem *umem = NULL;
> @@ -1147,14 +1208,14 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> unsigned long npages, lock_limit;
> int ret;
>
> - if (!dev->domain || !dev->domain->bounce_map ||
> - size != dev->domain->bounce_size ||
> + if (!dev->domain[asid] || !dev->domain[asid]->bounce_map ||
> + size != dev->domain[asid]->bounce_size ||
> iova != 0 || uaddr & ~PAGE_MASK)
> return -EINVAL;
>
> mutex_lock(&dev->mem_lock);
> ret = -EEXIST;
> - if (dev->umem)
> + if (dev->umem[asid])
> goto unlock;
>
> ret = -ENOMEM;
> @@ -1178,7 +1239,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> goto out;
> }
>
> - ret = vduse_domain_add_user_bounce_pages(dev->domain,
> + ret = vduse_domain_add_user_bounce_pages(dev->domain[asid],
> page_list, pinned);
> if (ret)
> goto out;
> @@ -1191,7 +1252,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> umem->mm = current->mm;
> mmgrab(current->mm);
>
> - dev->umem = umem;
> + dev->umem[asid] = umem;
> out:
> if (ret && pinned > 0)
> unpin_user_pages(page_list, pinned);
> @@ -1234,26 +1295,43 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>
> switch (cmd) {
> case VDUSE_IOTLB_GET_FD: {
> - struct vduse_iotlb_entry entry;
> + struct vduse_iotlb_entry_v2 entry;
> + struct vduse_iotlb_entry entry_old;
> struct vhost_iotlb_map *map;
> struct vdpa_map_file *map_file;
> struct file *f = NULL;
>
> ret = -EFAULT;
> - if (copy_from_user(&entry, argp, sizeof(entry)))
> - break;
> + if (dev->api_version >= VDUSE_API_VERSION_1) {
> + if (copy_from_user(&entry, argp, sizeof(entry)))
> + break;
> + } else {
> + if (copy_from_user(&entry_old, argp,
> + sizeof(entry_old)))
> + break;
> +
> + entry.offset = entry_old.offset;
> + entry.start = entry_old.start;
> + entry.last = entry_old.last;
> + entry.asid = 0;
> + entry.perm = entry_old.perm;
> + }
>
> ret = -EINVAL;
> if (entry.start > entry.last)
> break;
>
> + if (entry.asid > dev->nas)
> + break;
> +
> mutex_lock(&dev->domain_lock);
> - if (!dev->domain) {
> + /* TODO accessing an array with idx from userspace, mitigations? */
> + if (!dev->domain[entry.asid]) {
> mutex_unlock(&dev->domain_lock);
> break;
> }
> - spin_lock(&dev->domain->iotlb_lock);
> - map = vhost_iotlb_itree_first(dev->domain->iotlb,
> + spin_lock(&dev->domain[entry.asid]->iotlb_lock);
> + map = vhost_iotlb_itree_first(dev->domain[entry.asid]->iotlb,
> entry.start, entry.last);
> if (map) {
> map_file = (struct vdpa_map_file *)map->opaque;
> @@ -1263,7 +1341,7 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> entry.last = map->last;
> entry.perm = map->perm;
> }
> - spin_unlock(&dev->domain->iotlb_lock);
> + spin_unlock(&dev->domain[entry.asid]->iotlb_lock);
> mutex_unlock(&dev->domain_lock);
> ret = -EINVAL;
> if (!f)
> @@ -1413,12 +1491,16 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> break;
>
> ret = -EINVAL;
> + /* TODO: Using asid from userspace, need to mitigate? */
> if (!is_mem_zero((const char *)umem.reserved,
> - sizeof(umem.reserved)))
> + sizeof(umem.reserved)) ||
> + (dev->api_version < VDUSE_API_VERSION_1 &&
> + umem.asid != 0) ||
> + umem.asid > dev->nas)
> break;
>
> mutex_lock(&dev->domain_lock);
> - ret = vduse_dev_reg_umem(dev, umem.iova,
> + ret = vduse_dev_reg_umem(dev, umem.asid, umem.iova,
> umem.uaddr, umem.size);
> mutex_unlock(&dev->domain_lock);
> break;
> @@ -1431,15 +1513,24 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> break;
>
> ret = -EINVAL;
> + /* TODO: Using asid from userspace, need to mitigate? */
> if (!is_mem_zero((const char *)umem.reserved,
> - sizeof(umem.reserved)))
> + sizeof(umem.reserved)) ||
> + (dev->api_version < VDUSE_API_VERSION_1 &&
> + umem.asid != 0) ||
> + umem.asid > dev->nas)
> + break;
> +
> +
> + if (umem.asid > dev->nas)
> break;
> mutex_lock(&dev->domain_lock);
> - ret = vduse_dev_dereg_umem(dev, umem.iova,
> + ret = vduse_dev_dereg_umem(dev, umem.asid, umem.iova,
> umem.size);
> mutex_unlock(&dev->domain_lock);
> break;
> }
> + /* TODO can we merge this with GET_FD? */
Probably but might be too late.
> case VDUSE_IOTLB_GET_INFO: {
> struct vduse_iova_info info;
> struct vhost_iotlb_map *map;
> @@ -1452,27 +1543,32 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> if (info.start > info.last)
> break;
>
> + if (info.asid > dev->nas)
> + break;
> +
> if (!is_mem_zero((const char *)info.reserved,
> sizeof(info.reserved)))
> break;
>
> mutex_lock(&dev->domain_lock);
> - if (!dev->domain) {
> + /* TODO asid comes from userspace. mitigations? */
We should do sanity check to avoid OOB at least?
> + if (!dev->domain[info.asid]) {
> mutex_unlock(&dev->domain_lock);
> break;
> }
> - spin_lock(&dev->domain->iotlb_lock);
> - map = vhost_iotlb_itree_first(dev->domain->iotlb,
> + spin_lock(&dev->domain[info.asid]->iotlb_lock);
> + map = vhost_iotlb_itree_first(dev->domain[info.asid]->iotlb,
> info.start, info.last);
> if (map) {
> info.start = map->start;
> info.last = map->last;
> info.capability = 0;
> - if (dev->domain->bounce_map && map->start == 0 &&
> - map->last == dev->domain->bounce_size - 1)
> + if (dev->domain[info.asid]->bounce_map &&
> + map->start == 0 &&
> + map->last == dev->domain[info.asid]->bounce_size - 1)
> info.capability |= VDUSE_IOVA_CAP_UMEM;
> }
> - spin_unlock(&dev->domain->iotlb_lock);
> + spin_unlock(&dev->domain[info.asid]->iotlb_lock);
> mutex_unlock(&dev->domain_lock);
> if (!map)
> break;
> @@ -1497,8 +1593,13 @@ static int vduse_dev_release(struct inode *inode, struct file *file)
> struct vduse_dev *dev = file->private_data;
>
> mutex_lock(&dev->domain_lock);
> - if (dev->domain)
> - vduse_dev_dereg_umem(dev, 0, dev->domain->bounce_size);
> + for (int i = 0; i < dev->nas; i++) {
> + if (dev->domain[i]) {
> + vduse_dev_dereg_umem(dev, i, 0,
> + dev->domain[i]->bounce_size);
> + dev->domain[i] = NULL;
> + }
> + }
> mutex_unlock(&dev->domain_lock);
> spin_lock(&dev->msg_lock);
> /* Make sure the inflight messages can processed after reconncection */
> @@ -1768,9 +1869,12 @@ static int vduse_destroy_dev(char *name)
> idr_remove(&vduse_idr, dev->minor);
> kvfree(dev->config);
> vduse_dev_deinit_vqs(dev);
> - if (dev->domain)
> - vduse_domain_destroy(dev->domain);
> + for (int i = 0; i < dev->nas; i++) {
> + if (dev->domain[i])
> + vduse_domain_destroy(dev->domain[i]);
> + }
> kfree(dev->name);
> + kfree(dev->umem);
> vduse_dev_destroy(dev);
> module_put(THIS_MODULE);
>
> @@ -1877,7 +1981,8 @@ static ssize_t bounce_size_store(struct device *device,
>
> ret = -EPERM;
> mutex_lock(&dev->domain_lock);
> - if (dev->domain)
> + /* Assuming that if the first domain is allocated, all are allocated */
> + if (dev->domain[0])
> goto unlock;
>
> ret = kstrtouint(buf, 10, &bounce_size);
> @@ -1933,11 +2038,20 @@ static int vduse_create_dev(struct vduse_dev_config *config,
> if (config->ngroups > VDUSE_MAX_VQ_GROUPS) {
> pr_err("Not creating a VDUSE device with %u vq groups. Max: %u",
> config->ngroups, VDUSE_MAX_VQ_GROUPS);
> + /* TODO: Need to destroy dev here too! */
> goto err;
> }
> + if (config->nas > VDUSE_MAX_ASID) {
> + pr_err("Not creating a VDUSE device with %u asid. Max: %u",
> + config->nas, VDUSE_MAX_ASID);
> + goto err_nas;
> + }
> +
> dev->ngroups = config->ngroups ?: 1;
> + dev->nas = config->nas ?: 1;
> } else {
> dev->ngroups = 1;
> + dev->nas = 1;
> }
> for (u32 i = 0; i < dev->ngroups; ++i)
> dev->groups[i].dev = dev;
> @@ -1977,6 +2091,7 @@ static int vduse_create_dev(struct vduse_dev_config *config,
> err_idr:
> kfree(dev->name);
> err_str:
> +err_nas:
> vduse_dev_destroy(dev);
> err:
> return ret;
> @@ -2069,7 +2184,6 @@ static int vduse_open(struct inode *inode, struct file *file)
> if (!control)
> return -ENOMEM;
>
> - control->api_version = VDUSE_API_VERSION;
> file->private_data = control;
>
> return 0;
> @@ -2101,7 +2215,7 @@ static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
>
> vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev,
> &vduse_vdpa_config_ops, &vduse_map_ops,
> - dev->ngroups, 1, name, true);
> + dev->ngroups, dev->nas, name, true);
> if (IS_ERR(vdev))
> return PTR_ERR(vdev);
>
> @@ -2137,11 +2251,23 @@ static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name,
> return ret;
>
> mutex_lock(&dev->domain_lock);
> - if (!dev->domain)
> - dev->domain = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
> - dev->bounce_size);
> + ret = 0;
> +
> + /* TODO we could delay the creation of the domain */
I don't get what this means?
> + for (int i = 0; i < dev->nas; ++i) {
> + if (!dev->domain[i])
> + dev->domain[i] = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
> + dev->bounce_size);
> + if (!dev->domain[i]) {
> + ret = -ENOMEM;
> + for (int j = 0; j < i; ++j)
> + vduse_domain_destroy(dev->domain[j]);
Why not move the above error handling to the err label?
> + goto err_domain;
> + }
> + }
> +
> mutex_unlock(&dev->domain_lock);
> - if (!dev->domain) {
> + if (ret == -ENOMEM) {
> put_device(&dev->vdev->vdpa.dev);
> return -ENOMEM;
> }
> @@ -2150,13 +2276,22 @@ static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name,
> if (ret) {
> put_device(&dev->vdev->vdpa.dev);
> mutex_lock(&dev->domain_lock);
> - vduse_domain_destroy(dev->domain);
> - dev->domain = NULL;
> + for (int i = 0; i < dev->nas; i++) {
> + if (dev->domain[i]) {
> + vduse_domain_destroy(dev->domain[i]);
> + dev->domain[i] = NULL;
> + }
> + }
> mutex_unlock(&dev->domain_lock);
> return ret;
> }
>
> return 0;
> +
> +err_domain:
> + /* TODO do I need to call put_device? */
If it comes after the success of vduse_dev_init_vdpa().
> + mutex_unlock(&dev->domain_lock);
> + return ret;
> }
>
> static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> index b4b139dc76bb..d300fd5f867f 100644
> --- a/include/uapi/linux/vduse.h
> +++ b/include/uapi/linux/vduse.h
> @@ -47,7 +47,8 @@ struct vduse_dev_config {
> __u32 vq_num;
> __u32 vq_align;
> __u32 ngroups; /* if VDUSE_API_VERSION >= 1 */
> - __u32 reserved[12];
> + __u32 nas; /* if VDUSE_API_VERSION >= 1 */
> + __u32 reserved[11];
> __u32 config_size;
> __u8 config[];
> };
> @@ -82,6 +83,17 @@ struct vduse_iotlb_entry {
> __u8 perm;
> };
>
> +struct vduse_iotlb_entry_v2 {
> + __u64 offset;
> + __u64 start;
> + __u64 last;
> +#define VDUSE_ACCESS_RO 0x1
> +#define VDUSE_ACCESS_WO 0x2
> +#define VDUSE_ACCESS_RW 0x3
> + __u8 perm;
We can embed vduse_iotlb_entry here. This will ease the userspace development.
> + __u32 asid;
> +};
> +
> /*
> * Find the first IOVA region that overlaps with the range [start, last]
> * and return the corresponding file descriptor. Return -EINVAL means the
> @@ -172,6 +184,16 @@ struct vduse_vq_group {
> __u32 num;
> };
>
> +/**
> + * struct vduse_vq_group - virtqueue group
> + @ @group: Index of the virtqueue group
> + * @asid: Address space ID of the group
> + */
> +struct vduse_vq_group_asid {
> + __u32 group;
> + __u32 asid;
> +};
> +
> /**
> * struct vduse_vq_info - information of a virtqueue
> * @index: virtqueue index
> @@ -232,6 +254,7 @@ struct vduse_vq_eventfd {
> * @uaddr: start address of userspace memory, it must be aligned to page size
> * @iova: start of the IOVA region
> * @size: size of the IOVA region
> + * @asid: Address space ID of the IOVA region
> * @reserved: for future use, needs to be initialized to zero
> *
> * Structure used by VDUSE_IOTLB_REG_UMEM and VDUSE_IOTLB_DEREG_UMEM
> @@ -241,7 +264,8 @@ struct vduse_iova_umem {
> __u64 uaddr;
> __u64 iova;
> __u64 size;
> - __u64 reserved[3];
> + __u32 asid;
> + __u32 reserved[5];
> };
>
> /* Register userspace memory for IOVA regions */
> @@ -265,7 +289,8 @@ struct vduse_iova_info {
> __u64 last;
> #define VDUSE_IOVA_CAP_UMEM (1 << 0)
> __u64 capability;
> - __u64 reserved[3];
> + __u32 asid; /* Only if device API version >= 1 */
> + __u32 reserved[5];
> };
>
> /*
> @@ -289,6 +314,7 @@ enum vduse_req_type {
> VDUSE_UPDATE_IOTLB,
> VDUSE_GET_VQ_GROUP,
> VDUSE_GET_VRING_DESC_GROUP,
> + VDUSE_SET_VQ_GROUP_ASID,
> };
>
> /**
> @@ -344,6 +370,8 @@ struct vduse_dev_request {
> struct vduse_dev_status s;
> struct vduse_iova_range iova;
> struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
> + /* Only if vduse api version >= 1 */
We can use:
/* Only if vduse api version >= 1 */
struct vduse_vq_group vq_group;
struct vduse_vq_group_asid vq_group_asid;
> + struct vduse_vq_group_asid vq_group_asid;
> __u32 padding[32];
> };
> };
> @@ -367,6 +395,8 @@ struct vduse_dev_response {
> union {
> struct vduse_vq_state vq_state;
> struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
> + /* Only if vduse api version >= 1 */
> + struct vduse_vq_group_asid vq_group_asid;
> __u32 padding[32];
> };
> };
> --
> 2.50.1
>
Thanks
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC v2 6/7] vduse: send update_iotlb_v2 message
2025-08-07 11:57 ` [RFC v2 6/7] vduse: send update_iotlb_v2 message Eugenio Pérez
@ 2025-08-11 3:24 ` Jason Wang
0 siblings, 0 replies; 24+ messages in thread
From: Jason Wang @ 2025-08-11 3:24 UTC (permalink / raw)
To: Eugenio Pérez
Cc: Michael S . Tsirkin, Cindy Lu, Yongji Xie, Stefano Garzarella,
virtualization, Laurent Vivier, linux-kernel, Xuan Zhuo,
Maxime Coquelin
On Thu, Aug 7, 2025 at 7:58 PM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> This message lets the kernel notify userspace VDUSE backends about
> updated IOTLB mappings for a specific ASID.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
I guess this could be squashed to the previous patch for logic completeness.
Thanks
> ---
> drivers/vdpa/vdpa_user/vduse_dev.c | 14 ++++++++++----
> include/uapi/linux/vduse.h | 7 +++++++
> 2 files changed, 17 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> index 145147c49930..ac7897068222 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -326,7 +326,7 @@ static int vduse_dev_set_status(struct vduse_dev *dev, u8 status)
> return vduse_dev_msg_sync(dev, &msg);
> }
>
> -static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> +static int vduse_dev_update_iotlb(struct vduse_dev *dev, u32 asid,
> u64 start, u64 last)
> {
> struct vduse_dev_msg msg = { 0 };
> @@ -335,8 +335,14 @@ static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> return -EINVAL;
>
> msg.req.type = VDUSE_UPDATE_IOTLB;
> - msg.req.iova.start = start;
> - msg.req.iova.last = last;
> + if (dev->api_version < VDUSE_API_VERSION_1) {
> + msg.req.iova.start = start;
> + msg.req.iova.last = last;
> + } else {
> + msg.req.iova_v2.start = start;
> + msg.req.iova_v2.last = last;
> + msg.req.iova_v2.asid = asid;
> + }
>
> return vduse_dev_msg_sync(dev, &msg);
> }
> @@ -882,7 +888,7 @@ static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
> if (ret)
> return ret;
>
> - ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> + ret = vduse_dev_update_iotlb(dev, asid, 0ULL, ULLONG_MAX);
> if (ret) {
> vduse_domain_clear_map(dev->domain[asid], iotlb);
> return ret;
> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> index d300fd5f867f..a5f7a5edb8e0 100644
> --- a/include/uapi/linux/vduse.h
> +++ b/include/uapi/linux/vduse.h
> @@ -349,6 +349,12 @@ struct vduse_iova_range {
> __u64 last;
> };
>
> +struct vduse_iova_range_v2 {
> + __u64 start;
> + __u64 last;
> + __u32 asid;
> +};
> +
> /**
> * struct vduse_dev_request - control request
> * @type: request type
> @@ -369,6 +375,7 @@ struct vduse_dev_request {
> struct vduse_vq_state vq_state;
> struct vduse_dev_status s;
> struct vduse_iova_range iova;
> + struct vduse_iova_range_v2 iova_v2;
> struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
> /* Only if vduse api version >= 1 */
> struct vduse_vq_group_asid vq_group_asid;
> --
> 2.50.1
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC v2 2/7] vduse: add vq group support
2025-08-11 3:09 ` Jason Wang
@ 2025-08-11 9:51 ` Eugenio Perez Martin
2025-08-12 3:01 ` Jason Wang
0 siblings, 1 reply; 24+ messages in thread
From: Eugenio Perez Martin @ 2025-08-11 9:51 UTC (permalink / raw)
To: Jason Wang
Cc: Michael S . Tsirkin, Cindy Lu, Yongji Xie, Stefano Garzarella,
virtualization, Laurent Vivier, linux-kernel, Xuan Zhuo,
Maxime Coquelin
On Mon, Aug 11, 2025 at 5:10 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Thu, Aug 7, 2025 at 7:58 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> >
> > This allows sepparate the different virtqueues in groups that shares the
> > same address space. Asking the VDUSE device for the groups of the vq at
> > the beginning as they're needed for the DMA API.
> >
> > Allocating 3 vq groups as net is the device that need the most groups:
> > * Dataplane (guest passthrough)
> > * CVQ
> > * Shadowed vrings.
> >
> > Future versions of the series can include dynamic allocation of the
> > groups array so VDUSE can declare more groups.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> > v2:
> > * Cache group information in kernel, as we need to provide the vq map
> > tokens properly.
> > * Add descs vq group to optimize SVQ forwarding and support indirect
> > descriptors out of the box.
> > ---
> > drivers/vdpa/vdpa_user/vduse_dev.c | 71 +++++++++++++++++++++++++++++-
> > include/uapi/linux/vduse.h | 19 +++++++-
> > 2 files changed, 88 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > index d858c4389cc1..d1f6d00a9c71 100644
> > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > @@ -46,6 +46,11 @@
> > #define VDUSE_IOVA_SIZE (VDUSE_MAX_BOUNCE_SIZE + 128 * 1024 * 1024)
> > #define VDUSE_MSG_DEFAULT_TIMEOUT 30
> >
> > +/*
> > + * Let's make it 3 for simplicity.
> > + */
> > +#define VDUSE_MAX_VQ_GROUPS 3
>
> I think we can release this to something like 64. Otherwise we might
> bump the version again just to increase the limitation? Or having a
> sysfs entry like bounce_size?
>
I think it should not be linked to the version, but it is true there
is no way for VDUSE devices to check the maximum VQ groups / ASID that
the kernel supports.
To handle as bounce_size seems the best option, good point. I'll send
a new version with that!
> > +
> > #define IRQ_UNBOUND -1
> >
> > struct vduse_virtqueue {
> > @@ -58,6 +63,8 @@ struct vduse_virtqueue {
> > struct vdpa_vq_state state;
> > bool ready;
> > bool kicked;
> > + u32 vq_group;
> > + u32 vq_desc_group;
> > spinlock_t kick_lock;
> > spinlock_t irq_lock;
> > struct eventfd_ctx *kickfd;
> > @@ -114,6 +121,7 @@ struct vduse_dev {
> > u8 status;
> > u32 vq_num;
> > u32 vq_align;
> > + u32 ngroups;
> > struct vduse_umem *umem;
> > struct mutex mem_lock;
> > unsigned int bounce_size;
> > @@ -592,6 +600,20 @@ static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
> > return 0;
> > }
> >
> > +static u32 vduse_get_vq_group(struct vdpa_device *vdpa, u16 idx)
> > +{
> > + struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > + return dev->vqs[idx]->vq_group;
> > +}
> > +
> > +static u32 vduse_get_vq_desc_group(struct vdpa_device *vdpa, u16 idx)
> > +{
> > + struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > + return dev->vqs[idx]->vq_desc_group;
> > +}
> > +
> > static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> > struct vdpa_vq_state *state)
> > {
> > @@ -678,13 +700,48 @@ static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
> > return dev->status;
> > }
> >
> > +static int vduse_fill_vq_groups(struct vduse_dev *dev)
> > +{
> > + if (dev->api_version < VDUSE_API_VERSION_1)
> > + return 0;
> > +
> > + for (int i = 0; i < dev->vdev->vdpa.nvqs; ++i) {
> > + struct vduse_dev_msg msg = { 0 };
> > + int ret;
> > +
> > + msg.req.type = VDUSE_GET_VQ_GROUP;
> > + msg.req.vq_group.index = i;
> > + ret = vduse_dev_msg_sync(dev, &msg);
>
> I fail to understand why the default group mapping is not done during
> device creation.
>
Because it changes depending on the features.
If a new device has 5 virtqueues and the device wants to isolate the
CVQ, the CVQ position depends on the features that the guest's acks:
* If MQ is acked the isolated vq is #5
* If MQ is not acked the isolated vq is #3.
> > + if (ret)
> > + return ret;
> > +
> > + dev->vqs[i]->vq_group = msg.resp.vq_group.num;
> > +
> > + msg.req.type = VDUSE_GET_VRING_DESC_GROUP;
> > + ret = vduse_dev_msg_sync(dev, &msg);
> > + if (ret)
> > + return ret;
> > +
> > + dev->vqs[i]->vq_desc_group = msg.resp.vq_group.num;
> > + }
> > +
> > + return 0;
> > +}
> > +
> > static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
> > {
> > struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > + u8 previous_status = dev->status;
> >
> > if (vduse_dev_set_status(dev, status))
> > return;
> >
> > + if ((dev->status ^ previous_status) &
> > + BIT_ULL(VIRTIO_CONFIG_S_FEATURES_OK) &&
> > + status & (1ULL << VIRTIO_CONFIG_S_FEATURES_OK))
> > + if (vduse_fill_vq_groups(dev))
>
> Can we merge the two messages into a single one? Or can we use a
> shared memory for storing such mapping?
>
> For example, if we have 256 queues it would be very slow.
>
To merge it in the same message is good to me, sure. To make it a
table in shm seems more complicated, unless we accept a void * in the
reply and VDUSE uses copy_from_user. If that is ok here, then sure.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC v2 4/7] vduse: return internal vq group struct as map token
2025-08-11 3:11 ` Jason Wang
@ 2025-08-11 11:03 ` Eugenio Perez Martin
2025-08-12 3:02 ` Jason Wang
0 siblings, 1 reply; 24+ messages in thread
From: Eugenio Perez Martin @ 2025-08-11 11:03 UTC (permalink / raw)
To: Jason Wang
Cc: Michael S . Tsirkin, Cindy Lu, Yongji Xie, Stefano Garzarella,
virtualization, Laurent Vivier, linux-kernel, Xuan Zhuo,
Maxime Coquelin
On Mon, Aug 11, 2025 at 5:11 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Thu, Aug 7, 2025 at 7:58 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> >
> > Return the internal struct that represents the vq group as virtqueue map
> > token, instead of the device.
>
> Note that Michael prefers to use the iova domain. This indeed seems to
> be better.
>
Well iova domain would delete an indirection in the pointer chase, but
it would be problematic to store the token in the caller.
And we need to add some way to protect that the ASID of a vq group is
not changed in the middle of the operation by an ioctl. IOW, the
vq_group_internal struct pointer is constant for all the lifetime of
the device, while iova_domain is not.
> > This allows the DMA functions to access
>
> s/DMA/map/
>
Ouch, thanks for the catch!
> > the information per group.
> >
> > At this moment all the virtqueues share the same vq group, that only
> > can point to ASID 0. This change prepares the infrastructure for actual
> > per-group address space handling
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>
> Thanks
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC v2 5/7] vduse: add vq group asid support
2025-08-11 3:21 ` Jason Wang
@ 2025-08-11 11:19 ` Eugenio Perez Martin
2025-08-12 3:05 ` Jason Wang
2025-08-13 11:02 ` Eugenio Perez Martin
0 siblings, 2 replies; 24+ messages in thread
From: Eugenio Perez Martin @ 2025-08-11 11:19 UTC (permalink / raw)
To: Jason Wang
Cc: Michael S . Tsirkin, Cindy Lu, Yongji Xie, Stefano Garzarella,
virtualization, Laurent Vivier, linux-kernel, Xuan Zhuo,
Maxime Coquelin
On Mon, Aug 11, 2025 at 5:21 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Thu, Aug 7, 2025 at 7:58 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> >
> > Add support for assigning Address Space Identifiers (ASIDs) to each VQ
> > group. This enables mapping each group into a distinct memory space.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> > v2: Make iotlb entry the last one of vduse_iotlb_entry_v2 so the first
> > part of the struct is the same.
> > ---
> > drivers/vdpa/vdpa_user/vduse_dev.c | 259 ++++++++++++++++++++++-------
> > include/uapi/linux/vduse.h | 36 +++-
> > 2 files changed, 230 insertions(+), 65 deletions(-)
> >
> > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > index a7a2749f5818..145147c49930 100644
> > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > @@ -51,6 +51,12 @@
> > */
> > #define VDUSE_MAX_VQ_GROUPS 3
> >
> > +/*
> > + * We can do ASID thanks to (ab)using the vduse device and the vdpa one. So no
> > + * more than 2.
> > + */
> > +#define VDUSE_MAX_ASID 2
>
> Similar to group limitation, I'd like to either make it larger to 64
> or make it changable via sysfs.
>
> > +
> > #define IRQ_UNBOUND -1
> >
> > struct vduse_virtqueue {
> > @@ -91,6 +97,7 @@ struct vduse_umem {
> > };
> >
> > struct vduse_vq_group_int {
> > + struct vduse_iova_domain *domain;
> > struct vduse_dev *dev;
> > };
> >
> > @@ -98,7 +105,7 @@ struct vduse_dev {
> > struct vduse_vdpa *vdev;
> > struct device *dev;
> > struct vduse_virtqueue **vqs;
> > - struct vduse_iova_domain *domain;
> > + struct vduse_iova_domain *domain[VDUSE_MAX_ASID];
> > char *name;
> > struct mutex lock;
> > spinlock_t msg_lock;
> > @@ -126,7 +133,8 @@ struct vduse_dev {
> > u32 vq_num;
> > u32 vq_align;
> > u32 ngroups;
> > - struct vduse_umem *umem;
> > + u32 nas;
> > + struct vduse_umem *umem[VDUSE_MAX_ASID];
> > struct vduse_vq_group_int groups[VDUSE_MAX_VQ_GROUPS];
> > struct mutex mem_lock;
> > unsigned int bounce_size;
> > @@ -440,14 +448,28 @@ static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> > return mask;
> > }
> >
> > +/* Force set the asid to a vq group without a message to the VDUSE device */
> > +static void vduse_set_group_asid_nomsg(struct vduse_dev *dev,
> > + unsigned int group, unsigned int asid)
> > +{
> > + guard(mutex)(&dev->domain_lock);
> > + dev->groups[group].domain = dev->domain[asid];
> > +}
> > +
> > static void vduse_dev_reset(struct vduse_dev *dev)
> > {
> > int i;
> > - struct vduse_iova_domain *domain = dev->domain;
> >
> > /* The coherent mappings are handled in vduse_dev_free_coherent() */
> > - if (domain && domain->bounce_map)
> > - vduse_domain_reset_bounce_map(domain);
> > + for (i = 0; i < dev->nas; i++) {
> > + struct vduse_iova_domain *domain = dev->domain[i];
> > +
> > + if (domain && domain->bounce_map)
> > + vduse_domain_reset_bounce_map(domain);
> > + }
> > +
> > + for (i = 0; i < dev->ngroups; i++)
> > + vduse_set_group_asid_nomsg(dev, i, 0);
>
> I vaguely remember Siwei has invented a feature that forbids the reset
> of the mapping during virtio reset. If it's true, let's check that
> feature flag.
>
Ouch, very good catch!
> Btw even if we device to reset the mapping, should it be implied via
> reset message itself instead of having more operations here?
>
This was resetting all to valid values by the start of the device
operation, so no need for a new message at least for now.
> >
> > down_write(&dev->rwsem);
> >
> > @@ -619,6 +641,29 @@ static u32 vduse_get_vq_desc_group(struct vdpa_device *vdpa, u16 idx)
> > return dev->vqs[idx]->vq_desc_group;
> > }
> >
> > +static int vduse_set_group_asid(struct vdpa_device *vdpa, unsigned int group,
> > + unsigned int asid)
> > +{
> > + struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > + struct vduse_dev_msg msg = { 0 };
> > + int r;
> > +
> > + if (dev->api_version < VDUSE_API_VERSION_1 ||
> > + group > dev->ngroups || asid > dev->nas)
> > + return -EINVAL;
> > +
> > + msg.req.type = VDUSE_SET_VQ_GROUP_ASID;
> > + msg.req.vq_group_asid.group = group;
> > + msg.req.vq_group_asid.asid = asid;
>
> I wonder if this could be part of the shared memory?
>
I think it complicates the solution a lot, isn't it? How to notify the
VDUSE device that these properties have changed, what vq group
changed, etc? What do we want to achieve with the shm solution?
> > +
> > + r = vduse_dev_msg_sync(dev, &msg);
> > + if (r < 0)
> > + return r;
> > +
> > + vduse_set_group_asid_nomsg(dev, group, asid);
> > + return 0;
> > +}
> > +
> > static void *vduse_get_vq_map_token(struct vdpa_device *vdpa, u16 idx)
> > {
> > struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > @@ -833,13 +878,13 @@ static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
> > struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > int ret;
> >
> > - ret = vduse_domain_set_map(dev->domain, iotlb);
> > + ret = vduse_domain_set_map(dev->domain[asid], iotlb);
> > if (ret)
> > return ret;
> >
> > ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> > if (ret) {
> > - vduse_domain_clear_map(dev->domain, iotlb);
> > + vduse_domain_clear_map(dev->domain[asid], iotlb);
> > return ret;
> > }
> >
> > @@ -883,6 +928,7 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> > .get_vq_affinity = vduse_vdpa_get_vq_affinity,
> > .reset = vduse_vdpa_reset,
> > .set_map = vduse_vdpa_set_map,
> > + .set_group_asid = vduse_set_group_asid,
> > .get_vq_map_token = vduse_get_vq_map_token,
> > .free = vduse_vdpa_free,
> > };
> > @@ -893,8 +939,10 @@ static void vduse_dev_sync_single_for_device(void *token,
> > {
> > struct vduse_vq_group_int *group = token;
> > struct vduse_dev *vdev = group->dev;
> > - struct vduse_iova_domain *domain = vdev->domain;
> > + struct vduse_iova_domain *domain;
> >
> > + guard(mutex)(&vdev->domain_lock);
> > + domain = group->domain;
> > vduse_domain_sync_single_for_device(domain, dma_addr, size, dir);
> > }
> >
> > @@ -904,8 +952,10 @@ static void vduse_dev_sync_single_for_cpu(void *token,
> > {
> > struct vduse_vq_group_int *group = token;
> > struct vduse_dev *vdev = group->dev;
> > - struct vduse_iova_domain *domain = vdev->domain;
> > + struct vduse_iova_domain *domain;
> >
> > + guard(mutex)(&vdev->domain_lock);
> > + domain = group->domain;
> > vduse_domain_sync_single_for_cpu(domain, dma_addr, size, dir);
> > }
> >
> > @@ -916,8 +966,10 @@ static dma_addr_t vduse_dev_map_page(void *token, struct page *page,
> > {
> > struct vduse_vq_group_int *group = token;
> > struct vduse_dev *vdev = group->dev;
> > - struct vduse_iova_domain *domain = vdev->domain;
> > + struct vduse_iova_domain *domain;
> >
> > + guard(mutex)(&vdev->domain_lock);
> > + domain = group->domain;
> > return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> > }
> >
> > @@ -927,8 +979,10 @@ static void vduse_dev_unmap_page(void *token, dma_addr_t dma_addr,
> > {
> > struct vduse_vq_group_int *group = token;
> > struct vduse_dev *vdev = group->dev;
> > - struct vduse_iova_domain *domain = vdev->domain;
> > + struct vduse_iova_domain *domain;
> >
> > + guard(mutex)(&vdev->domain_lock);
> > + domain = group->domain;
> > return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> > }
> >
> > @@ -937,11 +991,13 @@ static void *vduse_dev_alloc_coherent(void *token, size_t size,
> > {
> > struct vduse_vq_group_int *group = token;
> > struct vduse_dev *vdev = group->dev;
> > - struct vduse_iova_domain *domain = vdev->domain;
> > + struct vduse_iova_domain *domain;
> > unsigned long iova;
> > void *addr;
> >
> > *dma_addr = DMA_MAPPING_ERROR;
> > + guard(mutex)(&vdev->domain_lock);
> > + domain = group->domain;
> > addr = vduse_domain_alloc_coherent(domain, size,
> > (dma_addr_t *)&iova, flag);
> > if (!addr)
> > @@ -958,8 +1014,10 @@ static void vduse_dev_free_coherent(void *token, size_t size,
> > {
> > struct vduse_vq_group_int *group = token;
> > struct vduse_dev *vdev = group->dev;
> > - struct vduse_iova_domain *domain = vdev->domain;
> > + struct vduse_iova_domain *domain;
> >
> > + guard(mutex)(&vdev->domain_lock);
> > + domain = group->domain;
> > vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
> > }
> >
> > @@ -967,8 +1025,10 @@ static size_t vduse_dev_max_mapping_size(void *token)
> > {
> > struct vduse_vq_group_int *group = token;
> > struct vduse_dev *vdev = group->dev;
> > - struct vduse_iova_domain *domain = vdev->domain;
> > + struct vduse_iova_domain *domain;
> >
> > + guard(mutex)(&vdev->domain_lock);
> > + domain = group->domain;
> > return domain->bounce_size;
> > }
> >
> > @@ -1107,31 +1167,32 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
> > return ret;
> > }
> >
> > -static int vduse_dev_dereg_umem(struct vduse_dev *dev,
> > +static int vduse_dev_dereg_umem(struct vduse_dev *dev, u32 asid,
> > u64 iova, u64 size)
> > {
> > int ret;
> >
> > mutex_lock(&dev->mem_lock);
> > ret = -ENOENT;
> > - if (!dev->umem)
> > + if (!dev->umem[asid])
> > goto unlock;
> >
> > ret = -EINVAL;
> > - if (!dev->domain)
> > + if (!dev->domain[asid])
> > goto unlock;
> >
> > - if (dev->umem->iova != iova || size != dev->domain->bounce_size)
> > + if (dev->umem[asid]->iova != iova ||
> > + size != dev->domain[asid]->bounce_size)
> > goto unlock;
> >
> > - vduse_domain_remove_user_bounce_pages(dev->domain);
> > - unpin_user_pages_dirty_lock(dev->umem->pages,
> > - dev->umem->npages, true);
> > - atomic64_sub(dev->umem->npages, &dev->umem->mm->pinned_vm);
> > - mmdrop(dev->umem->mm);
> > - vfree(dev->umem->pages);
> > - kfree(dev->umem);
> > - dev->umem = NULL;
> > + vduse_domain_remove_user_bounce_pages(dev->domain[asid]);
> > + unpin_user_pages_dirty_lock(dev->umem[asid]->pages,
> > + dev->umem[asid]->npages, true);
> > + atomic64_sub(dev->umem[asid]->npages, &dev->umem[asid]->mm->pinned_vm);
> > + mmdrop(dev->umem[asid]->mm);
> > + vfree(dev->umem[asid]->pages);
> > + kfree(dev->umem[asid]);
> > + dev->umem[asid] = NULL;
> > ret = 0;
> > unlock:
> > mutex_unlock(&dev->mem_lock);
> > @@ -1139,7 +1200,7 @@ static int vduse_dev_dereg_umem(struct vduse_dev *dev,
> > }
> >
> > static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > - u64 iova, u64 uaddr, u64 size)
> > + u32 asid, u64 iova, u64 uaddr, u64 size)
> > {
> > struct page **page_list = NULL;
> > struct vduse_umem *umem = NULL;
> > @@ -1147,14 +1208,14 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > unsigned long npages, lock_limit;
> > int ret;
> >
> > - if (!dev->domain || !dev->domain->bounce_map ||
> > - size != dev->domain->bounce_size ||
> > + if (!dev->domain[asid] || !dev->domain[asid]->bounce_map ||
> > + size != dev->domain[asid]->bounce_size ||
> > iova != 0 || uaddr & ~PAGE_MASK)
> > return -EINVAL;
> >
> > mutex_lock(&dev->mem_lock);
> > ret = -EEXIST;
> > - if (dev->umem)
> > + if (dev->umem[asid])
> > goto unlock;
> >
> > ret = -ENOMEM;
> > @@ -1178,7 +1239,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > goto out;
> > }
> >
> > - ret = vduse_domain_add_user_bounce_pages(dev->domain,
> > + ret = vduse_domain_add_user_bounce_pages(dev->domain[asid],
> > page_list, pinned);
> > if (ret)
> > goto out;
> > @@ -1191,7 +1252,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > umem->mm = current->mm;
> > mmgrab(current->mm);
> >
> > - dev->umem = umem;
> > + dev->umem[asid] = umem;
> > out:
> > if (ret && pinned > 0)
> > unpin_user_pages(page_list, pinned);
> > @@ -1234,26 +1295,43 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> >
> > switch (cmd) {
> > case VDUSE_IOTLB_GET_FD: {
> > - struct vduse_iotlb_entry entry;
> > + struct vduse_iotlb_entry_v2 entry;
> > + struct vduse_iotlb_entry entry_old;
> > struct vhost_iotlb_map *map;
> > struct vdpa_map_file *map_file;
> > struct file *f = NULL;
> >
> > ret = -EFAULT;
> > - if (copy_from_user(&entry, argp, sizeof(entry)))
> > - break;
> > + if (dev->api_version >= VDUSE_API_VERSION_1) {
> > + if (copy_from_user(&entry, argp, sizeof(entry)))
> > + break;
> > + } else {
> > + if (copy_from_user(&entry_old, argp,
> > + sizeof(entry_old)))
> > + break;
> > +
> > + entry.offset = entry_old.offset;
> > + entry.start = entry_old.start;
> > + entry.last = entry_old.last;
> > + entry.asid = 0;
> > + entry.perm = entry_old.perm;
> > + }
> >
> > ret = -EINVAL;
> > if (entry.start > entry.last)
> > break;
> >
> > + if (entry.asid > dev->nas)
> > + break;
> > +
> > mutex_lock(&dev->domain_lock);
> > - if (!dev->domain) {
> > + /* TODO accessing an array with idx from userspace, mitigations? */
> > + if (!dev->domain[entry.asid]) {
> > mutex_unlock(&dev->domain_lock);
> > break;
> > }
> > - spin_lock(&dev->domain->iotlb_lock);
> > - map = vhost_iotlb_itree_first(dev->domain->iotlb,
> > + spin_lock(&dev->domain[entry.asid]->iotlb_lock);
> > + map = vhost_iotlb_itree_first(dev->domain[entry.asid]->iotlb,
> > entry.start, entry.last);
> > if (map) {
> > map_file = (struct vdpa_map_file *)map->opaque;
> > @@ -1263,7 +1341,7 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > entry.last = map->last;
> > entry.perm = map->perm;
> > }
> > - spin_unlock(&dev->domain->iotlb_lock);
> > + spin_unlock(&dev->domain[entry.asid]->iotlb_lock);
> > mutex_unlock(&dev->domain_lock);
> > ret = -EINVAL;
> > if (!f)
> > @@ -1413,12 +1491,16 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > break;
> >
> > ret = -EINVAL;
> > + /* TODO: Using asid from userspace, need to mitigate? */
> > if (!is_mem_zero((const char *)umem.reserved,
> > - sizeof(umem.reserved)))
> > + sizeof(umem.reserved)) ||
> > + (dev->api_version < VDUSE_API_VERSION_1 &&
> > + umem.asid != 0) ||
> > + umem.asid > dev->nas)
> > break;
> >
> > mutex_lock(&dev->domain_lock);
> > - ret = vduse_dev_reg_umem(dev, umem.iova,
> > + ret = vduse_dev_reg_umem(dev, umem.asid, umem.iova,
> > umem.uaddr, umem.size);
> > mutex_unlock(&dev->domain_lock);
> > break;
> > @@ -1431,15 +1513,24 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > break;
> >
> > ret = -EINVAL;
> > + /* TODO: Using asid from userspace, need to mitigate? */
> > if (!is_mem_zero((const char *)umem.reserved,
> > - sizeof(umem.reserved)))
> > + sizeof(umem.reserved)) ||
> > + (dev->api_version < VDUSE_API_VERSION_1 &&
> > + umem.asid != 0) ||
> > + umem.asid > dev->nas)
> > + break;
> > +
> > +
> > + if (umem.asid > dev->nas)
> > break;
> > mutex_lock(&dev->domain_lock);
> > - ret = vduse_dev_dereg_umem(dev, umem.iova,
> > + ret = vduse_dev_dereg_umem(dev, umem.asid, umem.iova,
> > umem.size);
> > mutex_unlock(&dev->domain_lock);
> > break;
> > }
> > + /* TODO can we merge this with GET_FD? */
>
> Probably but might be too late.
>
Yep I should just remove this TODO actually.
> > case VDUSE_IOTLB_GET_INFO: {
> > struct vduse_iova_info info;
> > struct vhost_iotlb_map *map;
> > @@ -1452,27 +1543,32 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > if (info.start > info.last)
> > break;
> >
> > + if (info.asid > dev->nas)
> > + break;
> > +
> > if (!is_mem_zero((const char *)info.reserved,
> > sizeof(info.reserved)))
> > break;
> >
> > mutex_lock(&dev->domain_lock);
> > - if (!dev->domain) {
> > + /* TODO asid comes from userspace. mitigations? */
>
> We should do sanity check to avoid OOB at least?
>
There is an if (info.asid > dev->nas) break above, what is missing?
> > + if (!dev->domain[info.asid]) {
> > mutex_unlock(&dev->domain_lock);
> > break;
> > }
> > - spin_lock(&dev->domain->iotlb_lock);
> > - map = vhost_iotlb_itree_first(dev->domain->iotlb,
> > + spin_lock(&dev->domain[info.asid]->iotlb_lock);
> > + map = vhost_iotlb_itree_first(dev->domain[info.asid]->iotlb,
> > info.start, info.last);
> > if (map) {
> > info.start = map->start;
> > info.last = map->last;
> > info.capability = 0;
> > - if (dev->domain->bounce_map && map->start == 0 &&
> > - map->last == dev->domain->bounce_size - 1)
> > + if (dev->domain[info.asid]->bounce_map &&
> > + map->start == 0 &&
> > + map->last == dev->domain[info.asid]->bounce_size - 1)
> > info.capability |= VDUSE_IOVA_CAP_UMEM;
> > }
> > - spin_unlock(&dev->domain->iotlb_lock);
> > + spin_unlock(&dev->domain[info.asid]->iotlb_lock);
> > mutex_unlock(&dev->domain_lock);
> > if (!map)
> > break;
> > @@ -1497,8 +1593,13 @@ static int vduse_dev_release(struct inode *inode, struct file *file)
> > struct vduse_dev *dev = file->private_data;
> >
> > mutex_lock(&dev->domain_lock);
> > - if (dev->domain)
> > - vduse_dev_dereg_umem(dev, 0, dev->domain->bounce_size);
> > + for (int i = 0; i < dev->nas; i++) {
> > + if (dev->domain[i]) {
> > + vduse_dev_dereg_umem(dev, i, 0,
> > + dev->domain[i]->bounce_size);
> > + dev->domain[i] = NULL;
> > + }
> > + }
> > mutex_unlock(&dev->domain_lock);
> > spin_lock(&dev->msg_lock);
> > /* Make sure the inflight messages can processed after reconncection */
> > @@ -1768,9 +1869,12 @@ static int vduse_destroy_dev(char *name)
> > idr_remove(&vduse_idr, dev->minor);
> > kvfree(dev->config);
> > vduse_dev_deinit_vqs(dev);
> > - if (dev->domain)
> > - vduse_domain_destroy(dev->domain);
> > + for (int i = 0; i < dev->nas; i++) {
> > + if (dev->domain[i])
> > + vduse_domain_destroy(dev->domain[i]);
> > + }
> > kfree(dev->name);
> > + kfree(dev->umem);
> > vduse_dev_destroy(dev);
> > module_put(THIS_MODULE);
> >
> > @@ -1877,7 +1981,8 @@ static ssize_t bounce_size_store(struct device *device,
> >
> > ret = -EPERM;
> > mutex_lock(&dev->domain_lock);
> > - if (dev->domain)
> > + /* Assuming that if the first domain is allocated, all are allocated */
> > + if (dev->domain[0])
> > goto unlock;
> >
> > ret = kstrtouint(buf, 10, &bounce_size);
> > @@ -1933,11 +2038,20 @@ static int vduse_create_dev(struct vduse_dev_config *config,
> > if (config->ngroups > VDUSE_MAX_VQ_GROUPS) {
> > pr_err("Not creating a VDUSE device with %u vq groups. Max: %u",
> > config->ngroups, VDUSE_MAX_VQ_GROUPS);
> > + /* TODO: Need to destroy dev here too! */
> > goto err;
> > }
> > + if (config->nas > VDUSE_MAX_ASID) {
> > + pr_err("Not creating a VDUSE device with %u asid. Max: %u",
> > + config->nas, VDUSE_MAX_ASID);
> > + goto err_nas;
> > + }
> > +
> > dev->ngroups = config->ngroups ?: 1;
> > + dev->nas = config->nas ?: 1;
> > } else {
> > dev->ngroups = 1;
> > + dev->nas = 1;
> > }
> > for (u32 i = 0; i < dev->ngroups; ++i)
> > dev->groups[i].dev = dev;
> > @@ -1977,6 +2091,7 @@ static int vduse_create_dev(struct vduse_dev_config *config,
> > err_idr:
> > kfree(dev->name);
> > err_str:
> > +err_nas:
> > vduse_dev_destroy(dev);
> > err:
> > return ret;
> > @@ -2069,7 +2184,6 @@ static int vduse_open(struct inode *inode, struct file *file)
> > if (!control)
> > return -ENOMEM;
> >
> > - control->api_version = VDUSE_API_VERSION;
> > file->private_data = control;
> >
> > return 0;
> > @@ -2101,7 +2215,7 @@ static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
> >
> > vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev,
> > &vduse_vdpa_config_ops, &vduse_map_ops,
> > - dev->ngroups, 1, name, true);
> > + dev->ngroups, dev->nas, name, true);
> > if (IS_ERR(vdev))
> > return PTR_ERR(vdev);
> >
> > @@ -2137,11 +2251,23 @@ static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name,
> > return ret;
> >
> > mutex_lock(&dev->domain_lock);
> > - if (!dev->domain)
> > - dev->domain = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
> > - dev->bounce_size);
> > + ret = 0;
> > +
> > + /* TODO we could delay the creation of the domain */
>
> I don't get what this means?
>
If the device support all 64 asid but it never uses them we're
consuming the memory for all the struct vduse_iova_domain, struct
vduse_umem, etc. We could delay its creating until we start using it,
but it complicates the code.
> > + for (int i = 0; i < dev->nas; ++i) {
> > + if (!dev->domain[i])
> > + dev->domain[i] = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
> > + dev->bounce_size);
> > + if (!dev->domain[i]) {
> > + ret = -ENOMEM;
> > + for (int j = 0; j < i; ++j)
> > + vduse_domain_destroy(dev->domain[j]);
>
> Why not move the above error handling to the err label?
>
That implies I need to signal the index of successfully created
domains. We only need to handle that in this part of the code. To move
it is not impossible of course but it complicates the code IMO. I'd be
happy to move it if requested though.
> > + goto err_domain;
> > + }
> > + }
> > +
> > mutex_unlock(&dev->domain_lock);
> > - if (!dev->domain) {
> > + if (ret == -ENOMEM) {
> > put_device(&dev->vdev->vdpa.dev);
> > return -ENOMEM;
> > }
> > @@ -2150,13 +2276,22 @@ static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name,
> > if (ret) {
> > put_device(&dev->vdev->vdpa.dev);
> > mutex_lock(&dev->domain_lock);
> > - vduse_domain_destroy(dev->domain);
> > - dev->domain = NULL;
> > + for (int i = 0; i < dev->nas; i++) {
> > + if (dev->domain[i]) {
> > + vduse_domain_destroy(dev->domain[i]);
> > + dev->domain[i] = NULL;
> > + }
> > + }
> > mutex_unlock(&dev->domain_lock);
> > return ret;
> > }
> >
> > return 0;
> > +
> > +err_domain:
> > + /* TODO do I need to call put_device? */
>
> If it comes after the success of vduse_dev_init_vdpa().
>
Ouch, that was a TODO for me but I forgot to remove it, sorry :).
> > + mutex_unlock(&dev->domain_lock);
> > + return ret;
> > }
> >
> > static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
> > diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> > index b4b139dc76bb..d300fd5f867f 100644
> > --- a/include/uapi/linux/vduse.h
> > +++ b/include/uapi/linux/vduse.h
> > @@ -47,7 +47,8 @@ struct vduse_dev_config {
> > __u32 vq_num;
> > __u32 vq_align;
> > __u32 ngroups; /* if VDUSE_API_VERSION >= 1 */
> > - __u32 reserved[12];
> > + __u32 nas; /* if VDUSE_API_VERSION >= 1 */
> > + __u32 reserved[11];
> > __u32 config_size;
> > __u8 config[];
> > };
> > @@ -82,6 +83,17 @@ struct vduse_iotlb_entry {
> > __u8 perm;
> > };
> >
> > +struct vduse_iotlb_entry_v2 {
> > + __u64 offset;
> > + __u64 start;
> > + __u64 last;
> > +#define VDUSE_ACCESS_RO 0x1
> > +#define VDUSE_ACCESS_WO 0x2
> > +#define VDUSE_ACCESS_RW 0x3
> > + __u8 perm;
>
> We can embed vduse_iotlb_entry here. This will ease the userspace development.
>
I agree, I'll do it that way for the next version.
> > + __u32 asid;
> > +};
> > +
> > /*
> > * Find the first IOVA region that overlaps with the range [start, last]
> > * and return the corresponding file descriptor. Return -EINVAL means the
> > @@ -172,6 +184,16 @@ struct vduse_vq_group {
> > __u32 num;
> > };
> >
> > +/**
> > + * struct vduse_vq_group - virtqueue group
> > + @ @group: Index of the virtqueue group
> > + * @asid: Address space ID of the group
> > + */
> > +struct vduse_vq_group_asid {
> > + __u32 group;
> > + __u32 asid;
> > +};
> > +
> > /**
> > * struct vduse_vq_info - information of a virtqueue
> > * @index: virtqueue index
> > @@ -232,6 +254,7 @@ struct vduse_vq_eventfd {
> > * @uaddr: start address of userspace memory, it must be aligned to page size
> > * @iova: start of the IOVA region
> > * @size: size of the IOVA region
> > + * @asid: Address space ID of the IOVA region
> > * @reserved: for future use, needs to be initialized to zero
> > *
> > * Structure used by VDUSE_IOTLB_REG_UMEM and VDUSE_IOTLB_DEREG_UMEM
> > @@ -241,7 +264,8 @@ struct vduse_iova_umem {
> > __u64 uaddr;
> > __u64 iova;
> > __u64 size;
> > - __u64 reserved[3];
> > + __u32 asid;
> > + __u32 reserved[5];
> > };
> >
> > /* Register userspace memory for IOVA regions */
> > @@ -265,7 +289,8 @@ struct vduse_iova_info {
> > __u64 last;
> > #define VDUSE_IOVA_CAP_UMEM (1 << 0)
> > __u64 capability;
> > - __u64 reserved[3];
> > + __u32 asid; /* Only if device API version >= 1 */
> > + __u32 reserved[5];
> > };
> >
> > /*
> > @@ -289,6 +314,7 @@ enum vduse_req_type {
> > VDUSE_UPDATE_IOTLB,
> > VDUSE_GET_VQ_GROUP,
> > VDUSE_GET_VRING_DESC_GROUP,
> > + VDUSE_SET_VQ_GROUP_ASID,
> > };
> >
> > /**
> > @@ -344,6 +370,8 @@ struct vduse_dev_request {
> > struct vduse_dev_status s;
> > struct vduse_iova_range iova;
> > struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
> > + /* Only if vduse api version >= 1 */
>
> We can use:
>
> /* Only if vduse api version >= 1 */
> struct vduse_vq_group vq_group;
> struct vduse_vq_group_asid vq_group_asid;
>
I'm happy to do it that way, but it is not clear to me that both
members are only if version >=1. Maybe "The following members are
valid only if..."?
> > + struct vduse_vq_group_asid vq_group_asid;
> > __u32 padding[32];
> > };
> > };
> > @@ -367,6 +395,8 @@ struct vduse_dev_response {
> > union {
> > struct vduse_vq_state vq_state;
> > struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
> > + /* Only if vduse api version >= 1 */
> > + struct vduse_vq_group_asid vq_group_asid;
> > __u32 padding[32];
> > };
> > };
> > --
> > 2.50.1
> >
>
> Thanks
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC v2 2/7] vduse: add vq group support
2025-08-11 9:51 ` Eugenio Perez Martin
@ 2025-08-12 3:01 ` Jason Wang
2025-08-13 10:11 ` Eugenio Perez Martin
0 siblings, 1 reply; 24+ messages in thread
From: Jason Wang @ 2025-08-12 3:01 UTC (permalink / raw)
To: Eugenio Perez Martin
Cc: Michael S . Tsirkin, Cindy Lu, Yongji Xie, Stefano Garzarella,
virtualization, Laurent Vivier, linux-kernel, Xuan Zhuo,
Maxime Coquelin
On Mon, Aug 11, 2025 at 5:52 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Mon, Aug 11, 2025 at 5:10 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Thu, Aug 7, 2025 at 7:58 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > >
> > > This allows sepparate the different virtqueues in groups that shares the
> > > same address space. Asking the VDUSE device for the groups of the vq at
> > > the beginning as they're needed for the DMA API.
> > >
> > > Allocating 3 vq groups as net is the device that need the most groups:
> > > * Dataplane (guest passthrough)
> > > * CVQ
> > > * Shadowed vrings.
> > >
> > > Future versions of the series can include dynamic allocation of the
> > > groups array so VDUSE can declare more groups.
> > >
> > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > ---
> > > v2:
> > > * Cache group information in kernel, as we need to provide the vq map
> > > tokens properly.
> > > * Add descs vq group to optimize SVQ forwarding and support indirect
> > > descriptors out of the box.
> > > ---
> > > drivers/vdpa/vdpa_user/vduse_dev.c | 71 +++++++++++++++++++++++++++++-
> > > include/uapi/linux/vduse.h | 19 +++++++-
> > > 2 files changed, 88 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > index d858c4389cc1..d1f6d00a9c71 100644
> > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > @@ -46,6 +46,11 @@
> > > #define VDUSE_IOVA_SIZE (VDUSE_MAX_BOUNCE_SIZE + 128 * 1024 * 1024)
> > > #define VDUSE_MSG_DEFAULT_TIMEOUT 30
> > >
> > > +/*
> > > + * Let's make it 3 for simplicity.
> > > + */
> > > +#define VDUSE_MAX_VQ_GROUPS 3
> >
> > I think we can release this to something like 64. Otherwise we might
> > bump the version again just to increase the limitation? Or having a
> > sysfs entry like bounce_size?
> >
>
> I think it should not be linked to the version, but it is true there
> is no way for VDUSE devices to check the maximum VQ groups / ASID that
> the kernel supports.
>
> To handle as bounce_size seems the best option, good point. I'll send
> a new version with that!
>
> > > +
> > > #define IRQ_UNBOUND -1
> > >
> > > struct vduse_virtqueue {
> > > @@ -58,6 +63,8 @@ struct vduse_virtqueue {
> > > struct vdpa_vq_state state;
> > > bool ready;
> > > bool kicked;
> > > + u32 vq_group;
> > > + u32 vq_desc_group;
> > > spinlock_t kick_lock;
> > > spinlock_t irq_lock;
> > > struct eventfd_ctx *kickfd;
> > > @@ -114,6 +121,7 @@ struct vduse_dev {
> > > u8 status;
> > > u32 vq_num;
> > > u32 vq_align;
> > > + u32 ngroups;
> > > struct vduse_umem *umem;
> > > struct mutex mem_lock;
> > > unsigned int bounce_size;
> > > @@ -592,6 +600,20 @@ static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
> > > return 0;
> > > }
> > >
> > > +static u32 vduse_get_vq_group(struct vdpa_device *vdpa, u16 idx)
> > > +{
> > > + struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > > +
> > > + return dev->vqs[idx]->vq_group;
> > > +}
> > > +
> > > +static u32 vduse_get_vq_desc_group(struct vdpa_device *vdpa, u16 idx)
> > > +{
> > > + struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > > +
> > > + return dev->vqs[idx]->vq_desc_group;
> > > +}
> > > +
> > > static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> > > struct vdpa_vq_state *state)
> > > {
> > > @@ -678,13 +700,48 @@ static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
> > > return dev->status;
> > > }
> > >
> > > +static int vduse_fill_vq_groups(struct vduse_dev *dev)
> > > +{
> > > + if (dev->api_version < VDUSE_API_VERSION_1)
> > > + return 0;
> > > +
> > > + for (int i = 0; i < dev->vdev->vdpa.nvqs; ++i) {
> > > + struct vduse_dev_msg msg = { 0 };
> > > + int ret;
> > > +
> > > + msg.req.type = VDUSE_GET_VQ_GROUP;
> > > + msg.req.vq_group.index = i;
> > > + ret = vduse_dev_msg_sync(dev, &msg);
> >
> > I fail to understand why the default group mapping is not done during
> > device creation.
> >
>
> Because it changes depending on the features.
>
> If a new device has 5 virtqueues and the device wants to isolate the
> CVQ, the CVQ position depends on the features that the guest's acks:
> * If MQ is acked the isolated vq is #5
> * If MQ is not acked the isolated vq is #3.
I see we are still damaged by the design of the cvq index. But this is
just a static branch not a dynamic one. If we can find ways to make it
static it would be better.
>
> > > + if (ret)
> > > + return ret;
> > > +
> > > + dev->vqs[i]->vq_group = msg.resp.vq_group.num;
> > > +
> > > + msg.req.type = VDUSE_GET_VRING_DESC_GROUP;
> > > + ret = vduse_dev_msg_sync(dev, &msg);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + dev->vqs[i]->vq_desc_group = msg.resp.vq_group.num;
> > > + }
> > > +
> > > + return 0;
> > > +}
> > > +
> > > static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
> > > {
> > > struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > > + u8 previous_status = dev->status;
> > >
> > > if (vduse_dev_set_status(dev, status))
> > > return;
> > >
> > > + if ((dev->status ^ previous_status) &
> > > + BIT_ULL(VIRTIO_CONFIG_S_FEATURES_OK) &&
> > > + status & (1ULL << VIRTIO_CONFIG_S_FEATURES_OK))
> > > + if (vduse_fill_vq_groups(dev))
> >
> > Can we merge the two messages into a single one? Or can we use a
> > shared memory for storing such mapping?
> >
> > For example, if we have 256 queues it would be very slow.
> >
>
> To merge it in the same message is good to me, sure.
We can start from this if we can't find a way to provision vq to group
mapping during device creation.
> To make it a
> table in shm seems more complicated, unless we accept a void * in the
> reply and VDUSE uses copy_from_user. If that is ok here, then sure.
>
This looks tricky indeed.
Thanks
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC v2 4/7] vduse: return internal vq group struct as map token
2025-08-11 11:03 ` Eugenio Perez Martin
@ 2025-08-12 3:02 ` Jason Wang
0 siblings, 0 replies; 24+ messages in thread
From: Jason Wang @ 2025-08-12 3:02 UTC (permalink / raw)
To: Eugenio Perez Martin
Cc: Michael S . Tsirkin, Cindy Lu, Yongji Xie, Stefano Garzarella,
virtualization, Laurent Vivier, linux-kernel, Xuan Zhuo,
Maxime Coquelin
On Mon, Aug 11, 2025 at 7:04 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Mon, Aug 11, 2025 at 5:11 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Thu, Aug 7, 2025 at 7:58 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > >
> > > Return the internal struct that represents the vq group as virtqueue map
> > > token, instead of the device.
> >
> > Note that Michael prefers to use the iova domain. This indeed seems to
> > be better.
> >
>
> Well iova domain would delete an indirection in the pointer chase, but
> it would be problematic to store the token in the caller.
>
> And we need to add some way to protect that the ASID of a vq group is
> not changed in the middle of the operation by an ioctl. IOW, the
> vq_group_internal struct pointer is constant for all the lifetime of
> the device, while iova_domain is not.
I will post a new version of DMA rework and switch to using the iova
domain there. Let's see if it works then.
>
> > > This allows the DMA functions to access
> >
> > s/DMA/map/
> >
>
> Ouch, thanks for the catch!
>
> > > the information per group.
> > >
> > > At this moment all the virtqueues share the same vq group, that only
> > > can point to ASID 0. This change prepares the infrastructure for actual
> > > per-group address space handling
> > >
> > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >
> > Thanks
> >
>
Thanks
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC v2 5/7] vduse: add vq group asid support
2025-08-11 11:19 ` Eugenio Perez Martin
@ 2025-08-12 3:05 ` Jason Wang
2025-08-13 11:02 ` Eugenio Perez Martin
1 sibling, 0 replies; 24+ messages in thread
From: Jason Wang @ 2025-08-12 3:05 UTC (permalink / raw)
To: Eugenio Perez Martin
Cc: Michael S . Tsirkin, Cindy Lu, Yongji Xie, Stefano Garzarella,
virtualization, Laurent Vivier, linux-kernel, Xuan Zhuo,
Maxime Coquelin
On Mon, Aug 11, 2025 at 7:20 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Mon, Aug 11, 2025 at 5:21 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Thu, Aug 7, 2025 at 7:58 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > >
> > > Add support for assigning Address Space Identifiers (ASIDs) to each VQ
> > > group. This enables mapping each group into a distinct memory space.
> > >
> > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > ---
> > > v2: Make iotlb entry the last one of vduse_iotlb_entry_v2 so the first
> > > part of the struct is the same.
> > > ---
> > > drivers/vdpa/vdpa_user/vduse_dev.c | 259 ++++++++++++++++++++++-------
> > > include/uapi/linux/vduse.h | 36 +++-
> > > 2 files changed, 230 insertions(+), 65 deletions(-)
> > >
> > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > index a7a2749f5818..145147c49930 100644
> > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > @@ -51,6 +51,12 @@
> > > */
> > > #define VDUSE_MAX_VQ_GROUPS 3
> > >
> > > +/*
> > > + * We can do ASID thanks to (ab)using the vduse device and the vdpa one. So no
> > > + * more than 2.
> > > + */
> > > +#define VDUSE_MAX_ASID 2
> >
> > Similar to group limitation, I'd like to either make it larger to 64
> > or make it changable via sysfs.
We can start from something that is simple (64).
> >
> > > +
> > > #define IRQ_UNBOUND -1
> > >
> > > struct vduse_virtqueue {
> > > @@ -91,6 +97,7 @@ struct vduse_umem {
> > > };
> > >
> > > struct vduse_vq_group_int {
> > > + struct vduse_iova_domain *domain;
> > > struct vduse_dev *dev;
> > > };
> > >
> > > @@ -98,7 +105,7 @@ struct vduse_dev {
> > > struct vduse_vdpa *vdev;
> > > struct device *dev;
> > > struct vduse_virtqueue **vqs;
> > > - struct vduse_iova_domain *domain;
> > > + struct vduse_iova_domain *domain[VDUSE_MAX_ASID];
> > > char *name;
> > > struct mutex lock;
> > > spinlock_t msg_lock;
> > > @@ -126,7 +133,8 @@ struct vduse_dev {
> > > u32 vq_num;
> > > u32 vq_align;
> > > u32 ngroups;
> > > - struct vduse_umem *umem;
> > > + u32 nas;
> > > + struct vduse_umem *umem[VDUSE_MAX_ASID];
> > > struct vduse_vq_group_int groups[VDUSE_MAX_VQ_GROUPS];
> > > struct mutex mem_lock;
> > > unsigned int bounce_size;
> > > @@ -440,14 +448,28 @@ static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> > > return mask;
> > > }
> > >
> > > +/* Force set the asid to a vq group without a message to the VDUSE device */
> > > +static void vduse_set_group_asid_nomsg(struct vduse_dev *dev,
> > > + unsigned int group, unsigned int asid)
> > > +{
> > > + guard(mutex)(&dev->domain_lock);
> > > + dev->groups[group].domain = dev->domain[asid];
> > > +}
> > > +
> > > static void vduse_dev_reset(struct vduse_dev *dev)
> > > {
> > > int i;
> > > - struct vduse_iova_domain *domain = dev->domain;
> > >
> > > /* The coherent mappings are handled in vduse_dev_free_coherent() */
> > > - if (domain && domain->bounce_map)
> > > - vduse_domain_reset_bounce_map(domain);
> > > + for (i = 0; i < dev->nas; i++) {
> > > + struct vduse_iova_domain *domain = dev->domain[i];
> > > +
> > > + if (domain && domain->bounce_map)
> > > + vduse_domain_reset_bounce_map(domain);
> > > + }
> > > +
> > > + for (i = 0; i < dev->ngroups; i++)
> > > + vduse_set_group_asid_nomsg(dev, i, 0);
> >
> > I vaguely remember Siwei has invented a feature that forbids the reset
> > of the mapping during virtio reset. If it's true, let's check that
> > feature flag.
> >
>
> Ouch, very good catch!
>
> > Btw even if we device to reset the mapping, should it be implied via
> > reset message itself instead of having more operations here?
> >
>
> This was resetting all to valid values by the start of the device
> operation, so no need for a new message at least for now.
>
> > >
> > > down_write(&dev->rwsem);
> > >
> > > @@ -619,6 +641,29 @@ static u32 vduse_get_vq_desc_group(struct vdpa_device *vdpa, u16 idx)
> > > return dev->vqs[idx]->vq_desc_group;
> > > }
> > >
> > > +static int vduse_set_group_asid(struct vdpa_device *vdpa, unsigned int group,
> > > + unsigned int asid)
> > > +{
> > > + struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > > + struct vduse_dev_msg msg = { 0 };
> > > + int r;
> > > +
> > > + if (dev->api_version < VDUSE_API_VERSION_1 ||
> > > + group > dev->ngroups || asid > dev->nas)
> > > + return -EINVAL;
> > > +
> > > + msg.req.type = VDUSE_SET_VQ_GROUP_ASID;
> > > + msg.req.vq_group_asid.group = group;
> > > + msg.req.vq_group_asid.asid = asid;
> >
> > I wonder if this could be part of the shared memory?
> >
>
> I think it complicates the solution a lot, isn't it? How to notify the
> VDUSE device that these properties have changed, what vq group
> changed, etc? What do we want to achieve with the shm solution?
Rethink about this, I think you're right. But the point is to avoid
userspace transaction as much as possible.
>
> > > +
> > > + r = vduse_dev_msg_sync(dev, &msg);
> > > + if (r < 0)
> > > + return r;
> > > +
> > > + vduse_set_group_asid_nomsg(dev, group, asid);
> > > + return 0;
> > > +}
> > > +
> > > static void *vduse_get_vq_map_token(struct vdpa_device *vdpa, u16 idx)
> > > {
> > > struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > > @@ -833,13 +878,13 @@ static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
> > > struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > > int ret;
> > >
> > > - ret = vduse_domain_set_map(dev->domain, iotlb);
> > > + ret = vduse_domain_set_map(dev->domain[asid], iotlb);
> > > if (ret)
> > > return ret;
> > >
> > > ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> > > if (ret) {
> > > - vduse_domain_clear_map(dev->domain, iotlb);
> > > + vduse_domain_clear_map(dev->domain[asid], iotlb);
> > > return ret;
> > > }
> > >
> > > @@ -883,6 +928,7 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> > > .get_vq_affinity = vduse_vdpa_get_vq_affinity,
> > > .reset = vduse_vdpa_reset,
> > > .set_map = vduse_vdpa_set_map,
> > > + .set_group_asid = vduse_set_group_asid,
> > > .get_vq_map_token = vduse_get_vq_map_token,
> > > .free = vduse_vdpa_free,
> > > };
> > > @@ -893,8 +939,10 @@ static void vduse_dev_sync_single_for_device(void *token,
> > > {
> > > struct vduse_vq_group_int *group = token;
> > > struct vduse_dev *vdev = group->dev;
> > > - struct vduse_iova_domain *domain = vdev->domain;
> > > + struct vduse_iova_domain *domain;
> > >
> > > + guard(mutex)(&vdev->domain_lock);
> > > + domain = group->domain;
> > > vduse_domain_sync_single_for_device(domain, dma_addr, size, dir);
> > > }
> > >
> > > @@ -904,8 +952,10 @@ static void vduse_dev_sync_single_for_cpu(void *token,
> > > {
> > > struct vduse_vq_group_int *group = token;
> > > struct vduse_dev *vdev = group->dev;
> > > - struct vduse_iova_domain *domain = vdev->domain;
> > > + struct vduse_iova_domain *domain;
> > >
> > > + guard(mutex)(&vdev->domain_lock);
> > > + domain = group->domain;
> > > vduse_domain_sync_single_for_cpu(domain, dma_addr, size, dir);
> > > }
> > >
> > > @@ -916,8 +966,10 @@ static dma_addr_t vduse_dev_map_page(void *token, struct page *page,
> > > {
> > > struct vduse_vq_group_int *group = token;
> > > struct vduse_dev *vdev = group->dev;
> > > - struct vduse_iova_domain *domain = vdev->domain;
> > > + struct vduse_iova_domain *domain;
> > >
> > > + guard(mutex)(&vdev->domain_lock);
> > > + domain = group->domain;
> > > return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> > > }
> > >
> > > @@ -927,8 +979,10 @@ static void vduse_dev_unmap_page(void *token, dma_addr_t dma_addr,
> > > {
> > > struct vduse_vq_group_int *group = token;
> > > struct vduse_dev *vdev = group->dev;
> > > - struct vduse_iova_domain *domain = vdev->domain;
> > > + struct vduse_iova_domain *domain;
> > >
> > > + guard(mutex)(&vdev->domain_lock);
> > > + domain = group->domain;
> > > return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> > > }
> > >
> > > @@ -937,11 +991,13 @@ static void *vduse_dev_alloc_coherent(void *token, size_t size,
> > > {
> > > struct vduse_vq_group_int *group = token;
> > > struct vduse_dev *vdev = group->dev;
> > > - struct vduse_iova_domain *domain = vdev->domain;
> > > + struct vduse_iova_domain *domain;
> > > unsigned long iova;
> > > void *addr;
> > >
> > > *dma_addr = DMA_MAPPING_ERROR;
> > > + guard(mutex)(&vdev->domain_lock);
> > > + domain = group->domain;
> > > addr = vduse_domain_alloc_coherent(domain, size,
> > > (dma_addr_t *)&iova, flag);
> > > if (!addr)
> > > @@ -958,8 +1014,10 @@ static void vduse_dev_free_coherent(void *token, size_t size,
> > > {
> > > struct vduse_vq_group_int *group = token;
> > > struct vduse_dev *vdev = group->dev;
> > > - struct vduse_iova_domain *domain = vdev->domain;
> > > + struct vduse_iova_domain *domain;
> > >
> > > + guard(mutex)(&vdev->domain_lock);
> > > + domain = group->domain;
> > > vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
> > > }
> > >
> > > @@ -967,8 +1025,10 @@ static size_t vduse_dev_max_mapping_size(void *token)
> > > {
> > > struct vduse_vq_group_int *group = token;
> > > struct vduse_dev *vdev = group->dev;
> > > - struct vduse_iova_domain *domain = vdev->domain;
> > > + struct vduse_iova_domain *domain;
> > >
> > > + guard(mutex)(&vdev->domain_lock);
> > > + domain = group->domain;
> > > return domain->bounce_size;
> > > }
> > >
> > > @@ -1107,31 +1167,32 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
> > > return ret;
> > > }
> > >
> > > -static int vduse_dev_dereg_umem(struct vduse_dev *dev,
> > > +static int vduse_dev_dereg_umem(struct vduse_dev *dev, u32 asid,
> > > u64 iova, u64 size)
> > > {
> > > int ret;
> > >
> > > mutex_lock(&dev->mem_lock);
> > > ret = -ENOENT;
> > > - if (!dev->umem)
> > > + if (!dev->umem[asid])
> > > goto unlock;
> > >
> > > ret = -EINVAL;
> > > - if (!dev->domain)
> > > + if (!dev->domain[asid])
> > > goto unlock;
> > >
> > > - if (dev->umem->iova != iova || size != dev->domain->bounce_size)
> > > + if (dev->umem[asid]->iova != iova ||
> > > + size != dev->domain[asid]->bounce_size)
> > > goto unlock;
> > >
> > > - vduse_domain_remove_user_bounce_pages(dev->domain);
> > > - unpin_user_pages_dirty_lock(dev->umem->pages,
> > > - dev->umem->npages, true);
> > > - atomic64_sub(dev->umem->npages, &dev->umem->mm->pinned_vm);
> > > - mmdrop(dev->umem->mm);
> > > - vfree(dev->umem->pages);
> > > - kfree(dev->umem);
> > > - dev->umem = NULL;
> > > + vduse_domain_remove_user_bounce_pages(dev->domain[asid]);
> > > + unpin_user_pages_dirty_lock(dev->umem[asid]->pages,
> > > + dev->umem[asid]->npages, true);
> > > + atomic64_sub(dev->umem[asid]->npages, &dev->umem[asid]->mm->pinned_vm);
> > > + mmdrop(dev->umem[asid]->mm);
> > > + vfree(dev->umem[asid]->pages);
> > > + kfree(dev->umem[asid]);
> > > + dev->umem[asid] = NULL;
> > > ret = 0;
> > > unlock:
> > > mutex_unlock(&dev->mem_lock);
> > > @@ -1139,7 +1200,7 @@ static int vduse_dev_dereg_umem(struct vduse_dev *dev,
> > > }
> > >
> > > static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > > - u64 iova, u64 uaddr, u64 size)
> > > + u32 asid, u64 iova, u64 uaddr, u64 size)
> > > {
> > > struct page **page_list = NULL;
> > > struct vduse_umem *umem = NULL;
> > > @@ -1147,14 +1208,14 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > > unsigned long npages, lock_limit;
> > > int ret;
> > >
> > > - if (!dev->domain || !dev->domain->bounce_map ||
> > > - size != dev->domain->bounce_size ||
> > > + if (!dev->domain[asid] || !dev->domain[asid]->bounce_map ||
> > > + size != dev->domain[asid]->bounce_size ||
> > > iova != 0 || uaddr & ~PAGE_MASK)
> > > return -EINVAL;
> > >
> > > mutex_lock(&dev->mem_lock);
> > > ret = -EEXIST;
> > > - if (dev->umem)
> > > + if (dev->umem[asid])
> > > goto unlock;
> > >
> > > ret = -ENOMEM;
> > > @@ -1178,7 +1239,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > > goto out;
> > > }
> > >
> > > - ret = vduse_domain_add_user_bounce_pages(dev->domain,
> > > + ret = vduse_domain_add_user_bounce_pages(dev->domain[asid],
> > > page_list, pinned);
> > > if (ret)
> > > goto out;
> > > @@ -1191,7 +1252,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > > umem->mm = current->mm;
> > > mmgrab(current->mm);
> > >
> > > - dev->umem = umem;
> > > + dev->umem[asid] = umem;
> > > out:
> > > if (ret && pinned > 0)
> > > unpin_user_pages(page_list, pinned);
> > > @@ -1234,26 +1295,43 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > >
> > > switch (cmd) {
> > > case VDUSE_IOTLB_GET_FD: {
> > > - struct vduse_iotlb_entry entry;
> > > + struct vduse_iotlb_entry_v2 entry;
> > > + struct vduse_iotlb_entry entry_old;
> > > struct vhost_iotlb_map *map;
> > > struct vdpa_map_file *map_file;
> > > struct file *f = NULL;
> > >
> > > ret = -EFAULT;
> > > - if (copy_from_user(&entry, argp, sizeof(entry)))
> > > - break;
> > > + if (dev->api_version >= VDUSE_API_VERSION_1) {
> > > + if (copy_from_user(&entry, argp, sizeof(entry)))
> > > + break;
> > > + } else {
> > > + if (copy_from_user(&entry_old, argp,
> > > + sizeof(entry_old)))
> > > + break;
> > > +
> > > + entry.offset = entry_old.offset;
> > > + entry.start = entry_old.start;
> > > + entry.last = entry_old.last;
> > > + entry.asid = 0;
> > > + entry.perm = entry_old.perm;
> > > + }
> > >
> > > ret = -EINVAL;
> > > if (entry.start > entry.last)
> > > break;
> > >
> > > + if (entry.asid > dev->nas)
> > > + break;
> > > +
> > > mutex_lock(&dev->domain_lock);
> > > - if (!dev->domain) {
> > > + /* TODO accessing an array with idx from userspace, mitigations? */
> > > + if (!dev->domain[entry.asid]) {
> > > mutex_unlock(&dev->domain_lock);
> > > break;
> > > }
> > > - spin_lock(&dev->domain->iotlb_lock);
> > > - map = vhost_iotlb_itree_first(dev->domain->iotlb,
> > > + spin_lock(&dev->domain[entry.asid]->iotlb_lock);
> > > + map = vhost_iotlb_itree_first(dev->domain[entry.asid]->iotlb,
> > > entry.start, entry.last);
> > > if (map) {
> > > map_file = (struct vdpa_map_file *)map->opaque;
> > > @@ -1263,7 +1341,7 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > > entry.last = map->last;
> > > entry.perm = map->perm;
> > > }
> > > - spin_unlock(&dev->domain->iotlb_lock);
> > > + spin_unlock(&dev->domain[entry.asid]->iotlb_lock);
> > > mutex_unlock(&dev->domain_lock);
> > > ret = -EINVAL;
> > > if (!f)
> > > @@ -1413,12 +1491,16 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > > break;
> > >
> > > ret = -EINVAL;
> > > + /* TODO: Using asid from userspace, need to mitigate? */
> > > if (!is_mem_zero((const char *)umem.reserved,
> > > - sizeof(umem.reserved)))
> > > + sizeof(umem.reserved)) ||
> > > + (dev->api_version < VDUSE_API_VERSION_1 &&
> > > + umem.asid != 0) ||
> > > + umem.asid > dev->nas)
> > > break;
> > >
> > > mutex_lock(&dev->domain_lock);
> > > - ret = vduse_dev_reg_umem(dev, umem.iova,
> > > + ret = vduse_dev_reg_umem(dev, umem.asid, umem.iova,
> > > umem.uaddr, umem.size);
> > > mutex_unlock(&dev->domain_lock);
> > > break;
> > > @@ -1431,15 +1513,24 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > > break;
> > >
> > > ret = -EINVAL;
> > > + /* TODO: Using asid from userspace, need to mitigate? */
> > > if (!is_mem_zero((const char *)umem.reserved,
> > > - sizeof(umem.reserved)))
> > > + sizeof(umem.reserved)) ||
> > > + (dev->api_version < VDUSE_API_VERSION_1 &&
> > > + umem.asid != 0) ||
> > > + umem.asid > dev->nas)
> > > + break;
> > > +
> > > +
> > > + if (umem.asid > dev->nas)
> > > break;
> > > mutex_lock(&dev->domain_lock);
> > > - ret = vduse_dev_dereg_umem(dev, umem.iova,
> > > + ret = vduse_dev_dereg_umem(dev, umem.asid, umem.iova,
> > > umem.size);
> > > mutex_unlock(&dev->domain_lock);
> > > break;
> > > }
> > > + /* TODO can we merge this with GET_FD? */
> >
> > Probably but might be too late.
> >
>
> Yep I should just remove this TODO actually.
>
> > > case VDUSE_IOTLB_GET_INFO: {
> > > struct vduse_iova_info info;
> > > struct vhost_iotlb_map *map;
> > > @@ -1452,27 +1543,32 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > > if (info.start > info.last)
> > > break;
> > >
> > > + if (info.asid > dev->nas)
> > > + break;
> > > +
> > > if (!is_mem_zero((const char *)info.reserved,
> > > sizeof(info.reserved)))
> > > break;
> > >
> > > mutex_lock(&dev->domain_lock);
> > > - if (!dev->domain) {
> > > + /* TODO asid comes from userspace. mitigations? */
> >
> > We should do sanity check to avoid OOB at least?
> >
>
> There is an if (info.asid > dev->nas) break above, what is missing?
Ok I get, so what you want is array_index_nospec()?
>
> > > + if (!dev->domain[info.asid]) {
> > > mutex_unlock(&dev->domain_lock);
> > > break;
> > > }
> > > - spin_lock(&dev->domain->iotlb_lock);
> > > - map = vhost_iotlb_itree_first(dev->domain->iotlb,
> > > + spin_lock(&dev->domain[info.asid]->iotlb_lock);
> > > + map = vhost_iotlb_itree_first(dev->domain[info.asid]->iotlb,
> > > info.start, info.last);
> > > if (map) {
> > > info.start = map->start;
> > > info.last = map->last;
> > > info.capability = 0;
> > > - if (dev->domain->bounce_map && map->start == 0 &&
> > > - map->last == dev->domain->bounce_size - 1)
> > > + if (dev->domain[info.asid]->bounce_map &&
> > > + map->start == 0 &&
> > > + map->last == dev->domain[info.asid]->bounce_size - 1)
> > > info.capability |= VDUSE_IOVA_CAP_UMEM;
> > > }
> > > - spin_unlock(&dev->domain->iotlb_lock);
> > > + spin_unlock(&dev->domain[info.asid]->iotlb_lock);
> > > mutex_unlock(&dev->domain_lock);
> > > if (!map)
> > > break;
> > > @@ -1497,8 +1593,13 @@ static int vduse_dev_release(struct inode *inode, struct file *file)
> > > struct vduse_dev *dev = file->private_data;
> > >
> > > mutex_lock(&dev->domain_lock);
> > > - if (dev->domain)
> > > - vduse_dev_dereg_umem(dev, 0, dev->domain->bounce_size);
> > > + for (int i = 0; i < dev->nas; i++) {
> > > + if (dev->domain[i]) {
> > > + vduse_dev_dereg_umem(dev, i, 0,
> > > + dev->domain[i]->bounce_size);
> > > + dev->domain[i] = NULL;
> > > + }
> > > + }
> > > mutex_unlock(&dev->domain_lock);
> > > spin_lock(&dev->msg_lock);
> > > /* Make sure the inflight messages can processed after reconncection */
> > > @@ -1768,9 +1869,12 @@ static int vduse_destroy_dev(char *name)
> > > idr_remove(&vduse_idr, dev->minor);
> > > kvfree(dev->config);
> > > vduse_dev_deinit_vqs(dev);
> > > - if (dev->domain)
> > > - vduse_domain_destroy(dev->domain);
> > > + for (int i = 0; i < dev->nas; i++) {
> > > + if (dev->domain[i])
> > > + vduse_domain_destroy(dev->domain[i]);
> > > + }
> > > kfree(dev->name);
> > > + kfree(dev->umem);
> > > vduse_dev_destroy(dev);
> > > module_put(THIS_MODULE);
> > >
> > > @@ -1877,7 +1981,8 @@ static ssize_t bounce_size_store(struct device *device,
> > >
> > > ret = -EPERM;
> > > mutex_lock(&dev->domain_lock);
> > > - if (dev->domain)
> > > + /* Assuming that if the first domain is allocated, all are allocated */
> > > + if (dev->domain[0])
> > > goto unlock;
> > >
> > > ret = kstrtouint(buf, 10, &bounce_size);
> > > @@ -1933,11 +2038,20 @@ static int vduse_create_dev(struct vduse_dev_config *config,
> > > if (config->ngroups > VDUSE_MAX_VQ_GROUPS) {
> > > pr_err("Not creating a VDUSE device with %u vq groups. Max: %u",
> > > config->ngroups, VDUSE_MAX_VQ_GROUPS);
> > > + /* TODO: Need to destroy dev here too! */
> > > goto err;
> > > }
> > > + if (config->nas > VDUSE_MAX_ASID) {
> > > + pr_err("Not creating a VDUSE device with %u asid. Max: %u",
> > > + config->nas, VDUSE_MAX_ASID);
> > > + goto err_nas;
> > > + }
> > > +
> > > dev->ngroups = config->ngroups ?: 1;
> > > + dev->nas = config->nas ?: 1;
> > > } else {
> > > dev->ngroups = 1;
> > > + dev->nas = 1;
> > > }
> > > for (u32 i = 0; i < dev->ngroups; ++i)
> > > dev->groups[i].dev = dev;
> > > @@ -1977,6 +2091,7 @@ static int vduse_create_dev(struct vduse_dev_config *config,
> > > err_idr:
> > > kfree(dev->name);
> > > err_str:
> > > +err_nas:
> > > vduse_dev_destroy(dev);
> > > err:
> > > return ret;
> > > @@ -2069,7 +2184,6 @@ static int vduse_open(struct inode *inode, struct file *file)
> > > if (!control)
> > > return -ENOMEM;
> > >
> > > - control->api_version = VDUSE_API_VERSION;
> > > file->private_data = control;
> > >
> > > return 0;
> > > @@ -2101,7 +2215,7 @@ static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
> > >
> > > vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev,
> > > &vduse_vdpa_config_ops, &vduse_map_ops,
> > > - dev->ngroups, 1, name, true);
> > > + dev->ngroups, dev->nas, name, true);
> > > if (IS_ERR(vdev))
> > > return PTR_ERR(vdev);
> > >
> > > @@ -2137,11 +2251,23 @@ static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name,
> > > return ret;
> > >
> > > mutex_lock(&dev->domain_lock);
> > > - if (!dev->domain)
> > > - dev->domain = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
> > > - dev->bounce_size);
> > > + ret = 0;
> > > +
> > > + /* TODO we could delay the creation of the domain */
> >
> > I don't get what this means?
> >
>
> If the device support all 64 asid but it never uses them we're
> consuming the memory for all the struct vduse_iova_domain, struct
> vduse_umem, etc. We could delay its creating until we start using it,
> but it complicates the code.
If it doesn't takes too much memory, we can start from a static
allocation (if that's what you meant here).
>
> > > + for (int i = 0; i < dev->nas; ++i) {
> > > + if (!dev->domain[i])
> > > + dev->domain[i] = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
> > > + dev->bounce_size);
> > > + if (!dev->domain[i]) {
> > > + ret = -ENOMEM;
> > > + for (int j = 0; j < i; ++j)
> > > + vduse_domain_destroy(dev->domain[j]);
> >
> > Why not move the above error handling to the err label?
> >
>
> That implies I need to signal the index of successfully created
> domains. We only need to handle that in this part of the code. To move
> it is not impossible of course but it complicates the code IMO. I'd be
> happy to move it if requested though.
>
> > > + goto err_domain;
> > > + }
> > > + }
> > > +
> > > mutex_unlock(&dev->domain_lock);
> > > - if (!dev->domain) {
> > > + if (ret == -ENOMEM) {
> > > put_device(&dev->vdev->vdpa.dev);
> > > return -ENOMEM;
> > > }
> > > @@ -2150,13 +2276,22 @@ static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name,
> > > if (ret) {
> > > put_device(&dev->vdev->vdpa.dev);
> > > mutex_lock(&dev->domain_lock);
> > > - vduse_domain_destroy(dev->domain);
> > > - dev->domain = NULL;
> > > + for (int i = 0; i < dev->nas; i++) {
> > > + if (dev->domain[i]) {
> > > + vduse_domain_destroy(dev->domain[i]);
> > > + dev->domain[i] = NULL;
> > > + }
> > > + }
> > > mutex_unlock(&dev->domain_lock);
> > > return ret;
> > > }
> > >
> > > return 0;
> > > +
> > > +err_domain:
> > > + /* TODO do I need to call put_device? */
> >
> > If it comes after the success of vduse_dev_init_vdpa().
> >
>
> Ouch, that was a TODO for me but I forgot to remove it, sorry :).
>
> > > + mutex_unlock(&dev->domain_lock);
> > > + return ret;
> > > }
> > >
> > > static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
> > > diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> > > index b4b139dc76bb..d300fd5f867f 100644
> > > --- a/include/uapi/linux/vduse.h
> > > +++ b/include/uapi/linux/vduse.h
> > > @@ -47,7 +47,8 @@ struct vduse_dev_config {
> > > __u32 vq_num;
> > > __u32 vq_align;
> > > __u32 ngroups; /* if VDUSE_API_VERSION >= 1 */
> > > - __u32 reserved[12];
> > > + __u32 nas; /* if VDUSE_API_VERSION >= 1 */
> > > + __u32 reserved[11];
> > > __u32 config_size;
> > > __u8 config[];
> > > };
> > > @@ -82,6 +83,17 @@ struct vduse_iotlb_entry {
> > > __u8 perm;
> > > };
> > >
> > > +struct vduse_iotlb_entry_v2 {
> > > + __u64 offset;
> > > + __u64 start;
> > > + __u64 last;
> > > +#define VDUSE_ACCESS_RO 0x1
> > > +#define VDUSE_ACCESS_WO 0x2
> > > +#define VDUSE_ACCESS_RW 0x3
> > > + __u8 perm;
> >
> > We can embed vduse_iotlb_entry here. This will ease the userspace development.
> >
>
> I agree, I'll do it that way for the next version.
>
> > > + __u32 asid;
> > > +};
> > > +
> > > /*
> > > * Find the first IOVA region that overlaps with the range [start, last]
> > > * and return the corresponding file descriptor. Return -EINVAL means the
> > > @@ -172,6 +184,16 @@ struct vduse_vq_group {
> > > __u32 num;
> > > };
> > >
> > > +/**
> > > + * struct vduse_vq_group - virtqueue group
> > > + @ @group: Index of the virtqueue group
> > > + * @asid: Address space ID of the group
> > > + */
> > > +struct vduse_vq_group_asid {
> > > + __u32 group;
> > > + __u32 asid;
> > > +};
> > > +
> > > /**
> > > * struct vduse_vq_info - information of a virtqueue
> > > * @index: virtqueue index
> > > @@ -232,6 +254,7 @@ struct vduse_vq_eventfd {
> > > * @uaddr: start address of userspace memory, it must be aligned to page size
> > > * @iova: start of the IOVA region
> > > * @size: size of the IOVA region
> > > + * @asid: Address space ID of the IOVA region
> > > * @reserved: for future use, needs to be initialized to zero
> > > *
> > > * Structure used by VDUSE_IOTLB_REG_UMEM and VDUSE_IOTLB_DEREG_UMEM
> > > @@ -241,7 +264,8 @@ struct vduse_iova_umem {
> > > __u64 uaddr;
> > > __u64 iova;
> > > __u64 size;
> > > - __u64 reserved[3];
> > > + __u32 asid;
> > > + __u32 reserved[5];
> > > };
> > >
> > > /* Register userspace memory for IOVA regions */
> > > @@ -265,7 +289,8 @@ struct vduse_iova_info {
> > > __u64 last;
> > > #define VDUSE_IOVA_CAP_UMEM (1 << 0)
> > > __u64 capability;
> > > - __u64 reserved[3];
> > > + __u32 asid; /* Only if device API version >= 1 */
> > > + __u32 reserved[5];
> > > };
> > >
> > > /*
> > > @@ -289,6 +314,7 @@ enum vduse_req_type {
> > > VDUSE_UPDATE_IOTLB,
> > > VDUSE_GET_VQ_GROUP,
> > > VDUSE_GET_VRING_DESC_GROUP,
> > > + VDUSE_SET_VQ_GROUP_ASID,
> > > };
> > >
> > > /**
> > > @@ -344,6 +370,8 @@ struct vduse_dev_request {
> > > struct vduse_dev_status s;
> > > struct vduse_iova_range iova;
> > > struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
> > > + /* Only if vduse api version >= 1 */
> >
> > We can use:
> >
> > /* Only if vduse api version >= 1 */
> > struct vduse_vq_group vq_group;
> > struct vduse_vq_group_asid vq_group_asid;
> >
>
> I'm happy to do it that way, but it is not clear to me that both
> members are only if version >=1. Maybe "The following members are
> valid only if..."?
>
> > > + struct vduse_vq_group_asid vq_group_asid;
> > > __u32 padding[32];
> > > };
> > > };
> > > @@ -367,6 +395,8 @@ struct vduse_dev_response {
> > > union {
> > > struct vduse_vq_state vq_state;
> > > struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
> > > + /* Only if vduse api version >= 1 */
> > > + struct vduse_vq_group_asid vq_group_asid;
> > > __u32 padding[32];
> > > };
> > > };
> > > --
> > > 2.50.1
> > >
> >
> > Thanks
> >
>
Thanks
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC v2 2/7] vduse: add vq group support
2025-08-12 3:01 ` Jason Wang
@ 2025-08-13 10:11 ` Eugenio Perez Martin
2025-08-14 3:42 ` Jason Wang
0 siblings, 1 reply; 24+ messages in thread
From: Eugenio Perez Martin @ 2025-08-13 10:11 UTC (permalink / raw)
To: Jason Wang
Cc: Michael S . Tsirkin, Cindy Lu, Yongji Xie, Stefano Garzarella,
virtualization, Laurent Vivier, linux-kernel, Xuan Zhuo,
Maxime Coquelin
On Tue, Aug 12, 2025 at 5:01 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Mon, Aug 11, 2025 at 5:52 PM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > On Mon, Aug 11, 2025 at 5:10 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Thu, Aug 7, 2025 at 7:58 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > >
> > > > This allows sepparate the different virtqueues in groups that shares the
> > > > same address space. Asking the VDUSE device for the groups of the vq at
> > > > the beginning as they're needed for the DMA API.
> > > >
> > > > Allocating 3 vq groups as net is the device that need the most groups:
> > > > * Dataplane (guest passthrough)
> > > > * CVQ
> > > > * Shadowed vrings.
> > > >
> > > > Future versions of the series can include dynamic allocation of the
> > > > groups array so VDUSE can declare more groups.
> > > >
> > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > > ---
> > > > v2:
> > > > * Cache group information in kernel, as we need to provide the vq map
> > > > tokens properly.
> > > > * Add descs vq group to optimize SVQ forwarding and support indirect
> > > > descriptors out of the box.
> > > > ---
> > > > drivers/vdpa/vdpa_user/vduse_dev.c | 71 +++++++++++++++++++++++++++++-
> > > > include/uapi/linux/vduse.h | 19 +++++++-
> > > > 2 files changed, 88 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > index d858c4389cc1..d1f6d00a9c71 100644
> > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > @@ -46,6 +46,11 @@
> > > > #define VDUSE_IOVA_SIZE (VDUSE_MAX_BOUNCE_SIZE + 128 * 1024 * 1024)
> > > > #define VDUSE_MSG_DEFAULT_TIMEOUT 30
> > > >
> > > > +/*
> > > > + * Let's make it 3 for simplicity.
> > > > + */
> > > > +#define VDUSE_MAX_VQ_GROUPS 3
> > >
> > > I think we can release this to something like 64. Otherwise we might
> > > bump the version again just to increase the limitation? Or having a
> > > sysfs entry like bounce_size?
> > >
> >
> > I think it should not be linked to the version, but it is true there
> > is no way for VDUSE devices to check the maximum VQ groups / ASID that
> > the kernel supports.
> >
> > To handle as bounce_size seems the best option, good point. I'll send
> > a new version with that!
> >
> > > > +
> > > > #define IRQ_UNBOUND -1
> > > >
> > > > struct vduse_virtqueue {
> > > > @@ -58,6 +63,8 @@ struct vduse_virtqueue {
> > > > struct vdpa_vq_state state;
> > > > bool ready;
> > > > bool kicked;
> > > > + u32 vq_group;
> > > > + u32 vq_desc_group;
> > > > spinlock_t kick_lock;
> > > > spinlock_t irq_lock;
> > > > struct eventfd_ctx *kickfd;
> > > > @@ -114,6 +121,7 @@ struct vduse_dev {
> > > > u8 status;
> > > > u32 vq_num;
> > > > u32 vq_align;
> > > > + u32 ngroups;
> > > > struct vduse_umem *umem;
> > > > struct mutex mem_lock;
> > > > unsigned int bounce_size;
> > > > @@ -592,6 +600,20 @@ static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
> > > > return 0;
> > > > }
> > > >
> > > > +static u32 vduse_get_vq_group(struct vdpa_device *vdpa, u16 idx)
> > > > +{
> > > > + struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > > > +
> > > > + return dev->vqs[idx]->vq_group;
> > > > +}
> > > > +
> > > > +static u32 vduse_get_vq_desc_group(struct vdpa_device *vdpa, u16 idx)
> > > > +{
> > > > + struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > > > +
> > > > + return dev->vqs[idx]->vq_desc_group;
> > > > +}
> > > > +
> > > > static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> > > > struct vdpa_vq_state *state)
> > > > {
> > > > @@ -678,13 +700,48 @@ static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
> > > > return dev->status;
> > > > }
> > > >
> > > > +static int vduse_fill_vq_groups(struct vduse_dev *dev)
> > > > +{
> > > > + if (dev->api_version < VDUSE_API_VERSION_1)
> > > > + return 0;
> > > > +
> > > > + for (int i = 0; i < dev->vdev->vdpa.nvqs; ++i) {
> > > > + struct vduse_dev_msg msg = { 0 };
> > > > + int ret;
> > > > +
> > > > + msg.req.type = VDUSE_GET_VQ_GROUP;
> > > > + msg.req.vq_group.index = i;
> > > > + ret = vduse_dev_msg_sync(dev, &msg);
> > >
> > > I fail to understand why the default group mapping is not done during
> > > device creation.
> > >
> >
> > Because it changes depending on the features.
> >
> > If a new device has 5 virtqueues and the device wants to isolate the
> > CVQ, the CVQ position depends on the features that the guest's acks:
> > * If MQ is acked the isolated vq is #5
> > * If MQ is not acked the isolated vq is #3.
>
> I see we are still damaged by the design of the cvq index. But this is
> just a static branch not a dynamic one. If we can find ways to make it
> static it would be better.
>
> >
> > > > + if (ret)
> > > > + return ret;
> > > > +
> > > > + dev->vqs[i]->vq_group = msg.resp.vq_group.num;
> > > > +
> > > > + msg.req.type = VDUSE_GET_VRING_DESC_GROUP;
> > > > + ret = vduse_dev_msg_sync(dev, &msg);
> > > > + if (ret)
> > > > + return ret;
> > > > +
> > > > + dev->vqs[i]->vq_desc_group = msg.resp.vq_group.num;
> > > > + }
> > > > +
> > > > + return 0;
> > > > +}
> > > > +
> > > > static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
> > > > {
> > > > struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > > > + u8 previous_status = dev->status;
> > > >
> > > > if (vduse_dev_set_status(dev, status))
> > > > return;
> > > >
> > > > + if ((dev->status ^ previous_status) &
> > > > + BIT_ULL(VIRTIO_CONFIG_S_FEATURES_OK) &&
> > > > + status & (1ULL << VIRTIO_CONFIG_S_FEATURES_OK))
> > > > + if (vduse_fill_vq_groups(dev))
> > >
> > > Can we merge the two messages into a single one? Or can we use a
> > > shared memory for storing such mapping?
> > >
> > > For example, if we have 256 queues it would be very slow.
> > >
> >
> > To merge it in the same message is good to me, sure.
>
> We can start from this if we can't find a way to provision vq to group
> mapping during device creation.
>
Note that I don't think it is worth implementing these in this series,
but to add them on top in another one. Because I don't think we will
find devices with a lot of virtqueues for now. But here are some ideas
to mitigate the startup time cost:
1) Support more than one virtqueue in the same vduse request / reply
Something like:
vduse_dev_response{
u32 req_id;
u32 result;
union {
...
struct vduse_vq_group {
u32 num_vqs_requested;
struct {
u32 vq_idx;
u32 vq_group;
} vqs[15];
}
u32 padding[32];
}
}
Choosing 15 to fill the current size of vduse_dev_response struct.
2) Pointer chasing in the struct written
Same as previous, but the vqs struct is actually a pointer in
userspace. This way it can be arbitrarily big.
vduse_dev_response{
u32 req_id;
u32 result;
union {
...
struct vduse_vq_group {
u32 num_vqs_requested;
struct {
u32 vq_idx;
u32 vq_group;
} *vqs;
}
u32 padding[32];
}
}
I cannot locate any use of this in write() data, but it is more or
less common in ioctl.
3) Allow VQ_GROUP_BATCH_BEGIN and _END, similar to how memory map
works in vhost_vdpa. As many vq_group response as needed in between.
+) Assume that any vq not mentioned in the reply is vq group 0.
> > To make it a
> > table in shm seems more complicated, unless we accept a void * in the
> > reply and VDUSE uses copy_from_user. If that is ok here, then sure.
> >
>
> This looks tricky indeed.
>
> Thanks
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC v2 5/7] vduse: add vq group asid support
2025-08-11 11:19 ` Eugenio Perez Martin
2025-08-12 3:05 ` Jason Wang
@ 2025-08-13 11:02 ` Eugenio Perez Martin
1 sibling, 0 replies; 24+ messages in thread
From: Eugenio Perez Martin @ 2025-08-13 11:02 UTC (permalink / raw)
To: Jason Wang
Cc: Michael S . Tsirkin, Cindy Lu, Yongji Xie, Stefano Garzarella,
virtualization, Laurent Vivier, linux-kernel, Xuan Zhuo,
Maxime Coquelin
On Mon, Aug 11, 2025 at 1:19 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Mon, Aug 11, 2025 at 5:21 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Thu, Aug 7, 2025 at 7:58 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > >
> > > Add support for assigning Address Space Identifiers (ASIDs) to each VQ
> > > group. This enables mapping each group into a distinct memory space.
> > >
> > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > ---
> > > v2: Make iotlb entry the last one of vduse_iotlb_entry_v2 so the first
> > > part of the struct is the same.
> > > ---
> > > drivers/vdpa/vdpa_user/vduse_dev.c | 259 ++++++++++++++++++++++-------
> > > include/uapi/linux/vduse.h | 36 +++-
> > > 2 files changed, 230 insertions(+), 65 deletions(-)
> > >
> > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > index a7a2749f5818..145147c49930 100644
> > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > @@ -51,6 +51,12 @@
> > > */
> > > #define VDUSE_MAX_VQ_GROUPS 3
> > >
> > > +/*
> > > + * We can do ASID thanks to (ab)using the vduse device and the vdpa one. So no
> > > + * more than 2.
> > > + */
> > > +#define VDUSE_MAX_ASID 2
> >
> > Similar to group limitation, I'd like to either make it larger to 64
> > or make it changable via sysfs.
> >
> > > +
> > > #define IRQ_UNBOUND -1
> > >
> > > struct vduse_virtqueue {
> > > @@ -91,6 +97,7 @@ struct vduse_umem {
> > > };
> > >
> > > struct vduse_vq_group_int {
> > > + struct vduse_iova_domain *domain;
> > > struct vduse_dev *dev;
> > > };
> > >
> > > @@ -98,7 +105,7 @@ struct vduse_dev {
> > > struct vduse_vdpa *vdev;
> > > struct device *dev;
> > > struct vduse_virtqueue **vqs;
> > > - struct vduse_iova_domain *domain;
> > > + struct vduse_iova_domain *domain[VDUSE_MAX_ASID];
> > > char *name;
> > > struct mutex lock;
> > > spinlock_t msg_lock;
> > > @@ -126,7 +133,8 @@ struct vduse_dev {
> > > u32 vq_num;
> > > u32 vq_align;
> > > u32 ngroups;
> > > - struct vduse_umem *umem;
> > > + u32 nas;
> > > + struct vduse_umem *umem[VDUSE_MAX_ASID];
> > > struct vduse_vq_group_int groups[VDUSE_MAX_VQ_GROUPS];
> > > struct mutex mem_lock;
> > > unsigned int bounce_size;
> > > @@ -440,14 +448,28 @@ static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> > > return mask;
> > > }
> > >
> > > +/* Force set the asid to a vq group without a message to the VDUSE device */
> > > +static void vduse_set_group_asid_nomsg(struct vduse_dev *dev,
> > > + unsigned int group, unsigned int asid)
> > > +{
> > > + guard(mutex)(&dev->domain_lock);
> > > + dev->groups[group].domain = dev->domain[asid];
> > > +}
> > > +
> > > static void vduse_dev_reset(struct vduse_dev *dev)
> > > {
> > > int i;
> > > - struct vduse_iova_domain *domain = dev->domain;
> > >
> > > /* The coherent mappings are handled in vduse_dev_free_coherent() */
> > > - if (domain && domain->bounce_map)
> > > - vduse_domain_reset_bounce_map(domain);
> > > + for (i = 0; i < dev->nas; i++) {
> > > + struct vduse_iova_domain *domain = dev->domain[i];
> > > +
> > > + if (domain && domain->bounce_map)
> > > + vduse_domain_reset_bounce_map(domain);
> > > + }
> > > +
> > > + for (i = 0; i < dev->ngroups; i++)
> > > + vduse_set_group_asid_nomsg(dev, i, 0);
> >
> > I vaguely remember Siwei has invented a feature that forbids the reset
> > of the mapping during virtio reset. If it's true, let's check that
> > feature flag.
> >
>
> Ouch, very good catch!
>
Ok so the behavior is following what mlx and vdpa_sim do: It reset the
vq_group -> asid maps but it does not reset the iotlb of each asid.
> > Btw even if we device to reset the mapping, should it be implied via
> > reset message itself instead of having more operations here?
> >
>
> This was resetting all to valid values by the start of the device
> operation, so no need for a new message at least for now.
>
> > >
> > > down_write(&dev->rwsem);
> > >
> > > @@ -619,6 +641,29 @@ static u32 vduse_get_vq_desc_group(struct vdpa_device *vdpa, u16 idx)
> > > return dev->vqs[idx]->vq_desc_group;
> > > }
> > >
> > > +static int vduse_set_group_asid(struct vdpa_device *vdpa, unsigned int group,
> > > + unsigned int asid)
> > > +{
> > > + struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > > + struct vduse_dev_msg msg = { 0 };
> > > + int r;
> > > +
> > > + if (dev->api_version < VDUSE_API_VERSION_1 ||
> > > + group > dev->ngroups || asid > dev->nas)
> > > + return -EINVAL;
> > > +
> > > + msg.req.type = VDUSE_SET_VQ_GROUP_ASID;
> > > + msg.req.vq_group_asid.group = group;
> > > + msg.req.vq_group_asid.asid = asid;
> >
> > I wonder if this could be part of the shared memory?
> >
>
> I think it complicates the solution a lot, isn't it? How to notify the
> VDUSE device that these properties have changed, what vq group
> changed, etc? What do we want to achieve with the shm solution?
>
> > > +
> > > + r = vduse_dev_msg_sync(dev, &msg);
> > > + if (r < 0)
> > > + return r;
> > > +
> > > + vduse_set_group_asid_nomsg(dev, group, asid);
> > > + return 0;
> > > +}
> > > +
> > > static void *vduse_get_vq_map_token(struct vdpa_device *vdpa, u16 idx)
> > > {
> > > struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > > @@ -833,13 +878,13 @@ static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
> > > struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > > int ret;
> > >
> > > - ret = vduse_domain_set_map(dev->domain, iotlb);
> > > + ret = vduse_domain_set_map(dev->domain[asid], iotlb);
> > > if (ret)
> > > return ret;
> > >
> > > ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> > > if (ret) {
> > > - vduse_domain_clear_map(dev->domain, iotlb);
> > > + vduse_domain_clear_map(dev->domain[asid], iotlb);
> > > return ret;
> > > }
> > >
> > > @@ -883,6 +928,7 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> > > .get_vq_affinity = vduse_vdpa_get_vq_affinity,
> > > .reset = vduse_vdpa_reset,
> > > .set_map = vduse_vdpa_set_map,
> > > + .set_group_asid = vduse_set_group_asid,
> > > .get_vq_map_token = vduse_get_vq_map_token,
> > > .free = vduse_vdpa_free,
> > > };
> > > @@ -893,8 +939,10 @@ static void vduse_dev_sync_single_for_device(void *token,
> > > {
> > > struct vduse_vq_group_int *group = token;
> > > struct vduse_dev *vdev = group->dev;
> > > - struct vduse_iova_domain *domain = vdev->domain;
> > > + struct vduse_iova_domain *domain;
> > >
> > > + guard(mutex)(&vdev->domain_lock);
> > > + domain = group->domain;
> > > vduse_domain_sync_single_for_device(domain, dma_addr, size, dir);
> > > }
> > >
> > > @@ -904,8 +952,10 @@ static void vduse_dev_sync_single_for_cpu(void *token,
> > > {
> > > struct vduse_vq_group_int *group = token;
> > > struct vduse_dev *vdev = group->dev;
> > > - struct vduse_iova_domain *domain = vdev->domain;
> > > + struct vduse_iova_domain *domain;
> > >
> > > + guard(mutex)(&vdev->domain_lock);
> > > + domain = group->domain;
> > > vduse_domain_sync_single_for_cpu(domain, dma_addr, size, dir);
> > > }
> > >
> > > @@ -916,8 +966,10 @@ static dma_addr_t vduse_dev_map_page(void *token, struct page *page,
> > > {
> > > struct vduse_vq_group_int *group = token;
> > > struct vduse_dev *vdev = group->dev;
> > > - struct vduse_iova_domain *domain = vdev->domain;
> > > + struct vduse_iova_domain *domain;
> > >
> > > + guard(mutex)(&vdev->domain_lock);
> > > + domain = group->domain;
> > > return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> > > }
> > >
> > > @@ -927,8 +979,10 @@ static void vduse_dev_unmap_page(void *token, dma_addr_t dma_addr,
> > > {
> > > struct vduse_vq_group_int *group = token;
> > > struct vduse_dev *vdev = group->dev;
> > > - struct vduse_iova_domain *domain = vdev->domain;
> > > + struct vduse_iova_domain *domain;
> > >
> > > + guard(mutex)(&vdev->domain_lock);
> > > + domain = group->domain;
> > > return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> > > }
> > >
> > > @@ -937,11 +991,13 @@ static void *vduse_dev_alloc_coherent(void *token, size_t size,
> > > {
> > > struct vduse_vq_group_int *group = token;
> > > struct vduse_dev *vdev = group->dev;
> > > - struct vduse_iova_domain *domain = vdev->domain;
> > > + struct vduse_iova_domain *domain;
> > > unsigned long iova;
> > > void *addr;
> > >
> > > *dma_addr = DMA_MAPPING_ERROR;
> > > + guard(mutex)(&vdev->domain_lock);
> > > + domain = group->domain;
> > > addr = vduse_domain_alloc_coherent(domain, size,
> > > (dma_addr_t *)&iova, flag);
> > > if (!addr)
> > > @@ -958,8 +1014,10 @@ static void vduse_dev_free_coherent(void *token, size_t size,
> > > {
> > > struct vduse_vq_group_int *group = token;
> > > struct vduse_dev *vdev = group->dev;
> > > - struct vduse_iova_domain *domain = vdev->domain;
> > > + struct vduse_iova_domain *domain;
> > >
> > > + guard(mutex)(&vdev->domain_lock);
> > > + domain = group->domain;
> > > vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
> > > }
> > >
> > > @@ -967,8 +1025,10 @@ static size_t vduse_dev_max_mapping_size(void *token)
> > > {
> > > struct vduse_vq_group_int *group = token;
> > > struct vduse_dev *vdev = group->dev;
> > > - struct vduse_iova_domain *domain = vdev->domain;
> > > + struct vduse_iova_domain *domain;
> > >
> > > + guard(mutex)(&vdev->domain_lock);
> > > + domain = group->domain;
> > > return domain->bounce_size;
> > > }
> > >
> > > @@ -1107,31 +1167,32 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
> > > return ret;
> > > }
> > >
> > > -static int vduse_dev_dereg_umem(struct vduse_dev *dev,
> > > +static int vduse_dev_dereg_umem(struct vduse_dev *dev, u32 asid,
> > > u64 iova, u64 size)
> > > {
> > > int ret;
> > >
> > > mutex_lock(&dev->mem_lock);
> > > ret = -ENOENT;
> > > - if (!dev->umem)
> > > + if (!dev->umem[asid])
> > > goto unlock;
> > >
> > > ret = -EINVAL;
> > > - if (!dev->domain)
> > > + if (!dev->domain[asid])
> > > goto unlock;
> > >
> > > - if (dev->umem->iova != iova || size != dev->domain->bounce_size)
> > > + if (dev->umem[asid]->iova != iova ||
> > > + size != dev->domain[asid]->bounce_size)
> > > goto unlock;
> > >
> > > - vduse_domain_remove_user_bounce_pages(dev->domain);
> > > - unpin_user_pages_dirty_lock(dev->umem->pages,
> > > - dev->umem->npages, true);
> > > - atomic64_sub(dev->umem->npages, &dev->umem->mm->pinned_vm);
> > > - mmdrop(dev->umem->mm);
> > > - vfree(dev->umem->pages);
> > > - kfree(dev->umem);
> > > - dev->umem = NULL;
> > > + vduse_domain_remove_user_bounce_pages(dev->domain[asid]);
> > > + unpin_user_pages_dirty_lock(dev->umem[asid]->pages,
> > > + dev->umem[asid]->npages, true);
> > > + atomic64_sub(dev->umem[asid]->npages, &dev->umem[asid]->mm->pinned_vm);
> > > + mmdrop(dev->umem[asid]->mm);
> > > + vfree(dev->umem[asid]->pages);
> > > + kfree(dev->umem[asid]);
> > > + dev->umem[asid] = NULL;
> > > ret = 0;
> > > unlock:
> > > mutex_unlock(&dev->mem_lock);
> > > @@ -1139,7 +1200,7 @@ static int vduse_dev_dereg_umem(struct vduse_dev *dev,
> > > }
> > >
> > > static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > > - u64 iova, u64 uaddr, u64 size)
> > > + u32 asid, u64 iova, u64 uaddr, u64 size)
> > > {
> > > struct page **page_list = NULL;
> > > struct vduse_umem *umem = NULL;
> > > @@ -1147,14 +1208,14 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > > unsigned long npages, lock_limit;
> > > int ret;
> > >
> > > - if (!dev->domain || !dev->domain->bounce_map ||
> > > - size != dev->domain->bounce_size ||
> > > + if (!dev->domain[asid] || !dev->domain[asid]->bounce_map ||
> > > + size != dev->domain[asid]->bounce_size ||
> > > iova != 0 || uaddr & ~PAGE_MASK)
> > > return -EINVAL;
> > >
> > > mutex_lock(&dev->mem_lock);
> > > ret = -EEXIST;
> > > - if (dev->umem)
> > > + if (dev->umem[asid])
> > > goto unlock;
> > >
> > > ret = -ENOMEM;
> > > @@ -1178,7 +1239,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > > goto out;
> > > }
> > >
> > > - ret = vduse_domain_add_user_bounce_pages(dev->domain,
> > > + ret = vduse_domain_add_user_bounce_pages(dev->domain[asid],
> > > page_list, pinned);
> > > if (ret)
> > > goto out;
> > > @@ -1191,7 +1252,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
> > > umem->mm = current->mm;
> > > mmgrab(current->mm);
> > >
> > > - dev->umem = umem;
> > > + dev->umem[asid] = umem;
> > > out:
> > > if (ret && pinned > 0)
> > > unpin_user_pages(page_list, pinned);
> > > @@ -1234,26 +1295,43 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > >
> > > switch (cmd) {
> > > case VDUSE_IOTLB_GET_FD: {
> > > - struct vduse_iotlb_entry entry;
> > > + struct vduse_iotlb_entry_v2 entry;
> > > + struct vduse_iotlb_entry entry_old;
> > > struct vhost_iotlb_map *map;
> > > struct vdpa_map_file *map_file;
> > > struct file *f = NULL;
> > >
> > > ret = -EFAULT;
> > > - if (copy_from_user(&entry, argp, sizeof(entry)))
> > > - break;
> > > + if (dev->api_version >= VDUSE_API_VERSION_1) {
> > > + if (copy_from_user(&entry, argp, sizeof(entry)))
> > > + break;
> > > + } else {
> > > + if (copy_from_user(&entry_old, argp,
> > > + sizeof(entry_old)))
> > > + break;
> > > +
> > > + entry.offset = entry_old.offset;
> > > + entry.start = entry_old.start;
> > > + entry.last = entry_old.last;
> > > + entry.asid = 0;
> > > + entry.perm = entry_old.perm;
> > > + }
> > >
> > > ret = -EINVAL;
> > > if (entry.start > entry.last)
> > > break;
> > >
> > > + if (entry.asid > dev->nas)
> > > + break;
> > > +
> > > mutex_lock(&dev->domain_lock);
> > > - if (!dev->domain) {
> > > + /* TODO accessing an array with idx from userspace, mitigations? */
> > > + if (!dev->domain[entry.asid]) {
> > > mutex_unlock(&dev->domain_lock);
> > > break;
> > > }
> > > - spin_lock(&dev->domain->iotlb_lock);
> > > - map = vhost_iotlb_itree_first(dev->domain->iotlb,
> > > + spin_lock(&dev->domain[entry.asid]->iotlb_lock);
> > > + map = vhost_iotlb_itree_first(dev->domain[entry.asid]->iotlb,
> > > entry.start, entry.last);
> > > if (map) {
> > > map_file = (struct vdpa_map_file *)map->opaque;
> > > @@ -1263,7 +1341,7 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > > entry.last = map->last;
> > > entry.perm = map->perm;
> > > }
> > > - spin_unlock(&dev->domain->iotlb_lock);
> > > + spin_unlock(&dev->domain[entry.asid]->iotlb_lock);
> > > mutex_unlock(&dev->domain_lock);
> > > ret = -EINVAL;
> > > if (!f)
> > > @@ -1413,12 +1491,16 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > > break;
> > >
> > > ret = -EINVAL;
> > > + /* TODO: Using asid from userspace, need to mitigate? */
> > > if (!is_mem_zero((const char *)umem.reserved,
> > > - sizeof(umem.reserved)))
> > > + sizeof(umem.reserved)) ||
> > > + (dev->api_version < VDUSE_API_VERSION_1 &&
> > > + umem.asid != 0) ||
> > > + umem.asid > dev->nas)
> > > break;
> > >
> > > mutex_lock(&dev->domain_lock);
> > > - ret = vduse_dev_reg_umem(dev, umem.iova,
> > > + ret = vduse_dev_reg_umem(dev, umem.asid, umem.iova,
> > > umem.uaddr, umem.size);
> > > mutex_unlock(&dev->domain_lock);
> > > break;
> > > @@ -1431,15 +1513,24 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > > break;
> > >
> > > ret = -EINVAL;
> > > + /* TODO: Using asid from userspace, need to mitigate? */
> > > if (!is_mem_zero((const char *)umem.reserved,
> > > - sizeof(umem.reserved)))
> > > + sizeof(umem.reserved)) ||
> > > + (dev->api_version < VDUSE_API_VERSION_1 &&
> > > + umem.asid != 0) ||
> > > + umem.asid > dev->nas)
> > > + break;
> > > +
> > > +
> > > + if (umem.asid > dev->nas)
> > > break;
> > > mutex_lock(&dev->domain_lock);
> > > - ret = vduse_dev_dereg_umem(dev, umem.iova,
> > > + ret = vduse_dev_dereg_umem(dev, umem.asid, umem.iova,
> > > umem.size);
> > > mutex_unlock(&dev->domain_lock);
> > > break;
> > > }
> > > + /* TODO can we merge this with GET_FD? */
> >
> > Probably but might be too late.
> >
>
> Yep I should just remove this TODO actually.
>
> > > case VDUSE_IOTLB_GET_INFO: {
> > > struct vduse_iova_info info;
> > > struct vhost_iotlb_map *map;
> > > @@ -1452,27 +1543,32 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > > if (info.start > info.last)
> > > break;
> > >
> > > + if (info.asid > dev->nas)
> > > + break;
> > > +
> > > if (!is_mem_zero((const char *)info.reserved,
> > > sizeof(info.reserved)))
> > > break;
> > >
> > > mutex_lock(&dev->domain_lock);
> > > - if (!dev->domain) {
> > > + /* TODO asid comes from userspace. mitigations? */
> >
> > We should do sanity check to avoid OOB at least?
> >
>
> There is an if (info.asid > dev->nas) break above, what is missing?
>
> > > + if (!dev->domain[info.asid]) {
> > > mutex_unlock(&dev->domain_lock);
> > > break;
> > > }
> > > - spin_lock(&dev->domain->iotlb_lock);
> > > - map = vhost_iotlb_itree_first(dev->domain->iotlb,
> > > + spin_lock(&dev->domain[info.asid]->iotlb_lock);
> > > + map = vhost_iotlb_itree_first(dev->domain[info.asid]->iotlb,
> > > info.start, info.last);
> > > if (map) {
> > > info.start = map->start;
> > > info.last = map->last;
> > > info.capability = 0;
> > > - if (dev->domain->bounce_map && map->start == 0 &&
> > > - map->last == dev->domain->bounce_size - 1)
> > > + if (dev->domain[info.asid]->bounce_map &&
> > > + map->start == 0 &&
> > > + map->last == dev->domain[info.asid]->bounce_size - 1)
> > > info.capability |= VDUSE_IOVA_CAP_UMEM;
> > > }
> > > - spin_unlock(&dev->domain->iotlb_lock);
> > > + spin_unlock(&dev->domain[info.asid]->iotlb_lock);
> > > mutex_unlock(&dev->domain_lock);
> > > if (!map)
> > > break;
> > > @@ -1497,8 +1593,13 @@ static int vduse_dev_release(struct inode *inode, struct file *file)
> > > struct vduse_dev *dev = file->private_data;
> > >
> > > mutex_lock(&dev->domain_lock);
> > > - if (dev->domain)
> > > - vduse_dev_dereg_umem(dev, 0, dev->domain->bounce_size);
> > > + for (int i = 0; i < dev->nas; i++) {
> > > + if (dev->domain[i]) {
> > > + vduse_dev_dereg_umem(dev, i, 0,
> > > + dev->domain[i]->bounce_size);
> > > + dev->domain[i] = NULL;
> > > + }
> > > + }
> > > mutex_unlock(&dev->domain_lock);
> > > spin_lock(&dev->msg_lock);
> > > /* Make sure the inflight messages can processed after reconncection */
> > > @@ -1768,9 +1869,12 @@ static int vduse_destroy_dev(char *name)
> > > idr_remove(&vduse_idr, dev->minor);
> > > kvfree(dev->config);
> > > vduse_dev_deinit_vqs(dev);
> > > - if (dev->domain)
> > > - vduse_domain_destroy(dev->domain);
> > > + for (int i = 0; i < dev->nas; i++) {
> > > + if (dev->domain[i])
> > > + vduse_domain_destroy(dev->domain[i]);
> > > + }
> > > kfree(dev->name);
> > > + kfree(dev->umem);
> > > vduse_dev_destroy(dev);
> > > module_put(THIS_MODULE);
> > >
> > > @@ -1877,7 +1981,8 @@ static ssize_t bounce_size_store(struct device *device,
> > >
> > > ret = -EPERM;
> > > mutex_lock(&dev->domain_lock);
> > > - if (dev->domain)
> > > + /* Assuming that if the first domain is allocated, all are allocated */
> > > + if (dev->domain[0])
> > > goto unlock;
> > >
> > > ret = kstrtouint(buf, 10, &bounce_size);
> > > @@ -1933,11 +2038,20 @@ static int vduse_create_dev(struct vduse_dev_config *config,
> > > if (config->ngroups > VDUSE_MAX_VQ_GROUPS) {
> > > pr_err("Not creating a VDUSE device with %u vq groups. Max: %u",
> > > config->ngroups, VDUSE_MAX_VQ_GROUPS);
> > > + /* TODO: Need to destroy dev here too! */
> > > goto err;
> > > }
> > > + if (config->nas > VDUSE_MAX_ASID) {
> > > + pr_err("Not creating a VDUSE device with %u asid. Max: %u",
> > > + config->nas, VDUSE_MAX_ASID);
> > > + goto err_nas;
> > > + }
> > > +
> > > dev->ngroups = config->ngroups ?: 1;
> > > + dev->nas = config->nas ?: 1;
> > > } else {
> > > dev->ngroups = 1;
> > > + dev->nas = 1;
> > > }
> > > for (u32 i = 0; i < dev->ngroups; ++i)
> > > dev->groups[i].dev = dev;
> > > @@ -1977,6 +2091,7 @@ static int vduse_create_dev(struct vduse_dev_config *config,
> > > err_idr:
> > > kfree(dev->name);
> > > err_str:
> > > +err_nas:
> > > vduse_dev_destroy(dev);
> > > err:
> > > return ret;
> > > @@ -2069,7 +2184,6 @@ static int vduse_open(struct inode *inode, struct file *file)
> > > if (!control)
> > > return -ENOMEM;
> > >
> > > - control->api_version = VDUSE_API_VERSION;
> > > file->private_data = control;
> > >
> > > return 0;
> > > @@ -2101,7 +2215,7 @@ static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
> > >
> > > vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev,
> > > &vduse_vdpa_config_ops, &vduse_map_ops,
> > > - dev->ngroups, 1, name, true);
> > > + dev->ngroups, dev->nas, name, true);
> > > if (IS_ERR(vdev))
> > > return PTR_ERR(vdev);
> > >
> > > @@ -2137,11 +2251,23 @@ static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name,
> > > return ret;
> > >
> > > mutex_lock(&dev->domain_lock);
> > > - if (!dev->domain)
> > > - dev->domain = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
> > > - dev->bounce_size);
> > > + ret = 0;
> > > +
> > > + /* TODO we could delay the creation of the domain */
> >
> > I don't get what this means?
> >
>
> If the device support all 64 asid but it never uses them we're
> consuming the memory for all the struct vduse_iova_domain, struct
> vduse_umem, etc. We could delay its creating until we start using it,
> but it complicates the code.
>
> > > + for (int i = 0; i < dev->nas; ++i) {
> > > + if (!dev->domain[i])
> > > + dev->domain[i] = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
> > > + dev->bounce_size);
> > > + if (!dev->domain[i]) {
> > > + ret = -ENOMEM;
> > > + for (int j = 0; j < i; ++j)
> > > + vduse_domain_destroy(dev->domain[j]);
> >
> > Why not move the above error handling to the err label?
> >
>
> That implies I need to signal the index of successfully created
> domains. We only need to handle that in this part of the code. To move
> it is not impossible of course but it complicates the code IMO. I'd be
> happy to move it if requested though.
>
> > > + goto err_domain;
> > > + }
> > > + }
> > > +
> > > mutex_unlock(&dev->domain_lock);
> > > - if (!dev->domain) {
> > > + if (ret == -ENOMEM) {
> > > put_device(&dev->vdev->vdpa.dev);
> > > return -ENOMEM;
> > > }
> > > @@ -2150,13 +2276,22 @@ static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name,
> > > if (ret) {
> > > put_device(&dev->vdev->vdpa.dev);
> > > mutex_lock(&dev->domain_lock);
> > > - vduse_domain_destroy(dev->domain);
> > > - dev->domain = NULL;
> > > + for (int i = 0; i < dev->nas; i++) {
> > > + if (dev->domain[i]) {
> > > + vduse_domain_destroy(dev->domain[i]);
> > > + dev->domain[i] = NULL;
> > > + }
> > > + }
> > > mutex_unlock(&dev->domain_lock);
> > > return ret;
> > > }
> > >
> > > return 0;
> > > +
> > > +err_domain:
> > > + /* TODO do I need to call put_device? */
> >
> > If it comes after the success of vduse_dev_init_vdpa().
> >
>
> Ouch, that was a TODO for me but I forgot to remove it, sorry :).
>
> > > + mutex_unlock(&dev->domain_lock);
> > > + return ret;
> > > }
> > >
> > > static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
> > > diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> > > index b4b139dc76bb..d300fd5f867f 100644
> > > --- a/include/uapi/linux/vduse.h
> > > +++ b/include/uapi/linux/vduse.h
> > > @@ -47,7 +47,8 @@ struct vduse_dev_config {
> > > __u32 vq_num;
> > > __u32 vq_align;
> > > __u32 ngroups; /* if VDUSE_API_VERSION >= 1 */
> > > - __u32 reserved[12];
> > > + __u32 nas; /* if VDUSE_API_VERSION >= 1 */
> > > + __u32 reserved[11];
> > > __u32 config_size;
> > > __u8 config[];
> > > };
> > > @@ -82,6 +83,17 @@ struct vduse_iotlb_entry {
> > > __u8 perm;
> > > };
> > >
> > > +struct vduse_iotlb_entry_v2 {
> > > + __u64 offset;
> > > + __u64 start;
> > > + __u64 last;
> > > +#define VDUSE_ACCESS_RO 0x1
> > > +#define VDUSE_ACCESS_WO 0x2
> > > +#define VDUSE_ACCESS_RW 0x3
> > > + __u8 perm;
> >
> > We can embed vduse_iotlb_entry here. This will ease the userspace development.
> >
>
> I agree, I'll do it that way for the next version.
>
> > > + __u32 asid;
> > > +};
> > > +
> > > /*
> > > * Find the first IOVA region that overlaps with the range [start, last]
> > > * and return the corresponding file descriptor. Return -EINVAL means the
> > > @@ -172,6 +184,16 @@ struct vduse_vq_group {
> > > __u32 num;
> > > };
> > >
> > > +/**
> > > + * struct vduse_vq_group - virtqueue group
> > > + @ @group: Index of the virtqueue group
> > > + * @asid: Address space ID of the group
> > > + */
> > > +struct vduse_vq_group_asid {
> > > + __u32 group;
> > > + __u32 asid;
> > > +};
> > > +
> > > /**
> > > * struct vduse_vq_info - information of a virtqueue
> > > * @index: virtqueue index
> > > @@ -232,6 +254,7 @@ struct vduse_vq_eventfd {
> > > * @uaddr: start address of userspace memory, it must be aligned to page size
> > > * @iova: start of the IOVA region
> > > * @size: size of the IOVA region
> > > + * @asid: Address space ID of the IOVA region
> > > * @reserved: for future use, needs to be initialized to zero
> > > *
> > > * Structure used by VDUSE_IOTLB_REG_UMEM and VDUSE_IOTLB_DEREG_UMEM
> > > @@ -241,7 +264,8 @@ struct vduse_iova_umem {
> > > __u64 uaddr;
> > > __u64 iova;
> > > __u64 size;
> > > - __u64 reserved[3];
> > > + __u32 asid;
> > > + __u32 reserved[5];
> > > };
> > >
> > > /* Register userspace memory for IOVA regions */
> > > @@ -265,7 +289,8 @@ struct vduse_iova_info {
> > > __u64 last;
> > > #define VDUSE_IOVA_CAP_UMEM (1 << 0)
> > > __u64 capability;
> > > - __u64 reserved[3];
> > > + __u32 asid; /* Only if device API version >= 1 */
> > > + __u32 reserved[5];
> > > };
> > >
> > > /*
> > > @@ -289,6 +314,7 @@ enum vduse_req_type {
> > > VDUSE_UPDATE_IOTLB,
> > > VDUSE_GET_VQ_GROUP,
> > > VDUSE_GET_VRING_DESC_GROUP,
> > > + VDUSE_SET_VQ_GROUP_ASID,
> > > };
> > >
> > > /**
> > > @@ -344,6 +370,8 @@ struct vduse_dev_request {
> > > struct vduse_dev_status s;
> > > struct vduse_iova_range iova;
> > > struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
> > > + /* Only if vduse api version >= 1 */
> >
> > We can use:
> >
> > /* Only if vduse api version >= 1 */
> > struct vduse_vq_group vq_group;
> > struct vduse_vq_group_asid vq_group_asid;
> >
>
> I'm happy to do it that way, but it is not clear to me that both
> members are only if version >=1. Maybe "The following members are
> valid only if..."?
>
> > > + struct vduse_vq_group_asid vq_group_asid;
> > > __u32 padding[32];
> > > };
> > > };
> > > @@ -367,6 +395,8 @@ struct vduse_dev_response {
> > > union {
> > > struct vduse_vq_state vq_state;
> > > struct vduse_vq_group vq_group; /* Only if vduse api version >= 1 */
> > > + /* Only if vduse api version >= 1 */
> > > + struct vduse_vq_group_asid vq_group_asid;
> > > __u32 padding[32];
> > > };
> > > };
> > > --
> > > 2.50.1
> > >
> >
> > Thanks
> >
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC v2 2/7] vduse: add vq group support
2025-08-13 10:11 ` Eugenio Perez Martin
@ 2025-08-14 3:42 ` Jason Wang
0 siblings, 0 replies; 24+ messages in thread
From: Jason Wang @ 2025-08-14 3:42 UTC (permalink / raw)
To: Eugenio Perez Martin
Cc: Michael S . Tsirkin, Cindy Lu, Yongji Xie, Stefano Garzarella,
virtualization, Laurent Vivier, linux-kernel, Xuan Zhuo,
Maxime Coquelin
On Wed, Aug 13, 2025 at 6:12 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Tue, Aug 12, 2025 at 5:01 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Mon, Aug 11, 2025 at 5:52 PM Eugenio Perez Martin
> > <eperezma@redhat.com> wrote:
> > >
> > > On Mon, Aug 11, 2025 at 5:10 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Thu, Aug 7, 2025 at 7:58 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > >
> > > > > This allows sepparate the different virtqueues in groups that shares the
> > > > > same address space. Asking the VDUSE device for the groups of the vq at
> > > > > the beginning as they're needed for the DMA API.
> > > > >
> > > > > Allocating 3 vq groups as net is the device that need the most groups:
> > > > > * Dataplane (guest passthrough)
> > > > > * CVQ
> > > > > * Shadowed vrings.
> > > > >
> > > > > Future versions of the series can include dynamic allocation of the
> > > > > groups array so VDUSE can declare more groups.
> > > > >
> > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > > > ---
> > > > > v2:
> > > > > * Cache group information in kernel, as we need to provide the vq map
> > > > > tokens properly.
> > > > > * Add descs vq group to optimize SVQ forwarding and support indirect
> > > > > descriptors out of the box.
> > > > > ---
> > > > > drivers/vdpa/vdpa_user/vduse_dev.c | 71 +++++++++++++++++++++++++++++-
> > > > > include/uapi/linux/vduse.h | 19 +++++++-
> > > > > 2 files changed, 88 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > index d858c4389cc1..d1f6d00a9c71 100644
> > > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > @@ -46,6 +46,11 @@
> > > > > #define VDUSE_IOVA_SIZE (VDUSE_MAX_BOUNCE_SIZE + 128 * 1024 * 1024)
> > > > > #define VDUSE_MSG_DEFAULT_TIMEOUT 30
> > > > >
> > > > > +/*
> > > > > + * Let's make it 3 for simplicity.
> > > > > + */
> > > > > +#define VDUSE_MAX_VQ_GROUPS 3
> > > >
> > > > I think we can release this to something like 64. Otherwise we might
> > > > bump the version again just to increase the limitation? Or having a
> > > > sysfs entry like bounce_size?
> > > >
> > >
> > > I think it should not be linked to the version, but it is true there
> > > is no way for VDUSE devices to check the maximum VQ groups / ASID that
> > > the kernel supports.
> > >
> > > To handle as bounce_size seems the best option, good point. I'll send
> > > a new version with that!
> > >
> > > > > +
> > > > > #define IRQ_UNBOUND -1
> > > > >
> > > > > struct vduse_virtqueue {
> > > > > @@ -58,6 +63,8 @@ struct vduse_virtqueue {
> > > > > struct vdpa_vq_state state;
> > > > > bool ready;
> > > > > bool kicked;
> > > > > + u32 vq_group;
> > > > > + u32 vq_desc_group;
> > > > > spinlock_t kick_lock;
> > > > > spinlock_t irq_lock;
> > > > > struct eventfd_ctx *kickfd;
> > > > > @@ -114,6 +121,7 @@ struct vduse_dev {
> > > > > u8 status;
> > > > > u32 vq_num;
> > > > > u32 vq_align;
> > > > > + u32 ngroups;
> > > > > struct vduse_umem *umem;
> > > > > struct mutex mem_lock;
> > > > > unsigned int bounce_size;
> > > > > @@ -592,6 +600,20 @@ static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
> > > > > return 0;
> > > > > }
> > > > >
> > > > > +static u32 vduse_get_vq_group(struct vdpa_device *vdpa, u16 idx)
> > > > > +{
> > > > > + struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > > > > +
> > > > > + return dev->vqs[idx]->vq_group;
> > > > > +}
> > > > > +
> > > > > +static u32 vduse_get_vq_desc_group(struct vdpa_device *vdpa, u16 idx)
> > > > > +{
> > > > > + struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > > > > +
> > > > > + return dev->vqs[idx]->vq_desc_group;
> > > > > +}
> > > > > +
> > > > > static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> > > > > struct vdpa_vq_state *state)
> > > > > {
> > > > > @@ -678,13 +700,48 @@ static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
> > > > > return dev->status;
> > > > > }
> > > > >
> > > > > +static int vduse_fill_vq_groups(struct vduse_dev *dev)
> > > > > +{
> > > > > + if (dev->api_version < VDUSE_API_VERSION_1)
> > > > > + return 0;
> > > > > +
> > > > > + for (int i = 0; i < dev->vdev->vdpa.nvqs; ++i) {
> > > > > + struct vduse_dev_msg msg = { 0 };
> > > > > + int ret;
> > > > > +
> > > > > + msg.req.type = VDUSE_GET_VQ_GROUP;
> > > > > + msg.req.vq_group.index = i;
> > > > > + ret = vduse_dev_msg_sync(dev, &msg);
> > > >
> > > > I fail to understand why the default group mapping is not done during
> > > > device creation.
> > > >
> > >
> > > Because it changes depending on the features.
> > >
> > > If a new device has 5 virtqueues and the device wants to isolate the
> > > CVQ, the CVQ position depends on the features that the guest's acks:
> > > * If MQ is acked the isolated vq is #5
> > > * If MQ is not acked the isolated vq is #3.
> >
> > I see we are still damaged by the design of the cvq index. But this is
> > just a static branch not a dynamic one. If we can find ways to make it
> > static it would be better.
> >
> > >
> > > > > + if (ret)
> > > > > + return ret;
> > > > > +
> > > > > + dev->vqs[i]->vq_group = msg.resp.vq_group.num;
> > > > > +
> > > > > + msg.req.type = VDUSE_GET_VRING_DESC_GROUP;
> > > > > + ret = vduse_dev_msg_sync(dev, &msg);
> > > > > + if (ret)
> > > > > + return ret;
> > > > > +
> > > > > + dev->vqs[i]->vq_desc_group = msg.resp.vq_group.num;
> > > > > + }
> > > > > +
> > > > > + return 0;
> > > > > +}
> > > > > +
> > > > > static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
> > > > > {
> > > > > struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > > > > + u8 previous_status = dev->status;
> > > > >
> > > > > if (vduse_dev_set_status(dev, status))
> > > > > return;
> > > > >
> > > > > + if ((dev->status ^ previous_status) &
> > > > > + BIT_ULL(VIRTIO_CONFIG_S_FEATURES_OK) &&
> > > > > + status & (1ULL << VIRTIO_CONFIG_S_FEATURES_OK))
> > > > > + if (vduse_fill_vq_groups(dev))
> > > >
> > > > Can we merge the two messages into a single one? Or can we use a
> > > > shared memory for storing such mapping?
> > > >
> > > > For example, if we have 256 queues it would be very slow.
> > > >
> > >
> > > To merge it in the same message is good to me, sure.
> >
> > We can start from this if we can't find a way to provision vq to group
> > mapping during device creation.
> >
>
> Note that I don't think it is worth implementing these in this series,
> but to add them on top in another one. Because I don't think we will
> find devices with a lot of virtqueues for now. But here are some ideas
> to mitigate the startup time cost:
>
> 1) Support more than one virtqueue in the same vduse request / reply
> Something like:
> vduse_dev_response{
> u32 req_id;
> u32 result;
> union {
> ...
> struct vduse_vq_group {
> u32 num_vqs_requested;
> struct {
> u32 vq_idx;
> u32 vq_group;
> } vqs[15];
> }
> u32 padding[32];
> }
> }
>
> Choosing 15 to fill the current size of vduse_dev_response struct.
>
> 2) Pointer chasing in the struct written
>
> Same as previous, but the vqs struct is actually a pointer in
> userspace. This way it can be arbitrarily big.
>
> vduse_dev_response{
> u32 req_id;
> u32 result;
> union {
> ...
> struct vduse_vq_group {
> u32 num_vqs_requested;
> struct {
> u32 vq_idx;
> u32 vq_group;
> } *vqs;
> }
> u32 padding[32];
> }
> }
>
> I cannot locate any use of this in write() data, but it is more or
> less common in ioctl.
>
> 3) Allow VQ_GROUP_BATCH_BEGIN and _END, similar to how memory map
> works in vhost_vdpa. As many vq_group response as needed in between.
>
> +) Assume that any vq not mentioned in the reply is vq group 0.
Or make VDUSE io_uring compatible. (Anyhow it looks like another topic).
I'm fine if Michael and others are fine with starting with merging the
message. It's worth trying to reduce the userspace/kernel interaction
as much as possible.
Thanks
>
>
> > > To make it a
> > > table in shm seems more complicated, unless we accept a void * in the
> > > reply and VDUSE uses copy_from_user. If that is ok here, then sure.
> > >
> >
> > This looks tricky indeed.
> >
> > Thanks
> >
>
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2025-08-14 3:42 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-07 11:57 [RFC v2 0/7] Add multiple address spaces support to VDUSE Eugenio Pérez
2025-08-07 11:57 ` [RFC v2 1/7] vduse: add v1 API definition Eugenio Pérez
2025-08-07 11:57 ` [RFC v2 2/7] vduse: add vq group support Eugenio Pérez
2025-08-08 7:46 ` Maxime Coquelin
2025-08-11 3:09 ` Jason Wang
2025-08-11 9:51 ` Eugenio Perez Martin
2025-08-12 3:01 ` Jason Wang
2025-08-13 10:11 ` Eugenio Perez Martin
2025-08-14 3:42 ` Jason Wang
2025-08-07 11:57 ` [RFC v2 3/7] vdpa: change get_vq_map_token type to void *(*cb) Eugenio Pérez
2025-08-07 11:57 ` [RFC v2 4/7] vduse: return internal vq group struct as map token Eugenio Pérez
2025-08-11 3:11 ` Jason Wang
2025-08-11 11:03 ` Eugenio Perez Martin
2025-08-12 3:02 ` Jason Wang
2025-08-07 11:57 ` [RFC v2 5/7] vduse: add vq group asid support Eugenio Pérez
2025-08-08 7:45 ` Maxime Coquelin
2025-08-10 9:45 ` Eugenio Perez Martin
2025-08-11 3:21 ` Jason Wang
2025-08-11 11:19 ` Eugenio Perez Martin
2025-08-12 3:05 ` Jason Wang
2025-08-13 11:02 ` Eugenio Perez Martin
2025-08-07 11:57 ` [RFC v2 6/7] vduse: send update_iotlb_v2 message Eugenio Pérez
2025-08-11 3:24 ` Jason Wang
2025-08-07 11:57 ` [RFC v2 7/7] vduse: bump version number Eugenio Pérez
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).