* [PATCH V2 0/7] vdpa live update
@ 2024-07-12 13:18 Steve Sistare
2024-07-12 13:18 ` [PATCH V2 1/7] vhost-vdpa: count pinned memory Steve Sistare
` (8 more replies)
0 siblings, 9 replies; 35+ messages in thread
From: Steve Sistare @ 2024-07-12 13:18 UTC (permalink / raw)
To: virtualization, linux-kernel
Cc: Michael S. Tsirkin, Jason Wang, Si-Wei Liu, Eugenio Perez Martin,
Xuan Zhuo, Dragos Tatulea, Steve Sistare
Live update is a technique wherein an application saves its state, exec's
to an updated version of itself, and restores its state. Clients of the
application experience a brief suspension of service, on the order of
100's of milliseconds, but are otherwise unaffected.
Define and implement interfaces that allow vdpa devices to be preserved
across fork or exec, to support live update for applications such as QEMU.
The device must be suspended during the update, but its DMA mappings are
preserved, so the suspension is brief.
The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory
accounting from one process to another.
The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that
VHOST_NEW_OWNER is supported.
The VHOST_IOTLB_REMAP message type updates a DMA mapping with its userland
address in the new process.
The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that
VHOST_IOTLB_REMAP is supported and required. Some devices do not
require it, because the userland address of each DMA mapping is discarded
after being translated to a physical address.
Here is a pseudo-code sequence for performing live update, based on
suspend + reset because resume is not yet widely available. The vdpa device
descriptor, fd, remains open across the exec.
ioctl(fd, VHOST_VDPA_SUSPEND)
ioctl(fd, VHOST_VDPA_SET_STATUS, 0)
exec
ioctl(fd, VHOST_NEW_OWNER)
issue ioctls to re-create vrings
if VHOST_BACKEND_F_IOTLB_REMAP
foreach dma mapping
write(fd, {VHOST_IOTLB_REMAP, new_addr})
ioctl(fd, VHOST_VDPA_SET_STATUS,
ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK)
This is faster than VHOST_RESET_OWNER + VHOST_SET_OWNER + VHOST_IOTLB_UPDATE,
as that would would unpin and repin physical pages, which would cost multiple
seconds for large memories.
This is implemented in QEMU by the patch series "Live update: vdpa"
https://lore.kernel.org/qemu-devel/TBD (reference to be posted shortly)
The QEMU implementation leverages the live migration code path, but after
CPR exec's new QEMU:
- vhost_vdpa_set_owner() calls VHOST_NEW_OWNER instead of VHOST_SET_OWNER
- vhost_vdpa_dma_map() sets type VHOST_IOTLB_REMAP instead of
VHOST_IOTLB_UPDATE
Changes in V2:
- clean up handling of set_map vs dma_map vs platform iommu in remap
- augment and clarify commit messages and comments
Steve Sistare (7):
vhost-vdpa: count pinned memory
vhost-vdpa: pass mm to bind
vhost-vdpa: VHOST_NEW_OWNER
vhost-vdpa: VHOST_BACKEND_F_NEW_OWNER
vhost-vdpa: VHOST_IOTLB_REMAP
vhost-vdpa: VHOST_BACKEND_F_IOTLB_REMAP
vdpa/mlx5: new owner capability
drivers/vdpa/mlx5/net/mlx5_vnet.c | 3 +-
drivers/vhost/vdpa.c | 125 ++++++++++++++++++++++++++++--
drivers/vhost/vhost.c | 15 ++++
drivers/vhost/vhost.h | 1 +
include/uapi/linux/vhost.h | 10 +++
include/uapi/linux/vhost_types.h | 15 +++-
6 files changed, 161 insertions(+), 8 deletions(-)
--
2.39.3
^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH V2 1/7] vhost-vdpa: count pinned memory
2024-07-12 13:18 [PATCH V2 0/7] vdpa live update Steve Sistare
@ 2024-07-12 13:18 ` Steve Sistare
2024-07-12 13:18 ` [PATCH V2 2/7] vhost-vdpa: pass mm to bind Steve Sistare
` (7 subsequent siblings)
8 siblings, 0 replies; 35+ messages in thread
From: Steve Sistare @ 2024-07-12 13:18 UTC (permalink / raw)
To: virtualization, linux-kernel
Cc: Michael S. Tsirkin, Jason Wang, Si-Wei Liu, Eugenio Perez Martin,
Xuan Zhuo, Dragos Tatulea, Steve Sistare
Remember the count of pinned memory for the device.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
drivers/vhost/vdpa.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 63a53680a85c..963f3704bc39 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -60,6 +60,7 @@ struct vhost_vdpa {
struct vdpa_iova_range range;
u32 batch_asid;
bool suspended;
+ long pinned_vm;
};
static DEFINE_IDA(vhost_vdpa_ida);
@@ -926,6 +927,7 @@ static void vhost_vdpa_pa_unmap(struct vhost_vdpa *v, struct vhost_iotlb *iotlb,
unpin_user_page(page);
}
atomic64_sub(PFN_DOWN(map->size), &dev->mm->pinned_vm);
+ v->pinned_vm -= PFN_DOWN(map->size);
vhost_vdpa_general_unmap(v, map, asid);
vhost_iotlb_map_free(iotlb, map);
}
@@ -1009,9 +1011,10 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct vhost_iotlb *iotlb,
return r;
}
- if (!vdpa->use_va)
+ if (!vdpa->use_va) {
atomic64_add(PFN_DOWN(size), &dev->mm->pinned_vm);
-
+ v->pinned_vm += PFN_DOWN(size);
+ }
return 0;
}
--
2.39.3
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [PATCH V2 2/7] vhost-vdpa: pass mm to bind
2024-07-12 13:18 [PATCH V2 0/7] vdpa live update Steve Sistare
2024-07-12 13:18 ` [PATCH V2 1/7] vhost-vdpa: count pinned memory Steve Sistare
@ 2024-07-12 13:18 ` Steve Sistare
2024-07-12 13:18 ` [PATCH V2 3/7] vhost-vdpa: VHOST_NEW_OWNER Steve Sistare
` (6 subsequent siblings)
8 siblings, 0 replies; 35+ messages in thread
From: Steve Sistare @ 2024-07-12 13:18 UTC (permalink / raw)
To: virtualization, linux-kernel
Cc: Michael S. Tsirkin, Jason Wang, Si-Wei Liu, Eugenio Perez Martin,
Xuan Zhuo, Dragos Tatulea, Steve Sistare
Pass the target mm to vhost_vdpa_bind_mm. No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
drivers/vhost/vdpa.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 963f3704bc39..b49e5831b3f0 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -251,7 +251,7 @@ static int vhost_vdpa_reset(struct vhost_vdpa *v)
return _compat_vdpa_reset(v);
}
-static long vhost_vdpa_bind_mm(struct vhost_vdpa *v)
+static long vhost_vdpa_bind_mm(struct vhost_vdpa *v, struct mm_struct *mm)
{
struct vdpa_device *vdpa = v->vdpa;
const struct vdpa_config_ops *ops = vdpa->config;
@@ -259,7 +259,7 @@ static long vhost_vdpa_bind_mm(struct vhost_vdpa *v)
if (!vdpa->use_va || !ops->bind_mm)
return 0;
- return ops->bind_mm(vdpa, v->vdev.mm);
+ return ops->bind_mm(vdpa, mm);
}
static void vhost_vdpa_unbind_mm(struct vhost_vdpa *v)
@@ -888,7 +888,7 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
switch (cmd) {
case VHOST_SET_OWNER:
- r = vhost_vdpa_bind_mm(v);
+ r = vhost_vdpa_bind_mm(v, v->vdev.mm);
if (r)
vhost_dev_reset_owner(d, NULL);
break;
--
2.39.3
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [PATCH V2 3/7] vhost-vdpa: VHOST_NEW_OWNER
2024-07-12 13:18 [PATCH V2 0/7] vdpa live update Steve Sistare
2024-07-12 13:18 ` [PATCH V2 1/7] vhost-vdpa: count pinned memory Steve Sistare
2024-07-12 13:18 ` [PATCH V2 2/7] vhost-vdpa: pass mm to bind Steve Sistare
@ 2024-07-12 13:18 ` Steve Sistare
2024-07-15 2:26 ` Jason Wang
2024-07-15 9:07 ` Michael S. Tsirkin
2024-07-12 13:18 ` [PATCH V2 4/7] vhost-vdpa: VHOST_BACKEND_F_NEW_OWNER Steve Sistare
` (5 subsequent siblings)
8 siblings, 2 replies; 35+ messages in thread
From: Steve Sistare @ 2024-07-12 13:18 UTC (permalink / raw)
To: virtualization, linux-kernel
Cc: Michael S. Tsirkin, Jason Wang, Si-Wei Liu, Eugenio Perez Martin,
Xuan Zhuo, Dragos Tatulea, Steve Sistare
Add an ioctl to transfer file descriptor ownership and pinned memory
accounting from one process to another.
This is more efficient than VHOST_RESET_OWNER followed by VHOST_SET_OWNER,
as that would unpin all physical pages, requiring them to be repinned in
the new process. That would cost multiple seconds for large memories, and
be incurred during a virtual machine's pause time during live update.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
drivers/vhost/vdpa.c | 41 ++++++++++++++++++++++++++++++++++++++
drivers/vhost/vhost.c | 15 ++++++++++++++
drivers/vhost/vhost.h | 1 +
include/uapi/linux/vhost.h | 10 ++++++++++
4 files changed, 67 insertions(+)
diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index b49e5831b3f0..5cf55ca4ec02 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -632,6 +632,44 @@ static long vhost_vdpa_resume(struct vhost_vdpa *v)
return ret;
}
+static long vhost_vdpa_new_owner(struct vhost_vdpa *v)
+{
+ int r;
+ struct vhost_dev *vdev = &v->vdev;
+ struct mm_struct *mm_old = vdev->mm;
+ struct mm_struct *mm_new = current->mm;
+ long pinned_vm = v->pinned_vm;
+ unsigned long lock_limit = PFN_DOWN(rlimit(RLIMIT_MEMLOCK));
+
+ if (!mm_old)
+ return -EINVAL;
+ mmgrab(mm_old);
+
+ if (!v->vdpa->use_va &&
+ pinned_vm + atomic64_read(&mm_new->pinned_vm) > lock_limit) {
+ r = -ENOMEM;
+ goto out;
+ }
+ r = vhost_vdpa_bind_mm(v, mm_new);
+ if (r)
+ goto out;
+
+ r = vhost_dev_new_owner(vdev);
+ if (r) {
+ vhost_vdpa_bind_mm(v, mm_old);
+ goto out;
+ }
+
+ if (!v->vdpa->use_va) {
+ atomic64_sub(pinned_vm, &mm_old->pinned_vm);
+ atomic64_add(pinned_vm, &mm_new->pinned_vm);
+ }
+
+out:
+ mmdrop(mm_old);
+ return r;
+}
+
static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, unsigned int cmd,
void __user *argp)
{
@@ -876,6 +914,9 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
case VHOST_VDPA_RESUME:
r = vhost_vdpa_resume(v);
break;
+ case VHOST_NEW_OWNER:
+ r = vhost_vdpa_new_owner(v);
+ break;
default:
r = vhost_dev_ioctl(&v->vdev, cmd, argp);
if (r == -ENOIOCTLCMD)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index b60955682474..ab40ae50552f 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -963,6 +963,21 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
}
EXPORT_SYMBOL_GPL(vhost_dev_set_owner);
+/* Caller should have device mutex */
+long vhost_dev_new_owner(struct vhost_dev *dev)
+{
+ if (dev->mm == current->mm)
+ return -EBUSY;
+
+ if (!vhost_dev_has_owner(dev))
+ return -EINVAL;
+
+ vhost_detach_mm(dev);
+ vhost_attach_mm(dev);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(vhost_dev_new_owner);
+
static struct vhost_iotlb *iotlb_alloc(void)
{
return vhost_iotlb_alloc(max_iotlb_entries,
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index bb75a292d50c..8b2018bb02b1 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -187,6 +187,7 @@ void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs,
int (*msg_handler)(struct vhost_dev *dev, u32 asid,
struct vhost_iotlb_msg *msg));
long vhost_dev_set_owner(struct vhost_dev *dev);
+long vhost_dev_new_owner(struct vhost_dev *dev);
bool vhost_dev_has_owner(struct vhost_dev *dev);
long vhost_dev_check_owner(struct vhost_dev *);
struct vhost_iotlb *vhost_dev_reset_owner_prepare(void);
diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index b95dd84eef2d..543d0e3434c3 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -123,6 +123,16 @@
#define VHOST_SET_BACKEND_FEATURES _IOW(VHOST_VIRTIO, 0x25, __u64)
#define VHOST_GET_BACKEND_FEATURES _IOR(VHOST_VIRTIO, 0x26, __u64)
+/* Set current process as the new owner of this file descriptor. The fd must
+ * already be owned, via a prior call to VHOST_SET_OWNER. The pinned memory
+ * count is transferred from the previous to the new owner.
+ * Errors:
+ * EINVAL: not owned
+ * EBUSY: caller is already the owner
+ * ENOMEM: RLIMIT_MEMLOCK exceeded
+ */
+#define VHOST_NEW_OWNER _IO(VHOST_VIRTIO, 0x27)
+
/* VHOST_NET specific defines */
/* Attach virtio net ring to a raw socket, or tap device.
--
2.39.3
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [PATCH V2 4/7] vhost-vdpa: VHOST_BACKEND_F_NEW_OWNER
2024-07-12 13:18 [PATCH V2 0/7] vdpa live update Steve Sistare
` (2 preceding siblings ...)
2024-07-12 13:18 ` [PATCH V2 3/7] vhost-vdpa: VHOST_NEW_OWNER Steve Sistare
@ 2024-07-12 13:18 ` Steve Sistare
2024-07-15 2:31 ` Jason Wang
2024-07-12 13:18 ` [PATCH V2 5/7] vhost-vdpa: VHOST_IOTLB_REMAP Steve Sistare
` (4 subsequent siblings)
8 siblings, 1 reply; 35+ messages in thread
From: Steve Sistare @ 2024-07-12 13:18 UTC (permalink / raw)
To: virtualization, linux-kernel
Cc: Michael S. Tsirkin, Jason Wang, Si-Wei Liu, Eugenio Perez Martin,
Xuan Zhuo, Dragos Tatulea, Steve Sistare
Add the VHOST_BACKEND_F_NEW_OWNER backend capability, which indicates that
VHOST_NEW_OWNER is supported.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
drivers/vhost/vdpa.c | 7 ++++++-
include/uapi/linux/vhost_types.h | 2 ++
2 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 5cf55ca4ec02..4396fe1a90c4 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -640,6 +640,10 @@ static long vhost_vdpa_new_owner(struct vhost_vdpa *v)
struct mm_struct *mm_new = current->mm;
long pinned_vm = v->pinned_vm;
unsigned long lock_limit = PFN_DOWN(rlimit(RLIMIT_MEMLOCK));
+ u64 features = vhost_vdpa_get_backend_features(v);
+
+ if (!(features & BIT_ULL(VHOST_BACKEND_F_NEW_OWNER)))
+ return -EOPNOTSUPP;
if (!mm_old)
return -EINVAL;
@@ -821,7 +825,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST) |
BIT_ULL(VHOST_BACKEND_F_SUSPEND) |
BIT_ULL(VHOST_BACKEND_F_RESUME) |
- BIT_ULL(VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK)))
+ BIT_ULL(VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK) |
+ BIT_ULL(VHOST_BACKEND_F_NEW_OWNER)))
return -EOPNOTSUPP;
if ((features & BIT_ULL(VHOST_BACKEND_F_SUSPEND)) &&
!vhost_vdpa_can_suspend(v))
diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
index d7656908f730..9177843951e9 100644
--- a/include/uapi/linux/vhost_types.h
+++ b/include/uapi/linux/vhost_types.h
@@ -192,5 +192,7 @@ struct vhost_vdpa_iova_range {
#define VHOST_BACKEND_F_DESC_ASID 0x7
/* IOTLB don't flush memory mapping across device reset */
#define VHOST_BACKEND_F_IOTLB_PERSIST 0x8
+/* Supports VHOST_NEW_OWNER */
+#define VHOST_BACKEND_F_NEW_OWNER 0x9
#endif
--
2.39.3
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [PATCH V2 5/7] vhost-vdpa: VHOST_IOTLB_REMAP
2024-07-12 13:18 [PATCH V2 0/7] vdpa live update Steve Sistare
` (3 preceding siblings ...)
2024-07-12 13:18 ` [PATCH V2 4/7] vhost-vdpa: VHOST_BACKEND_F_NEW_OWNER Steve Sistare
@ 2024-07-12 13:18 ` Steve Sistare
2024-07-15 2:34 ` Jason Wang
2024-07-12 13:18 ` [PATCH V2 6/7] vhost-vdpa: VHOST_BACKEND_F_IOTLB_REMAP Steve Sistare
` (3 subsequent siblings)
8 siblings, 1 reply; 35+ messages in thread
From: Steve Sistare @ 2024-07-12 13:18 UTC (permalink / raw)
To: virtualization, linux-kernel
Cc: Michael S. Tsirkin, Jason Wang, Si-Wei Liu, Eugenio Perez Martin,
Xuan Zhuo, Dragos Tatulea, Steve Sistare
When device ownership is passed to a new process via VHOST_NEW_OWNER,
some devices need to know the new userland addresses of the dma mappings.
Define the new iotlb message type VHOST_IOTLB_REMAP to update the uaddr
of a mapping. The new uaddr must address the same memory object as
originally mapped.
The user must suspend the device before the old address is invalidated,
and cannot resume it until after VHOST_IOTLB_REMAP is called, but this
requirement is not enforced by the API.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
drivers/vhost/vdpa.c | 58 ++++++++++++++++++++++++++++++++
include/uapi/linux/vhost_types.h | 11 +++++-
2 files changed, 68 insertions(+), 1 deletion(-)
diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 4396fe1a90c4..51f71c45c4a9 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -1257,6 +1257,61 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
}
+static int vhost_vdpa_process_iotlb_remap(struct vhost_vdpa *v,
+ struct vhost_iotlb *iotlb,
+ struct vhost_iotlb_msg *msg)
+{
+ struct vdpa_device *vdpa = v->vdpa;
+ const struct vdpa_config_ops *ops = vdpa->config;
+ u32 asid = iotlb_to_asid(iotlb);
+ u64 start = msg->iova;
+ u64 last = start + msg->size - 1;
+ struct vhost_iotlb_map *map;
+ int r = 0;
+
+ if (msg->perm || !msg->size)
+ return -EINVAL;
+
+ map = vhost_iotlb_itree_first(iotlb, start, last);
+ if (!map)
+ return -ENOENT;
+
+ if (map->start != start || map->last != last)
+ return -EINVAL;
+
+ /*
+ * The current implementation does not support the platform iommu
+ * with use_va. And if !use_va, remap is not necessary.
+ */
+ if (!ops->set_map && !ops->dma_map)
+ return -EINVAL;
+
+ /*
+ * The current implementation supports set_map but not dma_map.
+ */
+ if (!ops->set_map)
+ return -EINVAL;
+
+ /*
+ * Do not verify that the new size@uaddr points to the same physical
+ * pages as the old size@uaddr, because that would take time O(npages),
+ * and would increase guest down time during live update. If the app
+ * is buggy and they differ, then the app may corrupt its own memory,
+ * but no one else's.
+ */
+
+ /*
+ * Batch will finish later in BATCH_END by calling set_map for the new
+ * addresses collected here. Non-batch must do it now.
+ */
+ if (!v->in_batch)
+ r = ops->set_map(vdpa, asid, iotlb);
+ if (!r)
+ map->addr = msg->uaddr;
+
+ return r;
+}
+
static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
struct vhost_iotlb *iotlb,
struct vhost_iotlb_msg *msg)
@@ -1336,6 +1391,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, u32 asid,
ops->set_map(vdpa, asid, iotlb);
v->in_batch = false;
break;
+ case VHOST_IOTLB_REMAP:
+ r = vhost_vdpa_process_iotlb_remap(v, iotlb, msg);
+ break;
default:
r = -EINVAL;
break;
diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
index 9177843951e9..35908315ff55 100644
--- a/include/uapi/linux/vhost_types.h
+++ b/include/uapi/linux/vhost_types.h
@@ -79,7 +79,7 @@ struct vhost_iotlb_msg {
/*
* VHOST_IOTLB_BATCH_BEGIN and VHOST_IOTLB_BATCH_END allow modifying
* multiple mappings in one go: beginning with
- * VHOST_IOTLB_BATCH_BEGIN, followed by any number of
+ * VHOST_IOTLB_BATCH_BEGIN, followed by any number of VHOST_IOTLB_REMAP or
* VHOST_IOTLB_UPDATE messages, and ending with VHOST_IOTLB_BATCH_END.
* When one of these two values is used as the message type, the rest
* of the fields in the message are ignored. There's no guarantee that
@@ -87,6 +87,15 @@ struct vhost_iotlb_msg {
*/
#define VHOST_IOTLB_BATCH_BEGIN 5
#define VHOST_IOTLB_BATCH_END 6
+
+/*
+ * VHOST_IOTLB_REMAP registers a new uaddr for the existing mapping at iova.
+ * The new uaddr must address the same memory object as originally mapped.
+ * Failure to do so will result in user memory corruption and/or device
+ * misbehavior. iova and size must match the arguments used to create the
+ * an existing mapping. Protection is not changed, and perm must be 0.
+ */
+#define VHOST_IOTLB_REMAP 7
__u8 type;
};
--
2.39.3
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [PATCH V2 6/7] vhost-vdpa: VHOST_BACKEND_F_IOTLB_REMAP
2024-07-12 13:18 [PATCH V2 0/7] vdpa live update Steve Sistare
` (4 preceding siblings ...)
2024-07-12 13:18 ` [PATCH V2 5/7] vhost-vdpa: VHOST_IOTLB_REMAP Steve Sistare
@ 2024-07-12 13:18 ` Steve Sistare
2024-07-12 13:18 ` [PATCH V2 7/7] vdpa/mlx5: new owner capability Steve Sistare
` (2 subsequent siblings)
8 siblings, 0 replies; 35+ messages in thread
From: Steve Sistare @ 2024-07-12 13:18 UTC (permalink / raw)
To: virtualization, linux-kernel
Cc: Michael S. Tsirkin, Jason Wang, Si-Wei Liu, Eugenio Perez Martin,
Xuan Zhuo, Dragos Tatulea, Steve Sistare
Add the VHOST_BACKEND_F_IOTLB_REMAP backend capability, which indicates
that VHOST_IOTLB_REMAP is supported.
If VHOST_BACKEND_F_IOTLB_REMAP is advertised, then the user must call
VHOST_IOTLB_REMAP after ownership of a device is transferred to a new
process via VHOST_NEW_OWNER. Disabling the feature during negotiation
does not negate this requirement.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
drivers/vhost/vdpa.c | 8 +++++++-
include/uapi/linux/vhost_types.h | 2 ++
2 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 51f71c45c4a9..1ec1f1382fd7 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -826,7 +826,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
BIT_ULL(VHOST_BACKEND_F_SUSPEND) |
BIT_ULL(VHOST_BACKEND_F_RESUME) |
BIT_ULL(VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK) |
- BIT_ULL(VHOST_BACKEND_F_NEW_OWNER)))
+ BIT_ULL(VHOST_BACKEND_F_NEW_OWNER) |
+ BIT_ULL(VHOST_BACKEND_F_IOTLB_REMAP)))
return -EOPNOTSUPP;
if ((features & BIT_ULL(VHOST_BACKEND_F_SUSPEND)) &&
!vhost_vdpa_can_suspend(v))
@@ -1267,11 +1268,16 @@ static int vhost_vdpa_process_iotlb_remap(struct vhost_vdpa *v,
u64 start = msg->iova;
u64 last = start + msg->size - 1;
struct vhost_iotlb_map *map;
+ u64 features;
int r = 0;
if (msg->perm || !msg->size)
return -EINVAL;
+ features = ops->get_backend_features(vdpa);
+ if (!(features & BIT_ULL(VHOST_BACKEND_F_IOTLB_REMAP)))
+ return -EOPNOTSUPP;
+
map = vhost_iotlb_itree_first(iotlb, start, last);
if (!map)
return -ENOENT;
diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
index 35908315ff55..7e79e9bd0f7b 100644
--- a/include/uapi/linux/vhost_types.h
+++ b/include/uapi/linux/vhost_types.h
@@ -203,5 +203,7 @@ struct vhost_vdpa_iova_range {
#define VHOST_BACKEND_F_IOTLB_PERSIST 0x8
/* Supports VHOST_NEW_OWNER */
#define VHOST_BACKEND_F_NEW_OWNER 0x9
+/* Supports VHOST_IOTLB_REMAP */
+#define VHOST_BACKEND_F_IOTLB_REMAP 0xa
#endif
--
2.39.3
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [PATCH V2 7/7] vdpa/mlx5: new owner capability
2024-07-12 13:18 [PATCH V2 0/7] vdpa live update Steve Sistare
` (5 preceding siblings ...)
2024-07-12 13:18 ` [PATCH V2 6/7] vhost-vdpa: VHOST_BACKEND_F_IOTLB_REMAP Steve Sistare
@ 2024-07-12 13:18 ` Steve Sistare
2024-07-12 14:06 ` [PATCH V2 0/7] vdpa live update Steven Sistare
2024-07-15 2:14 ` Jason Wang
8 siblings, 0 replies; 35+ messages in thread
From: Steve Sistare @ 2024-07-12 13:18 UTC (permalink / raw)
To: virtualization, linux-kernel
Cc: Michael S. Tsirkin, Jason Wang, Si-Wei Liu, Eugenio Perez Martin,
Xuan Zhuo, Dragos Tatulea, Steve Sistare
The mlx5 vdpa device supports ownership transfer to a new process, so
advertise VHOST_BACKEND_F_NEW_OWNER. User virtual addresses are not
used after they are initially translated to physical, so VHOST_IOTLB_REMAP
is not required, hence VHOST_BACKEND_F_IOTLB_REMAP is not advertised.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
drivers/vdpa/mlx5/net/mlx5_vnet.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index ecfc16151d61..b8b16dc499da 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -2698,7 +2698,8 @@ static void unregister_link_notifier(struct mlx5_vdpa_net *ndev)
static u64 mlx5_vdpa_get_backend_features(const struct vdpa_device *vdpa)
{
- return BIT_ULL(VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK);
+ return BIT_ULL(VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK) |
+ BIT_ULL(VHOST_BACKEND_F_NEW_OWNER);
}
static int mlx5_vdpa_set_driver_features(struct vdpa_device *vdev, u64 features)
--
2.39.3
^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: [PATCH V2 0/7] vdpa live update
2024-07-12 13:18 [PATCH V2 0/7] vdpa live update Steve Sistare
` (6 preceding siblings ...)
2024-07-12 13:18 ` [PATCH V2 7/7] vdpa/mlx5: new owner capability Steve Sistare
@ 2024-07-12 14:06 ` Steven Sistare
2024-07-15 2:14 ` Jason Wang
8 siblings, 0 replies; 35+ messages in thread
From: Steven Sistare @ 2024-07-12 14:06 UTC (permalink / raw)
To: virtualization, linux-kernel
Cc: Michael S. Tsirkin, Jason Wang, Si-Wei Liu, Eugenio Perez Martin,
Xuan Zhuo, Dragos Tatulea
On 7/12/2024 9:18 AM, Steve Sistare wrote:
> Live update is a technique wherein an application saves its state, exec's
> to an updated version of itself, and restores its state. Clients of the
> application experience a brief suspension of service, on the order of
> 100's of milliseconds, but are otherwise unaffected.
>
> Define and implement interfaces that allow vdpa devices to be preserved
> across fork or exec, to support live update for applications such as QEMU.
> The device must be suspended during the update, but its DMA mappings are
> preserved, so the suspension is brief.
>
> The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory
> accounting from one process to another.
>
> The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that
> VHOST_NEW_OWNER is supported.
>
> The VHOST_IOTLB_REMAP message type updates a DMA mapping with its userland
> address in the new process.
>
> The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that
> VHOST_IOTLB_REMAP is supported and required. Some devices do not
> require it, because the userland address of each DMA mapping is discarded
> after being translated to a physical address.
>
> Here is a pseudo-code sequence for performing live update, based on
> suspend + reset because resume is not yet widely available. The vdpa device
> descriptor, fd, remains open across the exec.
>
> ioctl(fd, VHOST_VDPA_SUSPEND)
> ioctl(fd, VHOST_VDPA_SET_STATUS, 0)
> exec
>
> ioctl(fd, VHOST_NEW_OWNER)
>
> issue ioctls to re-create vrings
>
> if VHOST_BACKEND_F_IOTLB_REMAP
> foreach dma mapping
> write(fd, {VHOST_IOTLB_REMAP, new_addr})
>
> ioctl(fd, VHOST_VDPA_SET_STATUS,
> ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK)
>
> This is faster than VHOST_RESET_OWNER + VHOST_SET_OWNER + VHOST_IOTLB_UPDATE,
> as that would would unpin and repin physical pages, which would cost multiple
> seconds for large memories.
>
> This is implemented in QEMU by the patch series "Live update: vdpa"
> https://lore.kernel.org/qemu-devel/TBD (reference to be posted shortly)
https://lore.kernel.org/qemu-devel/1720792931-456433-3-git-send-email-steven.sistare@oracle.com
> The QEMU implementation leverages the live migration code path, but after
> CPR exec's new QEMU:
> - vhost_vdpa_set_owner() calls VHOST_NEW_OWNER instead of VHOST_SET_OWNER
> - vhost_vdpa_dma_map() sets type VHOST_IOTLB_REMAP instead of
> VHOST_IOTLB_UPDATE
>
> Changes in V2:
> - clean up handling of set_map vs dma_map vs platform iommu in remap
> - augment and clarify commit messages and comments
>
> Steve Sistare (7):
> vhost-vdpa: count pinned memory
> vhost-vdpa: pass mm to bind
> vhost-vdpa: VHOST_NEW_OWNER
> vhost-vdpa: VHOST_BACKEND_F_NEW_OWNER
> vhost-vdpa: VHOST_IOTLB_REMAP
> vhost-vdpa: VHOST_BACKEND_F_IOTLB_REMAP
> vdpa/mlx5: new owner capability
>
> drivers/vdpa/mlx5/net/mlx5_vnet.c | 3 +-
> drivers/vhost/vdpa.c | 125 ++++++++++++++++++++++++++++--
> drivers/vhost/vhost.c | 15 ++++
> drivers/vhost/vhost.h | 1 +
> include/uapi/linux/vhost.h | 10 +++
> include/uapi/linux/vhost_types.h | 15 +++-
> 6 files changed, 161 insertions(+), 8 deletions(-)
>
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH V2 0/7] vdpa live update
2024-07-12 13:18 [PATCH V2 0/7] vdpa live update Steve Sistare
` (7 preceding siblings ...)
2024-07-12 14:06 ` [PATCH V2 0/7] vdpa live update Steven Sistare
@ 2024-07-15 2:14 ` Jason Wang
2024-07-15 14:28 ` Steven Sistare
8 siblings, 1 reply; 35+ messages in thread
From: Jason Wang @ 2024-07-15 2:14 UTC (permalink / raw)
To: Steve Sistare
Cc: virtualization, linux-kernel, Michael S. Tsirkin, Si-Wei Liu,
Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea
On Fri, Jul 12, 2024 at 9:19 PM Steve Sistare <steven.sistare@oracle.com> wrote:
>
> Live update is a technique wherein an application saves its state, exec's
> to an updated version of itself, and restores its state. Clients of the
> application experience a brief suspension of service, on the order of
> 100's of milliseconds, but are otherwise unaffected.
>
> Define and implement interfaces that allow vdpa devices to be preserved
> across fork or exec, to support live update for applications such as QEMU.
> The device must be suspended during the update, but its DMA mappings are
> preserved, so the suspension is brief.
>
> The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory
> accounting from one process to another.
>
> The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that
> VHOST_NEW_OWNER is supported.
>
> The VHOST_IOTLB_REMAP message type updates a DMA mapping with its userland
> address in the new process.
>
> The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that
> VHOST_IOTLB_REMAP is supported and required. Some devices do not
> require it, because the userland address of each DMA mapping is discarded
> after being translated to a physical address.
>
> Here is a pseudo-code sequence for performing live update, based on
> suspend + reset because resume is not yet widely available. The vdpa device
> descriptor, fd, remains open across the exec.
>
> ioctl(fd, VHOST_VDPA_SUSPEND)
> ioctl(fd, VHOST_VDPA_SET_STATUS, 0)
I don't understand why we need a reset after suspend, it looks to me
the previous suspend became meaningless.
> exec
>
> ioctl(fd, VHOST_NEW_OWNER)
>
> issue ioctls to re-create vrings
>
> if VHOST_BACKEND_F_IOTLB_REMAP
So the idea is for a device that is using a virtual address, it
doesn't need VHOST_BACKEND_F_IOTLB_REMAP at all?
> foreach dma mapping
> write(fd, {VHOST_IOTLB_REMAP, new_addr})
>
> ioctl(fd, VHOST_VDPA_SET_STATUS,
> ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK)
From API level, this seems to be asymmetric as we have suspending but
not resuming?
>
> This is faster than VHOST_RESET_OWNER + VHOST_SET_OWNER + VHOST_IOTLB_UPDATE,
> as that would would unpin and repin physical pages, which would cost multiple
> seconds for large memories.
>
> This is implemented in QEMU by the patch series "Live update: vdpa"
> https://lore.kernel.org/qemu-devel/TBD (reference to be posted shortly)
>
> The QEMU implementation leverages the live migration code path, but after
> CPR exec's new QEMU:
> - vhost_vdpa_set_owner() calls VHOST_NEW_OWNER instead of VHOST_SET_OWNER
> - vhost_vdpa_dma_map() sets type VHOST_IOTLB_REMAP instead of
> VHOST_IOTLB_UPDATE
>
> Changes in V2:
> - clean up handling of set_map vs dma_map vs platform iommu in remap
> - augment and clarify commit messages and comments
>
> Steve Sistare (7):
> vhost-vdpa: count pinned memory
> vhost-vdpa: pass mm to bind
> vhost-vdpa: VHOST_NEW_OWNER
> vhost-vdpa: VHOST_BACKEND_F_NEW_OWNER
> vhost-vdpa: VHOST_IOTLB_REMAP
> vhost-vdpa: VHOST_BACKEND_F_IOTLB_REMAP
> vdpa/mlx5: new owner capability
>
> drivers/vdpa/mlx5/net/mlx5_vnet.c | 3 +-
> drivers/vhost/vdpa.c | 125 ++++++++++++++++++++++++++++--
> drivers/vhost/vhost.c | 15 ++++
> drivers/vhost/vhost.h | 1 +
> include/uapi/linux/vhost.h | 10 +++
> include/uapi/linux/vhost_types.h | 15 +++-
> 6 files changed, 161 insertions(+), 8 deletions(-)
>
> --
> 2.39.3
>
Thanks
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH V2 3/7] vhost-vdpa: VHOST_NEW_OWNER
2024-07-12 13:18 ` [PATCH V2 3/7] vhost-vdpa: VHOST_NEW_OWNER Steve Sistare
@ 2024-07-15 2:26 ` Jason Wang
2024-07-15 9:06 ` Michael S. Tsirkin
2024-07-15 14:27 ` Steven Sistare
2024-07-15 9:07 ` Michael S. Tsirkin
1 sibling, 2 replies; 35+ messages in thread
From: Jason Wang @ 2024-07-15 2:26 UTC (permalink / raw)
To: Steve Sistare
Cc: virtualization, linux-kernel, Michael S. Tsirkin, Si-Wei Liu,
Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea
On Fri, Jul 12, 2024 at 9:19 PM Steve Sistare <steven.sistare@oracle.com> wrote:
>
> Add an ioctl to transfer file descriptor ownership and pinned memory
> accounting from one process to another.
>
> This is more efficient than VHOST_RESET_OWNER followed by VHOST_SET_OWNER,
> as that would unpin all physical pages, requiring them to be repinned in
> the new process. That would cost multiple seconds for large memories, and
> be incurred during a virtual machine's pause time during live update.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> drivers/vhost/vdpa.c | 41 ++++++++++++++++++++++++++++++++++++++
> drivers/vhost/vhost.c | 15 ++++++++++++++
> drivers/vhost/vhost.h | 1 +
> include/uapi/linux/vhost.h | 10 ++++++++++
> 4 files changed, 67 insertions(+)
>
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index b49e5831b3f0..5cf55ca4ec02 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -632,6 +632,44 @@ static long vhost_vdpa_resume(struct vhost_vdpa *v)
> return ret;
> }
>
> +static long vhost_vdpa_new_owner(struct vhost_vdpa *v)
> +{
> + int r;
> + struct vhost_dev *vdev = &v->vdev;
> + struct mm_struct *mm_old = vdev->mm;
> + struct mm_struct *mm_new = current->mm;
> + long pinned_vm = v->pinned_vm;
> + unsigned long lock_limit = PFN_DOWN(rlimit(RLIMIT_MEMLOCK));
> +
> + if (!mm_old)
> + return -EINVAL;
> + mmgrab(mm_old);
> +
> + if (!v->vdpa->use_va &&
> + pinned_vm + atomic64_read(&mm_new->pinned_vm) > lock_limit) {
> + r = -ENOMEM;
> + goto out;
> + }
So this seems to allow an arbitrary process to execute this. Seems to be unsafe.
I wonder if we need to add some checks here, maybe PID or other stuff
to only allow the owner process to do this.
> + r = vhost_vdpa_bind_mm(v, mm_new);
> + if (r)
> + goto out;
> +
> + r = vhost_dev_new_owner(vdev);
> + if (r) {
> + vhost_vdpa_bind_mm(v, mm_old);
> + goto out;
> + }
> +
> + if (!v->vdpa->use_va) {
> + atomic64_sub(pinned_vm, &mm_old->pinned_vm);
> + atomic64_add(pinned_vm, &mm_new->pinned_vm);
> + }
> +
> +out:
> + mmdrop(mm_old);
> + return r;
> +}
> +
> static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, unsigned int cmd,
> void __user *argp)
> {
> @@ -876,6 +914,9 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
> case VHOST_VDPA_RESUME:
> r = vhost_vdpa_resume(v);
> break;
> + case VHOST_NEW_OWNER:
> + r = vhost_vdpa_new_owner(v);
> + break;
> default:
> r = vhost_dev_ioctl(&v->vdev, cmd, argp);
> if (r == -ENOIOCTLCMD)
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index b60955682474..ab40ae50552f 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -963,6 +963,21 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
> }
> EXPORT_SYMBOL_GPL(vhost_dev_set_owner);
>
> +/* Caller should have device mutex */
> +long vhost_dev_new_owner(struct vhost_dev *dev)
> +{
> + if (dev->mm == current->mm)
> + return -EBUSY;
> +
> + if (!vhost_dev_has_owner(dev))
> + return -EINVAL;
> +
> + vhost_detach_mm(dev);
> + vhost_attach_mm(dev);
This seems to do nothing unless I miss something.
Thanks
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH V2 4/7] vhost-vdpa: VHOST_BACKEND_F_NEW_OWNER
2024-07-12 13:18 ` [PATCH V2 4/7] vhost-vdpa: VHOST_BACKEND_F_NEW_OWNER Steve Sistare
@ 2024-07-15 2:31 ` Jason Wang
2024-07-15 14:27 ` Steven Sistare
0 siblings, 1 reply; 35+ messages in thread
From: Jason Wang @ 2024-07-15 2:31 UTC (permalink / raw)
To: Steve Sistare
Cc: virtualization, linux-kernel, Michael S. Tsirkin, Si-Wei Liu,
Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea
On Fri, Jul 12, 2024 at 9:19 PM Steve Sistare <steven.sistare@oracle.com> wrote:
>
> Add the VHOST_BACKEND_F_NEW_OWNER backend capability, which indicates that
> VHOST_NEW_OWNER is supported.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Doesn't harm but should this be part of the previous patch?
Thanks
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH V2 5/7] vhost-vdpa: VHOST_IOTLB_REMAP
2024-07-12 13:18 ` [PATCH V2 5/7] vhost-vdpa: VHOST_IOTLB_REMAP Steve Sistare
@ 2024-07-15 2:34 ` Jason Wang
2024-07-15 14:28 ` Steven Sistare
0 siblings, 1 reply; 35+ messages in thread
From: Jason Wang @ 2024-07-15 2:34 UTC (permalink / raw)
To: Steve Sistare
Cc: virtualization, linux-kernel, Michael S. Tsirkin, Si-Wei Liu,
Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea
On Fri, Jul 12, 2024 at 9:19 PM Steve Sistare <steven.sistare@oracle.com> wrote:
>
> When device ownership is passed to a new process via VHOST_NEW_OWNER,
> some devices need to know the new userland addresses of the dma mappings.
> Define the new iotlb message type VHOST_IOTLB_REMAP to update the uaddr
> of a mapping. The new uaddr must address the same memory object as
> originally mapped.
>
> The user must suspend the device before the old address is invalidated,
> and cannot resume it until after VHOST_IOTLB_REMAP is called, but this
> requirement is not enforced by the API.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> drivers/vhost/vdpa.c | 58 ++++++++++++++++++++++++++++++++
> include/uapi/linux/vhost_types.h | 11 +++++-
> 2 files changed, 68 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index 4396fe1a90c4..51f71c45c4a9 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -1257,6 +1257,61 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>
> }
>
> +static int vhost_vdpa_process_iotlb_remap(struct vhost_vdpa *v,
> + struct vhost_iotlb *iotlb,
> + struct vhost_iotlb_msg *msg)
> +{
> + struct vdpa_device *vdpa = v->vdpa;
> + const struct vdpa_config_ops *ops = vdpa->config;
> + u32 asid = iotlb_to_asid(iotlb);
> + u64 start = msg->iova;
> + u64 last = start + msg->size - 1;
> + struct vhost_iotlb_map *map;
> + int r = 0;
> +
> + if (msg->perm || !msg->size)
> + return -EINVAL;
> +
> + map = vhost_iotlb_itree_first(iotlb, start, last);
> + if (!map)
> + return -ENOENT;
> +
> + if (map->start != start || map->last != last)
> + return -EINVAL;
I had a question here, if a buggy user space that:
1) forget to update some of the mappings
2) address is wrong
3) other cases.
Does this mean the device can DMA to the previous owner? If yes, does
this have security implications?
> +
> + /*
> + * The current implementation does not support the platform iommu
> + * with use_va. And if !use_va, remap is not necessary.
> + */
> + if (!ops->set_map && !ops->dma_map)
> + return -EINVAL;
> +
> + /*
> + * The current implementation supports set_map but not dma_map.
> + */
> + if (!ops->set_map)
> + return -EINVAL;
> +
> + /*
> + * Do not verify that the new size@uaddr points to the same physical
> + * pages as the old size@uaddr, because that would take time O(npages),
> + * and would increase guest down time during live update. If the app
> + * is buggy and they differ, then the app may corrupt its own memory,
> + * but no one else's.
> + */
> +
> + /*
> + * Batch will finish later in BATCH_END by calling set_map for the new
> + * addresses collected here. Non-batch must do it now.
> + */
> + if (!v->in_batch)
> + r = ops->set_map(vdpa, asid, iotlb);
> + if (!r)
> + map->addr = msg->uaddr;
> +
> + return r;
> +}
> +
> static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
> struct vhost_iotlb *iotlb,
> struct vhost_iotlb_msg *msg)
> @@ -1336,6 +1391,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, u32 asid,
> ops->set_map(vdpa, asid, iotlb);
> v->in_batch = false;
> break;
> + case VHOST_IOTLB_REMAP:
> + r = vhost_vdpa_process_iotlb_remap(v, iotlb, msg);
> + break;
> default:
> r = -EINVAL;
> break;
> diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
> index 9177843951e9..35908315ff55 100644
> --- a/include/uapi/linux/vhost_types.h
> +++ b/include/uapi/linux/vhost_types.h
> @@ -79,7 +79,7 @@ struct vhost_iotlb_msg {
> /*
> * VHOST_IOTLB_BATCH_BEGIN and VHOST_IOTLB_BATCH_END allow modifying
> * multiple mappings in one go: beginning with
> - * VHOST_IOTLB_BATCH_BEGIN, followed by any number of
> + * VHOST_IOTLB_BATCH_BEGIN, followed by any number of VHOST_IOTLB_REMAP or
> * VHOST_IOTLB_UPDATE messages, and ending with VHOST_IOTLB_BATCH_END.
> * When one of these two values is used as the message type, the rest
> * of the fields in the message are ignored. There's no guarantee that
> @@ -87,6 +87,15 @@ struct vhost_iotlb_msg {
> */
> #define VHOST_IOTLB_BATCH_BEGIN 5
> #define VHOST_IOTLB_BATCH_END 6
> +
> +/*
> + * VHOST_IOTLB_REMAP registers a new uaddr for the existing mapping at iova.
> + * The new uaddr must address the same memory object as originally mapped.
> + * Failure to do so will result in user memory corruption and/or device
> + * misbehavior. iova and size must match the arguments used to create the
> + * an existing mapping. Protection is not changed, and perm must be 0.
> + */
> +#define VHOST_IOTLB_REMAP 7
> __u8 type;
> };
Thanks
>
> --
> 2.39.3
>
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH V2 3/7] vhost-vdpa: VHOST_NEW_OWNER
2024-07-15 2:26 ` Jason Wang
@ 2024-07-15 9:06 ` Michael S. Tsirkin
2024-07-15 14:27 ` Steven Sistare
1 sibling, 0 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2024-07-15 9:06 UTC (permalink / raw)
To: Jason Wang
Cc: Steve Sistare, virtualization, linux-kernel, Si-Wei Liu,
Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea
On Mon, Jul 15, 2024 at 10:26:13AM +0800, Jason Wang wrote:
> On Fri, Jul 12, 2024 at 9:19 PM Steve Sistare <steven.sistare@oracle.com> wrote:
> >
> > Add an ioctl to transfer file descriptor ownership and pinned memory
> > accounting from one process to another.
> >
> > This is more efficient than VHOST_RESET_OWNER followed by VHOST_SET_OWNER,
> > as that would unpin all physical pages, requiring them to be repinned in
> > the new process. That would cost multiple seconds for large memories, and
> > be incurred during a virtual machine's pause time during live update.
> >
> > Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > ---
> > drivers/vhost/vdpa.c | 41 ++++++++++++++++++++++++++++++++++++++
> > drivers/vhost/vhost.c | 15 ++++++++++++++
> > drivers/vhost/vhost.h | 1 +
> > include/uapi/linux/vhost.h | 10 ++++++++++
> > 4 files changed, 67 insertions(+)
> >
> > diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> > index b49e5831b3f0..5cf55ca4ec02 100644
> > --- a/drivers/vhost/vdpa.c
> > +++ b/drivers/vhost/vdpa.c
> > @@ -632,6 +632,44 @@ static long vhost_vdpa_resume(struct vhost_vdpa *v)
> > return ret;
> > }
> >
> > +static long vhost_vdpa_new_owner(struct vhost_vdpa *v)
> > +{
> > + int r;
> > + struct vhost_dev *vdev = &v->vdev;
> > + struct mm_struct *mm_old = vdev->mm;
> > + struct mm_struct *mm_new = current->mm;
> > + long pinned_vm = v->pinned_vm;
> > + unsigned long lock_limit = PFN_DOWN(rlimit(RLIMIT_MEMLOCK));
> > +
> > + if (!mm_old)
> > + return -EINVAL;
> > + mmgrab(mm_old);
> > +
> > + if (!v->vdpa->use_va &&
> > + pinned_vm + atomic64_read(&mm_new->pinned_vm) > lock_limit) {
> > + r = -ENOMEM;
> > + goto out;
> > + }
>
> So this seems to allow an arbitrary process to execute this. Seems to be unsafe.
>
> I wonder if we need to add some checks here, maybe PID or other stuff
> to only allow the owner process to do this.
Not pid pls.
> > + r = vhost_vdpa_bind_mm(v, mm_new);
> > + if (r)
> > + goto out;
> > +
> > + r = vhost_dev_new_owner(vdev);
> > + if (r) {
> > + vhost_vdpa_bind_mm(v, mm_old);
> > + goto out;
> > + }
> > +
> > + if (!v->vdpa->use_va) {
> > + atomic64_sub(pinned_vm, &mm_old->pinned_vm);
> > + atomic64_add(pinned_vm, &mm_new->pinned_vm);
> > + }
> > +
> > +out:
> > + mmdrop(mm_old);
> > + return r;
> > +}
> > +
> > static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, unsigned int cmd,
> > void __user *argp)
> > {
> > @@ -876,6 +914,9 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
> > case VHOST_VDPA_RESUME:
> > r = vhost_vdpa_resume(v);
> > break;
> > + case VHOST_NEW_OWNER:
> > + r = vhost_vdpa_new_owner(v);
> > + break;
> > default:
> > r = vhost_dev_ioctl(&v->vdev, cmd, argp);
> > if (r == -ENOIOCTLCMD)
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index b60955682474..ab40ae50552f 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -963,6 +963,21 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
> > }
> > EXPORT_SYMBOL_GPL(vhost_dev_set_owner);
> >
> > +/* Caller should have device mutex */
> > +long vhost_dev_new_owner(struct vhost_dev *dev)
> > +{
> > + if (dev->mm == current->mm)
> > + return -EBUSY;
> > +
> > + if (!vhost_dev_has_owner(dev))
> > + return -EINVAL;
> > +
> > + vhost_detach_mm(dev);
> > + vhost_attach_mm(dev);
>
> This seems to do nothing unless I miss something.
>
> Thanks
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH V2 3/7] vhost-vdpa: VHOST_NEW_OWNER
2024-07-12 13:18 ` [PATCH V2 3/7] vhost-vdpa: VHOST_NEW_OWNER Steve Sistare
2024-07-15 2:26 ` Jason Wang
@ 2024-07-15 9:07 ` Michael S. Tsirkin
2024-07-15 14:29 ` Steven Sistare
1 sibling, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2024-07-15 9:07 UTC (permalink / raw)
To: Steve Sistare
Cc: virtualization, linux-kernel, Jason Wang, Si-Wei Liu,
Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea
On Fri, Jul 12, 2024 at 06:18:49AM -0700, Steve Sistare wrote:
> Add an ioctl to transfer file descriptor ownership and pinned memory
> accounting from one process to another.
>
> This is more efficient than VHOST_RESET_OWNER followed by VHOST_SET_OWNER,
> as that would unpin all physical pages, requiring them to be repinned in
> the new process. That would cost multiple seconds for large memories, and
> be incurred during a virtual machine's pause time during live update.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Please, we just need to switch to use iommufd for pinning.
Piling up all these hacks gets us nowhere.
> ---
> drivers/vhost/vdpa.c | 41 ++++++++++++++++++++++++++++++++++++++
> drivers/vhost/vhost.c | 15 ++++++++++++++
> drivers/vhost/vhost.h | 1 +
> include/uapi/linux/vhost.h | 10 ++++++++++
> 4 files changed, 67 insertions(+)
>
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index b49e5831b3f0..5cf55ca4ec02 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -632,6 +632,44 @@ static long vhost_vdpa_resume(struct vhost_vdpa *v)
> return ret;
> }
>
> +static long vhost_vdpa_new_owner(struct vhost_vdpa *v)
> +{
> + int r;
> + struct vhost_dev *vdev = &v->vdev;
> + struct mm_struct *mm_old = vdev->mm;
> + struct mm_struct *mm_new = current->mm;
> + long pinned_vm = v->pinned_vm;
> + unsigned long lock_limit = PFN_DOWN(rlimit(RLIMIT_MEMLOCK));
> +
> + if (!mm_old)
> + return -EINVAL;
> + mmgrab(mm_old);
> +
> + if (!v->vdpa->use_va &&
> + pinned_vm + atomic64_read(&mm_new->pinned_vm) > lock_limit) {
> + r = -ENOMEM;
> + goto out;
> + }
> + r = vhost_vdpa_bind_mm(v, mm_new);
> + if (r)
> + goto out;
> +
> + r = vhost_dev_new_owner(vdev);
> + if (r) {
> + vhost_vdpa_bind_mm(v, mm_old);
> + goto out;
> + }
> +
> + if (!v->vdpa->use_va) {
> + atomic64_sub(pinned_vm, &mm_old->pinned_vm);
> + atomic64_add(pinned_vm, &mm_new->pinned_vm);
> + }
> +
> +out:
> + mmdrop(mm_old);
> + return r;
> +}
> +
> static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, unsigned int cmd,
> void __user *argp)
> {
> @@ -876,6 +914,9 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
> case VHOST_VDPA_RESUME:
> r = vhost_vdpa_resume(v);
> break;
> + case VHOST_NEW_OWNER:
> + r = vhost_vdpa_new_owner(v);
> + break;
> default:
> r = vhost_dev_ioctl(&v->vdev, cmd, argp);
> if (r == -ENOIOCTLCMD)
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index b60955682474..ab40ae50552f 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -963,6 +963,21 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
> }
> EXPORT_SYMBOL_GPL(vhost_dev_set_owner);
>
> +/* Caller should have device mutex */
> +long vhost_dev_new_owner(struct vhost_dev *dev)
> +{
> + if (dev->mm == current->mm)
> + return -EBUSY;
> +
> + if (!vhost_dev_has_owner(dev))
> + return -EINVAL;
> +
> + vhost_detach_mm(dev);
> + vhost_attach_mm(dev);
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(vhost_dev_new_owner);
> +
> static struct vhost_iotlb *iotlb_alloc(void)
> {
> return vhost_iotlb_alloc(max_iotlb_entries,
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index bb75a292d50c..8b2018bb02b1 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -187,6 +187,7 @@ void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs,
> int (*msg_handler)(struct vhost_dev *dev, u32 asid,
> struct vhost_iotlb_msg *msg));
> long vhost_dev_set_owner(struct vhost_dev *dev);
> +long vhost_dev_new_owner(struct vhost_dev *dev);
> bool vhost_dev_has_owner(struct vhost_dev *dev);
> long vhost_dev_check_owner(struct vhost_dev *);
> struct vhost_iotlb *vhost_dev_reset_owner_prepare(void);
> diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
> index b95dd84eef2d..543d0e3434c3 100644
> --- a/include/uapi/linux/vhost.h
> +++ b/include/uapi/linux/vhost.h
> @@ -123,6 +123,16 @@
> #define VHOST_SET_BACKEND_FEATURES _IOW(VHOST_VIRTIO, 0x25, __u64)
> #define VHOST_GET_BACKEND_FEATURES _IOR(VHOST_VIRTIO, 0x26, __u64)
>
> +/* Set current process as the new owner of this file descriptor. The fd must
> + * already be owned, via a prior call to VHOST_SET_OWNER. The pinned memory
> + * count is transferred from the previous to the new owner.
> + * Errors:
> + * EINVAL: not owned
> + * EBUSY: caller is already the owner
> + * ENOMEM: RLIMIT_MEMLOCK exceeded
> + */
> +#define VHOST_NEW_OWNER _IO(VHOST_VIRTIO, 0x27)
> +
> /* VHOST_NET specific defines */
>
> /* Attach virtio net ring to a raw socket, or tap device.
> --
> 2.39.3
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH V2 3/7] vhost-vdpa: VHOST_NEW_OWNER
2024-07-15 2:26 ` Jason Wang
2024-07-15 9:06 ` Michael S. Tsirkin
@ 2024-07-15 14:27 ` Steven Sistare
2024-07-16 5:16 ` Jason Wang
1 sibling, 1 reply; 35+ messages in thread
From: Steven Sistare @ 2024-07-15 14:27 UTC (permalink / raw)
To: Jason Wang
Cc: virtualization, linux-kernel, Michael S. Tsirkin, Si-Wei Liu,
Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea
On 7/14/2024 10:26 PM, Jason Wang wrote:
> On Fri, Jul 12, 2024 at 9:19 PM Steve Sistare <steven.sistare@oracle.com> wrote:
>>
>> Add an ioctl to transfer file descriptor ownership and pinned memory
>> accounting from one process to another.
>>
>> This is more efficient than VHOST_RESET_OWNER followed by VHOST_SET_OWNER,
>> as that would unpin all physical pages, requiring them to be repinned in
>> the new process. That would cost multiple seconds for large memories, and
>> be incurred during a virtual machine's pause time during live update.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> drivers/vhost/vdpa.c | 41 ++++++++++++++++++++++++++++++++++++++
>> drivers/vhost/vhost.c | 15 ++++++++++++++
>> drivers/vhost/vhost.h | 1 +
>> include/uapi/linux/vhost.h | 10 ++++++++++
>> 4 files changed, 67 insertions(+)
>>
>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>> index b49e5831b3f0..5cf55ca4ec02 100644
>> --- a/drivers/vhost/vdpa.c
>> +++ b/drivers/vhost/vdpa.c
>> @@ -632,6 +632,44 @@ static long vhost_vdpa_resume(struct vhost_vdpa *v)
>> return ret;
>> }
>>
>> +static long vhost_vdpa_new_owner(struct vhost_vdpa *v)
>> +{
>> + int r;
>> + struct vhost_dev *vdev = &v->vdev;
>> + struct mm_struct *mm_old = vdev->mm;
>> + struct mm_struct *mm_new = current->mm;
>> + long pinned_vm = v->pinned_vm;
>> + unsigned long lock_limit = PFN_DOWN(rlimit(RLIMIT_MEMLOCK));
>> +
>> + if (!mm_old)
>> + return -EINVAL;
>> + mmgrab(mm_old);
>> +
>> + if (!v->vdpa->use_va &&
>> + pinned_vm + atomic64_read(&mm_new->pinned_vm) > lock_limit) {
>> + r = -ENOMEM;
>> + goto out;
>> + }
>
> So this seems to allow an arbitrary process to execute this. Seems to be unsafe.
>
> I wonder if we need to add some checks here, maybe PID or other stuff
> to only allow the owner process to do this.
The original owner must send the file descriptor to the new owner.
That constitutes permission to take ownership.
>> + r = vhost_vdpa_bind_mm(v, mm_new);
>> + if (r)
>> + goto out;
>> +
>> + r = vhost_dev_new_owner(vdev);
>> + if (r) {
>> + vhost_vdpa_bind_mm(v, mm_old);
>> + goto out;
>> + }
>> +
>> + if (!v->vdpa->use_va) {
>> + atomic64_sub(pinned_vm, &mm_old->pinned_vm);
>> + atomic64_add(pinned_vm, &mm_new->pinned_vm);
>> + }
>> +
>> +out:
>> + mmdrop(mm_old);
>> + return r;
>> +}
>> +
>> static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, unsigned int cmd,
>> void __user *argp)
>> {
>> @@ -876,6 +914,9 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
>> case VHOST_VDPA_RESUME:
>> r = vhost_vdpa_resume(v);
>> break;
>> + case VHOST_NEW_OWNER:
>> + r = vhost_vdpa_new_owner(v);
>> + break;
>> default:
>> r = vhost_dev_ioctl(&v->vdev, cmd, argp);
>> if (r == -ENOIOCTLCMD)
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index b60955682474..ab40ae50552f 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -963,6 +963,21 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
>> }
>> EXPORT_SYMBOL_GPL(vhost_dev_set_owner);
>>
>> +/* Caller should have device mutex */
>> +long vhost_dev_new_owner(struct vhost_dev *dev)
>> +{
>> + if (dev->mm == current->mm)
>> + return -EBUSY;
>> +
>> + if (!vhost_dev_has_owner(dev))
>> + return -EINVAL;
>> +
>> + vhost_detach_mm(dev);
>> + vhost_attach_mm(dev);
>
> This seems to do nothing unless I miss something.
vhost_detach mm drops dev->mm.
vhost_attach_mm grabs current->mm.
- Steve
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH V2 4/7] vhost-vdpa: VHOST_BACKEND_F_NEW_OWNER
2024-07-15 2:31 ` Jason Wang
@ 2024-07-15 14:27 ` Steven Sistare
0 siblings, 0 replies; 35+ messages in thread
From: Steven Sistare @ 2024-07-15 14:27 UTC (permalink / raw)
To: Jason Wang
Cc: virtualization, linux-kernel, Michael S. Tsirkin, Si-Wei Liu,
Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea
On 7/14/2024 10:31 PM, Jason Wang wrote:
> On Fri, Jul 12, 2024 at 9:19 PM Steve Sistare <steven.sistare@oracle.com> wrote:
>>
>> Add the VHOST_BACKEND_F_NEW_OWNER backend capability, which indicates that
>> VHOST_NEW_OWNER is supported.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>
> Doesn't harm but should this be part of the previous patch?
One developer's minimal logical change is another developer's gratuitous patch:)
IMO separating this one makes the VHOST_NEW_OWNER patch slightly easier to grok.
- Steve
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH V2 5/7] vhost-vdpa: VHOST_IOTLB_REMAP
2024-07-15 2:34 ` Jason Wang
@ 2024-07-15 14:28 ` Steven Sistare
2024-07-16 5:28 ` Jason Wang
0 siblings, 1 reply; 35+ messages in thread
From: Steven Sistare @ 2024-07-15 14:28 UTC (permalink / raw)
To: Jason Wang
Cc: virtualization, linux-kernel, Michael S. Tsirkin, Si-Wei Liu,
Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea
On 7/14/2024 10:34 PM, Jason Wang wrote:
> On Fri, Jul 12, 2024 at 9:19 PM Steve Sistare <steven.sistare@oracle.com> wrote:
>>
>> When device ownership is passed to a new process via VHOST_NEW_OWNER,
>> some devices need to know the new userland addresses of the dma mappings.
>> Define the new iotlb message type VHOST_IOTLB_REMAP to update the uaddr
>> of a mapping. The new uaddr must address the same memory object as
>> originally mapped.
>>
>> The user must suspend the device before the old address is invalidated,
>> and cannot resume it until after VHOST_IOTLB_REMAP is called, but this
>> requirement is not enforced by the API.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> drivers/vhost/vdpa.c | 58 ++++++++++++++++++++++++++++++++
>> include/uapi/linux/vhost_types.h | 11 +++++-
>> 2 files changed, 68 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>> index 4396fe1a90c4..51f71c45c4a9 100644
>> --- a/drivers/vhost/vdpa.c
>> +++ b/drivers/vhost/vdpa.c
>> @@ -1257,6 +1257,61 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>>
>> }
>>
>> +static int vhost_vdpa_process_iotlb_remap(struct vhost_vdpa *v,
>> + struct vhost_iotlb *iotlb,
>> + struct vhost_iotlb_msg *msg)
>> +{
>> + struct vdpa_device *vdpa = v->vdpa;
>> + const struct vdpa_config_ops *ops = vdpa->config;
>> + u32 asid = iotlb_to_asid(iotlb);
>> + u64 start = msg->iova;
>> + u64 last = start + msg->size - 1;
>> + struct vhost_iotlb_map *map;
>> + int r = 0;
>> +
>> + if (msg->perm || !msg->size)
>> + return -EINVAL;
>> +
>> + map = vhost_iotlb_itree_first(iotlb, start, last);
>> + if (!map)
>> + return -ENOENT;
>> +
>> + if (map->start != start || map->last != last)
>> + return -EINVAL;
>
> I had a question here, if a buggy user space that:
>
> 1) forget to update some of the mappings
> 2) address is wrong
> 3) other cases.
>
> Does this mean the device can DMA to the previous owner?
Yes, but only to the mappings which were already pinned for DMA for this
device, and the old owner is giving the new owner permission to DMA to that
memory even without bugs.
> If yes, does
> this have security implications?
No. The previous owner has given the new owner permission to take over. They
trust each other completely. In the live update case, they are the same; the new
owner is the old owner reincarnated.
- Steve
>> +
>> + /*
>> + * The current implementation does not support the platform iommu
>> + * with use_va. And if !use_va, remap is not necessary.
>> + */
>> + if (!ops->set_map && !ops->dma_map)
>> + return -EINVAL;
>> +
>> + /*
>> + * The current implementation supports set_map but not dma_map.
>> + */
>> + if (!ops->set_map)
>> + return -EINVAL;
>> +
>> + /*
>> + * Do not verify that the new size@uaddr points to the same physical
>> + * pages as the old size@uaddr, because that would take time O(npages),
>> + * and would increase guest down time during live update. If the app
>> + * is buggy and they differ, then the app may corrupt its own memory,
>> + * but no one else's.
>> + */
>> +
>> + /*
>> + * Batch will finish later in BATCH_END by calling set_map for the new
>> + * addresses collected here. Non-batch must do it now.
>> + */
>> + if (!v->in_batch)
>> + r = ops->set_map(vdpa, asid, iotlb);
>> + if (!r)
>> + map->addr = msg->uaddr;
>> +
>> + return r;
>> +}
>> +
>> static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>> struct vhost_iotlb *iotlb,
>> struct vhost_iotlb_msg *msg)
>> @@ -1336,6 +1391,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, u32 asid,
>> ops->set_map(vdpa, asid, iotlb);
>> v->in_batch = false;
>> break;
>> + case VHOST_IOTLB_REMAP:
>> + r = vhost_vdpa_process_iotlb_remap(v, iotlb, msg);
>> + break;
>> default:
>> r = -EINVAL;
>> break;
>> diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
>> index 9177843951e9..35908315ff55 100644
>> --- a/include/uapi/linux/vhost_types.h
>> +++ b/include/uapi/linux/vhost_types.h
>> @@ -79,7 +79,7 @@ struct vhost_iotlb_msg {
>> /*
>> * VHOST_IOTLB_BATCH_BEGIN and VHOST_IOTLB_BATCH_END allow modifying
>> * multiple mappings in one go: beginning with
>> - * VHOST_IOTLB_BATCH_BEGIN, followed by any number of
>> + * VHOST_IOTLB_BATCH_BEGIN, followed by any number of VHOST_IOTLB_REMAP or
>> * VHOST_IOTLB_UPDATE messages, and ending with VHOST_IOTLB_BATCH_END.
>> * When one of these two values is used as the message type, the rest
>> * of the fields in the message are ignored. There's no guarantee that
>> @@ -87,6 +87,15 @@ struct vhost_iotlb_msg {
>> */
>> #define VHOST_IOTLB_BATCH_BEGIN 5
>> #define VHOST_IOTLB_BATCH_END 6
>> +
>> +/*
>> + * VHOST_IOTLB_REMAP registers a new uaddr for the existing mapping at iova.
>> + * The new uaddr must address the same memory object as originally mapped.
>> + * Failure to do so will result in user memory corruption and/or device
>> + * misbehavior. iova and size must match the arguments used to create the
>> + * an existing mapping. Protection is not changed, and perm must be 0.
>> + */
>> +#define VHOST_IOTLB_REMAP 7
>> __u8 type;
>> };
>
> Thanks
>
>>
>> --
>> 2.39.3
>>
>
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH V2 0/7] vdpa live update
2024-07-15 2:14 ` Jason Wang
@ 2024-07-15 14:28 ` Steven Sistare
2024-07-16 5:30 ` Jason Wang
0 siblings, 1 reply; 35+ messages in thread
From: Steven Sistare @ 2024-07-15 14:28 UTC (permalink / raw)
To: Jason Wang
Cc: virtualization, linux-kernel, Michael S. Tsirkin, Si-Wei Liu,
Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea
On 7/14/2024 10:14 PM, Jason Wang wrote:
> On Fri, Jul 12, 2024 at 9:19 PM Steve Sistare <steven.sistare@oracle.com> wrote:
>>
>> Live update is a technique wherein an application saves its state, exec's
>> to an updated version of itself, and restores its state. Clients of the
>> application experience a brief suspension of service, on the order of
>> 100's of milliseconds, but are otherwise unaffected.
>>
>> Define and implement interfaces that allow vdpa devices to be preserved
>> across fork or exec, to support live update for applications such as QEMU.
>> The device must be suspended during the update, but its DMA mappings are
>> preserved, so the suspension is brief.
>>
>> The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory
>> accounting from one process to another.
>>
>> The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that
>> VHOST_NEW_OWNER is supported.
>>
>> The VHOST_IOTLB_REMAP message type updates a DMA mapping with its userland
>> address in the new process.
>>
>> The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that
>> VHOST_IOTLB_REMAP is supported and required. Some devices do not
>> require it, because the userland address of each DMA mapping is discarded
>> after being translated to a physical address.
>>
>> Here is a pseudo-code sequence for performing live update, based on
>> suspend + reset because resume is not yet widely available. The vdpa device
>> descriptor, fd, remains open across the exec.
>>
>> ioctl(fd, VHOST_VDPA_SUSPEND)
>> ioctl(fd, VHOST_VDPA_SET_STATUS, 0)
>
> I don't understand why we need a reset after suspend, it looks to me
> the previous suspend became meaningless.
The suspend guarantees completion of in-progress DMA. At least, that is
my interpretation of why that is done for live migration in QEMU, which
also does suspend + reset + re-create. I am following the live migration
model.
>> exec
>>
>> ioctl(fd, VHOST_NEW_OWNER)
>>
>> issue ioctls to re-create vrings
>>
>> if VHOST_BACKEND_F_IOTLB_REMAP
>
> So the idea is for a device that is using a virtual address, it
> doesn't need VHOST_BACKEND_F_IOTLB_REMAP at all?
Actually the reverse: if the device translates virtual to physical when
the mappings are created, and discards the virtual, then VHOST_IOTLB_REMAP
is not needed.
>> foreach dma mapping
>> write(fd, {VHOST_IOTLB_REMAP, new_addr})
>>
>> ioctl(fd, VHOST_VDPA_SET_STATUS,
>> ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK)
>
> From API level, this seems to be asymmetric as we have suspending but
> not resuming?
Again, I am just following the path taken by live migration.
I will be happy to use resume when the devices and QEMU support it.
The decision to use reset vs resume should not affect the definition
and use of VHOST_NEW_OWNER and VHOST_IOTLB_REMAP.
- Steve
>> This is faster than VHOST_RESET_OWNER + VHOST_SET_OWNER + VHOST_IOTLB_UPDATE,
>> as that would would unpin and repin physical pages, which would cost multiple
>> seconds for large memories.
>>
>> This is implemented in QEMU by the patch series "Live update: vdpa"
>> https://lore.kernel.org/qemu-devel/TBD (reference to be posted shortly)
>>
>> The QEMU implementation leverages the live migration code path, but after
>> CPR exec's new QEMU:
>> - vhost_vdpa_set_owner() calls VHOST_NEW_OWNER instead of VHOST_SET_OWNER
>> - vhost_vdpa_dma_map() sets type VHOST_IOTLB_REMAP instead of
>> VHOST_IOTLB_UPDATE
>>
>> Changes in V2:
>> - clean up handling of set_map vs dma_map vs platform iommu in remap
>> - augment and clarify commit messages and comments
>>
>> Steve Sistare (7):
>> vhost-vdpa: count pinned memory
>> vhost-vdpa: pass mm to bind
>> vhost-vdpa: VHOST_NEW_OWNER
>> vhost-vdpa: VHOST_BACKEND_F_NEW_OWNER
>> vhost-vdpa: VHOST_IOTLB_REMAP
>> vhost-vdpa: VHOST_BACKEND_F_IOTLB_REMAP
>> vdpa/mlx5: new owner capability
>>
>> drivers/vdpa/mlx5/net/mlx5_vnet.c | 3 +-
>> drivers/vhost/vdpa.c | 125 ++++++++++++++++++++++++++++--
>> drivers/vhost/vhost.c | 15 ++++
>> drivers/vhost/vhost.h | 1 +
>> include/uapi/linux/vhost.h | 10 +++
>> include/uapi/linux/vhost_types.h | 15 +++-
>> 6 files changed, 161 insertions(+), 8 deletions(-)
>>
>> --
>> 2.39.3
>>
>
> Thanks
>
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH V2 3/7] vhost-vdpa: VHOST_NEW_OWNER
2024-07-15 9:07 ` Michael S. Tsirkin
@ 2024-07-15 14:29 ` Steven Sistare
2024-07-15 14:38 ` Michael S. Tsirkin
0 siblings, 1 reply; 35+ messages in thread
From: Steven Sistare @ 2024-07-15 14:29 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: virtualization, linux-kernel, Jason Wang, Si-Wei Liu,
Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea
On 7/15/2024 5:07 AM, Michael S. Tsirkin wrote:
> On Fri, Jul 12, 2024 at 06:18:49AM -0700, Steve Sistare wrote:
>> Add an ioctl to transfer file descriptor ownership and pinned memory
>> accounting from one process to another.
>>
>> This is more efficient than VHOST_RESET_OWNER followed by VHOST_SET_OWNER,
>> as that would unpin all physical pages, requiring them to be repinned in
>> the new process. That would cost multiple seconds for large memories, and
>> be incurred during a virtual machine's pause time during live update.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>
> Please, we just need to switch to use iommufd for pinning.
> Piling up all these hacks gets us nowhere.
I am working on iommufd kernel interfaces and QEMU changes. But who is working
on iommufd support for vdpa? If no one, or not for years, then adding these
small interfaces to vdpa plugs a signficant gap in live update coverage.
FWIW, the iommufd interfaces for live update will look much the same: change owner
and pinned memory accounting, and update virtual addresses. So adding that to vdpa
will not make it look like an odd duck.
- Steve
>> ---
>> drivers/vhost/vdpa.c | 41 ++++++++++++++++++++++++++++++++++++++
>> drivers/vhost/vhost.c | 15 ++++++++++++++
>> drivers/vhost/vhost.h | 1 +
>> include/uapi/linux/vhost.h | 10 ++++++++++
>> 4 files changed, 67 insertions(+)
>>
>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>> index b49e5831b3f0..5cf55ca4ec02 100644
>> --- a/drivers/vhost/vdpa.c
>> +++ b/drivers/vhost/vdpa.c
>> @@ -632,6 +632,44 @@ static long vhost_vdpa_resume(struct vhost_vdpa *v)
>> return ret;
>> }
>>
>> +static long vhost_vdpa_new_owner(struct vhost_vdpa *v)
>> +{
>> + int r;
>> + struct vhost_dev *vdev = &v->vdev;
>> + struct mm_struct *mm_old = vdev->mm;
>> + struct mm_struct *mm_new = current->mm;
>> + long pinned_vm = v->pinned_vm;
>> + unsigned long lock_limit = PFN_DOWN(rlimit(RLIMIT_MEMLOCK));
>> +
>> + if (!mm_old)
>> + return -EINVAL;
>> + mmgrab(mm_old);
>> +
>> + if (!v->vdpa->use_va &&
>> + pinned_vm + atomic64_read(&mm_new->pinned_vm) > lock_limit) {
>> + r = -ENOMEM;
>> + goto out;
>> + }
>> + r = vhost_vdpa_bind_mm(v, mm_new);
>> + if (r)
>> + goto out;
>> +
>> + r = vhost_dev_new_owner(vdev);
>> + if (r) {
>> + vhost_vdpa_bind_mm(v, mm_old);
>> + goto out;
>> + }
>> +
>> + if (!v->vdpa->use_va) {
>> + atomic64_sub(pinned_vm, &mm_old->pinned_vm);
>> + atomic64_add(pinned_vm, &mm_new->pinned_vm);
>> + }
>> +
>> +out:
>> + mmdrop(mm_old);
>> + return r;
>> +}
>> +
>> static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, unsigned int cmd,
>> void __user *argp)
>> {
>> @@ -876,6 +914,9 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
>> case VHOST_VDPA_RESUME:
>> r = vhost_vdpa_resume(v);
>> break;
>> + case VHOST_NEW_OWNER:
>> + r = vhost_vdpa_new_owner(v);
>> + break;
>> default:
>> r = vhost_dev_ioctl(&v->vdev, cmd, argp);
>> if (r == -ENOIOCTLCMD)
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index b60955682474..ab40ae50552f 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -963,6 +963,21 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
>> }
>> EXPORT_SYMBOL_GPL(vhost_dev_set_owner);
>>
>> +/* Caller should have device mutex */
>> +long vhost_dev_new_owner(struct vhost_dev *dev)
>> +{
>> + if (dev->mm == current->mm)
>> + return -EBUSY;
>> +
>> + if (!vhost_dev_has_owner(dev))
>> + return -EINVAL;
>> +
>> + vhost_detach_mm(dev);
>> + vhost_attach_mm(dev);
>> + return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(vhost_dev_new_owner);
>> +
>> static struct vhost_iotlb *iotlb_alloc(void)
>> {
>> return vhost_iotlb_alloc(max_iotlb_entries,
>> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
>> index bb75a292d50c..8b2018bb02b1 100644
>> --- a/drivers/vhost/vhost.h
>> +++ b/drivers/vhost/vhost.h
>> @@ -187,6 +187,7 @@ void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs,
>> int (*msg_handler)(struct vhost_dev *dev, u32 asid,
>> struct vhost_iotlb_msg *msg));
>> long vhost_dev_set_owner(struct vhost_dev *dev);
>> +long vhost_dev_new_owner(struct vhost_dev *dev);
>> bool vhost_dev_has_owner(struct vhost_dev *dev);
>> long vhost_dev_check_owner(struct vhost_dev *);
>> struct vhost_iotlb *vhost_dev_reset_owner_prepare(void);
>> diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
>> index b95dd84eef2d..543d0e3434c3 100644
>> --- a/include/uapi/linux/vhost.h
>> +++ b/include/uapi/linux/vhost.h
>> @@ -123,6 +123,16 @@
>> #define VHOST_SET_BACKEND_FEATURES _IOW(VHOST_VIRTIO, 0x25, __u64)
>> #define VHOST_GET_BACKEND_FEATURES _IOR(VHOST_VIRTIO, 0x26, __u64)
>>
>> +/* Set current process as the new owner of this file descriptor. The fd must
>> + * already be owned, via a prior call to VHOST_SET_OWNER. The pinned memory
>> + * count is transferred from the previous to the new owner.
>> + * Errors:
>> + * EINVAL: not owned
>> + * EBUSY: caller is already the owner
>> + * ENOMEM: RLIMIT_MEMLOCK exceeded
>> + */
>> +#define VHOST_NEW_OWNER _IO(VHOST_VIRTIO, 0x27)
>> +
>> /* VHOST_NET specific defines */
>>
>> /* Attach virtio net ring to a raw socket, or tap device.
>> --
>> 2.39.3
>
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH V2 3/7] vhost-vdpa: VHOST_NEW_OWNER
2024-07-15 14:29 ` Steven Sistare
@ 2024-07-15 14:38 ` Michael S. Tsirkin
2024-07-15 15:38 ` Steven Sistare
0 siblings, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2024-07-15 14:38 UTC (permalink / raw)
To: Steven Sistare
Cc: virtualization, linux-kernel, Jason Wang, Si-Wei Liu,
Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea
On Mon, Jul 15, 2024 at 10:29:26AM -0400, Steven Sistare wrote:
> On 7/15/2024 5:07 AM, Michael S. Tsirkin wrote:
> > On Fri, Jul 12, 2024 at 06:18:49AM -0700, Steve Sistare wrote:
> > > Add an ioctl to transfer file descriptor ownership and pinned memory
> > > accounting from one process to another.
> > >
> > > This is more efficient than VHOST_RESET_OWNER followed by VHOST_SET_OWNER,
> > > as that would unpin all physical pages, requiring them to be repinned in
> > > the new process. That would cost multiple seconds for large memories, and
> > > be incurred during a virtual machine's pause time during live update.
> > >
> > > Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >
> > Please, we just need to switch to use iommufd for pinning.
> > Piling up all these hacks gets us nowhere.
>
> I am working on iommufd kernel interfaces and QEMU changes. But who is working
> on iommufd support for vdpa? If no one, or not for years, then adding these
> small interfaces to vdpa plugs a signficant gap in live update coverage.
>
> FWIW, the iommufd interfaces for live update will look much the same: change owner
> and pinned memory accounting, and update virtual addresses. So adding that to vdpa
> will not make it look like an odd duck.
>
> - Steve
I think that no one is working on it - Cindy posted some rfcs in January
("vhost-vdpa: add support for iommufd"). Feel free to pick that up.
What you described is just more of a reason not to duplicate this code.
And it's always the same: a small extension here, a small extension there.
If you can make do with existing kernel interfaces, fine,
one can argue that userspace code is useful to support existing kernels.
--
MST
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH V2 3/7] vhost-vdpa: VHOST_NEW_OWNER
2024-07-15 14:38 ` Michael S. Tsirkin
@ 2024-07-15 15:38 ` Steven Sistare
0 siblings, 0 replies; 35+ messages in thread
From: Steven Sistare @ 2024-07-15 15:38 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: virtualization, linux-kernel, Jason Wang, Si-Wei Liu,
Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea
On 7/15/2024 10:38 AM, Michael S. Tsirkin wrote:
> On Mon, Jul 15, 2024 at 10:29:26AM -0400, Steven Sistare wrote:
>> On 7/15/2024 5:07 AM, Michael S. Tsirkin wrote:
>>> On Fri, Jul 12, 2024 at 06:18:49AM -0700, Steve Sistare wrote:
>>>> Add an ioctl to transfer file descriptor ownership and pinned memory
>>>> accounting from one process to another.
>>>>
>>>> This is more efficient than VHOST_RESET_OWNER followed by VHOST_SET_OWNER,
>>>> as that would unpin all physical pages, requiring them to be repinned in
>>>> the new process. That would cost multiple seconds for large memories, and
>>>> be incurred during a virtual machine's pause time during live update.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>
>>> Please, we just need to switch to use iommufd for pinning.
>>> Piling up all these hacks gets us nowhere.
>>
>> I am working on iommufd kernel interfaces and QEMU changes. But who is working
>> on iommufd support for vdpa? If no one, or not for years, then adding these
>> small interfaces to vdpa plugs a signficant gap in live update coverage.
>>
>> FWIW, the iommufd interfaces for live update will look much the same: change owner
>> and pinned memory accounting, and update virtual addresses. So adding that to vdpa
>> will not make it look like an odd duck.
>>
>> - Steve
>
> I think that no one is working on it - Cindy posted some rfcs in January
> ("vhost-vdpa: add support for iommufd"). Feel free to pick that up.
> What you described is just more of a reason not to duplicate this code.
> And it's always the same: a small extension here, a small extension there.
> If you can make do with existing kernel interfaces, fine,
> one can argue that userspace code is useful to support existing kernels.
Respectfully, I will not be volunteering to take that on. My work funnel to
deliver live update is already too wide. I hope that folks on this distribution
find the functionality interesting and useful enough to continue to review this
current proposal.
- Steve
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH V2 3/7] vhost-vdpa: VHOST_NEW_OWNER
2024-07-15 14:27 ` Steven Sistare
@ 2024-07-16 5:16 ` Jason Wang
2024-07-17 18:28 ` Steven Sistare
0 siblings, 1 reply; 35+ messages in thread
From: Jason Wang @ 2024-07-16 5:16 UTC (permalink / raw)
To: Steven Sistare
Cc: virtualization, linux-kernel, Michael S. Tsirkin, Si-Wei Liu,
Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea
On Mon, Jul 15, 2024 at 10:27 PM Steven Sistare
<steven.sistare@oracle.com> wrote:
>
> On 7/14/2024 10:26 PM, Jason Wang wrote:
> > On Fri, Jul 12, 2024 at 9:19 PM Steve Sistare <steven.sistare@oracle.com> wrote:
> >>
> >> Add an ioctl to transfer file descriptor ownership and pinned memory
> >> accounting from one process to another.
> >>
> >> This is more efficient than VHOST_RESET_OWNER followed by VHOST_SET_OWNER,
> >> as that would unpin all physical pages, requiring them to be repinned in
> >> the new process. That would cost multiple seconds for large memories, and
> >> be incurred during a virtual machine's pause time during live update.
> >>
> >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >> ---
> >> drivers/vhost/vdpa.c | 41 ++++++++++++++++++++++++++++++++++++++
> >> drivers/vhost/vhost.c | 15 ++++++++++++++
> >> drivers/vhost/vhost.h | 1 +
> >> include/uapi/linux/vhost.h | 10 ++++++++++
> >> 4 files changed, 67 insertions(+)
> >>
> >> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> >> index b49e5831b3f0..5cf55ca4ec02 100644
> >> --- a/drivers/vhost/vdpa.c
> >> +++ b/drivers/vhost/vdpa.c
> >> @@ -632,6 +632,44 @@ static long vhost_vdpa_resume(struct vhost_vdpa *v)
> >> return ret;
> >> }
> >>
> >> +static long vhost_vdpa_new_owner(struct vhost_vdpa *v)
> >> +{
> >> + int r;
> >> + struct vhost_dev *vdev = &v->vdev;
> >> + struct mm_struct *mm_old = vdev->mm;
> >> + struct mm_struct *mm_new = current->mm;
> >> + long pinned_vm = v->pinned_vm;
> >> + unsigned long lock_limit = PFN_DOWN(rlimit(RLIMIT_MEMLOCK));
> >> +
> >> + if (!mm_old)
> >> + return -EINVAL;
> >> + mmgrab(mm_old);
> >> +
> >> + if (!v->vdpa->use_va &&
> >> + pinned_vm + atomic64_read(&mm_new->pinned_vm) > lock_limit) {
> >> + r = -ENOMEM;
> >> + goto out;
> >> + }
> >
> > So this seems to allow an arbitrary process to execute this. Seems to be unsafe.
> >
> > I wonder if we need to add some checks here, maybe PID or other stuff
> > to only allow the owner process to do this.
>
> The original owner must send the file descriptor to the new owner.
This seems not to be in the steps you put in the cover letter.
> That constitutes permission to take ownership.
This seems like a relaxed version of the reset_owner:
Currently, reset_owner have the following check:
/* Caller should have device mutex */
long vhost_dev_check_owner(struct vhost_dev *dev)
{
/* Are you the owner? If not, I don't think you mean to do that */
return dev->mm == current->mm ? 0 : -EPERM;
}
EXPORT_SYMBOL_GPL(vhost_dev_check_owner);
It means even if the fd is passed to some other process, the reset
owner won't work there.
Thanks
>
> >> + r = vhost_vdpa_bind_mm(v, mm_new);
> >> + if (r)
> >> + goto out;
> >> +
> >> + r = vhost_dev_new_owner(vdev);
> >> + if (r) {
> >> + vhost_vdpa_bind_mm(v, mm_old);
> >> + goto out;
> >> + }
> >> +
> >> + if (!v->vdpa->use_va) {
> >> + atomic64_sub(pinned_vm, &mm_old->pinned_vm);
> >> + atomic64_add(pinned_vm, &mm_new->pinned_vm);
> >> + }
> >> +
> >> +out:
> >> + mmdrop(mm_old);
> >> + return r;
> >> +}
> >> +
> >> static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, unsigned int cmd,
> >> void __user *argp)
> >> {
> >> @@ -876,6 +914,9 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
> >> case VHOST_VDPA_RESUME:
> >> r = vhost_vdpa_resume(v);
> >> break;
> >> + case VHOST_NEW_OWNER:
> >> + r = vhost_vdpa_new_owner(v);
> >> + break;
> >> default:
> >> r = vhost_dev_ioctl(&v->vdev, cmd, argp);
> >> if (r == -ENOIOCTLCMD)
> >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> >> index b60955682474..ab40ae50552f 100644
> >> --- a/drivers/vhost/vhost.c
> >> +++ b/drivers/vhost/vhost.c
> >> @@ -963,6 +963,21 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
> >> }
> >> EXPORT_SYMBOL_GPL(vhost_dev_set_owner);
> >>
> >> +/* Caller should have device mutex */
> >> +long vhost_dev_new_owner(struct vhost_dev *dev)
> >> +{
> >> + if (dev->mm == current->mm)
> >> + return -EBUSY;
> >> +
> >> + if (!vhost_dev_has_owner(dev))
> >> + return -EINVAL;
> >> +
> >> + vhost_detach_mm(dev);
> >> + vhost_attach_mm(dev);
> >
> > This seems to do nothing unless I miss something.
>
> vhost_detach mm drops dev->mm.
> vhost_attach_mm grabs current->mm.
>
> - Steve
>
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH V2 5/7] vhost-vdpa: VHOST_IOTLB_REMAP
2024-07-15 14:28 ` Steven Sistare
@ 2024-07-16 5:28 ` Jason Wang
2024-07-17 18:29 ` Steven Sistare
0 siblings, 1 reply; 35+ messages in thread
From: Jason Wang @ 2024-07-16 5:28 UTC (permalink / raw)
To: Steven Sistare
Cc: virtualization, linux-kernel, Michael S. Tsirkin, Si-Wei Liu,
Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea
On Mon, Jul 15, 2024 at 10:28 PM Steven Sistare
<steven.sistare@oracle.com> wrote:
>
> On 7/14/2024 10:34 PM, Jason Wang wrote:
> > On Fri, Jul 12, 2024 at 9:19 PM Steve Sistare <steven.sistare@oracle.com> wrote:
> >>
> >> When device ownership is passed to a new process via VHOST_NEW_OWNER,
> >> some devices need to know the new userland addresses of the dma mappings.
> >> Define the new iotlb message type VHOST_IOTLB_REMAP to update the uaddr
> >> of a mapping. The new uaddr must address the same memory object as
> >> originally mapped.
> >>
> >> The user must suspend the device before the old address is invalidated,
> >> and cannot resume it until after VHOST_IOTLB_REMAP is called, but this
> >> requirement is not enforced by the API.
> >>
> >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >> ---
> >> drivers/vhost/vdpa.c | 58 ++++++++++++++++++++++++++++++++
> >> include/uapi/linux/vhost_types.h | 11 +++++-
> >> 2 files changed, 68 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> >> index 4396fe1a90c4..51f71c45c4a9 100644
> >> --- a/drivers/vhost/vdpa.c
> >> +++ b/drivers/vhost/vdpa.c
> >> @@ -1257,6 +1257,61 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
> >>
> >> }
> >>
> >> +static int vhost_vdpa_process_iotlb_remap(struct vhost_vdpa *v,
> >> + struct vhost_iotlb *iotlb,
> >> + struct vhost_iotlb_msg *msg)
> >> +{
> >> + struct vdpa_device *vdpa = v->vdpa;
> >> + const struct vdpa_config_ops *ops = vdpa->config;
> >> + u32 asid = iotlb_to_asid(iotlb);
> >> + u64 start = msg->iova;
> >> + u64 last = start + msg->size - 1;
> >> + struct vhost_iotlb_map *map;
> >> + int r = 0;
> >> +
> >> + if (msg->perm || !msg->size)
> >> + return -EINVAL;
> >> +
> >> + map = vhost_iotlb_itree_first(iotlb, start, last);
> >> + if (!map)
> >> + return -ENOENT;
> >> +
> >> + if (map->start != start || map->last != last)
> >> + return -EINVAL;
> >
> > I had a question here, if a buggy user space that:
> >
> > 1) forget to update some of the mappings
> > 2) address is wrong
> > 3) other cases.
> >
> > Does this mean the device can DMA to the previous owner?
>
> Yes, but only to the mappings which were already pinned for DMA for this
> device, and the old owner is giving the new owner permission to DMA to that
> memory even without bugs.
>
> > If yes, does
> > this have security implications?
>
> No. The previous owner has given the new owner permission to take over. They
> trust each other completely. In the live update case, they are the same; the new
> owner is the old owner reincarnated.
I understand the processes may trust each other but I meant the kernel
may not trust those processes.
For example:
1) old owner pass fd to new owner which is another process
2) the new owner do VHOST_NEW_OWNER
3) new owner doesn't do remap correctly
There's no way for the old owner to remove/unpin the mappings as we
have the owner check in IOTLB_UPDATE. Looks like a potential way for
DOS.
Thanks
>
> - Steve
>
> >> +
> >> + /*
> >> + * The current implementation does not support the platform iommu
> >> + * with use_va. And if !use_va, remap is not necessary.
> >> + */
> >> + if (!ops->set_map && !ops->dma_map)
> >> + return -EINVAL;
> >> +
> >> + /*
> >> + * The current implementation supports set_map but not dma_map.
> >> + */
> >> + if (!ops->set_map)
> >> + return -EINVAL;
> >> +
> >> + /*
> >> + * Do not verify that the new size@uaddr points to the same physical
> >> + * pages as the old size@uaddr, because that would take time O(npages),
> >> + * and would increase guest down time during live update. If the app
> >> + * is buggy and they differ, then the app may corrupt its own memory,
> >> + * but no one else's.
> >> + */
> >> +
> >> + /*
> >> + * Batch will finish later in BATCH_END by calling set_map for the new
> >> + * addresses collected here. Non-batch must do it now.
> >> + */
> >> + if (!v->in_batch)
> >> + r = ops->set_map(vdpa, asid, iotlb);
> >> + if (!r)
> >> + map->addr = msg->uaddr;
> >> +
> >> + return r;
> >> +}
> >> +
> >> static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
> >> struct vhost_iotlb *iotlb,
> >> struct vhost_iotlb_msg *msg)
> >> @@ -1336,6 +1391,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, u32 asid,
> >> ops->set_map(vdpa, asid, iotlb);
> >> v->in_batch = false;
> >> break;
> >> + case VHOST_IOTLB_REMAP:
> >> + r = vhost_vdpa_process_iotlb_remap(v, iotlb, msg);
> >> + break;
> >> default:
> >> r = -EINVAL;
> >> break;
> >> diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
> >> index 9177843951e9..35908315ff55 100644
> >> --- a/include/uapi/linux/vhost_types.h
> >> +++ b/include/uapi/linux/vhost_types.h
> >> @@ -79,7 +79,7 @@ struct vhost_iotlb_msg {
> >> /*
> >> * VHOST_IOTLB_BATCH_BEGIN and VHOST_IOTLB_BATCH_END allow modifying
> >> * multiple mappings in one go: beginning with
> >> - * VHOST_IOTLB_BATCH_BEGIN, followed by any number of
> >> + * VHOST_IOTLB_BATCH_BEGIN, followed by any number of VHOST_IOTLB_REMAP or
> >> * VHOST_IOTLB_UPDATE messages, and ending with VHOST_IOTLB_BATCH_END.
> >> * When one of these two values is used as the message type, the rest
> >> * of the fields in the message are ignored. There's no guarantee that
> >> @@ -87,6 +87,15 @@ struct vhost_iotlb_msg {
> >> */
> >> #define VHOST_IOTLB_BATCH_BEGIN 5
> >> #define VHOST_IOTLB_BATCH_END 6
> >> +
> >> +/*
> >> + * VHOST_IOTLB_REMAP registers a new uaddr for the existing mapping at iova.
> >> + * The new uaddr must address the same memory object as originally mapped.
> >> + * Failure to do so will result in user memory corruption and/or device
> >> + * misbehavior. iova and size must match the arguments used to create the
> >> + * an existing mapping. Protection is not changed, and perm must be 0.
> >> + */
> >> +#define VHOST_IOTLB_REMAP 7
> >> __u8 type;
> >> };
> >
> > Thanks
> >
> >>
> >> --
> >> 2.39.3
> >>
> >
>
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH V2 0/7] vdpa live update
2024-07-15 14:28 ` Steven Sistare
@ 2024-07-16 5:30 ` Jason Wang
2024-07-17 18:29 ` Steven Sistare
0 siblings, 1 reply; 35+ messages in thread
From: Jason Wang @ 2024-07-16 5:30 UTC (permalink / raw)
To: Steven Sistare
Cc: virtualization, linux-kernel, Michael S. Tsirkin, Si-Wei Liu,
Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea
On Mon, Jul 15, 2024 at 10:29 PM Steven Sistare
<steven.sistare@oracle.com> wrote:
>
> On 7/14/2024 10:14 PM, Jason Wang wrote:
> > On Fri, Jul 12, 2024 at 9:19 PM Steve Sistare <steven.sistare@oracle.com> wrote:
> >>
> >> Live update is a technique wherein an application saves its state, exec's
> >> to an updated version of itself, and restores its state. Clients of the
> >> application experience a brief suspension of service, on the order of
> >> 100's of milliseconds, but are otherwise unaffected.
> >>
> >> Define and implement interfaces that allow vdpa devices to be preserved
> >> across fork or exec, to support live update for applications such as QEMU.
> >> The device must be suspended during the update, but its DMA mappings are
> >> preserved, so the suspension is brief.
> >>
> >> The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory
> >> accounting from one process to another.
> >>
> >> The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that
> >> VHOST_NEW_OWNER is supported.
> >>
> >> The VHOST_IOTLB_REMAP message type updates a DMA mapping with its userland
> >> address in the new process.
> >>
> >> The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that
> >> VHOST_IOTLB_REMAP is supported and required. Some devices do not
> >> require it, because the userland address of each DMA mapping is discarded
> >> after being translated to a physical address.
> >>
> >> Here is a pseudo-code sequence for performing live update, based on
> >> suspend + reset because resume is not yet widely available. The vdpa device
> >> descriptor, fd, remains open across the exec.
> >>
> >> ioctl(fd, VHOST_VDPA_SUSPEND)
> >> ioctl(fd, VHOST_VDPA_SET_STATUS, 0)
> >
> > I don't understand why we need a reset after suspend, it looks to me
> > the previous suspend became meaningless.
>
> The suspend guarantees completion of in-progress DMA. At least, that is
> my interpretation of why that is done for live migration in QEMU, which
> also does suspend + reset + re-create. I am following the live migration
> model.
Yes, but any reason we need a reset after the suspension?
>
> >> exec
> >>
> >> ioctl(fd, VHOST_NEW_OWNER)
> >>
> >> issue ioctls to re-create vrings
> >>
> >> if VHOST_BACKEND_F_IOTLB_REMAP
> >
> > So the idea is for a device that is using a virtual address, it
> > doesn't need VHOST_BACKEND_F_IOTLB_REMAP at all?
>
> Actually the reverse: if the device translates virtual to physical when
> the mappings are created, and discards the virtual, then VHOST_IOTLB_REMAP
> is not needed.
Ok.
>
> >> foreach dma mapping
> >> write(fd, {VHOST_IOTLB_REMAP, new_addr})
> >>
> >> ioctl(fd, VHOST_VDPA_SET_STATUS,
> >> ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK)
> >
> > From API level, this seems to be asymmetric as we have suspending but
> > not resuming?
>
> Again, I am just following the path taken by live migration.
> I will be happy to use resume when the devices and QEMU support it.
> The decision to use reset vs resume should not affect the definition
> and use of VHOST_NEW_OWNER and VHOST_IOTLB_REMAP.
>
> - Steve
Thanks
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH V2 3/7] vhost-vdpa: VHOST_NEW_OWNER
2024-07-16 5:16 ` Jason Wang
@ 2024-07-17 18:28 ` Steven Sistare
2024-07-22 7:26 ` Jason Wang
0 siblings, 1 reply; 35+ messages in thread
From: Steven Sistare @ 2024-07-17 18:28 UTC (permalink / raw)
To: Jason Wang
Cc: virtualization, linux-kernel, Michael S. Tsirkin, Si-Wei Liu,
Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea
On 7/16/2024 1:16 AM, Jason Wang wrote:
> On Mon, Jul 15, 2024 at 10:27 PM Steven Sistare
> <steven.sistare@oracle.com> wrote:
>>
>> On 7/14/2024 10:26 PM, Jason Wang wrote:
>>> On Fri, Jul 12, 2024 at 9:19 PM Steve Sistare <steven.sistare@oracle.com> wrote:
>>>>
>>>> Add an ioctl to transfer file descriptor ownership and pinned memory
>>>> accounting from one process to another.
>>>>
>>>> This is more efficient than VHOST_RESET_OWNER followed by VHOST_SET_OWNER,
>>>> as that would unpin all physical pages, requiring them to be repinned in
>>>> the new process. That would cost multiple seconds for large memories, and
>>>> be incurred during a virtual machine's pause time during live update.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>> drivers/vhost/vdpa.c | 41 ++++++++++++++++++++++++++++++++++++++
>>>> drivers/vhost/vhost.c | 15 ++++++++++++++
>>>> drivers/vhost/vhost.h | 1 +
>>>> include/uapi/linux/vhost.h | 10 ++++++++++
>>>> 4 files changed, 67 insertions(+)
>>>>
>>>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>>>> index b49e5831b3f0..5cf55ca4ec02 100644
>>>> --- a/drivers/vhost/vdpa.c
>>>> +++ b/drivers/vhost/vdpa.c
>>>> @@ -632,6 +632,44 @@ static long vhost_vdpa_resume(struct vhost_vdpa *v)
>>>> return ret;
>>>> }
>>>>
>>>> +static long vhost_vdpa_new_owner(struct vhost_vdpa *v)
>>>> +{
>>>> + int r;
>>>> + struct vhost_dev *vdev = &v->vdev;
>>>> + struct mm_struct *mm_old = vdev->mm;
>>>> + struct mm_struct *mm_new = current->mm;
>>>> + long pinned_vm = v->pinned_vm;
>>>> + unsigned long lock_limit = PFN_DOWN(rlimit(RLIMIT_MEMLOCK));
>>>> +
>>>> + if (!mm_old)
>>>> + return -EINVAL;
>>>> + mmgrab(mm_old);
>>>> +
>>>> + if (!v->vdpa->use_va &&
>>>> + pinned_vm + atomic64_read(&mm_new->pinned_vm) > lock_limit) {
>>>> + r = -ENOMEM;
>>>> + goto out;
>>>> + }
>>>
>>> So this seems to allow an arbitrary process to execute this. Seems to be unsafe.
>>>
>>> I wonder if we need to add some checks here, maybe PID or other stuff
>>> to only allow the owner process to do this.
>>
>> The original owner must send the file descriptor to the new owner.
>
> This seems not to be in the steps you put in the cover letter.
It's there:
"The vdpa device descriptor, fd, remains open across the exec."
But, I can say more about how fd visibility constitutes permission to changer
owner in this commit message.
>> That constitutes permission to take ownership.
>
> This seems like a relaxed version of the reset_owner:
>
> Currently, reset_owner have the following check:
Not relaxed, just different. A process cannot do anything with fd if it
is not the owner, *except* for becoming the new owner. Holding the fd is
like holding a key.
- Steve
> /* Caller should have device mutex */
> long vhost_dev_check_owner(struct vhost_dev *dev)
> {
> /* Are you the owner? If not, I don't think you mean to do that */
> return dev->mm == current->mm ? 0 : -EPERM;
> }
> EXPORT_SYMBOL_GPL(vhost_dev_check_owner);
>
> It means even if the fd is passed to some other process, the reset
> owner won't work there.
>
> Thanks
>
>>
>>>> + r = vhost_vdpa_bind_mm(v, mm_new);
>>>> + if (r)
>>>> + goto out;
>>>> +
>>>> + r = vhost_dev_new_owner(vdev);
>>>> + if (r) {
>>>> + vhost_vdpa_bind_mm(v, mm_old);
>>>> + goto out;
>>>> + }
>>>> +
>>>> + if (!v->vdpa->use_va) {
>>>> + atomic64_sub(pinned_vm, &mm_old->pinned_vm);
>>>> + atomic64_add(pinned_vm, &mm_new->pinned_vm);
>>>> + }
>>>> +
>>>> +out:
>>>> + mmdrop(mm_old);
>>>> + return r;
>>>> +}
>>>> +
>>>> static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, unsigned int cmd,
>>>> void __user *argp)
>>>> {
>>>> @@ -876,6 +914,9 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
>>>> case VHOST_VDPA_RESUME:
>>>> r = vhost_vdpa_resume(v);
>>>> break;
>>>> + case VHOST_NEW_OWNER:
>>>> + r = vhost_vdpa_new_owner(v);
>>>> + break;
>>>> default:
>>>> r = vhost_dev_ioctl(&v->vdev, cmd, argp);
>>>> if (r == -ENOIOCTLCMD)
>>>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>>>> index b60955682474..ab40ae50552f 100644
>>>> --- a/drivers/vhost/vhost.c
>>>> +++ b/drivers/vhost/vhost.c
>>>> @@ -963,6 +963,21 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
>>>> }
>>>> EXPORT_SYMBOL_GPL(vhost_dev_set_owner);
>>>>
>>>> +/* Caller should have device mutex */
>>>> +long vhost_dev_new_owner(struct vhost_dev *dev)
>>>> +{
>>>> + if (dev->mm == current->mm)
>>>> + return -EBUSY;
>>>> +
>>>> + if (!vhost_dev_has_owner(dev))
>>>> + return -EINVAL;
>>>> +
>>>> + vhost_detach_mm(dev);
>>>> + vhost_attach_mm(dev);
>>>
>>> This seems to do nothing unless I miss something.
>>
>> vhost_detach mm drops dev->mm.
>> vhost_attach_mm grabs current->mm.
>>
>> - Steve
>>
>
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH V2 5/7] vhost-vdpa: VHOST_IOTLB_REMAP
2024-07-16 5:28 ` Jason Wang
@ 2024-07-17 18:29 ` Steven Sistare
2024-07-18 0:45 ` Jason Wang
0 siblings, 1 reply; 35+ messages in thread
From: Steven Sistare @ 2024-07-17 18:29 UTC (permalink / raw)
To: Jason Wang
Cc: virtualization, linux-kernel, Michael S. Tsirkin, Si-Wei Liu,
Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea, Alex Williamson
On 7/16/2024 1:28 AM, Jason Wang wrote:
> On Mon, Jul 15, 2024 at 10:28 PM Steven Sistare
> <steven.sistare@oracle.com> wrote:
>>
>> On 7/14/2024 10:34 PM, Jason Wang wrote:
>>> On Fri, Jul 12, 2024 at 9:19 PM Steve Sistare <steven.sistare@oracle.com> wrote:
>>>>
>>>> When device ownership is passed to a new process via VHOST_NEW_OWNER,
>>>> some devices need to know the new userland addresses of the dma mappings.
>>>> Define the new iotlb message type VHOST_IOTLB_REMAP to update the uaddr
>>>> of a mapping. The new uaddr must address the same memory object as
>>>> originally mapped.
>>>>
>>>> The user must suspend the device before the old address is invalidated,
>>>> and cannot resume it until after VHOST_IOTLB_REMAP is called, but this
>>>> requirement is not enforced by the API.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>> drivers/vhost/vdpa.c | 58 ++++++++++++++++++++++++++++++++
>>>> include/uapi/linux/vhost_types.h | 11 +++++-
>>>> 2 files changed, 68 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>>>> index 4396fe1a90c4..51f71c45c4a9 100644
>>>> --- a/drivers/vhost/vdpa.c
>>>> +++ b/drivers/vhost/vdpa.c
>>>> @@ -1257,6 +1257,61 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>>>>
>>>> }
>>>>
>>>> +static int vhost_vdpa_process_iotlb_remap(struct vhost_vdpa *v,
>>>> + struct vhost_iotlb *iotlb,
>>>> + struct vhost_iotlb_msg *msg)
>>>> +{
>>>> + struct vdpa_device *vdpa = v->vdpa;
>>>> + const struct vdpa_config_ops *ops = vdpa->config;
>>>> + u32 asid = iotlb_to_asid(iotlb);
>>>> + u64 start = msg->iova;
>>>> + u64 last = start + msg->size - 1;
>>>> + struct vhost_iotlb_map *map;
>>>> + int r = 0;
>>>> +
>>>> + if (msg->perm || !msg->size)
>>>> + return -EINVAL;
>>>> +
>>>> + map = vhost_iotlb_itree_first(iotlb, start, last);
>>>> + if (!map)
>>>> + return -ENOENT;
>>>> +
>>>> + if (map->start != start || map->last != last)
>>>> + return -EINVAL;
>>>
>>> I had a question here, if a buggy user space that:
>>>
>>> 1) forget to update some of the mappings
>>> 2) address is wrong
>>> 3) other cases.
>>>
>>> Does this mean the device can DMA to the previous owner?
>>
>> Yes, but only to the mappings which were already pinned for DMA for this
>> device, and the old owner is giving the new owner permission to DMA to that
>> memory even without bugs.
>>
>>> If yes, does
>>> this have security implications?
>>
>> No. The previous owner has given the new owner permission to take over. They
>> trust each other completely. In the live update case, they are the same; the new
>> owner is the old owner reincarnated.
>
> I understand the processes may trust each other but I meant the kernel
> may not trust those processes.
If a process holds the key (the fd) then the kernel can trust that is has
permission to use it. Keys are only passed around voluntarily, unless there
is a bug.
> For example:
>
> 1) old owner pass fd to new owner which is another process
> 2) the new owner do VHOST_NEW_OWNER
> 3) new owner doesn't do remap correctly
>
> There's no way for the old owner to remove/unpin the mappings as we
> have the owner check in IOTLB_UPDATE. Looks like a potential way for
> DOS.
This is a bug in the second cooperating process, not a DOS. The application
must fix it. Sometimes you cannot recover from an application bug at run time.
BTW, at one time vfio enforced the concept of an owner, but Alex deleted it.
It adds no value, because possession of the fd is the key.
ffed0518d871 ("vfio: remove useless judgement")
- Steve
>>>> +
>>>> + /*
>>>> + * The current implementation does not support the platform iommu
>>>> + * with use_va. And if !use_va, remap is not necessary.
>>>> + */
>>>> + if (!ops->set_map && !ops->dma_map)
>>>> + return -EINVAL;
>>>> +
>>>> + /*
>>>> + * The current implementation supports set_map but not dma_map.
>>>> + */
>>>> + if (!ops->set_map)
>>>> + return -EINVAL;
>>>> +
>>>> + /*
>>>> + * Do not verify that the new size@uaddr points to the same physical
>>>> + * pages as the old size@uaddr, because that would take time O(npages),
>>>> + * and would increase guest down time during live update. If the app
>>>> + * is buggy and they differ, then the app may corrupt its own memory,
>>>> + * but no one else's.
>>>> + */
>>>> +
>>>> + /*
>>>> + * Batch will finish later in BATCH_END by calling set_map for the new
>>>> + * addresses collected here. Non-batch must do it now.
>>>> + */
>>>> + if (!v->in_batch)
>>>> + r = ops->set_map(vdpa, asid, iotlb);
>>>> + if (!r)
>>>> + map->addr = msg->uaddr;
>>>> +
>>>> + return r;
>>>> +}
>>>> +
>>>> static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>>>> struct vhost_iotlb *iotlb,
>>>> struct vhost_iotlb_msg *msg)
>>>> @@ -1336,6 +1391,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, u32 asid,
>>>> ops->set_map(vdpa, asid, iotlb);
>>>> v->in_batch = false;
>>>> break;
>>>> + case VHOST_IOTLB_REMAP:
>>>> + r = vhost_vdpa_process_iotlb_remap(v, iotlb, msg);
>>>> + break;
>>>> default:
>>>> r = -EINVAL;
>>>> break;
>>>> diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
>>>> index 9177843951e9..35908315ff55 100644
>>>> --- a/include/uapi/linux/vhost_types.h
>>>> +++ b/include/uapi/linux/vhost_types.h
>>>> @@ -79,7 +79,7 @@ struct vhost_iotlb_msg {
>>>> /*
>>>> * VHOST_IOTLB_BATCH_BEGIN and VHOST_IOTLB_BATCH_END allow modifying
>>>> * multiple mappings in one go: beginning with
>>>> - * VHOST_IOTLB_BATCH_BEGIN, followed by any number of
>>>> + * VHOST_IOTLB_BATCH_BEGIN, followed by any number of VHOST_IOTLB_REMAP or
>>>> * VHOST_IOTLB_UPDATE messages, and ending with VHOST_IOTLB_BATCH_END.
>>>> * When one of these two values is used as the message type, the rest
>>>> * of the fields in the message are ignored. There's no guarantee that
>>>> @@ -87,6 +87,15 @@ struct vhost_iotlb_msg {
>>>> */
>>>> #define VHOST_IOTLB_BATCH_BEGIN 5
>>>> #define VHOST_IOTLB_BATCH_END 6
>>>> +
>>>> +/*
>>>> + * VHOST_IOTLB_REMAP registers a new uaddr for the existing mapping at iova.
>>>> + * The new uaddr must address the same memory object as originally mapped.
>>>> + * Failure to do so will result in user memory corruption and/or device
>>>> + * misbehavior. iova and size must match the arguments used to create the
>>>> + * an existing mapping. Protection is not changed, and perm must be 0.
>>>> + */
>>>> +#define VHOST_IOTLB_REMAP 7
>>>> __u8 type;
>>>> };
>>>
>>> Thanks
>>>
>>>>
>>>> --
>>>> 2.39.3
>>>>
>>>
>>
>
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH V2 0/7] vdpa live update
2024-07-16 5:30 ` Jason Wang
@ 2024-07-17 18:29 ` Steven Sistare
2024-07-18 0:33 ` Jason Wang
0 siblings, 1 reply; 35+ messages in thread
From: Steven Sistare @ 2024-07-17 18:29 UTC (permalink / raw)
To: Jason Wang
Cc: virtualization, linux-kernel, Michael S. Tsirkin, Si-Wei Liu,
Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea
On 7/16/2024 1:30 AM, Jason Wang wrote:
> On Mon, Jul 15, 2024 at 10:29 PM Steven Sistare
> <steven.sistare@oracle.com> wrote:
>>
>> On 7/14/2024 10:14 PM, Jason Wang wrote:
>>> On Fri, Jul 12, 2024 at 9:19 PM Steve Sistare <steven.sistare@oracle.com> wrote:
>>>>
>>>> Live update is a technique wherein an application saves its state, exec's
>>>> to an updated version of itself, and restores its state. Clients of the
>>>> application experience a brief suspension of service, on the order of
>>>> 100's of milliseconds, but are otherwise unaffected.
>>>>
>>>> Define and implement interfaces that allow vdpa devices to be preserved
>>>> across fork or exec, to support live update for applications such as QEMU.
>>>> The device must be suspended during the update, but its DMA mappings are
>>>> preserved, so the suspension is brief.
>>>>
>>>> The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory
>>>> accounting from one process to another.
>>>>
>>>> The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that
>>>> VHOST_NEW_OWNER is supported.
>>>>
>>>> The VHOST_IOTLB_REMAP message type updates a DMA mapping with its userland
>>>> address in the new process.
>>>>
>>>> The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that
>>>> VHOST_IOTLB_REMAP is supported and required. Some devices do not
>>>> require it, because the userland address of each DMA mapping is discarded
>>>> after being translated to a physical address.
>>>>
>>>> Here is a pseudo-code sequence for performing live update, based on
>>>> suspend + reset because resume is not yet widely available. The vdpa device
>>>> descriptor, fd, remains open across the exec.
>>>>
>>>> ioctl(fd, VHOST_VDPA_SUSPEND)
>>>> ioctl(fd, VHOST_VDPA_SET_STATUS, 0)
>>>
>>> I don't understand why we need a reset after suspend, it looks to me
>>> the previous suspend became meaningless.
>>
>> The suspend guarantees completion of in-progress DMA. At least, that is
>> my interpretation of why that is done for live migration in QEMU, which
>> also does suspend + reset + re-create. I am following the live migration
>> model.
>
> Yes, but any reason we need a reset after the suspension?
Probably not. I found it cleanest to call reset and let new qemu configure the
device as it always does during startup, rather than altering those code paths
to skip the kernel calls. So, consider this to be just one of several possible
userland algorithms.
- Steve
>>>> exec
>>>>
>>>> ioctl(fd, VHOST_NEW_OWNER)
>>>>
>>>> issue ioctls to re-create vrings
>>>>
>>>> if VHOST_BACKEND_F_IOTLB_REMAP
>>>
>>> So the idea is for a device that is using a virtual address, it
>>> doesn't need VHOST_BACKEND_F_IOTLB_REMAP at all?
>>
>> Actually the reverse: if the device translates virtual to physical when
>> the mappings are created, and discards the virtual, then VHOST_IOTLB_REMAP
>> is not needed.
>
> Ok.
>
>>
>>>> foreach dma mapping
>>>> write(fd, {VHOST_IOTLB_REMAP, new_addr})
>>>>
>>>> ioctl(fd, VHOST_VDPA_SET_STATUS,
>>>> ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK)
>>>
>>> From API level, this seems to be asymmetric as we have suspending but
>>> not resuming?
>>
>> Again, I am just following the path taken by live migration.
>> I will be happy to use resume when the devices and QEMU support it.
>> The decision to use reset vs resume should not affect the definition
>> and use of VHOST_NEW_OWNER and VHOST_IOTLB_REMAP.
>>
>> - Steve
>
> Thanks
>
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH V2 0/7] vdpa live update
2024-07-17 18:29 ` Steven Sistare
@ 2024-07-18 0:33 ` Jason Wang
2024-07-20 21:34 ` Steven Sistare
0 siblings, 1 reply; 35+ messages in thread
From: Jason Wang @ 2024-07-18 0:33 UTC (permalink / raw)
To: Steven Sistare
Cc: virtualization, linux-kernel, Michael S. Tsirkin, Si-Wei Liu,
Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea
On Thu, Jul 18, 2024 at 2:29 AM Steven Sistare
<steven.sistare@oracle.com> wrote:
>
> On 7/16/2024 1:30 AM, Jason Wang wrote:
> > On Mon, Jul 15, 2024 at 10:29 PM Steven Sistare
> > <steven.sistare@oracle.com> wrote:
> >>
> >> On 7/14/2024 10:14 PM, Jason Wang wrote:
> >>> On Fri, Jul 12, 2024 at 9:19 PM Steve Sistare <steven.sistare@oracle.com> wrote:
> >>>>
> >>>> Live update is a technique wherein an application saves its state, exec's
> >>>> to an updated version of itself, and restores its state. Clients of the
> >>>> application experience a brief suspension of service, on the order of
> >>>> 100's of milliseconds, but are otherwise unaffected.
> >>>>
> >>>> Define and implement interfaces that allow vdpa devices to be preserved
> >>>> across fork or exec, to support live update for applications such as QEMU.
> >>>> The device must be suspended during the update, but its DMA mappings are
> >>>> preserved, so the suspension is brief.
> >>>>
> >>>> The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory
> >>>> accounting from one process to another.
> >>>>
> >>>> The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that
> >>>> VHOST_NEW_OWNER is supported.
> >>>>
> >>>> The VHOST_IOTLB_REMAP message type updates a DMA mapping with its userland
> >>>> address in the new process.
> >>>>
> >>>> The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that
> >>>> VHOST_IOTLB_REMAP is supported and required. Some devices do not
> >>>> require it, because the userland address of each DMA mapping is discarded
> >>>> after being translated to a physical address.
> >>>>
> >>>> Here is a pseudo-code sequence for performing live update, based on
> >>>> suspend + reset because resume is not yet widely available. The vdpa device
> >>>> descriptor, fd, remains open across the exec.
> >>>>
> >>>> ioctl(fd, VHOST_VDPA_SUSPEND)
> >>>> ioctl(fd, VHOST_VDPA_SET_STATUS, 0)
> >>>
> >>> I don't understand why we need a reset after suspend, it looks to me
> >>> the previous suspend became meaningless.
> >>
> >> The suspend guarantees completion of in-progress DMA. At least, that is
> >> my interpretation of why that is done for live migration in QEMU, which
> >> also does suspend + reset + re-create. I am following the live migration
> >> model.
> >
> > Yes, but any reason we need a reset after the suspension?
>
> Probably not. I found it cleanest to call reset and let new qemu configure the
> device as it always does during startup, rather than altering those code paths
> to skip the kernel calls.
If we care about the downtime, I think avoiding a reset should be faster.
> So, consider this to be just one of several possible
> userland algorithms.
>
> - Steve
Thanks
>
> >>>> exec
> >>>>
> >>>> ioctl(fd, VHOST_NEW_OWNER)
> >>>>
> >>>> issue ioctls to re-create vrings
> >>>>
> >>>> if VHOST_BACKEND_F_IOTLB_REMAP
> >>>
> >>> So the idea is for a device that is using a virtual address, it
> >>> doesn't need VHOST_BACKEND_F_IOTLB_REMAP at all?
> >>
> >> Actually the reverse: if the device translates virtual to physical when
> >> the mappings are created, and discards the virtual, then VHOST_IOTLB_REMAP
> >> is not needed.
> >
> > Ok.
> >
> >>
> >>>> foreach dma mapping
> >>>> write(fd, {VHOST_IOTLB_REMAP, new_addr})
> >>>>
> >>>> ioctl(fd, VHOST_VDPA_SET_STATUS,
> >>>> ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK)
> >>>
> >>> From API level, this seems to be asymmetric as we have suspending but
> >>> not resuming?
> >>
> >> Again, I am just following the path taken by live migration.
> >> I will be happy to use resume when the devices and QEMU support it.
> >> The decision to use reset vs resume should not affect the definition
> >> and use of VHOST_NEW_OWNER and VHOST_IOTLB_REMAP.
> >>
> >> - Steve
> >
> > Thanks
> >
>
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH V2 5/7] vhost-vdpa: VHOST_IOTLB_REMAP
2024-07-17 18:29 ` Steven Sistare
@ 2024-07-18 0:45 ` Jason Wang
2024-07-18 19:39 ` Michael S. Tsirkin
0 siblings, 1 reply; 35+ messages in thread
From: Jason Wang @ 2024-07-18 0:45 UTC (permalink / raw)
To: Steven Sistare
Cc: virtualization, linux-kernel, Michael S. Tsirkin, Si-Wei Liu,
Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea, Alex Williamson
On Thu, Jul 18, 2024 at 2:29 AM Steven Sistare
<steven.sistare@oracle.com> wrote:
>
> On 7/16/2024 1:28 AM, Jason Wang wrote:
> > On Mon, Jul 15, 2024 at 10:28 PM Steven Sistare
> > <steven.sistare@oracle.com> wrote:
> >>
> >> On 7/14/2024 10:34 PM, Jason Wang wrote:
> >>> On Fri, Jul 12, 2024 at 9:19 PM Steve Sistare <steven.sistare@oracle.com> wrote:
> >>>>
> >>>> When device ownership is passed to a new process via VHOST_NEW_OWNER,
> >>>> some devices need to know the new userland addresses of the dma mappings.
> >>>> Define the new iotlb message type VHOST_IOTLB_REMAP to update the uaddr
> >>>> of a mapping. The new uaddr must address the same memory object as
> >>>> originally mapped.
> >>>>
> >>>> The user must suspend the device before the old address is invalidated,
> >>>> and cannot resume it until after VHOST_IOTLB_REMAP is called, but this
> >>>> requirement is not enforced by the API.
> >>>>
> >>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >>>> ---
> >>>> drivers/vhost/vdpa.c | 58 ++++++++++++++++++++++++++++++++
> >>>> include/uapi/linux/vhost_types.h | 11 +++++-
> >>>> 2 files changed, 68 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> >>>> index 4396fe1a90c4..51f71c45c4a9 100644
> >>>> --- a/drivers/vhost/vdpa.c
> >>>> +++ b/drivers/vhost/vdpa.c
> >>>> @@ -1257,6 +1257,61 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
> >>>>
> >>>> }
> >>>>
> >>>> +static int vhost_vdpa_process_iotlb_remap(struct vhost_vdpa *v,
> >>>> + struct vhost_iotlb *iotlb,
> >>>> + struct vhost_iotlb_msg *msg)
> >>>> +{
> >>>> + struct vdpa_device *vdpa = v->vdpa;
> >>>> + const struct vdpa_config_ops *ops = vdpa->config;
> >>>> + u32 asid = iotlb_to_asid(iotlb);
> >>>> + u64 start = msg->iova;
> >>>> + u64 last = start + msg->size - 1;
> >>>> + struct vhost_iotlb_map *map;
> >>>> + int r = 0;
> >>>> +
> >>>> + if (msg->perm || !msg->size)
> >>>> + return -EINVAL;
> >>>> +
> >>>> + map = vhost_iotlb_itree_first(iotlb, start, last);
> >>>> + if (!map)
> >>>> + return -ENOENT;
> >>>> +
> >>>> + if (map->start != start || map->last != last)
> >>>> + return -EINVAL;
> >>>
> >>> I had a question here, if a buggy user space that:
> >>>
> >>> 1) forget to update some of the mappings
> >>> 2) address is wrong
> >>> 3) other cases.
> >>>
> >>> Does this mean the device can DMA to the previous owner?
> >>
> >> Yes, but only to the mappings which were already pinned for DMA for this
> >> device, and the old owner is giving the new owner permission to DMA to that
> >> memory even without bugs.
> >>
> >>> If yes, does
> >>> this have security implications?
> >>
> >> No. The previous owner has given the new owner permission to take over. They
> >> trust each other completely. In the live update case, they are the same; the new
> >> owner is the old owner reincarnated.
> >
> > I understand the processes may trust each other but I meant the kernel
> > may not trust those processes.
>
> If a process holds the key (the fd) then the kernel can trust that is has
> permission to use it. Keys are only passed around voluntarily, unless there
> is a bug.
Looks not, for example, kernel can choose to limit various operations
on a fd, even if the process holds the key
1) privileged process do setup on fd, passing that fd to unprivileged fd
2) unprivileged process can only use a subset of the functions of a fd
In the case of Qemu, it prevents the kernel from the case where for
example malicious guests can escape to Qemu.
In the case of vhost-net, the privilege is the owner. For example, the
following seems to be valid in the case of vhost-net:
1) Two processes (A and B) share a part of the memory
2) A is the owner of the vhost-net who is in charge of building memory
mappings via IOTLB
3) A passess vhost-net fd to process B
>
> > For example:
> >
> > 1) old owner pass fd to new owner which is another process
> > 2) the new owner do VHOST_NEW_OWNER
> > 3) new owner doesn't do remap correctly
> >
> > There's no way for the old owner to remove/unpin the mappings as we
> > have the owner check in IOTLB_UPDATE. Looks like a potential way for
> > DOS.
>
> This is a bug in the second cooperating process, not a DOS. The application
> must fix it. Sometimes you cannot recover from an application bug at run time.
>
> BTW, at one time vfio enforced the concept of an owner, but Alex deleted it.
> It adds no value, because possession of the fd is the key.
> ffed0518d871 ("vfio: remove useless judgement")
This seems to be a great relaxation of the ownership check. I would
like to hear from Michael first.
Thanks
>
> - Steve
>
> >>>> +
> >>>> + /*
> >>>> + * The current implementation does not support the platform iommu
> >>>> + * with use_va. And if !use_va, remap is not necessary.
> >>>> + */
> >>>> + if (!ops->set_map && !ops->dma_map)
> >>>> + return -EINVAL;
> >>>> +
> >>>> + /*
> >>>> + * The current implementation supports set_map but not dma_map.
> >>>> + */
> >>>> + if (!ops->set_map)
> >>>> + return -EINVAL;
> >>>> +
> >>>> + /*
> >>>> + * Do not verify that the new size@uaddr points to the same physical
> >>>> + * pages as the old size@uaddr, because that would take time O(npages),
> >>>> + * and would increase guest down time during live update. If the app
> >>>> + * is buggy and they differ, then the app may corrupt its own memory,
> >>>> + * but no one else's.
> >>>> + */
> >>>> +
> >>>> + /*
> >>>> + * Batch will finish later in BATCH_END by calling set_map for the new
> >>>> + * addresses collected here. Non-batch must do it now.
> >>>> + */
> >>>> + if (!v->in_batch)
> >>>> + r = ops->set_map(vdpa, asid, iotlb);
> >>>> + if (!r)
> >>>> + map->addr = msg->uaddr;
> >>>> +
> >>>> + return r;
> >>>> +}
> >>>> +
> >>>> static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
> >>>> struct vhost_iotlb *iotlb,
> >>>> struct vhost_iotlb_msg *msg)
> >>>> @@ -1336,6 +1391,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, u32 asid,
> >>>> ops->set_map(vdpa, asid, iotlb);
> >>>> v->in_batch = false;
> >>>> break;
> >>>> + case VHOST_IOTLB_REMAP:
> >>>> + r = vhost_vdpa_process_iotlb_remap(v, iotlb, msg);
> >>>> + break;
> >>>> default:
> >>>> r = -EINVAL;
> >>>> break;
> >>>> diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
> >>>> index 9177843951e9..35908315ff55 100644
> >>>> --- a/include/uapi/linux/vhost_types.h
> >>>> +++ b/include/uapi/linux/vhost_types.h
> >>>> @@ -79,7 +79,7 @@ struct vhost_iotlb_msg {
> >>>> /*
> >>>> * VHOST_IOTLB_BATCH_BEGIN and VHOST_IOTLB_BATCH_END allow modifying
> >>>> * multiple mappings in one go: beginning with
> >>>> - * VHOST_IOTLB_BATCH_BEGIN, followed by any number of
> >>>> + * VHOST_IOTLB_BATCH_BEGIN, followed by any number of VHOST_IOTLB_REMAP or
> >>>> * VHOST_IOTLB_UPDATE messages, and ending with VHOST_IOTLB_BATCH_END.
> >>>> * When one of these two values is used as the message type, the rest
> >>>> * of the fields in the message are ignored. There's no guarantee that
> >>>> @@ -87,6 +87,15 @@ struct vhost_iotlb_msg {
> >>>> */
> >>>> #define VHOST_IOTLB_BATCH_BEGIN 5
> >>>> #define VHOST_IOTLB_BATCH_END 6
> >>>> +
> >>>> +/*
> >>>> + * VHOST_IOTLB_REMAP registers a new uaddr for the existing mapping at iova.
> >>>> + * The new uaddr must address the same memory object as originally mapped.
> >>>> + * Failure to do so will result in user memory corruption and/or device
> >>>> + * misbehavior. iova and size must match the arguments used to create the
> >>>> + * an existing mapping. Protection is not changed, and perm must be 0.
> >>>> + */
> >>>> +#define VHOST_IOTLB_REMAP 7
> >>>> __u8 type;
> >>>> };
> >>>
> >>> Thanks
> >>>
> >>>>
> >>>> --
> >>>> 2.39.3
> >>>>
> >>>
> >>
> >
>
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH V2 5/7] vhost-vdpa: VHOST_IOTLB_REMAP
2024-07-18 0:45 ` Jason Wang
@ 2024-07-18 19:39 ` Michael S. Tsirkin
2024-07-18 20:19 ` Steven Sistare
0 siblings, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2024-07-18 19:39 UTC (permalink / raw)
To: Jason Wang
Cc: Steven Sistare, virtualization, linux-kernel, Si-Wei Liu,
Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea, Alex Williamson
On Thu, Jul 18, 2024 at 08:45:31AM +0800, Jason Wang wrote:
> > > For example:
> > >
> > > 1) old owner pass fd to new owner which is another process
> > > 2) the new owner do VHOST_NEW_OWNER
> > > 3) new owner doesn't do remap correctly
> > >
> > > There's no way for the old owner to remove/unpin the mappings as we
> > > have the owner check in IOTLB_UPDATE. Looks like a potential way for
> > > DOS.
> >
> > This is a bug in the second cooperating process, not a DOS. The application
> > must fix it. Sometimes you cannot recover from an application bug at run time.
> >
> > BTW, at one time vfio enforced the concept of an owner, but Alex deleted it.
> > It adds no value, because possession of the fd is the key.
> > ffed0518d871 ("vfio: remove useless judgement")
>
> This seems to be a great relaxation of the ownership check. I would
> like to hear from Michael first.
>
> Thanks
It could be that the ownership model is too restrictive.
But again, this is changing a security assumption.
Looks like yes another reason to tie this to the switch to iommufd.
--
MST
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH V2 5/7] vhost-vdpa: VHOST_IOTLB_REMAP
2024-07-18 19:39 ` Michael S. Tsirkin
@ 2024-07-18 20:19 ` Steven Sistare
2024-07-19 1:01 ` Jason Wang
0 siblings, 1 reply; 35+ messages in thread
From: Steven Sistare @ 2024-07-18 20:19 UTC (permalink / raw)
To: Michael S. Tsirkin, Jason Wang
Cc: virtualization, linux-kernel, Si-Wei Liu, Eugenio Perez Martin,
Xuan Zhuo, Dragos Tatulea, Alex Williamson
On 7/18/2024 3:39 PM, Michael S. Tsirkin wrote:
> On Thu, Jul 18, 2024 at 08:45:31AM +0800, Jason Wang wrote:
>>>> For example:
>>>>
>>>> 1) old owner pass fd to new owner which is another process
>>>> 2) the new owner do VHOST_NEW_OWNER
>>>> 3) new owner doesn't do remap correctly
>>>>
>>>> There's no way for the old owner to remove/unpin the mappings as we
>>>> have the owner check in IOTLB_UPDATE. Looks like a potential way for
>>>> DOS.
>>>
>>> This is a bug in the second cooperating process, not a DOS. The application
>>> must fix it. Sometimes you cannot recover from an application bug at run time.
>>>
>>> BTW, at one time vfio enforced the concept of an owner, but Alex deleted it.
>>> It adds no value, because possession of the fd is the key.
>>> ffed0518d871 ("vfio: remove useless judgement")
>>
>> This seems to be a great relaxation of the ownership check. I would
>> like to hear from Michael first.
>>
>> Thanks
>
> It could be that the ownership model is too restrictive.
> But again, this is changing a security assumption.
> Looks like yes another reason to tie this to the switch to iommufd.
iommufd, like vfio, does not impose an ownership requirement. If vdpa has a
stricter requirement, such as allowing the vhost-net sharing that Jason
described, then we need to surface that now, and extend it to allow change
of ownership for live update.
Is the vhost-net scenario currently used, or aspirational?
Copying from Jason's email:
1) Two processes (A and B) share a part of the memory
2) A is the owner of the vhost-net who is in charge of building memory
mappings via IOTLB
3) A passes vhost-net fd to process B
- Steve
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH V2 5/7] vhost-vdpa: VHOST_IOTLB_REMAP
2024-07-18 20:19 ` Steven Sistare
@ 2024-07-19 1:01 ` Jason Wang
0 siblings, 0 replies; 35+ messages in thread
From: Jason Wang @ 2024-07-19 1:01 UTC (permalink / raw)
To: Steven Sistare
Cc: Michael S. Tsirkin, virtualization, linux-kernel, Si-Wei Liu,
Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea, Alex Williamson
On Fri, Jul 19, 2024 at 4:19 AM Steven Sistare
<steven.sistare@oracle.com> wrote:
>
> On 7/18/2024 3:39 PM, Michael S. Tsirkin wrote:
> > On Thu, Jul 18, 2024 at 08:45:31AM +0800, Jason Wang wrote:
> >>>> For example:
> >>>>
> >>>> 1) old owner pass fd to new owner which is another process
> >>>> 2) the new owner do VHOST_NEW_OWNER
> >>>> 3) new owner doesn't do remap correctly
> >>>>
> >>>> There's no way for the old owner to remove/unpin the mappings as we
> >>>> have the owner check in IOTLB_UPDATE. Looks like a potential way for
> >>>> DOS.
> >>>
> >>> This is a bug in the second cooperating process, not a DOS. The application
> >>> must fix it. Sometimes you cannot recover from an application bug at run time.
> >>>
> >>> BTW, at one time vfio enforced the concept of an owner, but Alex deleted it.
> >>> It adds no value, because possession of the fd is the key.
> >>> ffed0518d871 ("vfio: remove useless judgement")
> >>
> >> This seems to be a great relaxation of the ownership check. I would
> >> like to hear from Michael first.
> >>
> >> Thanks
> >
> > It could be that the ownership model is too restrictive.
> > But again, this is changing a security assumption.
> > Looks like yes another reason to tie this to the switch to iommufd.
>
> iommufd, like vfio, does not impose an ownership requirement. If vdpa has a
> stricter requirement, such as allowing the vhost-net sharing that Jason
> described, then we need to surface that now, and extend it to allow change
> of ownership for live update.
>
> Is the vhost-net scenario currently used, or aspirational?
This question is very hard. But for my understanding, what Michael meant is:
1) the current proposal changes the security assumption
2) iommufd require uAPI changes which may imply security assumption
changes or other user space noticeable behaviour changes
My understanding is:
If we want to go without iommufd, we may need to find a way to stick
to the assumption (not sure how hard it is). If we want to go with
iommufd, we probably won't worry about that, as you've pointed out,
there's no ownership requirement.
> Copying from Jason's email:
> 1) Two processes (A and B) share a part of the memory
> 2) A is the owner of the vhost-net who is in charge of building memory
> mappings via IOTLB
> 3) A passes vhost-net fd to process B
>
> - Steve
>
Thanks
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH V2 0/7] vdpa live update
2024-07-18 0:33 ` Jason Wang
@ 2024-07-20 21:34 ` Steven Sistare
0 siblings, 0 replies; 35+ messages in thread
From: Steven Sistare @ 2024-07-20 21:34 UTC (permalink / raw)
To: Jason Wang
Cc: virtualization, linux-kernel, Michael S. Tsirkin, Si-Wei Liu,
Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea
On 7/17/2024 8:33 PM, Jason Wang wrote:
> On Thu, Jul 18, 2024 at 2:29 AM Steven Sistare
> <steven.sistare@oracle.com> wrote:
>>
>> On 7/16/2024 1:30 AM, Jason Wang wrote:
>>> On Mon, Jul 15, 2024 at 10:29 PM Steven Sistare
>>> <steven.sistare@oracle.com> wrote:
>>>>
>>>> On 7/14/2024 10:14 PM, Jason Wang wrote:
>>>>> On Fri, Jul 12, 2024 at 9:19 PM Steve Sistare <steven.sistare@oracle.com> wrote:
>>>>>>
>>>>>> Live update is a technique wherein an application saves its state, exec's
>>>>>> to an updated version of itself, and restores its state. Clients of the
>>>>>> application experience a brief suspension of service, on the order of
>>>>>> 100's of milliseconds, but are otherwise unaffected.
>>>>>>
>>>>>> Define and implement interfaces that allow vdpa devices to be preserved
>>>>>> across fork or exec, to support live update for applications such as QEMU.
>>>>>> The device must be suspended during the update, but its DMA mappings are
>>>>>> preserved, so the suspension is brief.
>>>>>>
>>>>>> The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory
>>>>>> accounting from one process to another.
>>>>>>
>>>>>> The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that
>>>>>> VHOST_NEW_OWNER is supported.
>>>>>>
>>>>>> The VHOST_IOTLB_REMAP message type updates a DMA mapping with its userland
>>>>>> address in the new process.
>>>>>>
>>>>>> The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that
>>>>>> VHOST_IOTLB_REMAP is supported and required. Some devices do not
>>>>>> require it, because the userland address of each DMA mapping is discarded
>>>>>> after being translated to a physical address.
>>>>>>
>>>>>> Here is a pseudo-code sequence for performing live update, based on
>>>>>> suspend + reset because resume is not yet widely available. The vdpa device
>>>>>> descriptor, fd, remains open across the exec.
>>>>>>
>>>>>> ioctl(fd, VHOST_VDPA_SUSPEND)
>>>>>> ioctl(fd, VHOST_VDPA_SET_STATUS, 0)
>>>>>
>>>>> I don't understand why we need a reset after suspend, it looks to me
>>>>> the previous suspend became meaningless.
>>>>
>>>> The suspend guarantees completion of in-progress DMA. At least, that is
>>>> my interpretation of why that is done for live migration in QEMU, which
>>>> also does suspend + reset + re-create. I am following the live migration
>>>> model.
>>>
>>> Yes, but any reason we need a reset after the suspension?
>>
>> Probably not. I found it cleanest to call reset and let new qemu configure the
>> device as it always does during startup, rather than altering those code paths
>> to skip the kernel calls.
>
> If we care about the downtime, I think avoiding a reset should be faster.
Agreed. I want to try it some time, but I think what I have now using reset is
decent enough for version 1.
- Steve
>> So, consider this to be just one of several possible
>> userland algorithms.
>>
>> - Steve
>
> Thanks
>
>>
>>>>>> exec
>>>>>>
>>>>>> ioctl(fd, VHOST_NEW_OWNER)
>>>>>>
>>>>>> issue ioctls to re-create vrings
>>>>>>
>>>>>> if VHOST_BACKEND_F_IOTLB_REMAP
>>>>>
>>>>> So the idea is for a device that is using a virtual address, it
>>>>> doesn't need VHOST_BACKEND_F_IOTLB_REMAP at all?
>>>>
>>>> Actually the reverse: if the device translates virtual to physical when
>>>> the mappings are created, and discards the virtual, then VHOST_IOTLB_REMAP
>>>> is not needed.
>>>
>>> Ok.
>>>
>>>>
>>>>>> foreach dma mapping
>>>>>> write(fd, {VHOST_IOTLB_REMAP, new_addr})
>>>>>>
>>>>>> ioctl(fd, VHOST_VDPA_SET_STATUS,
>>>>>> ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK)
>>>>>
>>>>> From API level, this seems to be asymmetric as we have suspending but
>>>>> not resuming?
>>>>
>>>> Again, I am just following the path taken by live migration.
>>>> I will be happy to use resume when the devices and QEMU support it.
>>>> The decision to use reset vs resume should not affect the definition
>>>> and use of VHOST_NEW_OWNER and VHOST_IOTLB_REMAP.
>>>>
>>>> - Steve
>>>
>>> Thanks
>>>
>>
>
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH V2 3/7] vhost-vdpa: VHOST_NEW_OWNER
2024-07-17 18:28 ` Steven Sistare
@ 2024-07-22 7:26 ` Jason Wang
0 siblings, 0 replies; 35+ messages in thread
From: Jason Wang @ 2024-07-22 7:26 UTC (permalink / raw)
To: Steven Sistare
Cc: virtualization, linux-kernel, Michael S. Tsirkin, Si-Wei Liu,
Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea
On Thu, Jul 18, 2024 at 2:28 AM Steven Sistare
<steven.sistare@oracle.com> wrote:
>
> On 7/16/2024 1:16 AM, Jason Wang wrote:
> > On Mon, Jul 15, 2024 at 10:27 PM Steven Sistare
> > <steven.sistare@oracle.com> wrote:
> >>
> >> On 7/14/2024 10:26 PM, Jason Wang wrote:
> >>> On Fri, Jul 12, 2024 at 9:19 PM Steve Sistare <steven.sistare@oracle.com> wrote:
> >>>>
> >>>> Add an ioctl to transfer file descriptor ownership and pinned memory
> >>>> accounting from one process to another.
> >>>>
> >>>> This is more efficient than VHOST_RESET_OWNER followed by VHOST_SET_OWNER,
> >>>> as that would unpin all physical pages, requiring them to be repinned in
> >>>> the new process. That would cost multiple seconds for large memories, and
> >>>> be incurred during a virtual machine's pause time during live update.
> >>>>
> >>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >>>> ---
> >>>> drivers/vhost/vdpa.c | 41 ++++++++++++++++++++++++++++++++++++++
> >>>> drivers/vhost/vhost.c | 15 ++++++++++++++
> >>>> drivers/vhost/vhost.h | 1 +
> >>>> include/uapi/linux/vhost.h | 10 ++++++++++
> >>>> 4 files changed, 67 insertions(+)
> >>>>
> >>>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> >>>> index b49e5831b3f0..5cf55ca4ec02 100644
> >>>> --- a/drivers/vhost/vdpa.c
> >>>> +++ b/drivers/vhost/vdpa.c
> >>>> @@ -632,6 +632,44 @@ static long vhost_vdpa_resume(struct vhost_vdpa *v)
> >>>> return ret;
> >>>> }
> >>>>
> >>>> +static long vhost_vdpa_new_owner(struct vhost_vdpa *v)
> >>>> +{
> >>>> + int r;
> >>>> + struct vhost_dev *vdev = &v->vdev;
> >>>> + struct mm_struct *mm_old = vdev->mm;
> >>>> + struct mm_struct *mm_new = current->mm;
> >>>> + long pinned_vm = v->pinned_vm;
> >>>> + unsigned long lock_limit = PFN_DOWN(rlimit(RLIMIT_MEMLOCK));
> >>>> +
> >>>> + if (!mm_old)
> >>>> + return -EINVAL;
> >>>> + mmgrab(mm_old);
> >>>> +
> >>>> + if (!v->vdpa->use_va &&
> >>>> + pinned_vm + atomic64_read(&mm_new->pinned_vm) > lock_limit) {
> >>>> + r = -ENOMEM;
> >>>> + goto out;
> >>>> + }
> >>>
> >>> So this seems to allow an arbitrary process to execute this. Seems to be unsafe.
> >>>
> >>> I wonder if we need to add some checks here, maybe PID or other stuff
> >>> to only allow the owner process to do this.
> >>
> >> The original owner must send the file descriptor to the new owner.
> >
> > This seems not to be in the steps you put in the cover letter.
>
> It's there:
> "The vdpa device descriptor, fd, remains open across the exec."
>
> But, I can say more about how fd visibility constitutes permission to changer
> owner in this commit message.
Yes, that would be great.
>
> >> That constitutes permission to take ownership.
> >
> > This seems like a relaxed version of the reset_owner:
> >
> > Currently, reset_owner have the following check:
>
> Not relaxed, just different. A process cannot do anything with fd if it
> is not the owner, *except* for becoming the new owner. Holding the fd is
> like holding a key.
I meant it kind of "defeats" the effort of VHOST_NET_RESET_OWNER ...
Thanks
>
> - Steve
>
> > /* Caller should have device mutex */
> > long vhost_dev_check_owner(struct vhost_dev *dev)
> > {
> > /* Are you the owner? If not, I don't think you mean to do that */
> > return dev->mm == current->mm ? 0 : -EPERM;
> > }
> > EXPORT_SYMBOL_GPL(vhost_dev_check_owner);
> >
> > It means even if the fd is passed to some other process, the reset
> > owner won't work there.
> >
> > Thanks
> >
> >>
> >>>> + r = vhost_vdpa_bind_mm(v, mm_new);
> >>>> + if (r)
> >>>> + goto out;
> >>>> +
> >>>> + r = vhost_dev_new_owner(vdev);
> >>>> + if (r) {
> >>>> + vhost_vdpa_bind_mm(v, mm_old);
> >>>> + goto out;
> >>>> + }
> >>>> +
> >>>> + if (!v->vdpa->use_va) {
> >>>> + atomic64_sub(pinned_vm, &mm_old->pinned_vm);
> >>>> + atomic64_add(pinned_vm, &mm_new->pinned_vm);
> >>>> + }
> >>>> +
> >>>> +out:
> >>>> + mmdrop(mm_old);
> >>>> + return r;
> >>>> +}
> >>>> +
> >>>> static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, unsigned int cmd,
> >>>> void __user *argp)
> >>>> {
> >>>> @@ -876,6 +914,9 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
> >>>> case VHOST_VDPA_RESUME:
> >>>> r = vhost_vdpa_resume(v);
> >>>> break;
> >>>> + case VHOST_NEW_OWNER:
> >>>> + r = vhost_vdpa_new_owner(v);
> >>>> + break;
> >>>> default:
> >>>> r = vhost_dev_ioctl(&v->vdev, cmd, argp);
> >>>> if (r == -ENOIOCTLCMD)
> >>>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> >>>> index b60955682474..ab40ae50552f 100644
> >>>> --- a/drivers/vhost/vhost.c
> >>>> +++ b/drivers/vhost/vhost.c
> >>>> @@ -963,6 +963,21 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
> >>>> }
> >>>> EXPORT_SYMBOL_GPL(vhost_dev_set_owner);
> >>>>
> >>>> +/* Caller should have device mutex */
> >>>> +long vhost_dev_new_owner(struct vhost_dev *dev)
> >>>> +{
> >>>> + if (dev->mm == current->mm)
> >>>> + return -EBUSY;
> >>>> +
> >>>> + if (!vhost_dev_has_owner(dev))
> >>>> + return -EINVAL;
> >>>> +
> >>>> + vhost_detach_mm(dev);
> >>>> + vhost_attach_mm(dev);
> >>>
> >>> This seems to do nothing unless I miss something.
> >>
> >> vhost_detach mm drops dev->mm.
> >> vhost_attach_mm grabs current->mm.
> >>
> >> - Steve
> >>
> >
>
^ permalink raw reply [flat|nested] 35+ messages in thread
end of thread, other threads:[~2024-07-22 7:26 UTC | newest]
Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-12 13:18 [PATCH V2 0/7] vdpa live update Steve Sistare
2024-07-12 13:18 ` [PATCH V2 1/7] vhost-vdpa: count pinned memory Steve Sistare
2024-07-12 13:18 ` [PATCH V2 2/7] vhost-vdpa: pass mm to bind Steve Sistare
2024-07-12 13:18 ` [PATCH V2 3/7] vhost-vdpa: VHOST_NEW_OWNER Steve Sistare
2024-07-15 2:26 ` Jason Wang
2024-07-15 9:06 ` Michael S. Tsirkin
2024-07-15 14:27 ` Steven Sistare
2024-07-16 5:16 ` Jason Wang
2024-07-17 18:28 ` Steven Sistare
2024-07-22 7:26 ` Jason Wang
2024-07-15 9:07 ` Michael S. Tsirkin
2024-07-15 14:29 ` Steven Sistare
2024-07-15 14:38 ` Michael S. Tsirkin
2024-07-15 15:38 ` Steven Sistare
2024-07-12 13:18 ` [PATCH V2 4/7] vhost-vdpa: VHOST_BACKEND_F_NEW_OWNER Steve Sistare
2024-07-15 2:31 ` Jason Wang
2024-07-15 14:27 ` Steven Sistare
2024-07-12 13:18 ` [PATCH V2 5/7] vhost-vdpa: VHOST_IOTLB_REMAP Steve Sistare
2024-07-15 2:34 ` Jason Wang
2024-07-15 14:28 ` Steven Sistare
2024-07-16 5:28 ` Jason Wang
2024-07-17 18:29 ` Steven Sistare
2024-07-18 0:45 ` Jason Wang
2024-07-18 19:39 ` Michael S. Tsirkin
2024-07-18 20:19 ` Steven Sistare
2024-07-19 1:01 ` Jason Wang
2024-07-12 13:18 ` [PATCH V2 6/7] vhost-vdpa: VHOST_BACKEND_F_IOTLB_REMAP Steve Sistare
2024-07-12 13:18 ` [PATCH V2 7/7] vdpa/mlx5: new owner capability Steve Sistare
2024-07-12 14:06 ` [PATCH V2 0/7] vdpa live update Steven Sistare
2024-07-15 2:14 ` Jason Wang
2024-07-15 14:28 ` Steven Sistare
2024-07-16 5:30 ` Jason Wang
2024-07-17 18:29 ` Steven Sistare
2024-07-18 0:33 ` Jason Wang
2024-07-20 21:34 ` Steven Sistare
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).