virtualization.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed
* [RFC V1 00/13] vdpa live update
@ 2024-01-10 20:40 Steve Sistare
  2024-01-10 20:40 ` [RFC V1 01/13] vhost-vdpa: count pinned memory Steve Sistare
                   ` (13 more replies)
  0 siblings, 14 replies; 33+ messages in thread
From: Steve Sistare @ 2024-01-10 20:40 UTC (permalink / raw)
  To: virtualization, linux-kernel
  Cc: Michael S. Tsirkin, Jason Wang, Si-Wei Liu, Eugenio Perez Martin,
	Xuan Zhuo, Dragos Tatulea, Eli Cohen, Xie Yongji, Steve Sistare

Live update is a technique wherein an application saves its state, exec's
to an updated version of itself, and restores its state.  Clients of the
application experience a brief suspension of service, on the order of 
100's of milliseconds, but are otherwise unaffected.

Define and implement interfaces that allow vdpa devices to be preserved
across fork or exec, to support live update for applications such as qemu.
The device must be suspended during the update, but its dma mappings are
preserved, so the suspension is brief.

The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory
accounting from one process to another.

The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that
VHOST_NEW_OWNER is supported.

The VHOST_IOTLB_REMAP message type updates a dma mapping with its userland
address in the new process.

The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that
VHOST_IOTLB_REMAP is supported and required.  Some devices do not
require it, because the userland address of each dma mapping is discarded
after being translated to a physical address.

Here is a pseudo-code sequence for performing live update, based on
suspend + reset because resume is not yet available.  The vdpa device
descriptor, fd, remains open across the exec.

  ioctl(fd, VHOST_VDPA_SUSPEND)
  ioctl(fd, VHOST_VDPA_SET_STATUS, 0)
  exec 

  ioctl(fd, VHOST_NEW_OWNER)

  issue ioctls to re-create vrings

  if VHOST_BACKEND_F_IOTLB_REMAP
      foreach dma mapping
          write(fd, {VHOST_IOTLB_REMAP, new_addr})

  ioctl(fd, VHOST_VDPA_SET_STATUS,
            ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK)


Steve Sistare (13):
  vhost-vdpa: count pinned memory
  vhost-vdpa: pass mm to bind
  vhost-vdpa: VHOST_NEW_OWNER
  vhost-vdpa: VHOST_BACKEND_F_NEW_OWNER
  vhost-vdpa: VHOST_IOTLB_REMAP
  vhost-vdpa: VHOST_BACKEND_F_IOTLB_REMAP
  vhost-vdpa: flush workers on suspend
  vduse: flush workers on suspend
  vdpa_sim: reset must not run
  vdpa_sim: flush workers on suspend
  vdpa/mlx5: new owner capability
  vdpa_sim: new owner capability
  vduse: new owner capability

 drivers/vdpa/mlx5/net/mlx5_vnet.c  |   3 +-
 drivers/vdpa/vdpa_sim/vdpa_sim.c   |  24 ++++++-
 drivers/vdpa/vdpa_user/vduse_dev.c |  32 +++++++++
 drivers/vhost/vdpa.c               | 101 +++++++++++++++++++++++++++--
 drivers/vhost/vhost.c              |  15 +++++
 drivers/vhost/vhost.h              |   1 +
 include/uapi/linux/vhost.h         |  10 +++
 include/uapi/linux/vhost_types.h   |  15 ++++-
 8 files changed, 191 insertions(+), 10 deletions(-)

-- 
2.39.3


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC V1 01/13] vhost-vdpa: count pinned memory
  2024-01-10 20:40 [RFC V1 00/13] vdpa live update Steve Sistare
@ 2024-01-10 20:40 ` Steve Sistare
  2024-01-10 22:24   ` Michael S. Tsirkin
  2024-01-10 20:40 ` [RFC V1 02/13] vhost-vdpa: pass mm to bind Steve Sistare
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Steve Sistare @ 2024-01-10 20:40 UTC (permalink / raw)
  To: virtualization, linux-kernel
  Cc: Michael S. Tsirkin, Jason Wang, Si-Wei Liu, Eugenio Perez Martin,
	Xuan Zhuo, Dragos Tatulea, Eli Cohen, Xie Yongji, Steve Sistare

Remember the count of pinned memory for the device.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vhost/vdpa.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index da7ec77cdaff..10fb95bcca1a 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -59,6 +59,7 @@ struct vhost_vdpa {
 	int in_batch;
 	struct vdpa_iova_range range;
 	u32 batch_asid;
+	long pinned_vm;
 };
 
 static DEFINE_IDA(vhost_vdpa_ida);
@@ -893,6 +894,7 @@ static void vhost_vdpa_pa_unmap(struct vhost_vdpa *v, struct vhost_iotlb *iotlb,
 			unpin_user_page(page);
 		}
 		atomic64_sub(PFN_DOWN(map->size), &dev->mm->pinned_vm);
+		v->pinned_vm -= PFN_DOWN(map->size);
 		vhost_vdpa_general_unmap(v, map, asid);
 		vhost_iotlb_map_free(iotlb, map);
 	}
@@ -975,9 +977,10 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct vhost_iotlb *iotlb,
 		return r;
 	}
 
-	if (!vdpa->use_va)
+	if (!vdpa->use_va) {
 		atomic64_add(PFN_DOWN(size), &dev->mm->pinned_vm);
-
+		v->pinned_vm += PFN_DOWN(size);
+	}
 	return 0;
 }
 
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC V1 02/13] vhost-vdpa: pass mm to bind
  2024-01-10 20:40 [RFC V1 00/13] vdpa live update Steve Sistare
  2024-01-10 20:40 ` [RFC V1 01/13] vhost-vdpa: count pinned memory Steve Sistare
@ 2024-01-10 20:40 ` Steve Sistare
  2024-01-10 20:40 ` [RFC V1 03/13] vhost-vdpa: VHOST_NEW_OWNER Steve Sistare
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Steve Sistare @ 2024-01-10 20:40 UTC (permalink / raw)
  To: virtualization, linux-kernel
  Cc: Michael S. Tsirkin, Jason Wang, Si-Wei Liu, Eugenio Perez Martin,
	Xuan Zhuo, Dragos Tatulea, Eli Cohen, Xie Yongji, Steve Sistare

Pass the target mm to vhost_vdpa_bind_mm.  No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vhost/vdpa.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 10fb95bcca1a..2269988d6d33 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -248,7 +248,7 @@ static int vhost_vdpa_reset(struct vhost_vdpa *v)
 	return _compat_vdpa_reset(v);
 }
 
-static long vhost_vdpa_bind_mm(struct vhost_vdpa *v)
+static long vhost_vdpa_bind_mm(struct vhost_vdpa *v, struct mm_struct *mm)
 {
 	struct vdpa_device *vdpa = v->vdpa;
 	const struct vdpa_config_ops *ops = vdpa->config;
@@ -256,7 +256,7 @@ static long vhost_vdpa_bind_mm(struct vhost_vdpa *v)
 	if (!vdpa->use_va || !ops->bind_mm)
 		return 0;
 
-	return ops->bind_mm(vdpa, v->vdev.mm);
+	return ops->bind_mm(vdpa, mm);
 }
 
 static void vhost_vdpa_unbind_mm(struct vhost_vdpa *v)
@@ -855,7 +855,7 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
 
 	switch (cmd) {
 	case VHOST_SET_OWNER:
-		r = vhost_vdpa_bind_mm(v);
+		r = vhost_vdpa_bind_mm(v, v->vdev.mm);
 		if (r)
 			vhost_dev_reset_owner(d, NULL);
 		break;
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC V1 03/13] vhost-vdpa: VHOST_NEW_OWNER
  2024-01-10 20:40 [RFC V1 00/13] vdpa live update Steve Sistare
  2024-01-10 20:40 ` [RFC V1 01/13] vhost-vdpa: count pinned memory Steve Sistare
  2024-01-10 20:40 ` [RFC V1 02/13] vhost-vdpa: pass mm to bind Steve Sistare
@ 2024-01-10 20:40 ` Steve Sistare
  2024-01-10 20:40 ` [RFC V1 04/13] vhost-vdpa: VHOST_BACKEND_F_NEW_OWNER Steve Sistare
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Steve Sistare @ 2024-01-10 20:40 UTC (permalink / raw)
  To: virtualization, linux-kernel
  Cc: Michael S. Tsirkin, Jason Wang, Si-Wei Liu, Eugenio Perez Martin,
	Xuan Zhuo, Dragos Tatulea, Eli Cohen, Xie Yongji, Steve Sistare

Add an ioctl to transfer file descriptor ownership and pinned memory
accounting from one process to another.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vhost/vdpa.c       | 37 +++++++++++++++++++++++++++++++++++++
 drivers/vhost/vhost.c      | 15 +++++++++++++++
 drivers/vhost/vhost.h      |  1 +
 include/uapi/linux/vhost.h | 10 ++++++++++
 4 files changed, 63 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 2269988d6d33..eb3a95e703b0 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -613,6 +613,40 @@ static long vhost_vdpa_resume(struct vhost_vdpa *v)
 	return ops->resume(vdpa);
 }
 
+static long vhost_vdpa_new_owner(struct vhost_vdpa *v)
+{
+	int r;
+	struct vhost_dev *vdev = &v->vdev;
+	struct mm_struct *mm_old = vdev->mm;
+	struct mm_struct *mm_new = current->mm;
+	long pinned_vm = v->pinned_vm;
+	unsigned long lock_limit = PFN_DOWN(rlimit(RLIMIT_MEMLOCK));
+
+	if (!mm_old)
+		return -EINVAL;
+
+	if (!v->vdpa->use_va &&
+	    pinned_vm + atomic64_read(&mm_new->pinned_vm) > lock_limit)
+		return -ENOMEM;
+
+	r = vhost_vdpa_bind_mm(v, mm_new);
+	if (r)
+		return r;
+
+	r = vhost_dev_new_owner(vdev);
+	if (r) {
+		vhost_vdpa_bind_mm(v, mm_old);
+		return r;
+	}
+
+	if (!v->vdpa->use_va) {
+		atomic64_sub(pinned_vm, &mm_old->pinned_vm);
+		atomic64_add(pinned_vm, &mm_new->pinned_vm);
+	}
+
+	return r;
+}
+
 static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, unsigned int cmd,
 				   void __user *argp)
 {
@@ -843,6 +877,9 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
 	case VHOST_VDPA_RESUME:
 		r = vhost_vdpa_resume(v);
 		break;
+	case VHOST_NEW_OWNER:
+		r = vhost_vdpa_new_owner(v);
+		break;
 	default:
 		r = vhost_dev_ioctl(&v->vdev, cmd, argp);
 		if (r == -ENOIOCTLCMD)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index e0c181ad17e3..0ce7ee9834f4 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -907,6 +907,21 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_dev_set_owner);
 
+/* Caller should have device mutex */
+long vhost_dev_new_owner(struct vhost_dev *dev)
+{
+	if (dev->mm == current->mm)
+		return -EBUSY;
+
+	if (!vhost_dev_has_owner(dev))
+		return -EINVAL;
+
+	vhost_detach_mm(dev);
+	vhost_attach_mm(dev);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(vhost_dev_new_owner);
+
 static struct vhost_iotlb *iotlb_alloc(void)
 {
 	return vhost_iotlb_alloc(max_iotlb_entries,
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index f60d5f7bef94..cd0dab21d99e 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -185,6 +185,7 @@ void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs,
 		    int (*msg_handler)(struct vhost_dev *dev, u32 asid,
 				       struct vhost_iotlb_msg *msg));
 long vhost_dev_set_owner(struct vhost_dev *dev);
+long vhost_dev_new_owner(struct vhost_dev *dev);
 bool vhost_dev_has_owner(struct vhost_dev *dev);
 long vhost_dev_check_owner(struct vhost_dev *);
 struct vhost_iotlb *vhost_dev_reset_owner_prepare(void);
diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index 649560c685f1..5e3cdce4c0cf 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -123,6 +123,16 @@
 #define VHOST_SET_BACKEND_FEATURES _IOW(VHOST_VIRTIO, 0x25, __u64)
 #define VHOST_GET_BACKEND_FEATURES _IOR(VHOST_VIRTIO, 0x26, __u64)
 
+/* Set current process as the new owner of this file descriptor.  The fd must
+ * already be owned, via a prior call to VHOST_SET_OWNER.  The pinned memory
+ * count is transferred from the previous to the new owner.
+ * Errors:
+ *   EINVAL: not owned
+ *   EBUSY:  caller is already the owner
+ *   ENOMEM: RLIMIT_MEMLOCK exceeded
+ */
+#define VHOST_NEW_OWNER _IO(VHOST_VIRTIO, 0x27)
+
 /* VHOST_NET specific defines */
 
 /* Attach virtio net ring to a raw socket, or tap device.
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC V1 04/13] vhost-vdpa: VHOST_BACKEND_F_NEW_OWNER
  2024-01-10 20:40 [RFC V1 00/13] vdpa live update Steve Sistare
                   ` (2 preceding siblings ...)
  2024-01-10 20:40 ` [RFC V1 03/13] vhost-vdpa: VHOST_NEW_OWNER Steve Sistare
@ 2024-01-10 20:40 ` Steve Sistare
  2024-01-10 20:40 ` [RFC V1 05/13] vhost-vdpa: VHOST_IOTLB_REMAP Steve Sistare
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Steve Sistare @ 2024-01-10 20:40 UTC (permalink / raw)
  To: virtualization, linux-kernel
  Cc: Michael S. Tsirkin, Jason Wang, Si-Wei Liu, Eugenio Perez Martin,
	Xuan Zhuo, Dragos Tatulea, Eli Cohen, Xie Yongji, Steve Sistare

Add the VHOST_BACKEND_F_NEW_OWNER backend capability, which indicates that
VHOST_NEW_OWNER is supported.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vhost/vdpa.c             | 7 ++++++-
 include/uapi/linux/vhost_types.h | 2 ++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index eb3a95e703b0..faed6471934a 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -621,6 +621,10 @@ static long vhost_vdpa_new_owner(struct vhost_vdpa *v)
 	struct mm_struct *mm_new = current->mm;
 	long pinned_vm = v->pinned_vm;
 	unsigned long lock_limit = PFN_DOWN(rlimit(RLIMIT_MEMLOCK));
+	u64 features = vhost_vdpa_get_backend_features(v);
+
+	if (!(features & BIT_ULL(VHOST_BACKEND_F_NEW_OWNER)))
+		return -EOPNOTSUPP;
 
 	if (!mm_old)
 		return -EINVAL;
@@ -784,7 +788,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
 				 BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST) |
 				 BIT_ULL(VHOST_BACKEND_F_SUSPEND) |
 				 BIT_ULL(VHOST_BACKEND_F_RESUME) |
-				 BIT_ULL(VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK)))
+				 BIT_ULL(VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK) |
+				 BIT_ULL(VHOST_BACKEND_F_NEW_OWNER)))
 			return -EOPNOTSUPP;
 		if ((features & BIT_ULL(VHOST_BACKEND_F_SUSPEND)) &&
 		     !vhost_vdpa_can_suspend(v))
diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
index d7656908f730..9177843951e9 100644
--- a/include/uapi/linux/vhost_types.h
+++ b/include/uapi/linux/vhost_types.h
@@ -192,5 +192,7 @@ struct vhost_vdpa_iova_range {
 #define VHOST_BACKEND_F_DESC_ASID    0x7
 /* IOTLB don't flush memory mapping across device reset */
 #define VHOST_BACKEND_F_IOTLB_PERSIST  0x8
+/* Supports VHOST_NEW_OWNER */
+#define VHOST_BACKEND_F_NEW_OWNER  0x9
 
 #endif
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC V1 05/13] vhost-vdpa: VHOST_IOTLB_REMAP
  2024-01-10 20:40 [RFC V1 00/13] vdpa live update Steve Sistare
                   ` (3 preceding siblings ...)
  2024-01-10 20:40 ` [RFC V1 04/13] vhost-vdpa: VHOST_BACKEND_F_NEW_OWNER Steve Sistare
@ 2024-01-10 20:40 ` Steve Sistare
  2024-01-11  3:08   ` Jason Wang
  2024-01-16 18:14   ` Eugenio Perez Martin
  2024-01-10 20:40 ` [RFC V1 06/13] vhost-vdpa: VHOST_BACKEND_F_IOTLB_REMAP Steve Sistare
                   ` (8 subsequent siblings)
  13 siblings, 2 replies; 33+ messages in thread
From: Steve Sistare @ 2024-01-10 20:40 UTC (permalink / raw)
  To: virtualization, linux-kernel
  Cc: Michael S. Tsirkin, Jason Wang, Si-Wei Liu, Eugenio Perez Martin,
	Xuan Zhuo, Dragos Tatulea, Eli Cohen, Xie Yongji, Steve Sistare

When device ownership is passed to a new process via VHOST_NEW_OWNER,
some devices need to know the new userland addresses of the dma mappings.
Define the new iotlb message type VHOST_IOTLB_REMAP to update the uaddr
of a mapping.  The new uaddr must address the same memory object as
originally mapped.

The user must suspend the device before the old address is invalidated,
and cannot resume it until after VHOST_IOTLB_REMAP is called, but this
requirement is not enforced by the API.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vhost/vdpa.c             | 34 ++++++++++++++++++++++++++++++++
 include/uapi/linux/vhost_types.h | 11 ++++++++++-
 2 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index faed6471934a..ec5ca20bd47d 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -1219,6 +1219,37 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
 
 }
 
+static int vhost_vdpa_process_iotlb_remap(struct vhost_vdpa *v,
+					  struct vhost_iotlb *iotlb,
+					  struct vhost_iotlb_msg *msg)
+{
+	struct vdpa_device *vdpa = v->vdpa;
+	const struct vdpa_config_ops *ops = vdpa->config;
+	u32 asid = iotlb_to_asid(iotlb);
+	u64 start = msg->iova;
+	u64 last = start + msg->size - 1;
+	struct vhost_iotlb_map *map;
+	int r = 0;
+
+	if (msg->perm || !msg->size)
+		return -EINVAL;
+
+	map = vhost_iotlb_itree_first(iotlb, start, last);
+	if (!map)
+		return -ENOENT;
+
+	if (map->start != start || map->last != last)
+		return -EINVAL;
+
+	/* batch will finish with remap.  non-batch must do it now. */
+	if (!v->in_batch)
+		r = ops->set_map(vdpa, asid, iotlb);
+	if (!r)
+		map->addr = msg->uaddr;
+
+	return r;
+}
+
 static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
 					   struct vhost_iotlb *iotlb,
 					   struct vhost_iotlb_msg *msg)
@@ -1298,6 +1329,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, u32 asid,
 			ops->set_map(vdpa, asid, iotlb);
 		v->in_batch = false;
 		break;
+	case VHOST_IOTLB_REMAP:
+		r = vhost_vdpa_process_iotlb_remap(v, iotlb, msg);
+		break;
 	default:
 		r = -EINVAL;
 		break;
diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
index 9177843951e9..35908315ff55 100644
--- a/include/uapi/linux/vhost_types.h
+++ b/include/uapi/linux/vhost_types.h
@@ -79,7 +79,7 @@ struct vhost_iotlb_msg {
 /*
  * VHOST_IOTLB_BATCH_BEGIN and VHOST_IOTLB_BATCH_END allow modifying
  * multiple mappings in one go: beginning with
- * VHOST_IOTLB_BATCH_BEGIN, followed by any number of
+ * VHOST_IOTLB_BATCH_BEGIN, followed by any number of VHOST_IOTLB_REMAP or
  * VHOST_IOTLB_UPDATE messages, and ending with VHOST_IOTLB_BATCH_END.
  * When one of these two values is used as the message type, the rest
  * of the fields in the message are ignored. There's no guarantee that
@@ -87,6 +87,15 @@ struct vhost_iotlb_msg {
  */
 #define VHOST_IOTLB_BATCH_BEGIN    5
 #define VHOST_IOTLB_BATCH_END      6
+
+/*
+ * VHOST_IOTLB_REMAP registers a new uaddr for the existing mapping at iova.
+ * The new uaddr must address the same memory object as originally mapped.
+ * Failure to do so will result in user memory corruption and/or device
+ * misbehavior.  iova and size must match the arguments used to create the
+ * an existing mapping.  Protection is not changed, and perm must be 0.
+ */
+#define VHOST_IOTLB_REMAP          7
 	__u8 type;
 };
 
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC V1 06/13] vhost-vdpa: VHOST_BACKEND_F_IOTLB_REMAP
  2024-01-10 20:40 [RFC V1 00/13] vdpa live update Steve Sistare
                   ` (4 preceding siblings ...)
  2024-01-10 20:40 ` [RFC V1 05/13] vhost-vdpa: VHOST_IOTLB_REMAP Steve Sistare
@ 2024-01-10 20:40 ` Steve Sistare
  2024-01-10 20:40 ` [RFC V1 07/13] vhost-vdpa: flush workers on suspend Steve Sistare
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Steve Sistare @ 2024-01-10 20:40 UTC (permalink / raw)
  To: virtualization, linux-kernel
  Cc: Michael S. Tsirkin, Jason Wang, Si-Wei Liu, Eugenio Perez Martin,
	Xuan Zhuo, Dragos Tatulea, Eli Cohen, Xie Yongji, Steve Sistare

Add the VHOST_BACKEND_F_IOTLB_REMAP backend capability, which indicates
that VHOST_IOTLB_REMAP is supported.

If VHOST_BACKEND_F_IOTLB_REMAP is advertised, then the user must call
VHOST_IOTLB_REMAP after ownership of a device is transferred to a new
process via VHOST_NEW_OWNER.  Disabling the feature during negotiation
does not negate this requirement.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vhost/vdpa.c             | 8 +++++++-
 include/uapi/linux/vhost_types.h | 2 ++
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index ec5ca20bd47d..8fe1562d24af 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -789,7 +789,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
 				 BIT_ULL(VHOST_BACKEND_F_SUSPEND) |
 				 BIT_ULL(VHOST_BACKEND_F_RESUME) |
 				 BIT_ULL(VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK) |
-				 BIT_ULL(VHOST_BACKEND_F_NEW_OWNER)))
+				 BIT_ULL(VHOST_BACKEND_F_NEW_OWNER) |
+				 BIT_ULL(VHOST_BACKEND_F_IOTLB_REMAP)))
 			return -EOPNOTSUPP;
 		if ((features & BIT_ULL(VHOST_BACKEND_F_SUSPEND)) &&
 		     !vhost_vdpa_can_suspend(v))
@@ -1229,11 +1230,16 @@ static int vhost_vdpa_process_iotlb_remap(struct vhost_vdpa *v,
 	u64 start = msg->iova;
 	u64 last = start + msg->size - 1;
 	struct vhost_iotlb_map *map;
+	u64 features;
 	int r = 0;
 
 	if (msg->perm || !msg->size)
 		return -EINVAL;
 
+	features = ops->get_backend_features(vdpa);
+	if (!(features & BIT_ULL(VHOST_BACKEND_F_IOTLB_REMAP)))
+		return -EOPNOTSUPP;
+
 	map = vhost_iotlb_itree_first(iotlb, start, last);
 	if (!map)
 		return -ENOENT;
diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
index 35908315ff55..7e79e9bd0f7b 100644
--- a/include/uapi/linux/vhost_types.h
+++ b/include/uapi/linux/vhost_types.h
@@ -203,5 +203,7 @@ struct vhost_vdpa_iova_range {
 #define VHOST_BACKEND_F_IOTLB_PERSIST  0x8
 /* Supports VHOST_NEW_OWNER */
 #define VHOST_BACKEND_F_NEW_OWNER  0x9
+/* Supports VHOST_IOTLB_REMAP */
+#define VHOST_BACKEND_F_IOTLB_REMAP  0xa
 
 #endif
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC V1 07/13] vhost-vdpa: flush workers on suspend
  2024-01-10 20:40 [RFC V1 00/13] vdpa live update Steve Sistare
                   ` (5 preceding siblings ...)
  2024-01-10 20:40 ` [RFC V1 06/13] vhost-vdpa: VHOST_BACKEND_F_IOTLB_REMAP Steve Sistare
@ 2024-01-10 20:40 ` Steve Sistare
  2024-01-11  3:09   ` Jason Wang
  2024-01-10 20:40 ` [RFC V1 08/13] vduse: " Steve Sistare
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Steve Sistare @ 2024-01-10 20:40 UTC (permalink / raw)
  To: virtualization, linux-kernel
  Cc: Michael S. Tsirkin, Jason Wang, Si-Wei Liu, Eugenio Perez Martin,
	Xuan Zhuo, Dragos Tatulea, Eli Cohen, Xie Yongji, Steve Sistare

To pass ownership of a live vdpa device to a new process, the user
suspends the device, calls VHOST_NEW_OWNER to change the mm, and calls
VHOST_IOTLB_REMAP to change the user virtual addresses to match the new
mm.  Flush workers in suspend to guarantee that no worker sees the new
mm and old VA in between.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vhost/vdpa.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 8fe1562d24af..9673e8e20d11 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -591,10 +591,14 @@ static long vhost_vdpa_suspend(struct vhost_vdpa *v)
 {
 	struct vdpa_device *vdpa = v->vdpa;
 	const struct vdpa_config_ops *ops = vdpa->config;
+	struct vhost_dev *vdev = &v->vdev;
 
 	if (!ops->suspend)
 		return -EOPNOTSUPP;
 
+	if (vdev->use_worker)
+		vhost_dev_flush(vdev);
+
 	return ops->suspend(vdpa);
 }
 
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC V1 08/13] vduse: flush workers on suspend
  2024-01-10 20:40 [RFC V1 00/13] vdpa live update Steve Sistare
                   ` (6 preceding siblings ...)
  2024-01-10 20:40 ` [RFC V1 07/13] vhost-vdpa: flush workers on suspend Steve Sistare
@ 2024-01-10 20:40 ` Steve Sistare
  2024-01-11  3:09   ` Jason Wang
  2024-01-10 20:40 ` [RFC V1 09/13] vdpa_sim: reset must not run Steve Sistare
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Steve Sistare @ 2024-01-10 20:40 UTC (permalink / raw)
  To: virtualization, linux-kernel
  Cc: Michael S. Tsirkin, Jason Wang, Si-Wei Liu, Eugenio Perez Martin,
	Xuan Zhuo, Dragos Tatulea, Eli Cohen, Xie Yongji, Steve Sistare

To pass ownership of a live vdpa device to a new process, the user
suspends the device, calls VHOST_NEW_OWNER to change the mm, and calls
VHOST_IOTLB_REMAP to change the user virtual addresses to match the new
mm.  Flush workers in suspend to guarantee that no worker sees the new
mm and old VA in between.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index 0ddd4b8abecb..6b25457a037d 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -472,6 +472,18 @@ static void vduse_dev_reset(struct vduse_dev *dev)
 	up_write(&dev->rwsem);
 }
 
+static void vduse_flush_work(struct vduse_dev *dev)
+{
+	flush_work(&dev->inject);
+
+	for (int i = 0; i < dev->vq_num; i++) {
+		struct vduse_virtqueue *vq = dev->vqs[i];
+
+		flush_work(&vq->inject);
+		flush_work(&vq->kick);
+	}
+}
+
 static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
 				u64 desc_area, u64 driver_area,
 				u64 device_area)
@@ -713,6 +725,17 @@ static int vduse_vdpa_reset(struct vdpa_device *vdpa)
 	return ret;
 }
 
+static int vduse_vdpa_suspend(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	down_write(&dev->rwsem);
+	vduse_flush_work(dev);
+	up_write(&dev->rwsem);
+
+	return 0;
+}
+
 static u32 vduse_vdpa_get_generation(struct vdpa_device *vdpa)
 {
 	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
@@ -794,6 +817,7 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = {
 	.set_vq_affinity	= vduse_vdpa_set_vq_affinity,
 	.get_vq_affinity	= vduse_vdpa_get_vq_affinity,
 	.reset			= vduse_vdpa_reset,
+	.suspend		= vduse_vdpa_suspend,
 	.set_map		= vduse_vdpa_set_map,
 	.free			= vduse_vdpa_free,
 };
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC V1 09/13] vdpa_sim: reset must not run
  2024-01-10 20:40 [RFC V1 00/13] vdpa live update Steve Sistare
                   ` (7 preceding siblings ...)
  2024-01-10 20:40 ` [RFC V1 08/13] vduse: " Steve Sistare
@ 2024-01-10 20:40 ` Steve Sistare
  2024-01-16 18:33   ` Eugenio Perez Martin
  2024-01-10 20:40 ` [RFC V1 10/13] vdpa_sim: flush workers on suspend Steve Sistare
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Steve Sistare @ 2024-01-10 20:40 UTC (permalink / raw)
  To: virtualization, linux-kernel
  Cc: Michael S. Tsirkin, Jason Wang, Si-Wei Liu, Eugenio Perez Martin,
	Xuan Zhuo, Dragos Tatulea, Eli Cohen, Xie Yongji, Steve Sistare

vdpasim_do_reset sets running to true, which is wrong, as it allows
vdpasim_kick_vq to post work requests before the device has been
configured.  To fix, do not set running until VIRTIO_CONFIG_S_FEATURES_OK
is set.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vdpa/vdpa_sim/vdpa_sim.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index be2925d0d283..6304cb0b4770 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -160,7 +160,7 @@ static void vdpasim_do_reset(struct vdpasim *vdpasim, u32 flags)
 		}
 	}
 
-	vdpasim->running = true;
+	vdpasim->running = false;
 	spin_unlock(&vdpasim->iommu_lock);
 
 	vdpasim->features = 0;
@@ -483,6 +483,7 @@ static void vdpasim_set_status(struct vdpa_device *vdpa, u8 status)
 
 	mutex_lock(&vdpasim->mutex);
 	vdpasim->status = status;
+	vdpasim->running = (status & VIRTIO_CONFIG_S_FEATURES_OK) != 0;
 	mutex_unlock(&vdpasim->mutex);
 }
 
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC V1 10/13] vdpa_sim: flush workers on suspend
  2024-01-10 20:40 [RFC V1 00/13] vdpa live update Steve Sistare
                   ` (8 preceding siblings ...)
  2024-01-10 20:40 ` [RFC V1 09/13] vdpa_sim: reset must not run Steve Sistare
@ 2024-01-10 20:40 ` Steve Sistare
  2024-01-16 18:57   ` Eugenio Perez Martin
  2024-01-10 20:40 ` [RFC V1 11/13] vdpa/mlx5: new owner capability Steve Sistare
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Steve Sistare @ 2024-01-10 20:40 UTC (permalink / raw)
  To: virtualization, linux-kernel
  Cc: Michael S. Tsirkin, Jason Wang, Si-Wei Liu, Eugenio Perez Martin,
	Xuan Zhuo, Dragos Tatulea, Eli Cohen, Xie Yongji, Steve Sistare

To pass ownership of a live vdpa device to a new process, the user
suspends the device, calls VHOST_NEW_OWNER to change the mm, and calls
VHOST_IOTLB_REMAP to change the user virtual addresses to match the new
mm.  Flush workers in suspend to guarantee that no worker sees the new
mm and old VA in between.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vdpa/vdpa_sim/vdpa_sim.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index 6304cb0b4770..8734834983cb 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -74,6 +74,17 @@ static void vdpasim_worker_change_mm_sync(struct vdpasim *vdpasim,
 	kthread_flush_work(work);
 }
 
+static void flush_work_fn(struct kthread_work *work) {}
+
+static void vdpasim_flush_work(struct vdpasim *vdpasim)
+{
+	struct kthread_work work;
+
+	kthread_init_work(&work, flush_work_fn);
+	kthread_queue_work(vdpasim->worker, &work);
+	kthread_flush_work(&work);
+}
+
 static struct vdpasim *vdpa_to_sim(struct vdpa_device *vdpa)
 {
 	return container_of(vdpa, struct vdpasim, vdpa);
@@ -512,6 +523,8 @@ static int vdpasim_suspend(struct vdpa_device *vdpa)
 	vdpasim->running = false;
 	mutex_unlock(&vdpasim->mutex);
 
+	vdpasim_flush_work(vdpasim);
+
 	return 0;
 }
 
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC V1 11/13] vdpa/mlx5: new owner capability
  2024-01-10 20:40 [RFC V1 00/13] vdpa live update Steve Sistare
                   ` (9 preceding siblings ...)
  2024-01-10 20:40 ` [RFC V1 10/13] vdpa_sim: flush workers on suspend Steve Sistare
@ 2024-01-10 20:40 ` Steve Sistare
  2024-01-10 20:40 ` [RFC V1 12/13] vdpa_sim: " Steve Sistare
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Steve Sistare @ 2024-01-10 20:40 UTC (permalink / raw)
  To: virtualization, linux-kernel
  Cc: Michael S. Tsirkin, Jason Wang, Si-Wei Liu, Eugenio Perez Martin,
	Xuan Zhuo, Dragos Tatulea, Eli Cohen, Xie Yongji, Steve Sistare

The mlx5 vdpa device supports ownership transfer to a new process, so
advertise VHOST_BACKEND_F_NEW_OWNER.  User virtual addresses are not
used after they are initially translated to physical, so VHOST_IOTLB_REMAP
is not required, hence VHOST_BACKEND_F_IOTLB_REMAP is not advertised.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vdpa/mlx5/net/mlx5_vnet.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 26ba7da6b410..26f24fb0e160 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -2562,7 +2562,8 @@ static void unregister_link_notifier(struct mlx5_vdpa_net *ndev)
 
 static u64 mlx5_vdpa_get_backend_features(const struct vdpa_device *vdpa)
 {
-	return BIT_ULL(VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK);
+	return BIT_ULL(VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK) |
+	       BIT_ULL(VHOST_BACKEND_F_NEW_OWNER);
 }
 
 static int mlx5_vdpa_set_driver_features(struct vdpa_device *vdev, u64 features)
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC V1 12/13] vdpa_sim: new owner capability
  2024-01-10 20:40 [RFC V1 00/13] vdpa live update Steve Sistare
                   ` (10 preceding siblings ...)
  2024-01-10 20:40 ` [RFC V1 11/13] vdpa/mlx5: new owner capability Steve Sistare
@ 2024-01-10 20:40 ` Steve Sistare
  2024-01-10 20:40 ` [RFC V1 13/13] vduse: " Steve Sistare
  2024-01-11  2:55 ` [RFC V1 00/13] vdpa live update Jason Wang
  13 siblings, 0 replies; 33+ messages in thread
From: Steve Sistare @ 2024-01-10 20:40 UTC (permalink / raw)
  To: virtualization, linux-kernel
  Cc: Michael S. Tsirkin, Jason Wang, Si-Wei Liu, Eugenio Perez Martin,
	Xuan Zhuo, Dragos Tatulea, Eli Cohen, Xie Yongji, Steve Sistare

The vdpa_sim device supports ownership transfer to a new process, so
advertise VHOST_BACKEND_F_NEW_OWNER.  User virtual addresses are used
by the software iommu, so VHOST_IOTLB_REMAP is required after
VHOST_NEW_OWNER, so advertise VHOST_BACKEND_F_IOTLB_REMAP.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vdpa/vdpa_sim/vdpa_sim.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index 8734834983cb..d037869d8a89 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -430,7 +430,13 @@ static u64 vdpasim_get_device_features(struct vdpa_device *vdpa)
 
 static u64 vdpasim_get_backend_features(const struct vdpa_device *vdpa)
 {
-	return BIT_ULL(VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK);
+	u64 features = BIT_ULL(VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK) |
+		       BIT_ULL(VHOST_BACKEND_F_NEW_OWNER);
+
+	if (use_va)
+		features += BIT_ULL(VHOST_BACKEND_F_IOTLB_REMAP);
+
+	return features;
 }
 
 static int vdpasim_set_driver_features(struct vdpa_device *vdpa, u64 features)
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC V1 13/13] vduse: new owner capability
  2024-01-10 20:40 [RFC V1 00/13] vdpa live update Steve Sistare
                   ` (11 preceding siblings ...)
  2024-01-10 20:40 ` [RFC V1 12/13] vdpa_sim: " Steve Sistare
@ 2024-01-10 20:40 ` Steve Sistare
  2024-01-11  2:55 ` [RFC V1 00/13] vdpa live update Jason Wang
  13 siblings, 0 replies; 33+ messages in thread
From: Steve Sistare @ 2024-01-10 20:40 UTC (permalink / raw)
  To: virtualization, linux-kernel
  Cc: Michael S. Tsirkin, Jason Wang, Si-Wei Liu, Eugenio Perez Martin,
	Xuan Zhuo, Dragos Tatulea, Eli Cohen, Xie Yongji, Steve Sistare

The vduse device supports ownership transfer to a new process, so
advertise VHOST_BACKEND_F_NEW_OWNER.  User virtual addresses are used
by the software iommu, so VHOST_IOTLB_REMAP is required after
VHOST_NEW_OWNER, so advertise VHOST_BACKEND_F_IOTLB_REMAP.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index 6b25457a037d..67815f6391db 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -25,6 +25,7 @@
 #include <linux/sched/mm.h>
 #include <uapi/linux/vduse.h>
 #include <uapi/linux/vdpa.h>
+#include <uapi/linux/vhost_types.h>
 #include <uapi/linux/virtio_config.h>
 #include <uapi/linux/virtio_ids.h>
 #include <uapi/linux/virtio_blk.h>
@@ -608,6 +609,12 @@ static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
 	return dev->vq_align;
 }
 
+static u64 vduse_vdpa_get_backend_features(const struct vdpa_device *vdpa)
+{
+	return BIT_ULL(VHOST_BACKEND_F_IOTLB_REMAP) |
+	       BIT_ULL(VHOST_BACKEND_F_NEW_OWNER);
+}
+
 static u64 vduse_vdpa_get_device_features(struct vdpa_device *vdpa)
 {
 	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
@@ -801,6 +808,7 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = {
 	.set_vq_state		= vduse_vdpa_set_vq_state,
 	.get_vq_state		= vduse_vdpa_get_vq_state,
 	.get_vq_align		= vduse_vdpa_get_vq_align,
+	.get_backend_features	= vduse_vdpa_get_backend_features,
 	.get_device_features	= vduse_vdpa_get_device_features,
 	.set_driver_features	= vduse_vdpa_set_driver_features,
 	.get_driver_features	= vduse_vdpa_get_driver_features,
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC V1 01/13] vhost-vdpa: count pinned memory
  2024-01-10 20:40 ` [RFC V1 01/13] vhost-vdpa: count pinned memory Steve Sistare
@ 2024-01-10 22:24   ` Michael S. Tsirkin
  2024-01-17 20:34     ` Steven Sistare
  0 siblings, 1 reply; 33+ messages in thread
From: Michael S. Tsirkin @ 2024-01-10 22:24 UTC (permalink / raw)
  To: Steve Sistare
  Cc: virtualization, linux-kernel, Jason Wang, Si-Wei Liu,
	Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea, Eli Cohen,
	Xie Yongji

On Wed, Jan 10, 2024 at 12:40:03PM -0800, Steve Sistare wrote:
> Remember the count of pinned memory for the device.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Can we have iommufd support in vdpa so we do not keep extending these hacks?


> ---
>  drivers/vhost/vdpa.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index da7ec77cdaff..10fb95bcca1a 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -59,6 +59,7 @@ struct vhost_vdpa {
>  	int in_batch;
>  	struct vdpa_iova_range range;
>  	u32 batch_asid;
> +	long pinned_vm;
>  };
>  
>  static DEFINE_IDA(vhost_vdpa_ida);
> @@ -893,6 +894,7 @@ static void vhost_vdpa_pa_unmap(struct vhost_vdpa *v, struct vhost_iotlb *iotlb,
>  			unpin_user_page(page);
>  		}
>  		atomic64_sub(PFN_DOWN(map->size), &dev->mm->pinned_vm);
> +		v->pinned_vm -= PFN_DOWN(map->size);
>  		vhost_vdpa_general_unmap(v, map, asid);
>  		vhost_iotlb_map_free(iotlb, map);
>  	}
> @@ -975,9 +977,10 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct vhost_iotlb *iotlb,
>  		return r;
>  	}
>  
> -	if (!vdpa->use_va)
> +	if (!vdpa->use_va) {
>  		atomic64_add(PFN_DOWN(size), &dev->mm->pinned_vm);
> -
> +		v->pinned_vm += PFN_DOWN(size);
> +	}
>  	return 0;
>  }
>  
> -- 
> 2.39.3


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC V1 00/13] vdpa live update
  2024-01-10 20:40 [RFC V1 00/13] vdpa live update Steve Sistare
                   ` (12 preceding siblings ...)
  2024-01-10 20:40 ` [RFC V1 13/13] vduse: " Steve Sistare
@ 2024-01-11  2:55 ` Jason Wang
  2024-01-17 20:31   ` Steven Sistare
  13 siblings, 1 reply; 33+ messages in thread
From: Jason Wang @ 2024-01-11  2:55 UTC (permalink / raw)
  To: Steve Sistare
  Cc: virtualization, linux-kernel, Michael S. Tsirkin, Si-Wei Liu,
	Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea, Eli Cohen,
	Xie Yongji

On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare <steven.sistare@oracle.com> wrote:
>
> Live update is a technique wherein an application saves its state, exec's
> to an updated version of itself, and restores its state.  Clients of the
> application experience a brief suspension of service, on the order of
> 100's of milliseconds, but are otherwise unaffected.
>
> Define and implement interfaces that allow vdpa devices to be preserved
> across fork or exec, to support live update for applications such as qemu.
> The device must be suspended during the update, but its dma mappings are
> preserved, so the suspension is brief.
>
> The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory
> accounting from one process to another.
>
> The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that
> VHOST_NEW_OWNER is supported.
>
> The VHOST_IOTLB_REMAP message type updates a dma mapping with its userland
> address in the new process.
>
> The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that
> VHOST_IOTLB_REMAP is supported and required.  Some devices do not
> require it, because the userland address of each dma mapping is discarded
> after being translated to a physical address.
>
> Here is a pseudo-code sequence for performing live update, based on
> suspend + reset because resume is not yet available.  The vdpa device
> descriptor, fd, remains open across the exec.
>
>   ioctl(fd, VHOST_VDPA_SUSPEND)
>   ioctl(fd, VHOST_VDPA_SET_STATUS, 0)
>   exec

Is there a userspace implementation as a reference?

>
>   ioctl(fd, VHOST_NEW_OWNER)
>
>   issue ioctls to re-create vrings
>
>   if VHOST_BACKEND_F_IOTLB_REMAP
>       foreach dma mapping
>           write(fd, {VHOST_IOTLB_REMAP, new_addr})

I think I need to understand the advantages of this approach. For
example, why it is better than

ioctl(VHOST_RESET_OWNER)
exec

ioctl(VHOST_SET_OWNER)

for each dma mapping
     ioctl(VHOST_IOTLB_UPDATE)

Thanks

>
>   ioctl(fd, VHOST_VDPA_SET_STATUS,
>             ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK)
>
>
> Steve Sistare (13):
>   vhost-vdpa: count pinned memory
>   vhost-vdpa: pass mm to bind
>   vhost-vdpa: VHOST_NEW_OWNER
>   vhost-vdpa: VHOST_BACKEND_F_NEW_OWNER
>   vhost-vdpa: VHOST_IOTLB_REMAP
>   vhost-vdpa: VHOST_BACKEND_F_IOTLB_REMAP
>   vhost-vdpa: flush workers on suspend
>   vduse: flush workers on suspend
>   vdpa_sim: reset must not run
>   vdpa_sim: flush workers on suspend
>   vdpa/mlx5: new owner capability
>   vdpa_sim: new owner capability
>   vduse: new owner capability
>
>  drivers/vdpa/mlx5/net/mlx5_vnet.c  |   3 +-
>  drivers/vdpa/vdpa_sim/vdpa_sim.c   |  24 ++++++-
>  drivers/vdpa/vdpa_user/vduse_dev.c |  32 +++++++++
>  drivers/vhost/vdpa.c               | 101 +++++++++++++++++++++++++++--
>  drivers/vhost/vhost.c              |  15 +++++
>  drivers/vhost/vhost.h              |   1 +
>  include/uapi/linux/vhost.h         |  10 +++
>  include/uapi/linux/vhost_types.h   |  15 ++++-
>  8 files changed, 191 insertions(+), 10 deletions(-)
>
> --
> 2.39.3
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC V1 05/13] vhost-vdpa: VHOST_IOTLB_REMAP
  2024-01-10 20:40 ` [RFC V1 05/13] vhost-vdpa: VHOST_IOTLB_REMAP Steve Sistare
@ 2024-01-11  3:08   ` Jason Wang
  2024-01-17 20:31     ` Steven Sistare
  2024-01-16 18:14   ` Eugenio Perez Martin
  1 sibling, 1 reply; 33+ messages in thread
From: Jason Wang @ 2024-01-11  3:08 UTC (permalink / raw)
  To: Steve Sistare
  Cc: virtualization, linux-kernel, Michael S. Tsirkin, Si-Wei Liu,
	Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea, Eli Cohen,
	Xie Yongji

On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare <steven.sistare@oracle.com> wrote:
>
> When device ownership is passed to a new process via VHOST_NEW_OWNER,
> some devices need to know the new userland addresses of the dma mappings.
> Define the new iotlb message type VHOST_IOTLB_REMAP to update the uaddr
> of a mapping.  The new uaddr must address the same memory object as
> originally mapped.
>
> The user must suspend the device before the old address is invalidated,
> and cannot resume it until after VHOST_IOTLB_REMAP is called, but this
> requirement is not enforced by the API.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  drivers/vhost/vdpa.c             | 34 ++++++++++++++++++++++++++++++++
>  include/uapi/linux/vhost_types.h | 11 ++++++++++-
>  2 files changed, 44 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index faed6471934a..ec5ca20bd47d 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -1219,6 +1219,37 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>
>  }
>
> +static int vhost_vdpa_process_iotlb_remap(struct vhost_vdpa *v,
> +                                         struct vhost_iotlb *iotlb,
> +                                         struct vhost_iotlb_msg *msg)
> +{
> +       struct vdpa_device *vdpa = v->vdpa;
> +       const struct vdpa_config_ops *ops = vdpa->config;
> +       u32 asid = iotlb_to_asid(iotlb);
> +       u64 start = msg->iova;
> +       u64 last = start + msg->size - 1;
> +       struct vhost_iotlb_map *map;
> +       int r = 0;
> +
> +       if (msg->perm || !msg->size)
> +               return -EINVAL;
> +
> +       map = vhost_iotlb_itree_first(iotlb, start, last);
> +       if (!map)
> +               return -ENOENT;
> +
> +       if (map->start != start || map->last != last)
> +               return -EINVAL;
> +
> +       /* batch will finish with remap.  non-batch must do it now. */
> +       if (!v->in_batch)
> +               r = ops->set_map(vdpa, asid, iotlb);
> +       if (!r)
> +               map->addr = msg->uaddr;

I may miss something, for example for PA mapping,

1) need to convert uaddr into phys addr
2) need to check whether the uaddr is backed by the same page or not?

Thanks

> +
> +       return r;
> +}
> +
>  static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>                                            struct vhost_iotlb *iotlb,
>                                            struct vhost_iotlb_msg *msg)
> @@ -1298,6 +1329,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, u32 asid,
>                         ops->set_map(vdpa, asid, iotlb);
>                 v->in_batch = false;
>                 break;
> +       case VHOST_IOTLB_REMAP:
> +               r = vhost_vdpa_process_iotlb_remap(v, iotlb, msg);
> +               break;
>         default:
>                 r = -EINVAL;
>                 break;
> diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
> index 9177843951e9..35908315ff55 100644
> --- a/include/uapi/linux/vhost_types.h
> +++ b/include/uapi/linux/vhost_types.h
> @@ -79,7 +79,7 @@ struct vhost_iotlb_msg {
>  /*
>   * VHOST_IOTLB_BATCH_BEGIN and VHOST_IOTLB_BATCH_END allow modifying
>   * multiple mappings in one go: beginning with
> - * VHOST_IOTLB_BATCH_BEGIN, followed by any number of
> + * VHOST_IOTLB_BATCH_BEGIN, followed by any number of VHOST_IOTLB_REMAP or
>   * VHOST_IOTLB_UPDATE messages, and ending with VHOST_IOTLB_BATCH_END.
>   * When one of these two values is used as the message type, the rest
>   * of the fields in the message are ignored. There's no guarantee that
> @@ -87,6 +87,15 @@ struct vhost_iotlb_msg {
>   */
>  #define VHOST_IOTLB_BATCH_BEGIN    5
>  #define VHOST_IOTLB_BATCH_END      6
> +
> +/*
> + * VHOST_IOTLB_REMAP registers a new uaddr for the existing mapping at iova.
> + * The new uaddr must address the same memory object as originally mapped.
> + * Failure to do so will result in user memory corruption and/or device
> + * misbehavior.  iova and size must match the arguments used to create the
> + * an existing mapping.  Protection is not changed, and perm must be 0.
> + */
> +#define VHOST_IOTLB_REMAP          7
>         __u8 type;
>  };
>
> --
> 2.39.3
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC V1 07/13] vhost-vdpa: flush workers on suspend
  2024-01-10 20:40 ` [RFC V1 07/13] vhost-vdpa: flush workers on suspend Steve Sistare
@ 2024-01-11  3:09   ` Jason Wang
  2024-01-11 16:17     ` Mike Christie
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Wang @ 2024-01-11  3:09 UTC (permalink / raw)
  To: Steve Sistare
  Cc: virtualization, linux-kernel, Michael S. Tsirkin, Si-Wei Liu,
	Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea, Eli Cohen,
	Xie Yongji

On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare <steven.sistare@oracle.com> wrote:
>
> To pass ownership of a live vdpa device to a new process, the user
> suspends the device, calls VHOST_NEW_OWNER to change the mm, and calls
> VHOST_IOTLB_REMAP to change the user virtual addresses to match the new
> mm.  Flush workers in suspend to guarantee that no worker sees the new
> mm and old VA in between.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  drivers/vhost/vdpa.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index 8fe1562d24af..9673e8e20d11 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -591,10 +591,14 @@ static long vhost_vdpa_suspend(struct vhost_vdpa *v)
>  {
>         struct vdpa_device *vdpa = v->vdpa;
>         const struct vdpa_config_ops *ops = vdpa->config;
> +       struct vhost_dev *vdev = &v->vdev;
>
>         if (!ops->suspend)
>                 return -EOPNOTSUPP;
>
> +       if (vdev->use_worker)
> +               vhost_dev_flush(vdev);

It looks to me like it's better to check use_woker in vhost_dev_flush.

Thanks


> +
>         return ops->suspend(vdpa);
>  }
>
> --
> 2.39.3
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC V1 08/13] vduse: flush workers on suspend
  2024-01-10 20:40 ` [RFC V1 08/13] vduse: " Steve Sistare
@ 2024-01-11  3:09   ` Jason Wang
  2024-01-17 20:31     ` Steven Sistare
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Wang @ 2024-01-11  3:09 UTC (permalink / raw)
  To: Steve Sistare
  Cc: virtualization, linux-kernel, Michael S. Tsirkin, Si-Wei Liu,
	Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea, Eli Cohen,
	Xie Yongji

On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare <steven.sistare@oracle.com> wrote:
>
> To pass ownership of a live vdpa device to a new process, the user
> suspends the device, calls VHOST_NEW_OWNER to change the mm, and calls
> VHOST_IOTLB_REMAP to change the user virtual addresses to match the new
> mm.  Flush workers in suspend to guarantee that no worker sees the new
> mm and old VA in between.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

It seems we need a better title, probably "suspend support for vduse"?
And it looks better to be an separate patch.

Thanks


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC V1 07/13] vhost-vdpa: flush workers on suspend
  2024-01-11  3:09   ` Jason Wang
@ 2024-01-11 16:17     ` Mike Christie
  2024-01-12  2:28       ` Jason Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Mike Christie @ 2024-01-11 16:17 UTC (permalink / raw)
  To: Jason Wang, Steve Sistare
  Cc: virtualization, linux-kernel, Michael S. Tsirkin, Si-Wei Liu,
	Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea, Eli Cohen,
	Xie Yongji

On 1/10/24 9:09 PM, Jason Wang wrote:
> On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare <steven.sistare@oracle.com> wrote:
>>
>> To pass ownership of a live vdpa device to a new process, the user
>> suspends the device, calls VHOST_NEW_OWNER to change the mm, and calls
>> VHOST_IOTLB_REMAP to change the user virtual addresses to match the new
>> mm.  Flush workers in suspend to guarantee that no worker sees the new
>> mm and old VA in between.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  drivers/vhost/vdpa.c | 4 ++++
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>> index 8fe1562d24af..9673e8e20d11 100644
>> --- a/drivers/vhost/vdpa.c
>> +++ b/drivers/vhost/vdpa.c
>> @@ -591,10 +591,14 @@ static long vhost_vdpa_suspend(struct vhost_vdpa *v)
>>  {
>>         struct vdpa_device *vdpa = v->vdpa;
>>         const struct vdpa_config_ops *ops = vdpa->config;
>> +       struct vhost_dev *vdev = &v->vdev;
>>
>>         if (!ops->suspend)
>>                 return -EOPNOTSUPP;
>>
>> +       if (vdev->use_worker)
>> +               vhost_dev_flush(vdev);
> 
> It looks to me like it's better to check use_woker in vhost_dev_flush.
> 

You can now just call vhost_dev_flush and it will do the right thing.
The xa_for_each loop will only flush workers if they have been setup,
so for vdpa it will not find/flush anything.




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC V1 07/13] vhost-vdpa: flush workers on suspend
  2024-01-11 16:17     ` Mike Christie
@ 2024-01-12  2:28       ` Jason Wang
  2024-01-17 20:30         ` Steven Sistare
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Wang @ 2024-01-12  2:28 UTC (permalink / raw)
  To: Mike Christie
  Cc: Steve Sistare, virtualization, linux-kernel, Michael S. Tsirkin,
	Si-Wei Liu, Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea,
	Eli Cohen, Xie Yongji

On Fri, Jan 12, 2024 at 12:18 AM Mike Christie
<michael.christie@oracle.com> wrote:
>
> On 1/10/24 9:09 PM, Jason Wang wrote:
> > On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare <steven.sistare@oracle.com> wrote:
> >>
> >> To pass ownership of a live vdpa device to a new process, the user
> >> suspends the device, calls VHOST_NEW_OWNER to change the mm, and calls
> >> VHOST_IOTLB_REMAP to change the user virtual addresses to match the new
> >> mm.  Flush workers in suspend to guarantee that no worker sees the new
> >> mm and old VA in between.
> >>
> >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >> ---
> >>  drivers/vhost/vdpa.c | 4 ++++
> >>  1 file changed, 4 insertions(+)
> >>
> >> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> >> index 8fe1562d24af..9673e8e20d11 100644
> >> --- a/drivers/vhost/vdpa.c
> >> +++ b/drivers/vhost/vdpa.c
> >> @@ -591,10 +591,14 @@ static long vhost_vdpa_suspend(struct vhost_vdpa *v)
> >>  {
> >>         struct vdpa_device *vdpa = v->vdpa;
> >>         const struct vdpa_config_ops *ops = vdpa->config;
> >> +       struct vhost_dev *vdev = &v->vdev;
> >>
> >>         if (!ops->suspend)
> >>                 return -EOPNOTSUPP;
> >>
> >> +       if (vdev->use_worker)
> >> +               vhost_dev_flush(vdev);
> >
> > It looks to me like it's better to check use_woker in vhost_dev_flush.
> >
>
> You can now just call vhost_dev_flush and it will do the right thing.
> The xa_for_each loop will only flush workers if they have been setup,
> so for vdpa it will not find/flush anything.

Right.

Thanks

>
>
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC V1 05/13] vhost-vdpa: VHOST_IOTLB_REMAP
  2024-01-10 20:40 ` [RFC V1 05/13] vhost-vdpa: VHOST_IOTLB_REMAP Steve Sistare
  2024-01-11  3:08   ` Jason Wang
@ 2024-01-16 18:14   ` Eugenio Perez Martin
  2024-02-09 15:49     ` Steven Sistare
  1 sibling, 1 reply; 33+ messages in thread
From: Eugenio Perez Martin @ 2024-01-16 18:14 UTC (permalink / raw)
  To: Steve Sistare
  Cc: virtualization, linux-kernel, Michael S. Tsirkin, Jason Wang,
	Si-Wei Liu, Xuan Zhuo, Dragos Tatulea, Eli Cohen, Xie Yongji

On Wed, Jan 10, 2024 at 9:40 PM Steve Sistare <steven.sistare@oracle.com> wrote:
>
> When device ownership is passed to a new process via VHOST_NEW_OWNER,
> some devices need to know the new userland addresses of the dma mappings.
> Define the new iotlb message type VHOST_IOTLB_REMAP to update the uaddr
> of a mapping.  The new uaddr must address the same memory object as
> originally mapped.
>

I think this new ioctl is very interesting, and can be used to
optimize some parts of live migration with shadow virtqueue if it
allows to actually replace the uaddr. Would you be interested in that
capability?

> The user must suspend the device before the old address is invalidated,
> and cannot resume it until after VHOST_IOTLB_REMAP is called, but this
> requirement is not enforced by the API.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  drivers/vhost/vdpa.c             | 34 ++++++++++++++++++++++++++++++++
>  include/uapi/linux/vhost_types.h | 11 ++++++++++-
>  2 files changed, 44 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index faed6471934a..ec5ca20bd47d 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -1219,6 +1219,37 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>
>  }
>
> +static int vhost_vdpa_process_iotlb_remap(struct vhost_vdpa *v,
> +                                         struct vhost_iotlb *iotlb,
> +                                         struct vhost_iotlb_msg *msg)
> +{
> +       struct vdpa_device *vdpa = v->vdpa;
> +       const struct vdpa_config_ops *ops = vdpa->config;
> +       u32 asid = iotlb_to_asid(iotlb);
> +       u64 start = msg->iova;
> +       u64 last = start + msg->size - 1;
> +       struct vhost_iotlb_map *map;
> +       int r = 0;
> +
> +       if (msg->perm || !msg->size)
> +               return -EINVAL;
> +
> +       map = vhost_iotlb_itree_first(iotlb, start, last);
> +       if (!map)
> +               return -ENOENT;
> +
> +       if (map->start != start || map->last != last)
> +               return -EINVAL;
> +
> +       /* batch will finish with remap.  non-batch must do it now. */
> +       if (!v->in_batch)
> +               r = ops->set_map(vdpa, asid, iotlb);

I'm missing how these cases are handled:
* The device does not expose set_map but dma_map / dma_unmap (you can
check this case with the simulator)
* The device uses platform iommu (!set_map && !dma_unmap vdpa_ops).

> +       if (!r)
> +               map->addr = msg->uaddr;
> +
> +       return r;
> +}
> +
>  static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>                                            struct vhost_iotlb *iotlb,
>                                            struct vhost_iotlb_msg *msg)
> @@ -1298,6 +1329,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, u32 asid,
>                         ops->set_map(vdpa, asid, iotlb);
>                 v->in_batch = false;
>                 break;
> +       case VHOST_IOTLB_REMAP:
> +               r = vhost_vdpa_process_iotlb_remap(v, iotlb, msg);
> +               break;
>         default:
>                 r = -EINVAL;
>                 break;
> diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
> index 9177843951e9..35908315ff55 100644
> --- a/include/uapi/linux/vhost_types.h
> +++ b/include/uapi/linux/vhost_types.h
> @@ -79,7 +79,7 @@ struct vhost_iotlb_msg {
>  /*
>   * VHOST_IOTLB_BATCH_BEGIN and VHOST_IOTLB_BATCH_END allow modifying
>   * multiple mappings in one go: beginning with
> - * VHOST_IOTLB_BATCH_BEGIN, followed by any number of
> + * VHOST_IOTLB_BATCH_BEGIN, followed by any number of VHOST_IOTLB_REMAP or
>   * VHOST_IOTLB_UPDATE messages, and ending with VHOST_IOTLB_BATCH_END.
>   * When one of these two values is used as the message type, the rest
>   * of the fields in the message are ignored. There's no guarantee that
> @@ -87,6 +87,15 @@ struct vhost_iotlb_msg {
>   */
>  #define VHOST_IOTLB_BATCH_BEGIN    5
>  #define VHOST_IOTLB_BATCH_END      6
> +
> +/*
> + * VHOST_IOTLB_REMAP registers a new uaddr for the existing mapping at iova.
> + * The new uaddr must address the same memory object as originally mapped.
> + * Failure to do so will result in user memory corruption and/or device
> + * misbehavior.  iova and size must match the arguments used to create the
> + * an existing mapping.  Protection is not changed, and perm must be 0.
> + */
> +#define VHOST_IOTLB_REMAP          7
>         __u8 type;
>  };
>
> --
> 2.39.3
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC V1 09/13] vdpa_sim: reset must not run
  2024-01-10 20:40 ` [RFC V1 09/13] vdpa_sim: reset must not run Steve Sistare
@ 2024-01-16 18:33   ` Eugenio Perez Martin
  0 siblings, 0 replies; 33+ messages in thread
From: Eugenio Perez Martin @ 2024-01-16 18:33 UTC (permalink / raw)
  To: Steve Sistare
  Cc: virtualization, linux-kernel, Michael S. Tsirkin, Jason Wang,
	Si-Wei Liu, Xuan Zhuo, Dragos Tatulea, Eli Cohen, Xie Yongji

On Wed, Jan 10, 2024 at 9:40 PM Steve Sistare <steven.sistare@oracle.com> wrote:
>
> vdpasim_do_reset sets running to true, which is wrong, as it allows
> vdpasim_kick_vq to post work requests before the device has been
> configured.  To fix, do not set running until VIRTIO_CONFIG_S_FEATURES_OK
> is set.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Good catch, thanks! Please send this separately so it can be merged
independently.

Also, the Fixes tag is missing, isn't it?

Fixes: 0c89e2a3a9d0 ("vdpa_sim: Implement suspend vdpa op")

Reviewed-by: Eugenio Pérez <eperezma@redhat.com>

> ---
>  drivers/vdpa/vdpa_sim/vdpa_sim.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> index be2925d0d283..6304cb0b4770 100644
> --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
> +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> @@ -160,7 +160,7 @@ static void vdpasim_do_reset(struct vdpasim *vdpasim, u32 flags)
>                 }
>         }
>
> -       vdpasim->running = true;
> +       vdpasim->running = false;
>         spin_unlock(&vdpasim->iommu_lock);
>
>         vdpasim->features = 0;
> @@ -483,6 +483,7 @@ static void vdpasim_set_status(struct vdpa_device *vdpa, u8 status)
>
>         mutex_lock(&vdpasim->mutex);
>         vdpasim->status = status;
> +       vdpasim->running = (status & VIRTIO_CONFIG_S_FEATURES_OK) != 0;
>         mutex_unlock(&vdpasim->mutex);
>  }
>
> --
> 2.39.3
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC V1 10/13] vdpa_sim: flush workers on suspend
  2024-01-10 20:40 ` [RFC V1 10/13] vdpa_sim: flush workers on suspend Steve Sistare
@ 2024-01-16 18:57   ` Eugenio Perez Martin
  2024-01-17 20:31     ` Steven Sistare
  0 siblings, 1 reply; 33+ messages in thread
From: Eugenio Perez Martin @ 2024-01-16 18:57 UTC (permalink / raw)
  To: Steve Sistare
  Cc: virtualization, linux-kernel, Michael S. Tsirkin, Jason Wang,
	Si-Wei Liu, Xuan Zhuo, Dragos Tatulea, Eli Cohen, Xie Yongji

On Wed, Jan 10, 2024 at 9:40 PM Steve Sistare <steven.sistare@oracle.com> wrote:
>
> To pass ownership of a live vdpa device to a new process, the user
> suspends the device, calls VHOST_NEW_OWNER to change the mm, and calls
> VHOST_IOTLB_REMAP to change the user virtual addresses to match the new
> mm.  Flush workers in suspend to guarantee that no worker sees the new
> mm and old VA in between.
>

The worker should already be stopped by the end of the suspend ioctl,
so maybe we can consider this a fix?

> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  drivers/vdpa/vdpa_sim/vdpa_sim.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
>
> diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> index 6304cb0b4770..8734834983cb 100644
> --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
> +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> @@ -74,6 +74,17 @@ static void vdpasim_worker_change_mm_sync(struct vdpasim *vdpasim,
>         kthread_flush_work(work);
>  }
>
> +static void flush_work_fn(struct kthread_work *work) {}
> +
> +static void vdpasim_flush_work(struct vdpasim *vdpasim)
> +{
> +       struct kthread_work work;
> +
> +       kthread_init_work(&work, flush_work_fn);
> +       kthread_queue_work(vdpasim->worker, &work);
> +       kthread_flush_work(&work);

Wouldn't it be better to cancel the work with kthread_cancel_work_sync here?

> +}
> +
>  static struct vdpasim *vdpa_to_sim(struct vdpa_device *vdpa)
>  {
>         return container_of(vdpa, struct vdpasim, vdpa);
> @@ -512,6 +523,8 @@ static int vdpasim_suspend(struct vdpa_device *vdpa)
>         vdpasim->running = false;
>         mutex_unlock(&vdpasim->mutex);
>
> +       vdpasim_flush_work(vdpasim);
> +
>         return 0;
>  }
>
> --
> 2.39.3
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC V1 07/13] vhost-vdpa: flush workers on suspend
  2024-01-12  2:28       ` Jason Wang
@ 2024-01-17 20:30         ` Steven Sistare
  0 siblings, 0 replies; 33+ messages in thread
From: Steven Sistare @ 2024-01-17 20:30 UTC (permalink / raw)
  To: Jason Wang, Mike Christie
  Cc: virtualization, linux-kernel, Michael S. Tsirkin, Si-Wei Liu,
	Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea, Eli Cohen,
	Xie Yongji

On 1/11/2024 9:28 PM, Jason Wang wrote:
> On Fri, Jan 12, 2024 at 12:18 AM Mike Christie
> <michael.christie@oracle.com> wrote:
>>
>> On 1/10/24 9:09 PM, Jason Wang wrote:
>>> On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare <steven.sistare@oracle.com> wrote:
>>>>
>>>> To pass ownership of a live vdpa device to a new process, the user
>>>> suspends the device, calls VHOST_NEW_OWNER to change the mm, and calls
>>>> VHOST_IOTLB_REMAP to change the user virtual addresses to match the new
>>>> mm.  Flush workers in suspend to guarantee that no worker sees the new
>>>> mm and old VA in between.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>>  drivers/vhost/vdpa.c | 4 ++++
>>>>  1 file changed, 4 insertions(+)
>>>>
>>>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>>>> index 8fe1562d24af..9673e8e20d11 100644
>>>> --- a/drivers/vhost/vdpa.c
>>>> +++ b/drivers/vhost/vdpa.c
>>>> @@ -591,10 +591,14 @@ static long vhost_vdpa_suspend(struct vhost_vdpa *v)
>>>>  {
>>>>         struct vdpa_device *vdpa = v->vdpa;
>>>>         const struct vdpa_config_ops *ops = vdpa->config;
>>>> +       struct vhost_dev *vdev = &v->vdev;
>>>>
>>>>         if (!ops->suspend)
>>>>                 return -EOPNOTSUPP;
>>>>
>>>> +       if (vdev->use_worker)
>>>> +               vhost_dev_flush(vdev);
>>>
>>> It looks to me like it's better to check use_woker in vhost_dev_flush.
>>
>> You can now just call vhost_dev_flush and it will do the right thing.
>> The xa_for_each loop will only flush workers if they have been setup,
>> so for vdpa it will not find/flush anything.

Very good, I will drop this patch.

- Steve

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC V1 10/13] vdpa_sim: flush workers on suspend
  2024-01-16 18:57   ` Eugenio Perez Martin
@ 2024-01-17 20:31     ` Steven Sistare
  0 siblings, 0 replies; 33+ messages in thread
From: Steven Sistare @ 2024-01-17 20:31 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: virtualization, linux-kernel, Michael S. Tsirkin, Jason Wang,
	Si-Wei Liu, Xuan Zhuo, Dragos Tatulea, Eli Cohen, Xie Yongji

On 1/16/2024 1:57 PM, Eugenio Perez Martin wrote:
> On Wed, Jan 10, 2024 at 9:40 PM Steve Sistare <steven.sistare@oracle.com> wrote:
>>
>> To pass ownership of a live vdpa device to a new process, the user
>> suspends the device, calls VHOST_NEW_OWNER to change the mm, and calls
>> VHOST_IOTLB_REMAP to change the user virtual addresses to match the new
>> mm.  Flush workers in suspend to guarantee that no worker sees the new
>> mm and old VA in between.
> 
> The worker should already be stopped by the end of the suspend ioctl,
> so maybe we can consider this a fix?

Do you mean: the current behavior is a bug, independently of my new use case,
so I should submit this patch as a separate bug fix?  If yes, then will do.

>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  drivers/vdpa/vdpa_sim/vdpa_sim.c | 13 +++++++++++++
>>  1 file changed, 13 insertions(+)
>>
>> diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
>> index 6304cb0b4770..8734834983cb 100644
>> --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
>> +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
>> @@ -74,6 +74,17 @@ static void vdpasim_worker_change_mm_sync(struct vdpasim *vdpasim,
>>         kthread_flush_work(work);
>>  }
>>
>> +static void flush_work_fn(struct kthread_work *work) {}
>> +
>> +static void vdpasim_flush_work(struct vdpasim *vdpasim)
>> +{
>> +       struct kthread_work work;
>> +
>> +       kthread_init_work(&work, flush_work_fn);
>> +       kthread_queue_work(vdpasim->worker, &work);
>> +       kthread_flush_work(&work);
> 
> Wouldn't it be better to cancel the work with kthread_cancel_work_sync here?

I believe that does not guarantee that currently executing work completes:

  static bool __kthread_cancel_work_sync()
    if (worker->current_work != work)
        goto out_fast;

- Steve

>> +}
>> +
>>  static struct vdpasim *vdpa_to_sim(struct vdpa_device *vdpa)
>>  {
>>         return container_of(vdpa, struct vdpasim, vdpa);
>> @@ -512,6 +523,8 @@ static int vdpasim_suspend(struct vdpa_device *vdpa)
>>         vdpasim->running = false;
>>         mutex_unlock(&vdpasim->mutex);
>>
>> +       vdpasim_flush_work(vdpasim);
>> +
>>         return 0;
>>  }
>>
>> --
>> 2.39.3
>>
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC V1 08/13] vduse: flush workers on suspend
  2024-01-11  3:09   ` Jason Wang
@ 2024-01-17 20:31     ` Steven Sistare
  0 siblings, 0 replies; 33+ messages in thread
From: Steven Sistare @ 2024-01-17 20:31 UTC (permalink / raw)
  To: Jason Wang
  Cc: virtualization, linux-kernel, Michael S. Tsirkin, Si-Wei Liu,
	Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea, Eli Cohen,
	Xie Yongji

On 1/10/2024 10:09 PM, Jason Wang wrote:
> On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare <steven.sistare@oracle.com> wrote:
>>
>> To pass ownership of a live vdpa device to a new process, the user
>> suspends the device, calls VHOST_NEW_OWNER to change the mm, and calls
>> VHOST_IOTLB_REMAP to change the user virtual addresses to match the new
>> mm.  Flush workers in suspend to guarantee that no worker sees the new
>> mm and old VA in between.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> 
> It seems we need a better title, probably "suspend support for vduse"?
> And it looks better to be an separate patch.

Will do - steve


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC V1 05/13] vhost-vdpa: VHOST_IOTLB_REMAP
  2024-01-11  3:08   ` Jason Wang
@ 2024-01-17 20:31     ` Steven Sistare
  2024-01-22  4:05       ` Jason Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Steven Sistare @ 2024-01-17 20:31 UTC (permalink / raw)
  To: Jason Wang
  Cc: virtualization, linux-kernel, Michael S. Tsirkin, Si-Wei Liu,
	Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea, Eli Cohen,
	Xie Yongji

On 1/10/2024 10:08 PM, Jason Wang wrote:
> On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare <steven.sistare@oracle.com> wrote:
>>
>> When device ownership is passed to a new process via VHOST_NEW_OWNER,
>> some devices need to know the new userland addresses of the dma mappings.
>> Define the new iotlb message type VHOST_IOTLB_REMAP to update the uaddr
>> of a mapping.  The new uaddr must address the same memory object as
>> originally mapped.
>>
>> The user must suspend the device before the old address is invalidated,
>> and cannot resume it until after VHOST_IOTLB_REMAP is called, but this
>> requirement is not enforced by the API.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  drivers/vhost/vdpa.c             | 34 ++++++++++++++++++++++++++++++++
>>  include/uapi/linux/vhost_types.h | 11 ++++++++++-
>>  2 files changed, 44 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>> index faed6471934a..ec5ca20bd47d 100644
>> --- a/drivers/vhost/vdpa.c
>> +++ b/drivers/vhost/vdpa.c
>> @@ -1219,6 +1219,37 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>>
>>  }
>>
>> +static int vhost_vdpa_process_iotlb_remap(struct vhost_vdpa *v,
>> +                                         struct vhost_iotlb *iotlb,
>> +                                         struct vhost_iotlb_msg *msg)
>> +{
>> +       struct vdpa_device *vdpa = v->vdpa;
>> +       const struct vdpa_config_ops *ops = vdpa->config;
>> +       u32 asid = iotlb_to_asid(iotlb);
>> +       u64 start = msg->iova;
>> +       u64 last = start + msg->size - 1;
>> +       struct vhost_iotlb_map *map;
>> +       int r = 0;
>> +
>> +       if (msg->perm || !msg->size)
>> +               return -EINVAL;
>> +
>> +       map = vhost_iotlb_itree_first(iotlb, start, last);
>> +       if (!map)
>> +               return -ENOENT;
>> +
>> +       if (map->start != start || map->last != last)
>> +               return -EINVAL;
>> +
>> +       /* batch will finish with remap.  non-batch must do it now. */
>> +       if (!v->in_batch)
>> +               r = ops->set_map(vdpa, asid, iotlb);
>> +       if (!r)
>> +               map->addr = msg->uaddr;
> 
> I may miss something, for example for PA mapping,
> 
> 1) need to convert uaddr into phys addr
> 2) need to check whether the uaddr is backed by the same page or not?

This code does not verify that the new size@uaddr points to the same physical
pages as the old size@uaddr.  If the app screws up and they differ, then the app
may corrupt its own memory, but no-one else's.

It would be expensive for large memories to verify page by page, O(npages), and such
verification lies on the critical path for virtual machine downtime during live update.
I could compare the properties of the vma(s) for the old size@uaddr vs the vma for the 
new, but that is more complicated and would be a maintenance headache.  When I submitted
such code to Alex W when writing the equivalent patches for vfio, he said don't check,
correctness is the user's responsibility.

- Steve

>> +
>> +       return r;
>> +}
>> +
>>  static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>>                                            struct vhost_iotlb *iotlb,
>>                                            struct vhost_iotlb_msg *msg)
>> @@ -1298,6 +1329,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, u32 asid,
>>                         ops->set_map(vdpa, asid, iotlb);
>>                 v->in_batch = false;
>>                 break;
>> +       case VHOST_IOTLB_REMAP:
>> +               r = vhost_vdpa_process_iotlb_remap(v, iotlb, msg);
>> +               break;
>>         default:
>>                 r = -EINVAL;
>>                 break;
>> diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
>> index 9177843951e9..35908315ff55 100644
>> --- a/include/uapi/linux/vhost_types.h
>> +++ b/include/uapi/linux/vhost_types.h
>> @@ -79,7 +79,7 @@ struct vhost_iotlb_msg {
>>  /*
>>   * VHOST_IOTLB_BATCH_BEGIN and VHOST_IOTLB_BATCH_END allow modifying
>>   * multiple mappings in one go: beginning with
>> - * VHOST_IOTLB_BATCH_BEGIN, followed by any number of
>> + * VHOST_IOTLB_BATCH_BEGIN, followed by any number of VHOST_IOTLB_REMAP or
>>   * VHOST_IOTLB_UPDATE messages, and ending with VHOST_IOTLB_BATCH_END.
>>   * When one of these two values is used as the message type, the rest
>>   * of the fields in the message are ignored. There's no guarantee that
>> @@ -87,6 +87,15 @@ struct vhost_iotlb_msg {
>>   */
>>  #define VHOST_IOTLB_BATCH_BEGIN    5
>>  #define VHOST_IOTLB_BATCH_END      6
>> +
>> +/*
>> + * VHOST_IOTLB_REMAP registers a new uaddr for the existing mapping at iova.
>> + * The new uaddr must address the same memory object as originally mapped.
>> + * Failure to do so will result in user memory corruption and/or device
>> + * misbehavior.  iova and size must match the arguments used to create the
>> + * an existing mapping.  Protection is not changed, and perm must be 0.
>> + */
>> +#define VHOST_IOTLB_REMAP          7
>>         __u8 type;
>>  };
>>
>> --
>> 2.39.3
>>
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC V1 00/13] vdpa live update
  2024-01-11  2:55 ` [RFC V1 00/13] vdpa live update Jason Wang
@ 2024-01-17 20:31   ` Steven Sistare
  2024-01-22  4:12     ` Jason Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Steven Sistare @ 2024-01-17 20:31 UTC (permalink / raw)
  To: Jason Wang
  Cc: virtualization, linux-kernel, Michael S. Tsirkin, Si-Wei Liu,
	Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea, Eli Cohen,
	Xie Yongji

On 1/10/2024 9:55 PM, Jason Wang wrote:
> On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare <steven.sistare@oracle.com> wrote:
>>
>> Live update is a technique wherein an application saves its state, exec's
>> to an updated version of itself, and restores its state.  Clients of the
>> application experience a brief suspension of service, on the order of
>> 100's of milliseconds, but are otherwise unaffected.
>>
>> Define and implement interfaces that allow vdpa devices to be preserved
>> across fork or exec, to support live update for applications such as qemu.
>> The device must be suspended during the update, but its dma mappings are
>> preserved, so the suspension is brief.
>>
>> The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory
>> accounting from one process to another.
>>
>> The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that
>> VHOST_NEW_OWNER is supported.
>>
>> The VHOST_IOTLB_REMAP message type updates a dma mapping with its userland
>> address in the new process.
>>
>> The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that
>> VHOST_IOTLB_REMAP is supported and required.  Some devices do not
>> require it, because the userland address of each dma mapping is discarded
>> after being translated to a physical address.
>>
>> Here is a pseudo-code sequence for performing live update, based on
>> suspend + reset because resume is not yet available.  The vdpa device
>> descriptor, fd, remains open across the exec.
>>
>>   ioctl(fd, VHOST_VDPA_SUSPEND)
>>   ioctl(fd, VHOST_VDPA_SET_STATUS, 0)
>>   exec
> 
> Is there a userspace implementation as a reference?

I have working patches for qemu that use these ioctl's, but they depend on other 
qemu cpr patches that are a work in progress, and not posted yet.  I'm working on
that.

>>   ioctl(fd, VHOST_NEW_OWNER)
>>
>>   issue ioctls to re-create vrings
>>
>>   if VHOST_BACKEND_F_IOTLB_REMAP
>>       foreach dma mapping
>>           write(fd, {VHOST_IOTLB_REMAP, new_addr})
> 
> I think I need to understand the advantages of this approach. For
> example, why it is better than
> 
> ioctl(VHOST_RESET_OWNER)
> exec
> 
> ioctl(VHOST_SET_OWNER)
> 
> for each dma mapping
>      ioctl(VHOST_IOTLB_UPDATE)

That is slower.  VHOST_RESET_OWNER unbinds physical pages, and VHOST_IOTLB_UPDATE
rebinds them.  It costs multiple seconds for large memories, and is incurred during the
virtual machine's pause time during live update.  For comparison, the total pause time
for live update with vfio interfaces is ~100 millis.

However, the interaction with userland is so similar that the same code paths can be used.
In my qemu prototype, after cpr exec's new qemu:
  - vhost_vdpa_set_owner() calls VHOST_NEW_OWNER instead of VHOST_SET_OWNER
  - vhost_vdpa_dma_map() sets type VHOST_IOTLB_REMAP instead of VHOST_IOTLB_UPDATE

- Steve


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC V1 01/13] vhost-vdpa: count pinned memory
  2024-01-10 22:24   ` Michael S. Tsirkin
@ 2024-01-17 20:34     ` Steven Sistare
  0 siblings, 0 replies; 33+ messages in thread
From: Steven Sistare @ 2024-01-17 20:34 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtualization, linux-kernel, Jason Wang, Si-Wei Liu,
	Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea, Eli Cohen,
	Xie Yongji

On 1/10/2024 5:24 PM, Michael S. Tsirkin wrote:
> On Wed, Jan 10, 2024 at 12:40:03PM -0800, Steve Sistare wrote:
>> Remember the count of pinned memory for the device.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> 
> Can we have iommufd support in vdpa so we do not keep extending these hacks?

I assume this is rhetorical and not aimed specifically at me, but live update
interfaces for iommufd are on my todo list.

- Steve

>> ---
>>  drivers/vhost/vdpa.c | 7 +++++--
>>  1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>> index da7ec77cdaff..10fb95bcca1a 100644
>> --- a/drivers/vhost/vdpa.c
>> +++ b/drivers/vhost/vdpa.c
>> @@ -59,6 +59,7 @@ struct vhost_vdpa {
>>  	int in_batch;
>>  	struct vdpa_iova_range range;
>>  	u32 batch_asid;
>> +	long pinned_vm;
>>  };
>>  
>>  static DEFINE_IDA(vhost_vdpa_ida);
>> @@ -893,6 +894,7 @@ static void vhost_vdpa_pa_unmap(struct vhost_vdpa *v, struct vhost_iotlb *iotlb,
>>  			unpin_user_page(page);
>>  		}
>>  		atomic64_sub(PFN_DOWN(map->size), &dev->mm->pinned_vm);
>> +		v->pinned_vm -= PFN_DOWN(map->size);
>>  		vhost_vdpa_general_unmap(v, map, asid);
>>  		vhost_iotlb_map_free(iotlb, map);
>>  	}
>> @@ -975,9 +977,10 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct vhost_iotlb *iotlb,
>>  		return r;
>>  	}
>>  
>> -	if (!vdpa->use_va)
>> +	if (!vdpa->use_va) {
>>  		atomic64_add(PFN_DOWN(size), &dev->mm->pinned_vm);
>> -
>> +		v->pinned_vm += PFN_DOWN(size);
>> +	}
>>  	return 0;
>>  }
>>  
>> -- 
>> 2.39.3
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC V1 05/13] vhost-vdpa: VHOST_IOTLB_REMAP
  2024-01-17 20:31     ` Steven Sistare
@ 2024-01-22  4:05       ` Jason Wang
  0 siblings, 0 replies; 33+ messages in thread
From: Jason Wang @ 2024-01-22  4:05 UTC (permalink / raw)
  To: Steven Sistare
  Cc: virtualization, linux-kernel, Michael S. Tsirkin, Si-Wei Liu,
	Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea, Eli Cohen,
	Xie Yongji

On Thu, Jan 18, 2024 at 4:32 AM Steven Sistare
<steven.sistare@oracle.com> wrote:
>
> On 1/10/2024 10:08 PM, Jason Wang wrote:
> > On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare <steven.sistare@oracle.com> wrote:
> >>
> >> When device ownership is passed to a new process via VHOST_NEW_OWNER,
> >> some devices need to know the new userland addresses of the dma mappings.
> >> Define the new iotlb message type VHOST_IOTLB_REMAP to update the uaddr
> >> of a mapping.  The new uaddr must address the same memory object as
> >> originally mapped.
> >>
> >> The user must suspend the device before the old address is invalidated,
> >> and cannot resume it until after VHOST_IOTLB_REMAP is called, but this
> >> requirement is not enforced by the API.
> >>
> >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >> ---
> >>  drivers/vhost/vdpa.c             | 34 ++++++++++++++++++++++++++++++++
> >>  include/uapi/linux/vhost_types.h | 11 ++++++++++-
> >>  2 files changed, 44 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> >> index faed6471934a..ec5ca20bd47d 100644
> >> --- a/drivers/vhost/vdpa.c
> >> +++ b/drivers/vhost/vdpa.c
> >> @@ -1219,6 +1219,37 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
> >>
> >>  }
> >>
> >> +static int vhost_vdpa_process_iotlb_remap(struct vhost_vdpa *v,
> >> +                                         struct vhost_iotlb *iotlb,
> >> +                                         struct vhost_iotlb_msg *msg)
> >> +{
> >> +       struct vdpa_device *vdpa = v->vdpa;
> >> +       const struct vdpa_config_ops *ops = vdpa->config;
> >> +       u32 asid = iotlb_to_asid(iotlb);
> >> +       u64 start = msg->iova;
> >> +       u64 last = start + msg->size - 1;
> >> +       struct vhost_iotlb_map *map;
> >> +       int r = 0;
> >> +
> >> +       if (msg->perm || !msg->size)
> >> +               return -EINVAL;
> >> +
> >> +       map = vhost_iotlb_itree_first(iotlb, start, last);
> >> +       if (!map)
> >> +               return -ENOENT;
> >> +
> >> +       if (map->start != start || map->last != last)
> >> +               return -EINVAL;
> >> +
> >> +       /* batch will finish with remap.  non-batch must do it now. */
> >> +       if (!v->in_batch)
> >> +               r = ops->set_map(vdpa, asid, iotlb);
> >> +       if (!r)
> >> +               map->addr = msg->uaddr;
> >
> > I may miss something, for example for PA mapping,
> >
> > 1) need to convert uaddr into phys addr
> > 2) need to check whether the uaddr is backed by the same page or not?
>
> This code does not verify that the new size@uaddr points to the same physical
> pages as the old size@uaddr.  If the app screws up and they differ, then the app
> may corrupt its own memory, but no-one else's.
>
> It would be expensive for large memories to verify page by page, O(npages), and such
> verification lies on the critical path for virtual machine downtime during live update.
> I could compare the properties of the vma(s) for the old size@uaddr vs the vma for the
> new, but that is more complicated and would be a maintenance headache.  When I submitted
> such code to Alex W when writing the equivalent patches for vfio, he said don't check,
> correctness is the user's responsibility.

Ok, let's document this somewhere.

Thanks

>
> - Steve
>
> >> +
> >> +       return r;
> >> +}
> >> +
> >>  static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
> >>                                            struct vhost_iotlb *iotlb,
> >>                                            struct vhost_iotlb_msg *msg)
> >> @@ -1298,6 +1329,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, u32 asid,
> >>                         ops->set_map(vdpa, asid, iotlb);
> >>                 v->in_batch = false;
> >>                 break;
> >> +       case VHOST_IOTLB_REMAP:
> >> +               r = vhost_vdpa_process_iotlb_remap(v, iotlb, msg);
> >> +               break;
> >>         default:
> >>                 r = -EINVAL;
> >>                 break;
> >> diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
> >> index 9177843951e9..35908315ff55 100644
> >> --- a/include/uapi/linux/vhost_types.h
> >> +++ b/include/uapi/linux/vhost_types.h
> >> @@ -79,7 +79,7 @@ struct vhost_iotlb_msg {
> >>  /*
> >>   * VHOST_IOTLB_BATCH_BEGIN and VHOST_IOTLB_BATCH_END allow modifying
> >>   * multiple mappings in one go: beginning with
> >> - * VHOST_IOTLB_BATCH_BEGIN, followed by any number of
> >> + * VHOST_IOTLB_BATCH_BEGIN, followed by any number of VHOST_IOTLB_REMAP or
> >>   * VHOST_IOTLB_UPDATE messages, and ending with VHOST_IOTLB_BATCH_END.
> >>   * When one of these two values is used as the message type, the rest
> >>   * of the fields in the message are ignored. There's no guarantee that
> >> @@ -87,6 +87,15 @@ struct vhost_iotlb_msg {
> >>   */
> >>  #define VHOST_IOTLB_BATCH_BEGIN    5
> >>  #define VHOST_IOTLB_BATCH_END      6
> >> +
> >> +/*
> >> + * VHOST_IOTLB_REMAP registers a new uaddr for the existing mapping at iova.
> >> + * The new uaddr must address the same memory object as originally mapped.
> >> + * Failure to do so will result in user memory corruption and/or device
> >> + * misbehavior.  iova and size must match the arguments used to create the
> >> + * an existing mapping.  Protection is not changed, and perm must be 0.
> >> + */
> >> +#define VHOST_IOTLB_REMAP          7
> >>         __u8 type;
> >>  };
> >>
> >> --
> >> 2.39.3
> >>
> >
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC V1 00/13] vdpa live update
  2024-01-17 20:31   ` Steven Sistare
@ 2024-01-22  4:12     ` Jason Wang
  0 siblings, 0 replies; 33+ messages in thread
From: Jason Wang @ 2024-01-22  4:12 UTC (permalink / raw)
  To: Steven Sistare
  Cc: virtualization, linux-kernel, Michael S. Tsirkin, Si-Wei Liu,
	Eugenio Perez Martin, Xuan Zhuo, Dragos Tatulea, Eli Cohen,
	Xie Yongji

On Thu, Jan 18, 2024 at 4:32 AM Steven Sistare
<steven.sistare@oracle.com> wrote:
>
> On 1/10/2024 9:55 PM, Jason Wang wrote:
> > On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare <steven.sistare@oracle.com> wrote:
> >>
> >> Live update is a technique wherein an application saves its state, exec's
> >> to an updated version of itself, and restores its state.  Clients of the
> >> application experience a brief suspension of service, on the order of
> >> 100's of milliseconds, but are otherwise unaffected.
> >>
> >> Define and implement interfaces that allow vdpa devices to be preserved
> >> across fork or exec, to support live update for applications such as qemu.
> >> The device must be suspended during the update, but its dma mappings are
> >> preserved, so the suspension is brief.
> >>
> >> The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory
> >> accounting from one process to another.
> >>
> >> The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that
> >> VHOST_NEW_OWNER is supported.
> >>
> >> The VHOST_IOTLB_REMAP message type updates a dma mapping with its userland
> >> address in the new process.
> >>
> >> The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that
> >> VHOST_IOTLB_REMAP is supported and required.  Some devices do not
> >> require it, because the userland address of each dma mapping is discarded
> >> after being translated to a physical address.
> >>
> >> Here is a pseudo-code sequence for performing live update, based on
> >> suspend + reset because resume is not yet available.  The vdpa device
> >> descriptor, fd, remains open across the exec.
> >>
> >>   ioctl(fd, VHOST_VDPA_SUSPEND)
> >>   ioctl(fd, VHOST_VDPA_SET_STATUS, 0)
> >>   exec
> >
> > Is there a userspace implementation as a reference?
>
> I have working patches for qemu that use these ioctl's, but they depend on other
> qemu cpr patches that are a work in progress, and not posted yet.  I'm working on
> that.

Ok.

>
> >>   ioctl(fd, VHOST_NEW_OWNER)
> >>
> >>   issue ioctls to re-create vrings
> >>
> >>   if VHOST_BACKEND_F_IOTLB_REMAP
> >>       foreach dma mapping
> >>           write(fd, {VHOST_IOTLB_REMAP, new_addr})
> >
> > I think I need to understand the advantages of this approach. For
> > example, why it is better than
> >
> > ioctl(VHOST_RESET_OWNER)
> > exec
> >
> > ioctl(VHOST_SET_OWNER)
> >
> > for each dma mapping
> >      ioctl(VHOST_IOTLB_UPDATE)
>
> That is slower.  VHOST_RESET_OWNER unbinds physical pages, and VHOST_IOTLB_UPDATE
> rebinds them.  It costs multiple seconds for large memories, and is incurred during the
> virtual machine's pause time during live update.  For comparison, the total pause time
> for live update with vfio interfaces is ~100 millis.
>
> However, the interaction with userland is so similar that the same code paths can be used.
> In my qemu prototype, after cpr exec's new qemu:
>   - vhost_vdpa_set_owner() calls VHOST_NEW_OWNER instead of VHOST_SET_OWNER
>   - vhost_vdpa_dma_map() sets type VHOST_IOTLB_REMAP instead of VHOST_IOTLB_UPDATE
>
> - Steve
>

Ok, let's document this in the changlog.

Thanks


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC V1 05/13] vhost-vdpa: VHOST_IOTLB_REMAP
  2024-01-16 18:14   ` Eugenio Perez Martin
@ 2024-02-09 15:49     ` Steven Sistare
  0 siblings, 0 replies; 33+ messages in thread
From: Steven Sistare @ 2024-02-09 15:49 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: virtualization, linux-kernel, Michael S. Tsirkin, Jason Wang,
	Si-Wei Liu, Xuan Zhuo, Dragos Tatulea, Eli Cohen, Xie Yongji

On 1/16/2024 1:14 PM, Eugenio Perez Martin wrote:
> On Wed, Jan 10, 2024 at 9:40 PM Steve Sistare <steven.sistare@oracle.com> wrote:
>>
>> When device ownership is passed to a new process via VHOST_NEW_OWNER,
>> some devices need to know the new userland addresses of the dma mappings.
>> Define the new iotlb message type VHOST_IOTLB_REMAP to update the uaddr
>> of a mapping.  The new uaddr must address the same memory object as
>> originally mapped.
> 
> I think this new ioctl is very interesting, and can be used to
> optimize some parts of live migration with shadow virtqueue if it
> allows to actually replace the uaddr. Would you be interested in that
> capability?

Sure.  Please share your thoughts on how it would be used.  I don't follow,
because with live migration, you are creating a new vdpa instance with new
shadow queues on the destination, versus relocating old shadow queues (which 
we could do during live update which preserves the same vdpa instance).

(Sorry for the late response, I stashed this email and forgot to respond.)

>> The user must suspend the device before the old address is invalidated,
>> and cannot resume it until after VHOST_IOTLB_REMAP is called, but this
>> requirement is not enforced by the API.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  drivers/vhost/vdpa.c             | 34 ++++++++++++++++++++++++++++++++
>>  include/uapi/linux/vhost_types.h | 11 ++++++++++-
>>  2 files changed, 44 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>> index faed6471934a..ec5ca20bd47d 100644
>> --- a/drivers/vhost/vdpa.c
>> +++ b/drivers/vhost/vdpa.c
>> @@ -1219,6 +1219,37 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>>
>>  }
>>
>> +static int vhost_vdpa_process_iotlb_remap(struct vhost_vdpa *v,
>> +                                         struct vhost_iotlb *iotlb,
>> +                                         struct vhost_iotlb_msg *msg)
>> +{
>> +       struct vdpa_device *vdpa = v->vdpa;
>> +       const struct vdpa_config_ops *ops = vdpa->config;
>> +       u32 asid = iotlb_to_asid(iotlb);
>> +       u64 start = msg->iova;
>> +       u64 last = start + msg->size - 1;
>> +       struct vhost_iotlb_map *map;
>> +       int r = 0;
>> +
>> +       if (msg->perm || !msg->size)
>> +               return -EINVAL;
>> +
>> +       map = vhost_iotlb_itree_first(iotlb, start, last);
>> +       if (!map)
>> +               return -ENOENT;
>> +
>> +       if (map->start != start || map->last != last)
>> +               return -EINVAL;
>> +
>> +       /* batch will finish with remap.  non-batch must do it now. */
>> +       if (!v->in_batch)
>> +               r = ops->set_map(vdpa, asid, iotlb);
> 
> I'm missing how these cases are handled:
>
> * The device does not expose set_map but dma_map / dma_unmap (you can
> check this case with the simulator)

I chose not to support dma_map, because set_map looks to be more commonly supported,
and batch-mode plus set_map is the most efficient way to support remapping.
I could define a dma_remap entry point if folks think that is important.
Regardless, I will check that ops->set_map != NULL before calling it.

> * The device uses platform iommu (!set_map && !dma_unmap vdpa_ops).

I believe this just works, because iommu_map() never saves userland address, so it 
does not need to be informed when uaddr changes.  That is certainly true for the 
code path:

  vhost_vdpa_pa_map()
    vhost_vdpa_map()
      if !dma_map && !set_map
        iommu_map()

However, this code path confuses me:

  vhost_vdpa_process_iotlb_update()
    if (vdpa->use_va)
      vhost_vdpa_va_map(... uaddr)
        vhost_vdpa_map(... uaddr)
          iommu_map(... uaddr)

AFAICT uaddr is never translated to physical.
Maybe use_va is always false if !dma_map && !set_map  ?

- Steve

>> +       if (!r)
>> +               map->addr = msg->uaddr;
>> +
>> +       return r;
>> +}
>> +
>>  static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>>                                            struct vhost_iotlb *iotlb,
>>                                            struct vhost_iotlb_msg *msg)
>> @@ -1298,6 +1329,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, u32 asid,
>>                         ops->set_map(vdpa, asid, iotlb);
>>                 v->in_batch = false;
>>                 break;
>> +       case VHOST_IOTLB_REMAP:
>> +               r = vhost_vdpa_process_iotlb_remap(v, iotlb, msg);
>> +               break;
>>         default:
>>                 r = -EINVAL;
>>                 break;
>> diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
>> index 9177843951e9..35908315ff55 100644
>> --- a/include/uapi/linux/vhost_types.h
>> +++ b/include/uapi/linux/vhost_types.h
>> @@ -79,7 +79,7 @@ struct vhost_iotlb_msg {
>>  /*
>>   * VHOST_IOTLB_BATCH_BEGIN and VHOST_IOTLB_BATCH_END allow modifying
>>   * multiple mappings in one go: beginning with
>> - * VHOST_IOTLB_BATCH_BEGIN, followed by any number of
>> + * VHOST_IOTLB_BATCH_BEGIN, followed by any number of VHOST_IOTLB_REMAP or
>>   * VHOST_IOTLB_UPDATE messages, and ending with VHOST_IOTLB_BATCH_END.
>>   * When one of these two values is used as the message type, the rest
>>   * of the fields in the message are ignored. There's no guarantee that
>> @@ -87,6 +87,15 @@ struct vhost_iotlb_msg {
>>   */
>>  #define VHOST_IOTLB_BATCH_BEGIN    5
>>  #define VHOST_IOTLB_BATCH_END      6
>> +
>> +/*
>> + * VHOST_IOTLB_REMAP registers a new uaddr for the existing mapping at iova.
>> + * The new uaddr must address the same memory object as originally mapped.
>> + * Failure to do so will result in user memory corruption and/or device
>> + * misbehavior.  iova and size must match the arguments used to create the
>> + * an existing mapping.  Protection is not changed, and perm must be 0.
>> + */
>> +#define VHOST_IOTLB_REMAP          7
>>         __u8 type;
>>  };
>>
>> --
>> 2.39.3
>>
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2024-02-09 15:50 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-10 20:40 [RFC V1 00/13] vdpa live update Steve Sistare
2024-01-10 20:40 ` [RFC V1 01/13] vhost-vdpa: count pinned memory Steve Sistare
2024-01-10 22:24   ` Michael S. Tsirkin
2024-01-17 20:34     ` Steven Sistare
2024-01-10 20:40 ` [RFC V1 02/13] vhost-vdpa: pass mm to bind Steve Sistare
2024-01-10 20:40 ` [RFC V1 03/13] vhost-vdpa: VHOST_NEW_OWNER Steve Sistare
2024-01-10 20:40 ` [RFC V1 04/13] vhost-vdpa: VHOST_BACKEND_F_NEW_OWNER Steve Sistare
2024-01-10 20:40 ` [RFC V1 05/13] vhost-vdpa: VHOST_IOTLB_REMAP Steve Sistare
2024-01-11  3:08   ` Jason Wang
2024-01-17 20:31     ` Steven Sistare
2024-01-22  4:05       ` Jason Wang
2024-01-16 18:14   ` Eugenio Perez Martin
2024-02-09 15:49     ` Steven Sistare
2024-01-10 20:40 ` [RFC V1 06/13] vhost-vdpa: VHOST_BACKEND_F_IOTLB_REMAP Steve Sistare
2024-01-10 20:40 ` [RFC V1 07/13] vhost-vdpa: flush workers on suspend Steve Sistare
2024-01-11  3:09   ` Jason Wang
2024-01-11 16:17     ` Mike Christie
2024-01-12  2:28       ` Jason Wang
2024-01-17 20:30         ` Steven Sistare
2024-01-10 20:40 ` [RFC V1 08/13] vduse: " Steve Sistare
2024-01-11  3:09   ` Jason Wang
2024-01-17 20:31     ` Steven Sistare
2024-01-10 20:40 ` [RFC V1 09/13] vdpa_sim: reset must not run Steve Sistare
2024-01-16 18:33   ` Eugenio Perez Martin
2024-01-10 20:40 ` [RFC V1 10/13] vdpa_sim: flush workers on suspend Steve Sistare
2024-01-16 18:57   ` Eugenio Perez Martin
2024-01-17 20:31     ` Steven Sistare
2024-01-10 20:40 ` [RFC V1 11/13] vdpa/mlx5: new owner capability Steve Sistare
2024-01-10 20:40 ` [RFC V1 12/13] vdpa_sim: " Steve Sistare
2024-01-10 20:40 ` [RFC V1 13/13] vduse: " Steve Sistare
2024-01-11  2:55 ` [RFC V1 00/13] vdpa live update Jason Wang
2024-01-17 20:31   ` Steven Sistare
2024-01-22  4:12     ` Jason Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).