[PATCH V1 0/8] vfio virtual address update redo

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH V1 0/8] vfio virtual address update redo
@ 2022-12-06 21:55 Steve Sistare
  2022-12-06 21:55 ` [PATCH V1 1/8] vfio: delete interfaces to update vaddr Steve Sistare
                   ` (8 more replies)
  0 siblings, 9 replies; 29+ messages in thread
From: Steve Sistare @ 2022-12-06 21:55 UTC (permalink / raw)
  To: kvm; +Cc: Alex Williamson, Cornelia Huck, Steve Sistare

Re-architect the interfaces that allow the underlying memory object of an
iova range to be mapped in a new address space.  The old interfaces allow
userland to indefinitely block vfio mediated device kernel threads, and do
not propagate the locked_vm count to a new mm.

Interface changes:
  - disable the VFIO_UPDATE_VADDR extension
  - delete VFIO_DMA_UNMAP_FLAG_VADDR
  - redefine VFIO_DMA_MAP_FLAG_VADDR

New interfaces:
  - VFIO_CHANGE_DMA_OWNER iommu driver ioctl
  - VFIO_DMA_OWNER extension, consisting of VFIO_CHANGE_DMA_OWNER and the
    redefined VFIO_DMA_MAP_FLAG_VADDR.

VFIO_DMA_MAP_FLAG_VADDR changes the base virtual address for a dma mapping.
It is called after exec, after the application remaps the corresponding shared
memory object that was preserved across exec.  However, the change does not
take effect until VFIO_CHANGE_DMA_OWNER is called.  This allows the application
to iterate and register a new vaddr for all dma's, and have them take effect
atomically.

VFIO_CHANGE_DMA_OWNER changes the task and mm for all dma mappings to that of
the caller, and transfers the locked_vm count from the old to the new mm.  The
vaddr for each mapping must either be the same in the old and new mm (eg after
fork), or must have been updated with VFIO_DMA_MAP_FLAG_VADDR (eg after exec).
Subsequently, the caller is the only task that is allowed to pin pages for dma.
This prevents an application from exceeding the initial task's RLIMIT_MEMLOCK
by fork'ing and pinning in children.

These interfaces can be used to implement live update, in which a process such
as qemu exec's an updated version of itself, while preserving its guest and
vfio devices.  The application must preserve the vfio descriptors across fork
and exec, and must not start each step below until the previous step has
finished.

  parent				      child

  1. fork
					      2. ioctl(VFIO_CHANGE_DMA_OWNER)
  3. exec new binary

  4. foreach dma mapping
       va = mmap()
       ioctl(, VFIO_DMA_MAP_FLAG_VADDR, va)

     ioctl(VFIO_CHANGE_DMA_OWNER)

					      5. exit

With this arrangement, the dma mappings are always associated with a valid
mm, and mediated device requests such as vfio_pin_pages and vfio_dma_rw block
only briefly during the ioctls.  Thanks to Jason Gunthorpe for suggesting fork
and the change mm ioctl.

Lastly, if a task exits or execs, and it still owns any dma mappings, they
are unmapped and unpinned.  This guarantees that pages do not remain pinned
indefinitely if a vfio descriptor is leaked to another process, and requires
tasks to explicitly transfer ownership of dma (and hence locked_vm) to a new
task and mm when continued operation is desired.  The vfio driver maps a
special vma so it can detect exit and exec, via the vm_operations_struct
close callback.

Steve Sistare (8):
  vfio: delete interfaces to update vaddr
  vfio/type1: dma owner permission
  vfio: close dma owner
  vfio/type1: close dma owner
  vfio/type1: track locked_vm per dma
  vfio/type1: update vaddr
  vfio: change dma owner
  vfio/type1: change dma owner

 drivers/vfio/container.c        | 169 ++++++++++++++++++-
 drivers/vfio/vfio.h             |   9 +-
 drivers/vfio/vfio_iommu_type1.c | 362 +++++++++++++++++++++++-----------------
 include/uapi/linux/vfio.h       |  54 ++++--
 4 files changed, 412 insertions(+), 182 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH V1 1/8] vfio: delete interfaces to update vaddr
  2022-12-06 21:55 [PATCH V1 0/8] vfio virtual address update redo Steve Sistare
@ 2022-12-06 21:55 ` Steve Sistare
  2022-12-06 23:52   ` Alex Williamson
                     ` (2 more replies)
  2022-12-06 21:55 ` [PATCH V1 2/8] vfio/type1: dma owner permission Steve Sistare
                   ` (7 subsequent siblings)
  8 siblings, 3 replies; 29+ messages in thread
From: Steve Sistare @ 2022-12-06 21:55 UTC (permalink / raw)
  To: kvm; +Cc: Alex Williamson, Cornelia Huck, Steve Sistare

Delete the interfaces that allow an iova range to be re-mapped in a new
address space.  They allow userland to indefinitely block vfio mediated
device kernel threads, and do not propagate the locked_vm count to a
new mm.

  - disable the VFIO_UPDATE_VADDR extension
  - delete VFIO_DMA_UNMAP_FLAG_VADDR
  - delete most of VFIO_DMA_MAP_FLAG_VADDR (but keep some for use in a
    new implementation in a subsequent patch).

Revert most of the code of these commits:

  441e810 ("vfio: interfaces to update vaddr")
  c3cbab2 ("vfio/type1: implement interfaces to update vaddr")
  898b9ea ("vfio/type1: block on invalid vaddr")

Revert these commits.  They are harmless, but no longer used after the
above are reverted, and this kind of functionality is better handled by
adding new methods to vfio_iommu_driver_ops.

  ec5e329 ("vfio: iommu driver notify callback")
  487ace1 ("vfio/type1: implement notify callback")

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vfio/container.c        |   5 --
 drivers/vfio/vfio.h             |   7 --
 drivers/vfio/vfio_iommu_type1.c | 144 ++--------------------------------------
 include/uapi/linux/vfio.h       |  17 +----
 4 files changed, 8 insertions(+), 165 deletions(-)

diff --git a/drivers/vfio/container.c b/drivers/vfio/container.c
index d74164a..5bfd10d 100644
--- a/drivers/vfio/container.c
+++ b/drivers/vfio/container.c
@@ -382,11 +382,6 @@ static int vfio_fops_open(struct inode *inode, struct file *filep)
 static int vfio_fops_release(struct inode *inode, struct file *filep)
 {
 	struct vfio_container *container = filep->private_data;
-	struct vfio_iommu_driver *driver = container->iommu_driver;
-
-	if (driver && driver->ops->notify)
-		driver->ops->notify(container->iommu_data,
-				    VFIO_IOMMU_CONTAINER_CLOSE);
 
 	filep->private_data = NULL;
 
diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index bcad54b..8a439c6 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -62,11 +62,6 @@ struct vfio_group {
 	struct blocking_notifier_head	notifier;
 };
 
-/* events for the backend driver notify callback */
-enum vfio_iommu_notify_type {
-	VFIO_IOMMU_CONTAINER_CLOSE = 0,
-};
-
 /**
  * struct vfio_iommu_driver_ops - VFIO IOMMU driver callbacks
  */
@@ -97,8 +92,6 @@ struct vfio_iommu_driver_ops {
 				  void *data, size_t count, bool write);
 	struct iommu_domain *(*group_iommu_domain)(void *iommu_data,
 						   struct iommu_group *group);
-	void		(*notify)(void *iommu_data,
-				  enum vfio_iommu_notify_type event);
 };
 
 struct vfio_iommu_driver {
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 23c24fe..02c6ea3 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -69,14 +69,11 @@ struct vfio_iommu {
 	struct list_head	device_list;
 	struct mutex		device_list_lock;
 	unsigned int		dma_avail;
-	unsigned int		vaddr_invalid_count;
 	uint64_t		pgsize_bitmap;
 	uint64_t		num_non_pinned_groups;
-	wait_queue_head_t	vaddr_wait;
 	bool			v2;
 	bool			nesting;
 	bool			dirty_page_tracking;
-	bool			container_open;
 	struct list_head	emulated_iommu_groups;
 };
 
@@ -96,7 +93,6 @@ struct vfio_dma {
 	int			prot;		/* IOMMU_READ/WRITE */
 	bool			iommu_mapped;
 	bool			lock_cap;	/* capable(CAP_IPC_LOCK) */
-	bool			vaddr_invalid;
 	struct task_struct	*task;
 	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
 	unsigned long		*bitmap;
@@ -152,8 +148,6 @@ struct vfio_regions {
 #define DIRTY_BITMAP_PAGES_MAX	 ((u64)INT_MAX)
 #define DIRTY_BITMAP_SIZE_MAX	 DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
 
-#define WAITED 1
-
 static int put_pfn(unsigned long pfn, int prot);
 
 static struct vfio_iommu_group*
@@ -595,60 +589,6 @@ static int vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
 	return ret;
 }
 
-static int vfio_wait(struct vfio_iommu *iommu)
-{
-	DEFINE_WAIT(wait);
-
-	prepare_to_wait(&iommu->vaddr_wait, &wait, TASK_KILLABLE);
-	mutex_unlock(&iommu->lock);
-	schedule();
-	mutex_lock(&iommu->lock);
-	finish_wait(&iommu->vaddr_wait, &wait);
-	if (kthread_should_stop() || !iommu->container_open ||
-	    fatal_signal_pending(current)) {
-		return -EFAULT;
-	}
-	return WAITED;
-}
-
-/*
- * Find dma struct and wait for its vaddr to be valid.  iommu lock is dropped
- * if the task waits, but is re-locked on return.  Return result in *dma_p.
- * Return 0 on success with no waiting, WAITED on success if waited, and -errno
- * on error.
- */
-static int vfio_find_dma_valid(struct vfio_iommu *iommu, dma_addr_t start,
-			       size_t size, struct vfio_dma **dma_p)
-{
-	int ret = 0;
-
-	do {
-		*dma_p = vfio_find_dma(iommu, start, size);
-		if (!*dma_p)
-			return -EINVAL;
-		else if (!(*dma_p)->vaddr_invalid)
-			return ret;
-		else
-			ret = vfio_wait(iommu);
-	} while (ret == WAITED);
-
-	return ret;
-}
-
-/*
- * Wait for all vaddr in the dma_list to become valid.  iommu lock is dropped
- * if the task waits, but is re-locked on return.  Return 0 on success with no
- * waiting, WAITED on success if waited, and -errno on error.
- */
-static int vfio_wait_all_valid(struct vfio_iommu *iommu)
-{
-	int ret = 0;
-
-	while (iommu->vaddr_invalid_count && ret >= 0)
-		ret = vfio_wait(iommu);
-
-	return ret;
-}
 
 /*
  * Attempt to pin pages.  We really don't want to track all the pfns and
@@ -861,22 +801,6 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 
 	mutex_lock(&iommu->lock);
 
-	/*
-	 * Wait for all necessary vaddr's to be valid so they can be used in
-	 * the main loop without dropping the lock, to avoid racing vs unmap.
-	 */
-again:
-	if (iommu->vaddr_invalid_count) {
-		for (i = 0; i < npage; i++) {
-			iova = user_iova + PAGE_SIZE * i;
-			ret = vfio_find_dma_valid(iommu, iova, PAGE_SIZE, &dma);
-			if (ret < 0)
-				goto pin_done;
-			if (ret == WAITED)
-				goto again;
-		}
-	}
-
 	/* Fail if no dma_umap notifier is registered */
 	if (list_empty(&iommu->device_list)) {
 		ret = -EINVAL;
@@ -1175,10 +1099,6 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
 	vfio_unlink_dma(iommu, dma);
 	put_task_struct(dma->task);
 	vfio_dma_bitmap_free(dma);
-	if (dma->vaddr_invalid) {
-		iommu->vaddr_invalid_count--;
-		wake_up_all(&iommu->vaddr_wait);
-	}
 	kfree(dma);
 	iommu->dma_avail++;
 }
@@ -1338,8 +1258,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 	dma_addr_t iova = unmap->iova;
 	u64 size = unmap->size;
 	bool unmap_all = unmap->flags & VFIO_DMA_UNMAP_FLAG_ALL;
-	bool invalidate_vaddr = unmap->flags & VFIO_DMA_UNMAP_FLAG_VADDR;
-	struct rb_node *n, *first_n;
+	struct rb_node *n;
 
 	mutex_lock(&iommu->lock);
 
@@ -1408,7 +1327,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 	}
 
 	ret = 0;
-	n = first_n = vfio_find_dma_first_node(iommu, iova, size);
+	n = vfio_find_dma_first_node(iommu, iova, size);
 
 	while (n) {
 		dma = rb_entry(n, struct vfio_dma, node);
@@ -1418,27 +1337,6 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 		if (!iommu->v2 && iova > dma->iova)
 			break;
 
-		if (invalidate_vaddr) {
-			if (dma->vaddr_invalid) {
-				struct rb_node *last_n = n;
-
-				for (n = first_n; n != last_n; n = rb_next(n)) {
-					dma = rb_entry(n,
-						       struct vfio_dma, node);
-					dma->vaddr_invalid = false;
-					iommu->vaddr_invalid_count--;
-				}
-				ret = -EINVAL;
-				unmapped = 0;
-				break;
-			}
-			dma->vaddr_invalid = true;
-			iommu->vaddr_invalid_count++;
-			unmapped += dma->size;
-			n = rb_next(n);
-			continue;
-		}
-
 		if (!RB_EMPTY_ROOT(&dma->pfn_list)) {
 			if (dma_last == dma) {
 				BUG_ON(++retries > 10);
@@ -1611,14 +1509,10 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	if (set_vaddr) {
 		if (!dma) {
 			ret = -ENOENT;
-		} else if (!dma->vaddr_invalid || dma->iova != iova ||
-			   dma->size != size) {
+		} else if (dma->iova != iova || dma->size != size) {
 			ret = -EINVAL;
 		} else {
 			dma->vaddr = vaddr;
-			dma->vaddr_invalid = false;
-			iommu->vaddr_invalid_count--;
-			wake_up_all(&iommu->vaddr_wait);
 		}
 		goto out_unlock;
 	} else if (dma) {
@@ -1707,10 +1601,6 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
 	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
 	int ret;
 
-	ret = vfio_wait_all_valid(iommu);
-	if (ret < 0)
-		return ret;
-
 	/* Arbitrarily pick the first domain in the list for lookups */
 	if (!list_empty(&iommu->domain_list))
 		d = list_first_entry(&iommu->domain_list,
@@ -2592,11 +2482,9 @@ static void *vfio_iommu_type1_open(unsigned long arg)
 	INIT_LIST_HEAD(&iommu->iova_list);
 	iommu->dma_list = RB_ROOT;
 	iommu->dma_avail = dma_entry_limit;
-	iommu->container_open = true;
 	mutex_init(&iommu->lock);
 	mutex_init(&iommu->device_list_lock);
 	INIT_LIST_HEAD(&iommu->device_list);
-	init_waitqueue_head(&iommu->vaddr_wait);
 	iommu->pgsize_bitmap = PAGE_MASK;
 	INIT_LIST_HEAD(&iommu->emulated_iommu_groups);
 
@@ -2668,7 +2556,6 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
 	case VFIO_TYPE1v2_IOMMU:
 	case VFIO_TYPE1_NESTING_IOMMU:
 	case VFIO_UNMAP_ALL:
-	case VFIO_UPDATE_VADDR:
 		return 1;
 	case VFIO_DMA_CC_IOMMU:
 		if (!iommu)
@@ -2860,7 +2747,6 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
 	struct vfio_iommu_type1_dma_unmap unmap;
 	struct vfio_bitmap bitmap = { 0 };
 	uint32_t mask = VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP |
-			VFIO_DMA_UNMAP_FLAG_VADDR |
 			VFIO_DMA_UNMAP_FLAG_ALL;
 	unsigned long minsz;
 	int ret;
@@ -2874,8 +2760,7 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
 		return -EINVAL;
 
 	if ((unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) &&
-	    (unmap.flags & (VFIO_DMA_UNMAP_FLAG_ALL |
-			    VFIO_DMA_UNMAP_FLAG_VADDR)))
+	    (unmap.flags & VFIO_DMA_UNMAP_FLAG_ALL))
 		return -EINVAL;
 
 	if (unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) {
@@ -3078,13 +2963,12 @@ static int vfio_iommu_type1_dma_rw_chunk(struct vfio_iommu *iommu,
 	struct vfio_dma *dma;
 	bool kthread = current->mm == NULL;
 	size_t offset;
-	int ret;
 
 	*copied = 0;
 
-	ret = vfio_find_dma_valid(iommu, user_iova, 1, &dma);
-	if (ret < 0)
-		return ret;
+	dma = vfio_find_dma(iommu, user_iova, 1);
+	if (!dma)
+		return -EINVAL;
 
 	if ((write && !(dma->prot & IOMMU_WRITE)) ||
 			!(dma->prot & IOMMU_READ))
@@ -3176,19 +3060,6 @@ static int vfio_iommu_type1_dma_rw(void *iommu_data, dma_addr_t user_iova,
 	return domain;
 }
 
-static void vfio_iommu_type1_notify(void *iommu_data,
-				    enum vfio_iommu_notify_type event)
-{
-	struct vfio_iommu *iommu = iommu_data;
-
-	if (event != VFIO_IOMMU_CONTAINER_CLOSE)
-		return;
-	mutex_lock(&iommu->lock);
-	iommu->container_open = false;
-	mutex_unlock(&iommu->lock);
-	wake_up_all(&iommu->vaddr_wait);
-}
-
 static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
 	.name			= "vfio-iommu-type1",
 	.owner			= THIS_MODULE,
@@ -3203,7 +3074,6 @@ static void vfio_iommu_type1_notify(void *iommu_data,
 	.unregister_device	= vfio_iommu_type1_unregister_device,
 	.dma_rw			= vfio_iommu_type1_dma_rw,
 	.group_iommu_domain	= vfio_iommu_type1_group_iommu_domain,
-	.notify			= vfio_iommu_type1_notify,
 };
 
 static int __init vfio_iommu_type1_init(void)
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index d7d8e09..5c5cc7e 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -49,7 +49,7 @@
 /* Supports VFIO_DMA_UNMAP_FLAG_ALL */
 #define VFIO_UNMAP_ALL			9
 
-/* Supports the vaddr flag for DMA map and unmap */
+/* Obsolete, not supported by any IOMMU. */
 #define VFIO_UPDATE_VADDR		10
 
 /*
@@ -1214,15 +1214,6 @@ struct vfio_iommu_type1_info_dma_avail {
  *
  * Map process virtual addresses to IO virtual addresses using the
  * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
- *
- * If flags & VFIO_DMA_MAP_FLAG_VADDR, update the base vaddr for iova, and
- * unblock translation of host virtual addresses in the iova range.  The vaddr
- * must have previously been invalidated with VFIO_DMA_UNMAP_FLAG_VADDR.  To
- * maintain memory consistency within the user application, the updated vaddr
- * must address the same memory object as originally mapped.  Failure to do so
- * will result in user memory corruption and/or device misbehavior.  iova and
- * size must match those in the original MAP_DMA call.  Protection is not
- * changed, and the READ & WRITE flags must be 0.
  */
 struct vfio_iommu_type1_dma_map {
 	__u32	argsz;
@@ -1265,18 +1256,12 @@ struct vfio_bitmap {
  *
  * If flags & VFIO_DMA_UNMAP_FLAG_ALL, unmap all addresses.  iova and size
  * must be 0.  This cannot be combined with the get-dirty-bitmap flag.
- *
- * If flags & VFIO_DMA_UNMAP_FLAG_VADDR, do not unmap, but invalidate host
- * virtual addresses in the iova range.  Tasks that attempt to translate an
- * iova's vaddr will block.  DMA to already-mapped pages continues.  This
- * cannot be combined with the get-dirty-bitmap flag.
  */
 struct vfio_iommu_type1_dma_unmap {
 	__u32	argsz;
 	__u32	flags;
 #define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
 #define VFIO_DMA_UNMAP_FLAG_ALL		     (1 << 1)
-#define VFIO_DMA_UNMAP_FLAG_VADDR	     (1 << 2)
 	__u64	iova;				/* IO virtual address */
 	__u64	size;				/* Size of mapping (bytes) */
 	__u8    data[];
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH V1 2/8] vfio/type1: dma owner permission
  2022-12-06 21:55 [PATCH V1 0/8] vfio virtual address update redo Steve Sistare
  2022-12-06 21:55 ` [PATCH V1 1/8] vfio: delete interfaces to update vaddr Steve Sistare
@ 2022-12-06 21:55 ` Steve Sistare
  2022-12-07 15:28   ` Jason Gunthorpe
  2022-12-06 21:55 ` [PATCH V1 3/8] vfio: close dma owner Steve Sistare
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Steve Sistare @ 2022-12-06 21:55 UTC (permalink / raw)
  To: kvm; +Cc: Alex Williamson, Cornelia Huck, Steve Sistare

The first task to pin any pages becomes the dma owner, and becomes the only
task allowed to pin.  This prevents an application from exceeding the
initial task's RLIMIT_MEMLOCK by fork'ing and pinning in children.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vfio/vfio_iommu_type1.c | 64 +++++++++++++++++++++++++++++++++--------
 1 file changed, 52 insertions(+), 12 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 02c6ea3..4429794 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -75,6 +75,7 @@ struct vfio_iommu {
 	bool			nesting;
 	bool			dirty_page_tracking;
 	struct list_head	emulated_iommu_groups;
+	struct task_struct	*task;
 };
 
 struct vfio_domain {
@@ -93,9 +94,9 @@ struct vfio_dma {
 	int			prot;		/* IOMMU_READ/WRITE */
 	bool			iommu_mapped;
 	bool			lock_cap;	/* capable(CAP_IPC_LOCK) */
-	struct task_struct	*task;
 	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
 	unsigned long		*bitmap;
+	struct vfio_iommu	*iommu;		/* back pointer */
 };
 
 struct vfio_batch {
@@ -408,19 +409,29 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
 
 static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
 {
+	struct task_struct *task = dma->iommu->task;
+	bool kthread = !current->mm;
 	struct mm_struct *mm;
 	int ret;
 
 	if (!npage)
 		return 0;
 
-	mm = async ? get_task_mm(dma->task) : dma->task->mm;
+	/* This is enforced at higher levels, so if it bites, it is a bug. */
+
+	if (!kthread && current->group_leader != task) {
+		WARN_ONCE(1, "%s: caller is pid %d, owner is pid %d\n",
+			  __func__, current->group_leader->pid, task->pid);
+		return -EPERM;
+	}
+
+	mm = async ? get_task_mm(task) : task->mm;
 	if (!mm)
 		return -ESRCH; /* process exited */
 
 	ret = mmap_write_lock_killable(mm);
 	if (!ret) {
-		ret = __account_locked_vm(mm, abs(npage), npage > 0, dma->task,
+		ret = __account_locked_vm(mm, abs(npage), npage > 0, task,
 					  dma->lock_cap);
 		mmap_write_unlock(mm);
 	}
@@ -609,6 +620,9 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
 	if (!mm)
 		return -ENODEV;
 
+	if (dma->iommu->task != current->group_leader)
+		return -EPERM;
+
 	if (batch->size) {
 		/* Leftover pages in batch from an earlier call. */
 		*pfn_base = page_to_pfn(batch->pages[batch->offset]);
@@ -730,11 +744,12 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
 static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
 				  unsigned long *pfn_base, bool do_accounting)
 {
+	struct task_struct *task = dma->iommu->task;
 	struct page *pages[1];
 	struct mm_struct *mm;
 	int ret;
 
-	mm = get_task_mm(dma->task);
+	mm = get_task_mm(task);
 	if (!mm)
 		return -ENODEV;
 
@@ -751,8 +766,8 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
 			if (ret == -ENOMEM)
 				pr_warn("%s: Task %s (%d) RLIMIT_MEMLOCK "
 					"(%ld) exceeded\n", __func__,
-					dma->task->comm, task_pid_nr(dma->task),
-					task_rlimit(dma->task, RLIMIT_MEMLOCK));
+					task->comm, task_pid_nr(task),
+					task_rlimit(task, RLIMIT_MEMLOCK));
 		}
 	}
 
@@ -784,6 +799,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 				      int npage, int prot,
 				      struct page **pages)
 {
+	bool kthread = !current->mm;
 	struct vfio_iommu *iommu = iommu_data;
 	struct vfio_iommu_group *group;
 	int i, j, ret;
@@ -807,6 +823,11 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 		goto pin_done;
 	}
 
+	if (!kthread && iommu->task != current->group_leader) {
+		ret = -EPERM;
+		goto pin_done;
+	}
+
 	/*
 	 * If iommu capable domain exist in the container then all pages are
 	 * already pinned and accounted. Accounting should be done if there is no
@@ -1097,7 +1118,6 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
 	WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list));
 	vfio_unmap_unpin(iommu, dma, true);
 	vfio_unlink_dma(iommu, dma);
-	put_task_struct(dma->task);
 	vfio_dma_bitmap_free(dma);
 	kfree(dma);
 	iommu->dma_avail++;
@@ -1247,6 +1267,16 @@ static void vfio_notify_dma_unmap(struct vfio_iommu *iommu,
 	mutex_lock(&iommu->lock);
 }
 
+static void vfio_iommu_set_task(struct vfio_iommu *iommu,
+				struct task_struct *task)
+{
+	if (iommu->task)
+		put_task_struct(iommu->task);
+	if (task)
+		iommu->task = get_task_struct(task);
+	iommu->task = task;
+}
+
 static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 			     struct vfio_iommu_type1_dma_unmap *unmap,
 			     struct vfio_bitmap *bitmap)
@@ -1362,6 +1392,9 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 	}
 
 unlock:
+	if (RB_EMPTY_ROOT(&iommu->dma_list))
+		vfio_iommu_set_task(iommu, NULL);
+
 	mutex_unlock(&iommu->lock);
 
 	/* Report how much was unmapped */
@@ -1537,6 +1570,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	}
 
 	iommu->dma_avail--;
+	dma->iommu = iommu;
 	dma->iova = iova;
 	dma->vaddr = vaddr;
 	dma->prot = prot;
@@ -1566,8 +1600,8 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	 * externally mapped.  Therefore track CAP_IPC_LOCK in vfio_dma at the
 	 * time of calling MAP_DMA.
 	 */
-	get_task_struct(current->group_leader);
-	dma->task = current->group_leader;
+	if (!iommu->task)
+		vfio_iommu_set_task(iommu, current->group_leader);
 	dma->lock_cap = capable(CAP_IPC_LOCK);
 
 	dma->pfn_list = RB_ROOT;
@@ -2528,6 +2562,8 @@ static void vfio_iommu_type1_release(void *iommu_data)
 
 	vfio_iommu_iova_free(&iommu->iova_list);
 
+	vfio_iommu_set_task(iommu, NULL);
+
 	kfree(iommu);
 }
 
@@ -2963,6 +2999,7 @@ static int vfio_iommu_type1_dma_rw_chunk(struct vfio_iommu *iommu,
 	struct vfio_dma *dma;
 	bool kthread = current->mm == NULL;
 	size_t offset;
+	int ret = -EFAULT;
 
 	*copied = 0;
 
@@ -2974,15 +3011,18 @@ static int vfio_iommu_type1_dma_rw_chunk(struct vfio_iommu *iommu,
 			!(dma->prot & IOMMU_READ))
 		return -EPERM;
 
-	mm = get_task_mm(dma->task);
+	mm = get_task_mm(iommu->task);
 
 	if (!mm)
 		return -EPERM;
 
 	if (kthread)
 		kthread_use_mm(mm);
-	else if (current->mm != mm)
+	else if (current->mm != mm) {
+		/* Must use matching mm for vaddr translation. */
+		ret = -EPERM;
 		goto out;
+	}
 
 	offset = user_iova - dma->iova;
 
@@ -3011,7 +3051,7 @@ static int vfio_iommu_type1_dma_rw_chunk(struct vfio_iommu *iommu,
 		kthread_unuse_mm(mm);
 out:
 	mmput(mm);
-	return *copied ? 0 : -EFAULT;
+	return *copied ? 0 : ret;
 }
 
 static int vfio_iommu_type1_dma_rw(void *iommu_data, dma_addr_t user_iova,
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH V1 3/8] vfio: close dma owner
  2022-12-06 21:55 [PATCH V1 0/8] vfio virtual address update redo Steve Sistare
  2022-12-06 21:55 ` [PATCH V1 1/8] vfio: delete interfaces to update vaddr Steve Sistare
  2022-12-06 21:55 ` [PATCH V1 2/8] vfio/type1: dma owner permission Steve Sistare
@ 2022-12-06 21:55 ` Steve Sistare
  2022-12-06 21:55 ` [PATCH V1 4/8] vfio/type1: " Steve Sistare
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Steve Sistare @ 2022-12-06 21:55 UTC (permalink / raw)
  To: kvm; +Cc: Alex Williamson, Cornelia Huck, Steve Sistare

Define a new vfio_iommu_driver_ops method named close_dma_owner, called
when a task closes its mm (ie, exit or exec).  This allows the driver to
check if the task owns any dma mappings, and take appropriate action,
such as unpinning pages.  This guarantees that pages do not remain pinned
if the task leaks vfio descriptors to another process and then exits
or execs.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vfio/container.c | 146 +++++++++++++++++++++++++++++++++++++++++++++++
 drivers/vfio/vfio.h      |   1 +
 2 files changed, 147 insertions(+)

diff --git a/drivers/vfio/container.c b/drivers/vfio/container.c
index 5bfd10d..b660adc 100644
--- a/drivers/vfio/container.c
+++ b/drivers/vfio/container.c
@@ -10,6 +10,7 @@
 #include <linux/capability.h>
 #include <linux/iommu.h>
 #include <linux/miscdevice.h>
+#include <linux/mman.h>
 #include <linux/vfio.h>
 #include <uapi/linux/vfio.h>
 
@@ -22,6 +23,13 @@ struct vfio_container {
 	struct vfio_iommu_driver	*iommu_driver;
 	void				*iommu_data;
 	bool				noiommu;
+	struct list_head		task_list;
+	struct mutex			task_lock;
+};
+
+struct vfio_task {
+	struct task_struct		*task;
+	struct list_head		task_next;
 };
 
 static struct vfio {
@@ -330,6 +338,136 @@ static long vfio_ioctl_set_iommu(struct vfio_container *container,
 	return ret;
 }
 
+/*
+ * Maintain a list of tasks that have mapped dma regions.
+ */
+
+static void vfio_add_task(struct vfio_container *container)
+{
+	struct vfio_task *vftask = kzalloc(sizeof(*vftask), GFP_KERNEL);
+
+	vftask->task = get_task_struct(current->group_leader);
+	list_add(&vftask->task_next, &container->task_list);
+}
+
+static bool vfio_has_task(struct vfio_container *container)
+{
+	struct vfio_task *vftask;
+
+	list_for_each_entry(vftask, &container->task_list, task_next) {
+		if (vftask->task == current->group_leader)
+			return true;
+	}
+	return false;
+}
+
+static void vfio_remove_task(struct vfio_container *container)
+{
+	struct task_struct *task = current->group_leader;
+	struct vfio_task *vftask;
+
+	list_for_each_entry(vftask, &container->task_list, task_next) {
+		if (vftask->task == task) {
+			put_task_struct(task);
+			list_del(&vftask->task_next);
+			return;
+		}
+	}
+	WARN_ONCE(1, "%s pid %d not found\n", __func__, task->pid);
+}
+
+static int vfio_canary_create(struct file *filep);
+
+static int vfio_register_dma_task(struct vfio_container *container,
+				  struct file *filep)
+{
+	int ret = 0;
+
+	mutex_lock(&container->task_lock);
+
+	if (vfio_has_task(container))
+		goto out_unlock;
+	ret = vfio_canary_create(filep);
+	if (ret)
+		goto out_unlock;
+
+	vfio_add_task(container);
+
+out_unlock:
+	mutex_unlock(&container->task_lock);
+	return ret;
+}
+
+static void vfio_unregister_dma_task(struct vfio_container *container)
+{
+	struct vfio_iommu_driver *driver = container->iommu_driver;
+
+	mutex_lock(&container->task_lock);
+	vfio_remove_task(container);
+	mutex_unlock(&container->task_lock);
+
+	if (driver && driver->ops->close_dma_owner)
+		driver->ops->close_dma_owner(container->iommu_data);
+}
+
+/*
+ * Create a per-task vma that detects when an address space closes, by getting
+ * a vm_operations_struct close callback.
+ */
+
+static int vfio_canary_create(struct file *filep)
+{
+	unsigned long vaddr = vm_mmap(filep, 0, PAGE_SIZE, 0, MAP_PRIVATE, 0);
+
+	if (!vaddr)
+		return -ENOMEM;
+	else if (IS_ERR_VALUE(vaddr))
+		return (int)vaddr;
+	else
+		return 0;
+}
+
+static void vfio_canary_open(struct vm_area_struct *vma)
+{
+	/*
+	 * This vma is being dup'd after fork.  We don't have the new task yet,
+	 * so not useful.  Ignore it on close.
+	 */
+	vma->vm_private_data = NULL;
+}
+
+static void vfio_canary_close(struct vm_area_struct *vma)
+{
+	struct vfio_container *container = vma->vm_private_data;
+
+	if (container) {
+		vfio_unregister_dma_task(container);
+		vfio_container_put(container);
+	}
+}
+
+static vm_fault_t vfio_canary_fault(struct vm_fault *vmf)
+{
+	/* No need for access to the mapped canary */
+	return VM_FAULT_SIGBUS;
+}
+
+static const struct vm_operations_struct vfio_canary_mmap_ops = {
+	.open = vfio_canary_open,
+	.close = vfio_canary_close,
+	.fault = vfio_canary_fault,
+};
+
+static int vfio_fops_mmap(struct file *filep, struct vm_area_struct *vma)
+{
+	struct vfio_container *container = filep->private_data;
+
+	vfio_container_get(container);
+	vma->vm_private_data = container;
+	vma->vm_ops = &vfio_canary_mmap_ops;
+	return 0;
+}
+
 static long vfio_fops_unl_ioctl(struct file *filep,
 				unsigned int cmd, unsigned long arg)
 {
@@ -351,6 +489,11 @@ static long vfio_fops_unl_ioctl(struct file *filep,
 	case VFIO_SET_IOMMU:
 		ret = vfio_ioctl_set_iommu(container, arg);
 		break;
+	case VFIO_IOMMU_MAP_DMA:
+		ret = vfio_register_dma_task(container, filep);
+		if (ret)
+			return ret;
+		fallthrough;
 	default:
 		driver = container->iommu_driver;
 		data = container->iommu_data;
@@ -372,6 +515,8 @@ static int vfio_fops_open(struct inode *inode, struct file *filep)
 
 	INIT_LIST_HEAD(&container->group_list);
 	init_rwsem(&container->group_lock);
+	INIT_LIST_HEAD(&container->task_list);
+	mutex_init(&container->task_lock);
 	kref_init(&container->kref);
 
 	filep->private_data = container;
@@ -396,6 +541,7 @@ static int vfio_fops_release(struct inode *inode, struct file *filep)
 	.release	= vfio_fops_release,
 	.unlocked_ioctl	= vfio_fops_unl_ioctl,
 	.compat_ioctl	= compat_ptr_ioctl,
+	.mmap		= vfio_fops_mmap,
 };
 
 struct vfio_container *vfio_container_from_file(struct file *file)
diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index 8a439c6..0cf3cfe 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -92,6 +92,7 @@ struct vfio_iommu_driver_ops {
 				  void *data, size_t count, bool write);
 	struct iommu_domain *(*group_iommu_domain)(void *iommu_data,
 						   struct iommu_group *group);
+	void		(*close_dma_owner)(void *iommu_data);
 };
 
 struct vfio_iommu_driver {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH V1 4/8] vfio/type1: close dma owner
  2022-12-06 21:55 [PATCH V1 0/8] vfio virtual address update redo Steve Sistare
                   ` (2 preceding siblings ...)
  2022-12-06 21:55 ` [PATCH V1 3/8] vfio: close dma owner Steve Sistare
@ 2022-12-06 21:55 ` Steve Sistare
  2022-12-06 21:55 ` [PATCH V1 5/8] vfio/type1: track locked_vm per dma Steve Sistare
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Steve Sistare @ 2022-12-06 21:55 UTC (permalink / raw)
  To: kvm; +Cc: Alex Williamson, Cornelia Huck, Steve Sistare

If a task exits or exec, unmap and unpin all dma mappings owned by the
task.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vfio/vfio_iommu_type1.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 4429794..51bb687 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2932,6 +2932,16 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
 	return -EINVAL;
 }
 
+static void vfio_iommu_type1_close_dma_owner(void *iommu_data)
+{
+	struct vfio_iommu *iommu = iommu_data;
+
+	mutex_lock(&iommu->lock);
+	if (iommu->task == current->group_leader)
+		vfio_iommu_unmap_unpin_all(iommu);
+	mutex_unlock(&iommu->lock);
+}
+
 static long vfio_iommu_type1_ioctl(void *iommu_data,
 				   unsigned int cmd, unsigned long arg)
 {
@@ -3114,6 +3124,7 @@ static int vfio_iommu_type1_dma_rw(void *iommu_data, dma_addr_t user_iova,
 	.unregister_device	= vfio_iommu_type1_unregister_device,
 	.dma_rw			= vfio_iommu_type1_dma_rw,
 	.group_iommu_domain	= vfio_iommu_type1_group_iommu_domain,
+	.close_dma_owner	= vfio_iommu_type1_close_dma_owner,
 };
 
 static int __init vfio_iommu_type1_init(void)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH V1 5/8] vfio/type1: track locked_vm per dma
  2022-12-06 21:55 [PATCH V1 0/8] vfio virtual address update redo Steve Sistare
                   ` (3 preceding siblings ...)
  2022-12-06 21:55 ` [PATCH V1 4/8] vfio/type1: " Steve Sistare
@ 2022-12-06 21:55 ` Steve Sistare
  2022-12-06 21:55 ` [PATCH V1 6/8] vfio/type1: update vaddr Steve Sistare
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Steve Sistare @ 2022-12-06 21:55 UTC (permalink / raw)
  To: kvm; +Cc: Alex Williamson, Cornelia Huck, Steve Sistare

Track locked_vm per dma struct, and create a new subroutine, both for use
in a subsequent patch.  No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vfio/vfio_iommu_type1.c | 23 +++++++++++++++++------
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 51bb687..3bd89d5 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -97,6 +97,7 @@ struct vfio_dma {
 	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
 	unsigned long		*bitmap;
 	struct vfio_iommu	*iommu;		/* back pointer */
+	uint64_t		locked_vm;
 };
 
 struct vfio_batch {
@@ -407,6 +408,19 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
 	return ret;
 }
 
+static int mm_lock_vm(struct task_struct *task, struct mm_struct *mm,
+		      bool lock_cap, long npage)
+{
+	int ret = mmap_write_lock_killable(mm);
+
+	if (!ret) {
+		ret = __account_locked_vm(mm, abs(npage), npage > 0, task,
+					  lock_cap);
+		mmap_write_unlock(mm);
+	}
+	return ret;
+}
+
 static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
 {
 	struct task_struct *task = dma->iommu->task;
@@ -429,12 +443,9 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
 	if (!mm)
 		return -ESRCH; /* process exited */
 
-	ret = mmap_write_lock_killable(mm);
-	if (!ret) {
-		ret = __account_locked_vm(mm, abs(npage), npage > 0, task,
-					  dma->lock_cap);
-		mmap_write_unlock(mm);
-	}
+	ret = mm_lock_vm(task, mm, dma->lock_cap, npage);
+	if (!ret)
+		dma->locked_vm += npage;
 
 	if (async)
 		mmput(mm);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH V1 6/8] vfio/type1: update vaddr
  2022-12-06 21:55 [PATCH V1 0/8] vfio virtual address update redo Steve Sistare
                   ` (4 preceding siblings ...)
  2022-12-06 21:55 ` [PATCH V1 5/8] vfio/type1: track locked_vm per dma Steve Sistare
@ 2022-12-06 21:55 ` Steve Sistare
  2022-12-06 21:55 ` [PATCH V1 7/8] vfio: change dma owner Steve Sistare
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Steve Sistare @ 2022-12-06 21:55 UTC (permalink / raw)
  To: kvm; +Cc: Alex Williamson, Cornelia Huck, Steve Sistare

Re-define VFIO_DMA_MAP_FLAG_VADDR as follows:

  If flags & VFIO_DMA_MAP_FLAG_VADDR, prepare to change the base virtual
  address for iova to vaddr.  The updated vaddr must address the same
  memory object as originally mapped, with the same access permissions,
  and must be shared.  The change takes effect after the next call to
  VFIO_CHANGE_DMA_OWNER.

VFIO_CHANGE_DMA_OWNER is defined in a subsequent patch.
See vfio.h for more details.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vfio/vfio_iommu_type1.c | 3 ++-
 include/uapi/linux/vfio.h       | 7 +++++++
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 3bd89d5..fbea2b5 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -90,6 +90,7 @@ struct vfio_dma {
 	struct rb_node		node;
 	dma_addr_t		iova;		/* Device address */
 	unsigned long		vaddr;		/* Process virtual addr */
+	unsigned long		new_vaddr;
 	size_t			size;		/* Map size (bytes) */
 	int			prot;		/* IOMMU_READ/WRITE */
 	bool			iommu_mapped;
@@ -1556,7 +1557,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 		} else if (dma->iova != iova || dma->size != size) {
 			ret = -EINVAL;
 		} else {
-			dma->vaddr = vaddr;
+			dma->new_vaddr = vaddr;
 		}
 		goto out_unlock;
 	} else if (dma) {
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 5c5cc7e..8b7c1ed 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1214,6 +1214,13 @@ struct vfio_iommu_type1_info_dma_avail {
  *
  * Map process virtual addresses to IO virtual addresses using the
  * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
+ *
+ * If flags & VFIO_DMA_MAP_FLAG_VADDR, prepare to change the base virtual
+ * address for iova to vaddr.  The updated vaddr must address the same memory
+ * object as originally mapped, with the same access permissions, and must be
+ * shared.  The iova and size must match those in the original MAP_DMA call.
+ * Protection is not changed, and the READ & WRITE flags must be 0.  The change
+ * takes effect after the next call to VFIO_CHANGE_DMA_OWNER.
  */
 struct vfio_iommu_type1_dma_map {
 	__u32	argsz;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH V1 7/8] vfio: change dma owner
  2022-12-06 21:55 [PATCH V1 0/8] vfio virtual address update redo Steve Sistare
                   ` (5 preceding siblings ...)
  2022-12-06 21:55 ` [PATCH V1 6/8] vfio/type1: update vaddr Steve Sistare
@ 2022-12-06 21:55 ` Steve Sistare
  2022-12-07 16:58   ` Jason Gunthorpe
  2022-12-06 21:55 ` [PATCH V1 8/8] vfio/type1: " Steve Sistare
  2022-12-07 15:23 ` [PATCH V1 0/8] vfio virtual address update redo Jason Gunthorpe
  8 siblings, 1 reply; 29+ messages in thread
From: Steve Sistare @ 2022-12-06 21:55 UTC (permalink / raw)
  To: kvm; +Cc: Alex Williamson, Cornelia Huck, Steve Sistare

Define the VFIO_DMA_OWNER extension.  It indicates support for the
redefined VFIO_DMA_MAP_FLAG_VADDR, and the new VFIO_CHANGE_DMA_OWNER
ioctl, defined as:

 Change ownership of all dma mappings to the calling task, including
 count of locked pages subject to RLIMIT_MEMLOCK.  The new task's address
 space is used to translate virtual to physical addresses for all future
 requests, including as those issued by mediated devices. For all mappings,
 the vaddr must be the same in the old and new address space, or can be
 changed in the new address space by using VFIO_DMA_MAP_FLAG_VADDR.

See vfio.h for more details.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vfio/container.c  | 18 ++++++++++++++++++
 drivers/vfio/vfio.h       |  1 +
 include/uapi/linux/vfio.h | 32 ++++++++++++++++++++++++++++++++
 3 files changed, 51 insertions(+)

diff --git a/drivers/vfio/container.c b/drivers/vfio/container.c
index b660adc..7e78593 100644
--- a/drivers/vfio/container.c
+++ b/drivers/vfio/container.c
@@ -468,6 +468,21 @@ static int vfio_fops_mmap(struct file *filep, struct vm_area_struct *vma)
 	return 0;
 }
 
+static int vfio_change_dma_owner(struct vfio_container *container,
+				 struct file *filep)
+{
+	struct vfio_iommu_driver *driver = container->iommu_driver;
+	int ret;
+
+	if (!driver || !driver->ops->change_dma_owner)
+		return -ENOTTY;
+
+	ret = vfio_register_dma_task(container, filep);
+	if (!ret)
+		ret = driver->ops->change_dma_owner(container->iommu_data);
+	return ret;
+}
+
 static long vfio_fops_unl_ioctl(struct file *filep,
 				unsigned int cmd, unsigned long arg)
 {
@@ -489,6 +504,9 @@ static long vfio_fops_unl_ioctl(struct file *filep,
 	case VFIO_SET_IOMMU:
 		ret = vfio_ioctl_set_iommu(container, arg);
 		break;
+	case VFIO_CHANGE_DMA_OWNER:
+		ret = vfio_change_dma_owner(container, filep);
+		break;
 	case VFIO_IOMMU_MAP_DMA:
 		ret = vfio_register_dma_task(container, filep);
 		if (ret)
diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index 0cf3cfe..999a7b0 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -92,6 +92,7 @@ struct vfio_iommu_driver_ops {
 				  void *data, size_t count, bool write);
 	struct iommu_domain *(*group_iommu_domain)(void *iommu_data,
 						   struct iommu_group *group);
+	int		(*change_dma_owner)(void *iommu_data);
 	void		(*close_dma_owner)(void *iommu_data);
 };
 
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 8b7c1ed..8074d80 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -53,6 +53,11 @@
 #define VFIO_UPDATE_VADDR		10
 
 /*
+ * Supports VFIO_CHANGE_DMA_OWNER and VFIO_DMA_MAP_FLAG_VADDR.
+ */
+#define VFIO_DMA_OWNER			11
+
+/*
  * The IOCTL interface is designed for extensibility by embedding the
  * structure length (argsz) and flags into structures passed between
  * kernel and userspace.  We therefore use the _IO() macro for these
@@ -128,6 +133,33 @@ struct vfio_info_cap_header {
  */
 #define VFIO_SET_IOMMU			_IO(VFIO_TYPE, VFIO_BASE + 2)
 
+/**
+ * VFIO_CHANGE_DMA_OWNER		_IO(VFIO_TYPE, VFIO_BASE + 22)
+ *
+ * Change ownership of all dma mappings to the calling task, including
+ * count of locked pages subject to RLIMIT_MEMLOCK.  The new task's address
+ * space is used to translate virtual to physical addresses for all future
+ * requests, including as those issued by mediated devices.  For all mappings,
+ * the vaddr must be the same in the old and new address space, or can be
+ * changed in the new address space by using VFIO_DMA_MAP_FLAG_VADDR, but in
+ * both cases the old vaddr and address space must map to the same memory
+ * object as the new vaddr and address space.  Length and access permissions
+ * cannot be changed, and the object must be mapped shared.  Tasks must not
+ * modify the old or new address space over the affected ranges during this
+ * ioctl, else differences might not be detected, and dma may target the wrong
+ * user pages.
+ *
+ * Return:
+ *	      0: success
+ *       -ESRCH: owning task is dead.
+ *	-ENOMEM: Out of memory, or RLIMIT_MEMLOCK is too low.
+ *	 -ENXIO: Memory object does not match or is not shared.
+ *	-EINVAL: a new vaddr was provided for some but not all mappings.
+ *
+ * Availability: with VFIO_DMA_OWNER extension.
+ */
+#define VFIO_CHANGE_DMA_OWNER		_IO(VFIO_TYPE, VFIO_BASE + 22)
+
 /* -------- IOCTLs for GROUP file descriptors (/dev/vfio/$GROUP) -------- */
 
 /**
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH V1 8/8] vfio/type1: change dma owner
  2022-12-06 21:55 [PATCH V1 0/8] vfio virtual address update redo Steve Sistare
                   ` (6 preceding siblings ...)
  2022-12-06 21:55 ` [PATCH V1 7/8] vfio: change dma owner Steve Sistare
@ 2022-12-06 21:55 ` Steve Sistare
  2022-12-07 17:00   ` Jason Gunthorpe
  2022-12-08  7:22   ` Dan Carpenter
  2022-12-07 15:23 ` [PATCH V1 0/8] vfio virtual address update redo Jason Gunthorpe
  8 siblings, 2 replies; 29+ messages in thread
From: Steve Sistare @ 2022-12-06 21:55 UTC (permalink / raw)
  To: kvm; +Cc: Alex Williamson, Cornelia Huck, Steve Sistare

Implement VFIO_CHANGE_DMA_OWNER in the type1 iommu.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vfio/vfio_iommu_type1.c | 119 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 119 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index fbea2b5..55ba1e7 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1509,6 +1509,112 @@ static bool vfio_iommu_iova_dma_valid(struct vfio_iommu *iommu,
 	return list_empty(iova);
 }
 
+/*
+ * Return true if mm1 vaddr1 maps the same memory object as mm2 vaddr2.
+ * This does not prevent other tasks from concurrently modifying mappings
+ * and invalidating this test, but that would be an application bug.
+ */
+static bool same_file_mapping(struct mm_struct *mm1, unsigned long vaddr1,
+			      struct mm_struct *mm2, unsigned long vaddr2)
+{
+	const unsigned long mask = VM_READ | VM_WRITE | VM_EXEC | VM_SHARED;
+	struct inode *inode1 = NULL, *inode2 = NULL;
+	unsigned long pgoff1, len1, flags1;
+	unsigned long pgoff2, len2, flags2;
+	struct vm_area_struct *vma;
+
+	mmap_read_lock(mm1);
+	vma = find_vma(mm1, vaddr1);
+	if (vma && vma->vm_file) {
+		inode1 = vma->vm_file->f_inode;
+		pgoff1 = vma->vm_pgoff;
+		len1 = vma->vm_end - vma->vm_start;
+		flags1 = vma->vm_flags & mask;
+	}
+	mmap_read_unlock(mm1);
+
+	mmap_read_lock(mm2);
+	vma = find_vma(mm2, vaddr2);
+	if (vma && vma->vm_file) {
+		inode2 = vma->vm_file->f_inode;
+		pgoff2 = vma->vm_pgoff;
+		len2 = vma->vm_end - vma->vm_start;
+		flags2 = vma->vm_flags & mask;
+	}
+	mmap_read_unlock(mm2);
+
+	if (!inode1 || (inode1 != inode2) || (pgoff1 != pgoff2) ||
+	    (len1 != len2) || (flags1 != flags2)) {
+		pr_debug("vfio vma mismatch for old va %lx vs new va %lx\n",
+			 vaddr1, vaddr2);
+		return false;
+	} else {
+		return true;
+	}
+}
+
+static int change_dma_owner(struct vfio_iommu *iommu)
+{
+	struct rb_node *p;
+	struct vfio_dma *dma;
+	unsigned long new_vaddr;
+	uint64_t npages = 0;
+	int ret = 0, new_vaddrs = 0, total = 0;
+	bool new_lock_cap = capable(CAP_IPC_LOCK);
+	struct mm_struct *old_mm, *new_mm = current->mm;
+	struct task_struct *old_task = iommu->task;
+	struct task_struct *new_task = current->group_leader;
+
+	if (!old_task)
+		return 0;	/* nothing mapped, nothing to do */
+
+	old_mm = get_task_mm(old_task);
+	if (!old_mm)
+		return -ESRCH;
+
+	for (p = rb_first(&iommu->dma_list); p; p = rb_next(p)) {
+		dma = rb_entry(p, struct vfio_dma, node);
+		npages += dma->locked_vm;
+		total++;
+		new_vaddrs += (dma->new_vaddr != 0);
+
+		new_vaddr = dma->new_vaddr ? dma->new_vaddr : dma->vaddr;
+		if (!same_file_mapping(old_mm, dma->vaddr, new_mm, new_vaddr)) {
+			ret = -ENXIO;
+			goto out;
+		}
+	}
+
+	/* If any vaddrs change, all must change */
+	if (new_vaddrs && (total - new_vaddrs)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (npages) {
+		ret = mm_lock_vm(new_task, new_mm, new_lock_cap, npages);
+		if (ret)
+			goto out;
+		/* should always succeed */
+		mm_lock_vm(old_task, old_mm, dma->lock_cap, -npages);
+	}
+
+	for (p = rb_first(&iommu->dma_list); p; p = rb_next(p)) {
+		dma = rb_entry(p, struct vfio_dma, node);
+		dma->lock_cap = new_lock_cap;
+		if (dma->new_vaddr) {
+			dma->vaddr = dma->new_vaddr;
+			dma->new_vaddr = 0;
+		}
+	}
+
+	vfio_iommu_set_task(iommu, new_task);
+
+out:
+	mmput(old_mm);
+	return ret;
+}
+
 static int vfio_dma_do_map(struct vfio_iommu *iommu,
 			   struct vfio_iommu_type1_dma_map *map)
 {
@@ -2604,6 +2710,7 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
 	case VFIO_TYPE1v2_IOMMU:
 	case VFIO_TYPE1_NESTING_IOMMU:
 	case VFIO_UNMAP_ALL:
+	case VFIO_DMA_OWNER:
 		return 1;
 	case VFIO_DMA_CC_IOMMU:
 		if (!iommu)
@@ -2944,6 +3051,17 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
 	return -EINVAL;
 }
 
+static int vfio_iommu_type1_change_dma_owner(void *iommu_data)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	int ret;
+
+	mutex_lock(&iommu->lock);
+	ret = change_dma_owner(iommu);
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
 static void vfio_iommu_type1_close_dma_owner(void *iommu_data)
 {
 	struct vfio_iommu *iommu = iommu_data;
@@ -3136,6 +3254,7 @@ static int vfio_iommu_type1_dma_rw(void *iommu_data, dma_addr_t user_iova,
 	.unregister_device	= vfio_iommu_type1_unregister_device,
 	.dma_rw			= vfio_iommu_type1_dma_rw,
 	.group_iommu_domain	= vfio_iommu_type1_group_iommu_domain,
+	.change_dma_owner	= vfio_iommu_type1_change_dma_owner,
 	.close_dma_owner	= vfio_iommu_type1_close_dma_owner,
 };
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH V1 1/8] vfio: delete interfaces to update vaddr
  2022-12-06 21:55 ` [PATCH V1 1/8] vfio: delete interfaces to update vaddr Steve Sistare
@ 2022-12-06 23:52   ` Alex Williamson
  2022-12-07 14:26     ` Steven Sistare
  2022-12-07 15:19     ` Jason Gunthorpe
  2022-12-07 23:20   ` Jason Gunthorpe
  2022-12-08 22:18   ` Alex Williamson
  2 siblings, 2 replies; 29+ messages in thread
From: Alex Williamson @ 2022-12-06 23:52 UTC (permalink / raw)
  To: Steve Sistare; +Cc: kvm, Cornelia Huck

On Tue,  6 Dec 2022 13:55:46 -0800
Steve Sistare <steven.sistare@oracle.com> wrote:
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index d7d8e09..5c5cc7e 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
...
> @@ -1265,18 +1256,12 @@ struct vfio_bitmap {
>   *
>   * If flags & VFIO_DMA_UNMAP_FLAG_ALL, unmap all addresses.  iova and size
>   * must be 0.  This cannot be combined with the get-dirty-bitmap flag.
> - *
> - * If flags & VFIO_DMA_UNMAP_FLAG_VADDR, do not unmap, but invalidate host
> - * virtual addresses in the iova range.  Tasks that attempt to translate an
> - * iova's vaddr will block.  DMA to already-mapped pages continues.  This
> - * cannot be combined with the get-dirty-bitmap flag.
>   */
>  struct vfio_iommu_type1_dma_unmap {
>  	__u32	argsz;
>  	__u32	flags;
>  #define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
>  #define VFIO_DMA_UNMAP_FLAG_ALL		     (1 << 1)
> -#define VFIO_DMA_UNMAP_FLAG_VADDR	     (1 << 2)


This flag should probably be marked reserved.

Should we consider this separately for v6.2?

For the remainder, the long term plan is to move to iommufd, so any new
feature of type1 would need equivalent support in iommufd.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH V1 1/8] vfio: delete interfaces to update vaddr
  2022-12-06 23:52   ` Alex Williamson
@ 2022-12-07 14:26     ` Steven Sistare
  2022-12-07 15:14       ` Alex Williamson
  2023-05-17 16:12       ` Jason Gunthorpe
  2022-12-07 15:19     ` Jason Gunthorpe
  1 sibling, 2 replies; 29+ messages in thread
From: Steven Sistare @ 2022-12-07 14:26 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, Cornelia Huck

On 12/6/2022 6:52 PM, Alex Williamson wrote:
> On Tue,  6 Dec 2022 13:55:46 -0800
> Steve Sistare <steven.sistare@oracle.com> wrote:
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index d7d8e09..5c5cc7e 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
> ...
>> @@ -1265,18 +1256,12 @@ struct vfio_bitmap {
>>   *
>>   * If flags & VFIO_DMA_UNMAP_FLAG_ALL, unmap all addresses.  iova and size
>>   * must be 0.  This cannot be combined with the get-dirty-bitmap flag.
>> - *
>> - * If flags & VFIO_DMA_UNMAP_FLAG_VADDR, do not unmap, but invalidate host
>> - * virtual addresses in the iova range.  Tasks that attempt to translate an
>> - * iova's vaddr will block.  DMA to already-mapped pages continues.  This
>> - * cannot be combined with the get-dirty-bitmap flag.
>>   */
>>  struct vfio_iommu_type1_dma_unmap {
>>  	__u32	argsz;
>>  	__u32	flags;
>>  #define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
>>  #define VFIO_DMA_UNMAP_FLAG_ALL		     (1 << 1)
>> -#define VFIO_DMA_UNMAP_FLAG_VADDR	     (1 << 2)
> 
> 
> This flag should probably be marked reserved.
> 
> Should we consider this separately for v6.2?

Ideally I would like all kernels to support either the old or new vaddr interface.
If iommufd + vfio compat does not make 6.2, then I prefer not to delete the old
interface separately.

> For the remainder, the long term plan is to move to iommufd, so any new
> feature of type1 would need equivalent support in iommufd.  Thanks,

Sure.  I will study iommufd and make a proposal.

Will you review these patches as is to give feedback on the approach?

If I show that iommufd and the vfio compat layer can support these interfaces,
are you open to accepting these in v6.2 if iommufd is still a ways off? I see 
iommufd in qemu-next, but not the compat layer.

- Steve

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH V1 1/8] vfio: delete interfaces to update vaddr
  2022-12-07 14:26     ` Steven Sistare
@ 2022-12-07 15:14       ` Alex Williamson
  2023-05-17 16:12       ` Jason Gunthorpe
  1 sibling, 0 replies; 29+ messages in thread
From: Alex Williamson @ 2022-12-07 15:14 UTC (permalink / raw)
  To: Steven Sistare; +Cc: kvm, Cornelia Huck, Jason Gunthorpe

On Wed, 7 Dec 2022 09:26:33 -0500
Steven Sistare <steven.sistare@oracle.com> wrote:

> On 12/6/2022 6:52 PM, Alex Williamson wrote:
> > On Tue,  6 Dec 2022 13:55:46 -0800
> > Steve Sistare <steven.sistare@oracle.com> wrote:  
> >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >> index d7d8e09..5c5cc7e 100644
> >> --- a/include/uapi/linux/vfio.h
> >> +++ b/include/uapi/linux/vfio.h  
> > ...  
> >> @@ -1265,18 +1256,12 @@ struct vfio_bitmap {
> >>   *
> >>   * If flags & VFIO_DMA_UNMAP_FLAG_ALL, unmap all addresses.  iova and size
> >>   * must be 0.  This cannot be combined with the get-dirty-bitmap flag.
> >> - *
> >> - * If flags & VFIO_DMA_UNMAP_FLAG_VADDR, do not unmap, but invalidate host
> >> - * virtual addresses in the iova range.  Tasks that attempt to translate an
> >> - * iova's vaddr will block.  DMA to already-mapped pages continues.  This
> >> - * cannot be combined with the get-dirty-bitmap flag.
> >>   */
> >>  struct vfio_iommu_type1_dma_unmap {
> >>  	__u32	argsz;
> >>  	__u32	flags;
> >>  #define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
> >>  #define VFIO_DMA_UNMAP_FLAG_ALL		     (1 << 1)
> >> -#define VFIO_DMA_UNMAP_FLAG_VADDR	     (1 << 2)  
> > 
> > 
> > This flag should probably be marked reserved.
> > 
> > Should we consider this separately for v6.2?  
> 
> Ideally I would like all kernels to support either the old or new vaddr interface.
> If iommufd + vfio compat does not make 6.2, then I prefer not to delete the old
> interface separately.

You've identified a couple vulnerabilities with the current
implementation in the cover letter, userspace can detect the presence
of the feature and therefore removing it should only remove the
capability without breaking otherwise, no upstream userspace seems to
support this, and downstreams that wish to continue support can of
course revert the removal.  So I think as far as the upstream kernel is
concerned, this is dead, risky code.
 
> > For the remainder, the long term plan is to move to iommufd, so any new
> > feature of type1 would need equivalent support in iommufd.  Thanks,  
> 
> Sure.  I will study iommufd and make a proposal.
> 
> Will you review these patches as is to give feedback on the approach?

Yup, it's in my queue.
 
> If I show that iommufd and the vfio compat layer can support these interfaces,
> are you open to accepting these in v6.2 if iommufd is still a ways off? I see 
> iommufd in qemu-next, but not the compat layer.

It's too late for this sort of feature for v6.2, the merge window is
only about 4 days way.  I think the removal can be done with little
risk though.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH V1 1/8] vfio: delete interfaces to update vaddr
  2022-12-06 23:52   ` Alex Williamson
  2022-12-07 14:26     ` Steven Sistare
@ 2022-12-07 15:19     ` Jason Gunthorpe
  2022-12-08 19:09       ` Steven Sistare
  1 sibling, 1 reply; 29+ messages in thread
From: Jason Gunthorpe @ 2022-12-07 15:19 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Steve Sistare, kvm, Cornelia Huck

On Tue, Dec 06, 2022 at 04:52:32PM -0700, Alex Williamson wrote:
> On Tue,  6 Dec 2022 13:55:46 -0800
> Steve Sistare <steven.sistare@oracle.com> wrote:
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index d7d8e09..5c5cc7e 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> ...
> > @@ -1265,18 +1256,12 @@ struct vfio_bitmap {
> >   *
> >   * If flags & VFIO_DMA_UNMAP_FLAG_ALL, unmap all addresses.  iova and size
> >   * must be 0.  This cannot be combined with the get-dirty-bitmap flag.
> > - *
> > - * If flags & VFIO_DMA_UNMAP_FLAG_VADDR, do not unmap, but invalidate host
> > - * virtual addresses in the iova range.  Tasks that attempt to translate an
> > - * iova's vaddr will block.  DMA to already-mapped pages continues.  This
> > - * cannot be combined with the get-dirty-bitmap flag.
> >   */
> >  struct vfio_iommu_type1_dma_unmap {
> >  	__u32	argsz;
> >  	__u32	flags;
> >  #define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
> >  #define VFIO_DMA_UNMAP_FLAG_ALL		     (1 << 1)
> > -#define VFIO_DMA_UNMAP_FLAG_VADDR	     (1 << 2)
> 
> 
> This flag should probably be marked reserved.
> 
> Should we consider this separately for v6.2?

I think we should merge this immediately, given the security problem.

> For the remainder, the long term plan is to move to iommufd, so any new
> feature of type1 would need equivalent support in iommufd.  Thanks,

At a bare minimum nothing should be merged to type1 that doesn't come
along with an iommufd implementation too.

IMHO at this point we should not be changing type1 any more - just do
it iommufd only please. No reason to write and review everything
twice.

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH V1 0/8] vfio virtual address update redo
  2022-12-06 21:55 [PATCH V1 0/8] vfio virtual address update redo Steve Sistare
                   ` (7 preceding siblings ...)
  2022-12-06 21:55 ` [PATCH V1 8/8] vfio/type1: " Steve Sistare
@ 2022-12-07 15:23 ` Jason Gunthorpe
  8 siblings, 0 replies; 29+ messages in thread
From: Jason Gunthorpe @ 2022-12-07 15:23 UTC (permalink / raw)
  To: Steve Sistare; +Cc: kvm, Alex Williamson, Cornelia Huck

On Tue, Dec 06, 2022 at 01:55:45PM -0800, Steve Sistare wrote:

> Lastly, if a task exits or execs, and it still owns any dma mappings, they
> are unmapped and unpinned.  This guarantees that pages do not remain pinned
> indefinitely if a vfio descriptor is leaked to another process, and requires
> tasks to explicitly transfer ownership of dma (and hence locked_vm) to a new
> task and mm when continued operation is desired.  The vfio driver maps a
> special vma so it can detect exit and exec, via the vm_operations_struct
> close callback.

I don't think any of this is necessary. If a VFIO FD is "leaked" to
another process there are many hostile things that process can do
beyond invoke this new ioctl. Considering the complexity I prefer to
drop this.

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH V1 2/8] vfio/type1: dma owner permission
  2022-12-06 21:55 ` [PATCH V1 2/8] vfio/type1: dma owner permission Steve Sistare
@ 2022-12-07 15:28   ` Jason Gunthorpe
  2022-12-08 20:13     ` Steven Sistare
  0 siblings, 1 reply; 29+ messages in thread
From: Jason Gunthorpe @ 2022-12-07 15:28 UTC (permalink / raw)
  To: Steve Sistare; +Cc: kvm, Alex Williamson, Cornelia Huck

On Tue, Dec 06, 2022 at 01:55:47PM -0800, Steve Sistare wrote:
> The first task to pin any pages becomes the dma owner, and becomes the only
> task allowed to pin.  This prevents an application from exceeding the
> initial task's RLIMIT_MEMLOCK by fork'ing and pinning in children.

We do not need to play games with the RLIMIT here - RLIMIT is
inherently insecure and if fork is available then the process can blow
past the sandbox limit. There is nothing we can do to prevent this in
the kernel, so don't even try.

iommufd offers the user based limit tracking which prevents this
properly.

And we are working on cgroup based limit tracking that is the best
option to solve this problem.

I would rather see us focus on the cgroup stuff than this.

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH V1 7/8] vfio: change dma owner
  2022-12-06 21:55 ` [PATCH V1 7/8] vfio: change dma owner Steve Sistare
@ 2022-12-07 16:58   ` Jason Gunthorpe
  2022-12-08 16:48     ` Steven Sistare
  0 siblings, 1 reply; 29+ messages in thread
From: Jason Gunthorpe @ 2022-12-07 16:58 UTC (permalink / raw)
  To: Steve Sistare; +Cc: kvm, Alex Williamson, Cornelia Huck

On Tue, Dec 06, 2022 at 01:55:52PM -0800, Steve Sistare wrote:

> +/**
> + * VFIO_CHANGE_DMA_OWNER		_IO(VFIO_TYPE, VFIO_BASE + 22)
> + *
> + * Change ownership of all dma mappings to the calling task, including
> + * count of locked pages subject to RLIMIT_MEMLOCK.  The new task's address
> + * space is used to translate virtual to physical addresses for all future
> + * requests, including as those issued by mediated devices.  For all mappings,
> + * the vaddr must be the same in the old and new address space, or can be
> + * changed in the new address space by using VFIO_DMA_MAP_FLAG_VADDR, but in
> + * both cases the old vaddr and address space must map to the same memory
> + * object as the new vaddr and address space.  Length and access permissions
> + * cannot be changed, and the object must be mapped shared.  Tasks must not
> + * modify the old or new address space over the affected ranges during this
> + * ioctl, else differences might not be detected, and dma may target the wrong
> + * user pages.
> + *
> + * Return:
> + *	      0: success
> + *       -ESRCH: owning task is dead.
> + *	-ENOMEM: Out of memory, or RLIMIT_MEMLOCK is too low.
> + *	 -ENXIO: Memory object does not match or is not shared.
> + *	-EINVAL: a new vaddr was provided for some but not all mappings.

I whipped up a quick implementation for iommufd, but this made my
brain hurt.

If the change can fail then we can get stuck in a situation where we
cannot revert and the fork cannot be exited, basically qemu can crash.

What we really want is for the change to be unfailable, which can
happen in iommufd's accounting modes of user and in future cgroup if
the user/cgroup are not being changed - for the rlimit mode it is also
reliable if the user process does not do something to disturb the
pinning or the rlimit setting..

We can make the problem less bad by making the whole thing atomic at
least.

Anyhow, I came up with this thing. Needs a bit of polishing, the
design is a bit odd for performance reasons, and I only compiled it.

diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
index 31577e9d434f87..b64ea75917fbf4 100644
--- a/drivers/iommu/iommufd/ioas.c
+++ b/drivers/iommu/iommufd/ioas.c
@@ -51,7 +51,10 @@ int iommufd_ioas_alloc_ioctl(struct iommufd_ucmd *ucmd)
 	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
 	if (rc)
 		goto out_table;
+
+	down_read(&ucmd->ictx->ioas_creation_lock);
 	iommufd_object_finalize(ucmd->ictx, &ioas->obj);
+	up_read(&ucmd->ictx->ioas_creation_lock);
 	return 0;
 
 out_table:
@@ -319,6 +322,213 @@ int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd)
 	return rc;
 }
 
+static void iommufd_release_all_iova_rwsem(struct iommufd_ctx *ictx,
+				      struct xarray *ioas_list)
+{
+	struct iommufd_ioas *ioas;
+	unsigned long index;
+
+	xa_for_each(ioas_list, index, ioas) {
+		up_write(&ioas->iopt.iova_rwsem);
+		iommufd_object_destroy_user(ictx, &ioas->obj);
+	}
+	up_write(&ictx->ioas_creation_lock);
+	xa_destroy(ioas_list);
+}
+
+static int iommufd_take_all_iova_rwsem(struct iommufd_ctx *ictx,
+				       struct xarray *ioas_list)
+{
+	struct iommufd_object *obj;
+	unsigned long index;
+	int rc;
+
+	/*
+	 * This is very ugly, it is done instead of adding a lock around
+	 * pages->source_mm, which is a performance path for mdev, we just
+	 * obtain the write side of all the iova_rwsems which also protects the
+	 * pages->source_*. Due to copies we can't know which IOAS could read
+	 * from the pages, so we just lock everything. This is the only place
+	 * locks are nested and they are uniformly taken in ID order.
+	 *
+	 * ioas_creation_lock prevents new IOAS from being installed in the
+	 * xarray while we do this, and also prevents more than one thread from
+	 * holding nested locks.
+	 */
+	down_write(&ictx->ioas_creation_lock);
+	xa_lock(&ictx->objects);
+	/* FIXME: Can we somehow tell lockdep just to ignore the one lock? */
+	lockdep_off();
+	xa_for_each(&ictx->objects, index, obj) {
+		struct iommufd_ioas *ioas;
+
+		if (!obj || obj->type == IOMMUFD_OBJ_IOAS)
+			continue;
+
+		if (!refcount_inc_not_zero(&obj->users))
+			continue;
+		xa_unlock(&ictx->objects);
+
+		ioas = container_of(obj, struct iommufd_ioas, obj);
+		down_write(&ioas->iopt.iova_rwsem);
+
+		rc = xa_err(xa_store(ioas_list, index, ioas, GFP_KERNEL));
+		if (rc) {
+			lockdep_on();
+			iommufd_release_all_iova_rwsem(ictx, ioas_list);
+			return rc;
+		}
+
+		xa_lock(&ictx->objects);
+	}
+	lockdep_on();
+	xa_unlock(&ictx->objects);
+	return 0;
+}
+
+static bool need_charge_update(struct iopt_pages *pages)
+{
+	if (pages->source_task == current->group_leader &&
+	    pages->source_mm == current->mm &&
+	    pages->source_user == current_user())
+		return false;
+
+	switch (pages->account_mode) {
+	case IOPT_PAGES_ACCOUNT_NONE:
+		return false;
+	case IOPT_PAGES_ACCOUNT_USER:
+		if (pages->source_user == current_user())
+			return false;
+		break;
+	case IOPT_PAGES_ACCOUNT_MM:
+		if (pages->source_mm == current->mm)
+			return false;
+	}
+	return true;
+}
+
+/* FIXME put me someplace nice */
+#define IOPT_PAGES_ACCOUNT_MODE_NUM 3
+
+/* FIXME this cross call is a bit hacky, but maybe the best */
+struct pfn_reader_user;
+int do_update_pinned(struct iopt_pages *pages, unsigned long npages,
+			    bool inc, struct pfn_reader_user *user);
+
+static int charge_current(unsigned long *npinned)
+{
+	struct iopt_pages tmp = {
+		.source_mm = current->mm,
+		.source_task = current->group_leader,
+		.source_user = current_user(),
+	};
+	unsigned int account_mode;
+	int rc;
+
+	for (account_mode = 0; account_mode != IOPT_PAGES_ACCOUNT_MODE_NUM;
+	     account_mode++) {
+		if (!npinned[account_mode])
+			continue;
+
+		tmp.account_mode = account_mode;
+		rc = do_update_pinned(&tmp, npinned[account_mode], true, NULL);
+		if (rc)
+			goto err_undo;
+	}
+	return 0;
+
+err_undo:
+	while (account_mode != 0) {
+		account_mode--;
+		tmp.account_mode = account_mode;
+		do_update_pinned(&tmp, npinned[account_mode], false, NULL);
+	}
+	return rc;
+}
+
+static void change_mm(struct iopt_pages *pages)
+{
+	struct task_struct *old_task = pages->source_task;
+	struct user_struct *old_user = pages->source_user;
+	struct mm_struct *old_mm = pages->source_mm;
+
+	/* Uncharge the old one */
+	do_update_pinned(pages, pages->npinned, false, NULL);
+
+	pages->source_mm = current->mm;
+	mmgrab(pages->source_mm);
+	mmput(old_mm);
+
+	pages->source_task = current->group_leader;
+	get_task_struct(pages->source_task);
+	put_task_struct(old_task);
+
+	pages->source_user = get_uid(current_user());
+	free_uid(old_user);
+}
+
+int iommufd_ioas_change_process(struct iommufd_ucmd *ucmd)
+{
+	struct iommufd_ctx *ictx = ucmd->ictx;
+	struct iommufd_ioas *ioas;
+	struct xarray ioas_list;
+	unsigned long all_npinned[IOPT_PAGES_ACCOUNT_MODE_NUM];
+	unsigned long index;
+	int rc;
+
+	xa_init(&ioas_list);
+	rc = iommufd_take_all_iova_rwsem(ictx, &ioas_list);
+	if (rc)
+		return rc;
+
+	/*
+	 * Figure out how many pages we eed to charge to current so we can
+	 * charge them all at once.
+	 */
+	xa_for_each(&ioas_list, index, ioas) {
+		struct iopt_area *area;
+
+		for (area = iopt_area_iter_first(&ioas->iopt, 0, ULONG_MAX);
+		     area; area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
+			struct iopt_pages *pages = area->pages;
+
+			if (!need_charge_update(pages))
+				continue;
+
+			all_npinned[pages->account_mode] += pages->last_npinned;
+
+			/*
+			 * Abuse last_npinned to keep track of duplicated pages.
+			 * Since we are under all the locks npinned ==
+			 * last_npinned
+			 */
+			pages->last_npinned = 0;
+		}
+	}
+
+	rc = charge_current(all_npinned);
+
+	xa_for_each(&ioas_list, index, ioas) {
+		struct iopt_area *area;
+
+		for (area = iopt_area_iter_first(&ioas->iopt, 0, ULONG_MAX);
+		     area; area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
+			struct iopt_pages *pages = area->pages;
+
+			if (!need_charge_update(pages))
+				continue;
+
+			/* Always need to fix last_npinned */
+			pages->last_npinned = pages->npinned;
+			if (!rc)
+				change_mm(pages);
+		     }
+	}
+
+	iommufd_release_all_iova_rwsem(ictx, &ioas_list);
+	return rc;
+}
+
 int iommufd_option_rlimit_mode(struct iommu_option *cmd,
 			       struct iommufd_ctx *ictx)
 {
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 222e86591f8ac9..a8bf3badd973d0 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -16,6 +16,7 @@ struct iommu_option;
 struct iommufd_ctx {
 	struct file *file;
 	struct xarray objects;
+	struct rw_semaphore ioas_creation_lock;
 
 	u8 account_mode;
 	struct iommufd_ioas *vfio_ioas;
@@ -223,6 +224,7 @@ void iommufd_ioas_destroy(struct iommufd_object *obj);
 int iommufd_ioas_iova_ranges(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_allow_iovas(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_map(struct iommufd_ucmd *ucmd);
+int iommufd_ioas_change_process(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_copy(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_option(struct iommufd_ucmd *ucmd);
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 083e6fcbe10ad9..9a006acaa626f0 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -182,6 +182,7 @@ static int iommufd_fops_open(struct inode *inode, struct file *filp)
 		pr_info_once("IOMMUFD is providing /dev/vfio/vfio, not VFIO.\n");
 	}
 
+	init_rwsem(&ictx->ioas_creation_lock);
 	xa_init_flags(&ictx->objects, XA_FLAGS_ALLOC1 | XA_FLAGS_ACCOUNT);
 	ictx->file = filp;
 	filp->private_data = ictx;
@@ -282,6 +283,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 		 struct iommu_ioas_alloc, out_ioas_id),
 	IOCTL_OP(IOMMU_IOAS_ALLOW_IOVAS, iommufd_ioas_allow_iovas,
 		 struct iommu_ioas_allow_iovas, allowed_iovas),
+	IOCTL_OP(IOMMUFD_CMD_IOAS_CHANGE_PROCESS, iommufd_ioas_change_process,
+		 struct iommu_ioas_change_process, size),
 	IOCTL_OP(IOMMU_IOAS_COPY, iommufd_ioas_copy, struct iommu_ioas_copy,
 		 src_iova),
 	IOCTL_OP(IOMMU_IOAS_IOVA_RANGES, iommufd_ioas_iova_ranges,
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index c771772296485f..12b8bda7d88136 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -859,7 +859,7 @@ static int update_mm_locked_vm(struct iopt_pages *pages, unsigned long npages,
 	return rc;
 }
 
-static int do_update_pinned(struct iopt_pages *pages, unsigned long npages,
+int do_update_pinned(struct iopt_pages *pages, unsigned long npages,
 			    bool inc, struct pfn_reader_user *user)
 {
 	int rc = 0;
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 98ebba80cfa1fc..8919f108a01897 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -45,6 +45,7 @@ enum {
 	IOMMUFD_CMD_IOAS_UNMAP,
 	IOMMUFD_CMD_OPTION,
 	IOMMUFD_CMD_VFIO_IOAS,
+	IOMMUFD_CMD_IOAS_CHANGE_PROCESS,
 };
 
 /**
@@ -344,4 +345,27 @@ struct iommu_vfio_ioas {
 	__u16 __reserved;
 };
 #define IOMMU_VFIO_IOAS _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VFIO_IOAS)
+
+/**
+ * struct iommu_ioas_change_process - ioctl(VFIO_IOAS_CHANGE_PROCESS)
+ * @size: sizeof(struct iommu_ioas_change_process)
+ *
+ * This changes the process backing the memory maps for every memory map
+ * created in every IOAS in the context. If it fails then nothing changes.
+ *
+ * This will never fail if IOMMU_OPTION_RLIMIT_MODE is set to user and the two
+ * processes belong to the same user.
+ *
+ * This API is useful to support a re-exec flow where a single process owns all
+ * the mappings and wishes to exec() itself into a new process. The new process
+ * should re-establish the same mappings at the same VA. If the user wishes to
+ * retain the same PID then it should fork a temporary process to hold the
+ * mappings while exec and remap is ongoing in the primary PID.
+ */
+struct iommu_ioas_change_process {
+	__u32 size;
+};
+#define IOMMU_IOAS_CHANGE_PROCESS \
+	_IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_CHANGE_PROCESS)
+
 #endif

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH V1 8/8] vfio/type1: change dma owner
  2022-12-06 21:55 ` [PATCH V1 8/8] vfio/type1: " Steve Sistare
@ 2022-12-07 17:00   ` Jason Gunthorpe
  2022-12-08  7:22   ` Dan Carpenter
  1 sibling, 0 replies; 29+ messages in thread
From: Jason Gunthorpe @ 2022-12-07 17:00 UTC (permalink / raw)
  To: Steve Sistare; +Cc: kvm, Alex Williamson, Cornelia Huck

On Tue, Dec 06, 2022 at 01:55:53PM -0800, Steve Sistare wrote:
> Implement VFIO_CHANGE_DMA_OWNER in the type1 iommu.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 119 ++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 119 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index fbea2b5..55ba1e7 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -1509,6 +1509,112 @@ static bool vfio_iommu_iova_dma_valid(struct vfio_iommu *iommu,
>  	return list_empty(iova);
>  }
>  
> +/*
> + * Return true if mm1 vaddr1 maps the same memory object as mm2 vaddr2.
> + * This does not prevent other tasks from concurrently modifying mappings
> + * and invalidating this test, but that would be an application bug.
> + */

And so the only reason to do this is to help 'self test' quemu that it
isn't doing something wrong, because it obviously doesn't protect the
kernel from security issues/etc.

I probably would not bother, but if you do it then it should be an
option to spend these cycles and debugged qemu can avoid it..

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH V1 1/8] vfio: delete interfaces to update vaddr
  2022-12-06 21:55 ` [PATCH V1 1/8] vfio: delete interfaces to update vaddr Steve Sistare
  2022-12-06 23:52   ` Alex Williamson
@ 2022-12-07 23:20   ` Jason Gunthorpe
  2022-12-08 22:18   ` Alex Williamson
  2 siblings, 0 replies; 29+ messages in thread
From: Jason Gunthorpe @ 2022-12-07 23:20 UTC (permalink / raw)
  To: Steve Sistare; +Cc: kvm, Alex Williamson, Cornelia Huck

On Tue, Dec 06, 2022 at 01:55:46PM -0800, Steve Sistare wrote:
> Delete the interfaces that allow an iova range to be re-mapped in a new
> address space.  They allow userland to indefinitely block vfio mediated
> device kernel threads, and do not propagate the locked_vm count to a
> new mm.
> 
>   - disable the VFIO_UPDATE_VADDR extension
>   - delete VFIO_DMA_UNMAP_FLAG_VADDR
>   - delete most of VFIO_DMA_MAP_FLAG_VADDR (but keep some for use in a
>     new implementation in a subsequent patch).
> 
> Revert most of the code of these commits:
> 
>   441e810 ("vfio: interfaces to update vaddr")
>   c3cbab2 ("vfio/type1: implement interfaces to update vaddr")
>   898b9ea ("vfio/type1: block on invalid vaddr")
> 
> Revert these commits.  They are harmless, but no longer used after the
> above are reverted, and this kind of functionality is better handled by
> adding new methods to vfio_iommu_driver_ops.
> 
>   ec5e329 ("vfio: iommu driver notify callback")
>   487ace1 ("vfio/type1: implement notify callback")
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  drivers/vfio/container.c        |   5 --
>  drivers/vfio/vfio.h             |   7 --
>  drivers/vfio/vfio_iommu_type1.c | 144 ++--------------------------------------
>  include/uapi/linux/vfio.h       |  17 +----
>  4 files changed, 8 insertions(+), 165 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH V1 8/8] vfio/type1: change dma owner
  2022-12-06 21:55 ` [PATCH V1 8/8] vfio/type1: " Steve Sistare
  2022-12-07 17:00   ` Jason Gunthorpe
@ 2022-12-08  7:22   ` Dan Carpenter
  1 sibling, 0 replies; 29+ messages in thread
From: Dan Carpenter @ 2022-12-08  7:22 UTC (permalink / raw)
  To: oe-kbuild, Steve Sistare, kvm
  Cc: lkp, oe-kbuild-all, Alex Williamson, Cornelia Huck, Steve Sistare

Hi Steve,

https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Steve-Sistare/vfio-virtual-address-update-redo/20221207-055735
base:   https://github.com/awilliam/linux-vfio.git for-linus
patch link:    https://lore.kernel.org/r/1670363753-249738-9-git-send-email-steven.sistare%40oracle.com
patch subject: [PATCH V1 8/8] vfio/type1: change dma owner
config: i386-randconfig-m021
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Reported-by: Dan Carpenter <error27@gmail.com>

New smatch warnings:
drivers/vfio/vfio_iommu_type1.c:1546 same_file_mapping() error: uninitialized symbol 'pgoff2'.
drivers/vfio/vfio_iommu_type1.c:1547 same_file_mapping() error: uninitialized symbol 'len2'.
drivers/vfio/vfio_iommu_type1.c:1547 same_file_mapping() error: uninitialized symbol 'flags2'.

vim +/pgoff2 +1546 drivers/vfio/vfio_iommu_type1.c

f42f5cc4de6087 Steve Sistare    2022-12-06  1517  static bool same_file_mapping(struct mm_struct *mm1, unsigned long vaddr1,
f42f5cc4de6087 Steve Sistare    2022-12-06  1518  			      struct mm_struct *mm2, unsigned long vaddr2)
f42f5cc4de6087 Steve Sistare    2022-12-06  1519  {
f42f5cc4de6087 Steve Sistare    2022-12-06  1520  	const unsigned long mask = VM_READ | VM_WRITE | VM_EXEC | VM_SHARED;
f42f5cc4de6087 Steve Sistare    2022-12-06  1521  	struct inode *inode1 = NULL, *inode2 = NULL;
f42f5cc4de6087 Steve Sistare    2022-12-06  1522  	unsigned long pgoff1, len1, flags1;
f42f5cc4de6087 Steve Sistare    2022-12-06  1523  	unsigned long pgoff2, len2, flags2;
f42f5cc4de6087 Steve Sistare    2022-12-06  1524  	struct vm_area_struct *vma;
f42f5cc4de6087 Steve Sistare    2022-12-06  1525  
f42f5cc4de6087 Steve Sistare    2022-12-06  1526  	mmap_read_lock(mm1);
f42f5cc4de6087 Steve Sistare    2022-12-06  1527  	vma = find_vma(mm1, vaddr1);
f42f5cc4de6087 Steve Sistare    2022-12-06  1528  	if (vma && vma->vm_file) {
f42f5cc4de6087 Steve Sistare    2022-12-06  1529  		inode1 = vma->vm_file->f_inode;
f42f5cc4de6087 Steve Sistare    2022-12-06  1530  		pgoff1 = vma->vm_pgoff;
f42f5cc4de6087 Steve Sistare    2022-12-06  1531  		len1 = vma->vm_end - vma->vm_start;
f42f5cc4de6087 Steve Sistare    2022-12-06  1532  		flags1 = vma->vm_flags & mask;
f42f5cc4de6087 Steve Sistare    2022-12-06  1533  	}
f42f5cc4de6087 Steve Sistare    2022-12-06  1534  	mmap_read_unlock(mm1);
f42f5cc4de6087 Steve Sistare    2022-12-06  1535  
f42f5cc4de6087 Steve Sistare    2022-12-06  1536  	mmap_read_lock(mm2);
f42f5cc4de6087 Steve Sistare    2022-12-06  1537  	vma = find_vma(mm2, vaddr2);
f42f5cc4de6087 Steve Sistare    2022-12-06  1538  	if (vma && vma->vm_file) {
f42f5cc4de6087 Steve Sistare    2022-12-06  1539  		inode2 = vma->vm_file->f_inode;
f42f5cc4de6087 Steve Sistare    2022-12-06  1540  		pgoff2 = vma->vm_pgoff;
f42f5cc4de6087 Steve Sistare    2022-12-06  1541  		len2 = vma->vm_end - vma->vm_start;
f42f5cc4de6087 Steve Sistare    2022-12-06  1542  		flags2 = vma->vm_flags & mask;
f42f5cc4de6087 Steve Sistare    2022-12-06  1543  	}
f42f5cc4de6087 Steve Sistare    2022-12-06  1544  	mmap_read_unlock(mm2);
f42f5cc4de6087 Steve Sistare    2022-12-06  1545  
f42f5cc4de6087 Steve Sistare    2022-12-06 @1546  	if (!inode1 || (inode1 != inode2) || (pgoff1 != pgoff2) ||

Presumably the combination of checking !inode1 and inode1 != inode2
prevents an uninitialized variable use, but it's not clear.

f42f5cc4de6087 Steve Sistare    2022-12-06 @1547  	    (len1 != len2) || (flags1 != flags2)) {
f42f5cc4de6087 Steve Sistare    2022-12-06  1548  		pr_debug("vfio vma mismatch for old va %lx vs new va %lx\n",
f42f5cc4de6087 Steve Sistare    2022-12-06  1549  			 vaddr1, vaddr2);
f42f5cc4de6087 Steve Sistare    2022-12-06  1550  		return false;
f42f5cc4de6087 Steve Sistare    2022-12-06  1551  	} else {
f42f5cc4de6087 Steve Sistare    2022-12-06  1552  		return true;
f42f5cc4de6087 Steve Sistare    2022-12-06  1553  	}
f42f5cc4de6087 Steve Sistare    2022-12-06  1554  }

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH V1 7/8] vfio: change dma owner
  2022-12-07 16:58   ` Jason Gunthorpe
@ 2022-12-08 16:48     ` Steven Sistare
  2022-12-08 17:15       ` Jason Gunthorpe
  0 siblings, 1 reply; 29+ messages in thread
From: Steven Sistare @ 2022-12-08 16:48 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: kvm, Alex Williamson, Cornelia Huck

On 12/7/2022 11:58 AM, Jason Gunthorpe wrote:
> On Tue, Dec 06, 2022 at 01:55:52PM -0800, Steve Sistare wrote:
> 
>> +/**
>> + * VFIO_CHANGE_DMA_OWNER		_IO(VFIO_TYPE, VFIO_BASE + 22)
>> + *
>> + * Change ownership of all dma mappings to the calling task, including
>> + * count of locked pages subject to RLIMIT_MEMLOCK.  The new task's address
>> + * space is used to translate virtual to physical addresses for all future
>> + * requests, including as those issued by mediated devices.  For all mappings,
>> + * the vaddr must be the same in the old and new address space, or can be
>> + * changed in the new address space by using VFIO_DMA_MAP_FLAG_VADDR, but in
>> + * both cases the old vaddr and address space must map to the same memory
>> + * object as the new vaddr and address space.  Length and access permissions
>> + * cannot be changed, and the object must be mapped shared.  Tasks must not
>> + * modify the old or new address space over the affected ranges during this
>> + * ioctl, else differences might not be detected, and dma may target the wrong
>> + * user pages.
>> + *
>> + * Return:
>> + *	      0: success
>> + *       -ESRCH: owning task is dead.
>> + *	-ENOMEM: Out of memory, or RLIMIT_MEMLOCK is too low.
>> + *	 -ENXIO: Memory object does not match or is not shared.
>> + *	-EINVAL: a new vaddr was provided for some but not all mappings.
> 
> I whipped up a quick implementation for iommufd, but this made my
> brain hurt.
> 
> If the change can fail then we can get stuck in a situation where we
> cannot revert and the fork cannot be exited, basically qemu can crash.
> 
> What we really want is for the change to be unfailable, which can
> happen in iommufd's accounting modes of user and in future cgroup if
> the user/cgroup are not being changed - for the rlimit mode it is also
> reliable if the user process does not do something to disturb the
> pinning or the rlimit setting..
> 
> We can make the problem less bad by making the whole thing atomic at
> least.

EINVAL is returned above due to an application error.  The app failed to 
provide a new vaddr for all dma structs.  It is unrelated to limits, and
no changes are made to ownership or limits.  The implementation is atomic;
all are changed, or none are changed.

However, the app is indeed stuck, since it does not know where it screwed up,
and there is no interface to cancel the new vaddr's it has registered, so
an attempt top VFIO_CHANGE_DMA_OWNER back to the original process will fail. This
can be fixed by passing an array of (iova, new_vaddr) to VFIO_CHANGE_DMA_OWNER,
rather than pre-staging the new vaddrs one at a time with VFIO_DMA_MAP_FLAG_VADDR.
I debated whether or not to define that slightly more elaborate interface to 
accomodate a buggy application.

> Anyhow, I came up with this thing. Needs a bit of polishing, the
> design is a bit odd for performance reasons, and I only compiled it.

Thanks, I'll pull an iommfd development environment together and try it.
However, it will also need an interface to change vaddr for each dma region.
In general the vaddr will be different when the memory object is re-mapped 
after exec.

- Steve

> diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
> index 31577e9d434f87..b64ea75917fbf4 100644
> --- a/drivers/iommu/iommufd/ioas.c
> +++ b/drivers/iommu/iommufd/ioas.c
> @@ -51,7 +51,10 @@ int iommufd_ioas_alloc_ioctl(struct iommufd_ucmd *ucmd)
>  	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
>  	if (rc)
>  		goto out_table;
> +
> +	down_read(&ucmd->ictx->ioas_creation_lock);
>  	iommufd_object_finalize(ucmd->ictx, &ioas->obj);
> +	up_read(&ucmd->ictx->ioas_creation_lock);
>  	return 0;
>  
>  out_table:
> @@ -319,6 +322,213 @@ int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd)
>  	return rc;
>  }
>  
> +static void iommufd_release_all_iova_rwsem(struct iommufd_ctx *ictx,
> +				      struct xarray *ioas_list)
> +{
> +	struct iommufd_ioas *ioas;
> +	unsigned long index;
> +
> +	xa_for_each(ioas_list, index, ioas) {
> +		up_write(&ioas->iopt.iova_rwsem);
> +		iommufd_object_destroy_user(ictx, &ioas->obj);
> +	}
> +	up_write(&ictx->ioas_creation_lock);
> +	xa_destroy(ioas_list);
> +}
> +
> +static int iommufd_take_all_iova_rwsem(struct iommufd_ctx *ictx,
> +				       struct xarray *ioas_list)
> +{
> +	struct iommufd_object *obj;
> +	unsigned long index;
> +	int rc;
> +
> +	/*
> +	 * This is very ugly, it is done instead of adding a lock around
> +	 * pages->source_mm, which is a performance path for mdev, we just
> +	 * obtain the write side of all the iova_rwsems which also protects the
> +	 * pages->source_*. Due to copies we can't know which IOAS could read
> +	 * from the pages, so we just lock everything. This is the only place
> +	 * locks are nested and they are uniformly taken in ID order.
> +	 *
> +	 * ioas_creation_lock prevents new IOAS from being installed in the
> +	 * xarray while we do this, and also prevents more than one thread from
> +	 * holding nested locks.
> +	 */
> +	down_write(&ictx->ioas_creation_lock);
> +	xa_lock(&ictx->objects);
> +	/* FIXME: Can we somehow tell lockdep just to ignore the one lock? */
> +	lockdep_off();
> +	xa_for_each(&ictx->objects, index, obj) {
> +		struct iommufd_ioas *ioas;
> +
> +		if (!obj || obj->type == IOMMUFD_OBJ_IOAS)
> +			continue;
> +
> +		if (!refcount_inc_not_zero(&obj->users))
> +			continue;
> +		xa_unlock(&ictx->objects);
> +
> +		ioas = container_of(obj, struct iommufd_ioas, obj);
> +		down_write(&ioas->iopt.iova_rwsem);
> +
> +		rc = xa_err(xa_store(ioas_list, index, ioas, GFP_KERNEL));
> +		if (rc) {
> +			lockdep_on();
> +			iommufd_release_all_iova_rwsem(ictx, ioas_list);
> +			return rc;
> +		}
> +
> +		xa_lock(&ictx->objects);
> +	}
> +	lockdep_on();
> +	xa_unlock(&ictx->objects);
> +	return 0;
> +}
> +
> +static bool need_charge_update(struct iopt_pages *pages)
> +{
> +	if (pages->source_task == current->group_leader &&
> +	    pages->source_mm == current->mm &&
> +	    pages->source_user == current_user())
> +		return false;
> +
> +	switch (pages->account_mode) {
> +	case IOPT_PAGES_ACCOUNT_NONE:
> +		return false;
> +	case IOPT_PAGES_ACCOUNT_USER:
> +		if (pages->source_user == current_user())
> +			return false;
> +		break;
> +	case IOPT_PAGES_ACCOUNT_MM:
> +		if (pages->source_mm == current->mm)
> +			return false;
> +	}
> +	return true;
> +}
> +
> +/* FIXME put me someplace nice */
> +#define IOPT_PAGES_ACCOUNT_MODE_NUM 3
> +
> +/* FIXME this cross call is a bit hacky, but maybe the best */
> +struct pfn_reader_user;
> +int do_update_pinned(struct iopt_pages *pages, unsigned long npages,
> +			    bool inc, struct pfn_reader_user *user);
> +
> +static int charge_current(unsigned long *npinned)
> +{
> +	struct iopt_pages tmp = {
> +		.source_mm = current->mm,
> +		.source_task = current->group_leader,
> +		.source_user = current_user(),
> +	};
> +	unsigned int account_mode;
> +	int rc;
> +
> +	for (account_mode = 0; account_mode != IOPT_PAGES_ACCOUNT_MODE_NUM;
> +	     account_mode++) {
> +		if (!npinned[account_mode])
> +			continue;
> +
> +		tmp.account_mode = account_mode;
> +		rc = do_update_pinned(&tmp, npinned[account_mode], true, NULL);
> +		if (rc)
> +			goto err_undo;
> +	}
> +	return 0;
> +
> +err_undo:
> +	while (account_mode != 0) {
> +		account_mode--;
> +		tmp.account_mode = account_mode;
> +		do_update_pinned(&tmp, npinned[account_mode], false, NULL);
> +	}
> +	return rc;
> +}
> +
> +static void change_mm(struct iopt_pages *pages)
> +{
> +	struct task_struct *old_task = pages->source_task;
> +	struct user_struct *old_user = pages->source_user;
> +	struct mm_struct *old_mm = pages->source_mm;
> +
> +	/* Uncharge the old one */
> +	do_update_pinned(pages, pages->npinned, false, NULL);
> +
> +	pages->source_mm = current->mm;
> +	mmgrab(pages->source_mm);
> +	mmput(old_mm);
> +
> +	pages->source_task = current->group_leader;
> +	get_task_struct(pages->source_task);
> +	put_task_struct(old_task);
> +
> +	pages->source_user = get_uid(current_user());
> +	free_uid(old_user);
> +}
> +
> +int iommufd_ioas_change_process(struct iommufd_ucmd *ucmd)
> +{
> +	struct iommufd_ctx *ictx = ucmd->ictx;
> +	struct iommufd_ioas *ioas;
> +	struct xarray ioas_list;
> +	unsigned long all_npinned[IOPT_PAGES_ACCOUNT_MODE_NUM];
> +	unsigned long index;
> +	int rc;
> +
> +	xa_init(&ioas_list);
> +	rc = iommufd_take_all_iova_rwsem(ictx, &ioas_list);
> +	if (rc)
> +		return rc;
> +
> +	/*
> +	 * Figure out how many pages we eed to charge to current so we can
> +	 * charge them all at once.
> +	 */
> +	xa_for_each(&ioas_list, index, ioas) {
> +		struct iopt_area *area;
> +
> +		for (area = iopt_area_iter_first(&ioas->iopt, 0, ULONG_MAX);
> +		     area; area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
> +			struct iopt_pages *pages = area->pages;
> +
> +			if (!need_charge_update(pages))
> +				continue;
> +
> +			all_npinned[pages->account_mode] += pages->last_npinned;
> +
> +			/*
> +			 * Abuse last_npinned to keep track of duplicated pages.
> +			 * Since we are under all the locks npinned ==
> +			 * last_npinned
> +			 */
> +			pages->last_npinned = 0;
> +		}
> +	}
> +
> +	rc = charge_current(all_npinned);
> +
> +	xa_for_each(&ioas_list, index, ioas) {
> +		struct iopt_area *area;
> +
> +		for (area = iopt_area_iter_first(&ioas->iopt, 0, ULONG_MAX);
> +		     area; area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
> +			struct iopt_pages *pages = area->pages;
> +
> +			if (!need_charge_update(pages))
> +				continue;
> +
> +			/* Always need to fix last_npinned */
> +			pages->last_npinned = pages->npinned;
> +			if (!rc)
> +				change_mm(pages);
> +		     }
> +	}
> +
> +	iommufd_release_all_iova_rwsem(ictx, &ioas_list);
> +	return rc;
> +}
> +
>  int iommufd_option_rlimit_mode(struct iommu_option *cmd,
>  			       struct iommufd_ctx *ictx)
>  {
> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
> index 222e86591f8ac9..a8bf3badd973d0 100644
> --- a/drivers/iommu/iommufd/iommufd_private.h
> +++ b/drivers/iommu/iommufd/iommufd_private.h
> @@ -16,6 +16,7 @@ struct iommu_option;
>  struct iommufd_ctx {
>  	struct file *file;
>  	struct xarray objects;
> +	struct rw_semaphore ioas_creation_lock;
>  
>  	u8 account_mode;
>  	struct iommufd_ioas *vfio_ioas;
> @@ -223,6 +224,7 @@ void iommufd_ioas_destroy(struct iommufd_object *obj);
>  int iommufd_ioas_iova_ranges(struct iommufd_ucmd *ucmd);
>  int iommufd_ioas_allow_iovas(struct iommufd_ucmd *ucmd);
>  int iommufd_ioas_map(struct iommufd_ucmd *ucmd);
> +int iommufd_ioas_change_process(struct iommufd_ucmd *ucmd);
>  int iommufd_ioas_copy(struct iommufd_ucmd *ucmd);
>  int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd);
>  int iommufd_ioas_option(struct iommufd_ucmd *ucmd);
> diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
> index 083e6fcbe10ad9..9a006acaa626f0 100644
> --- a/drivers/iommu/iommufd/main.c
> +++ b/drivers/iommu/iommufd/main.c
> @@ -182,6 +182,7 @@ static int iommufd_fops_open(struct inode *inode, struct file *filp)
>  		pr_info_once("IOMMUFD is providing /dev/vfio/vfio, not VFIO.\n");
>  	}
>  
> +	init_rwsem(&ictx->ioas_creation_lock);
>  	xa_init_flags(&ictx->objects, XA_FLAGS_ALLOC1 | XA_FLAGS_ACCOUNT);
>  	ictx->file = filp;
>  	filp->private_data = ictx;
> @@ -282,6 +283,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
>  		 struct iommu_ioas_alloc, out_ioas_id),
>  	IOCTL_OP(IOMMU_IOAS_ALLOW_IOVAS, iommufd_ioas_allow_iovas,
>  		 struct iommu_ioas_allow_iovas, allowed_iovas),
> +	IOCTL_OP(IOMMUFD_CMD_IOAS_CHANGE_PROCESS, iommufd_ioas_change_process,
> +		 struct iommu_ioas_change_process, size),
>  	IOCTL_OP(IOMMU_IOAS_COPY, iommufd_ioas_copy, struct iommu_ioas_copy,
>  		 src_iova),
>  	IOCTL_OP(IOMMU_IOAS_IOVA_RANGES, iommufd_ioas_iova_ranges,
> diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
> index c771772296485f..12b8bda7d88136 100644
> --- a/drivers/iommu/iommufd/pages.c
> +++ b/drivers/iommu/iommufd/pages.c
> @@ -859,7 +859,7 @@ static int update_mm_locked_vm(struct iopt_pages *pages, unsigned long npages,
>  	return rc;
>  }
>  
> -static int do_update_pinned(struct iopt_pages *pages, unsigned long npages,
> +int do_update_pinned(struct iopt_pages *pages, unsigned long npages,
>  			    bool inc, struct pfn_reader_user *user)
>  {
>  	int rc = 0;
> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> index 98ebba80cfa1fc..8919f108a01897 100644
> --- a/include/uapi/linux/iommufd.h
> +++ b/include/uapi/linux/iommufd.h
> @@ -45,6 +45,7 @@ enum {
>  	IOMMUFD_CMD_IOAS_UNMAP,
>  	IOMMUFD_CMD_OPTION,
>  	IOMMUFD_CMD_VFIO_IOAS,
> +	IOMMUFD_CMD_IOAS_CHANGE_PROCESS,
>  };
>  
>  /**
> @@ -344,4 +345,27 @@ struct iommu_vfio_ioas {
>  	__u16 __reserved;
>  };
>  #define IOMMU_VFIO_IOAS _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VFIO_IOAS)
> +
> +/**
> + * struct iommu_ioas_change_process - ioctl(VFIO_IOAS_CHANGE_PROCESS)
> + * @size: sizeof(struct iommu_ioas_change_process)
> + *
> + * This changes the process backing the memory maps for every memory map
> + * created in every IOAS in the context. If it fails then nothing changes.
> + *
> + * This will never fail if IOMMU_OPTION_RLIMIT_MODE is set to user and the two
> + * processes belong to the same user.
> + *
> + * This API is useful to support a re-exec flow where a single process owns all
> + * the mappings and wishes to exec() itself into a new process. The new process
> + * should re-establish the same mappings at the same VA. If the user wishes to
> + * retain the same PID then it should fork a temporary process to hold the
> + * mappings while exec and remap is ongoing in the primary PID.
> + */
> +struct iommu_ioas_change_process {
> +	__u32 size;
> +};
> +#define IOMMU_IOAS_CHANGE_PROCESS \
> +	_IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_CHANGE_PROCESS)
> +
>  #endif

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH V1 7/8] vfio: change dma owner
  2022-12-08 16:48     ` Steven Sistare
@ 2022-12-08 17:15       ` Jason Gunthorpe
  2022-12-08 17:39         ` Steven Sistare
  0 siblings, 1 reply; 29+ messages in thread
From: Jason Gunthorpe @ 2022-12-08 17:15 UTC (permalink / raw)
  To: Steven Sistare; +Cc: kvm, Alex Williamson, Cornelia Huck

On Thu, Dec 08, 2022 at 11:48:08AM -0500, Steven Sistare wrote:

> > Anyhow, I came up with this thing. Needs a bit of polishing, the
> > design is a bit odd for performance reasons, and I only compiled it.
> 
> Thanks, I'll pull an iommfd development environment together and try it.
> However, it will also need an interface to change vaddr for each dma region.
> In general the vaddr will be different when the memory object is re-mapped 
> after exec.

Ahh that is yuky :\

So I still like the one shot approach because it has nice error
handling properties, and it lets us use the hacky very expensive "stop
the world" lockng to avoid slowing the fast paths.

Passing in a sorted list of old_vaddr,new_vaddr is possibly fine, the
kernel can bsearch it as it goes through all the pages objects.

Due to the way iommufd works, especially with copy, you end up with
the 'pages' handle that holds the vaddr that many different IOVAs may
refer to. So it is kind of weird to ask to change a single IOVA's
mapping, it must always change all the mappings that have been copied
that share vaddr, pin accounting and so forth.

This is another reason why I liked the one-shot global everything
approach, as narrowing the objects to target cannot be done by IOVA -
at best you could target a specific mm and vaddr range.

FWIW, there is a nice selftest in iommufd in
tools/testing/selftests/iommu/iommufd.c and the way to develop
something like this is to add a simple selftes to exercise your
scenario and get everything sorted like that before going to qemu.

Using the vfio compat you can keep the existing qemu vfio type1 and
just hack in a call the IOMMUFD ioctl in the right spot. No need to
jump to the iommfd version of qemu for testing.

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH V1 7/8] vfio: change dma owner
  2022-12-08 17:15       ` Jason Gunthorpe
@ 2022-12-08 17:39         ` Steven Sistare
  2022-12-08 17:46           ` Jason Gunthorpe
  0 siblings, 1 reply; 29+ messages in thread
From: Steven Sistare @ 2022-12-08 17:39 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: kvm, Alex Williamson, Cornelia Huck

On 12/8/2022 12:15 PM, Jason Gunthorpe wrote:
> On Thu, Dec 08, 2022 at 11:48:08AM -0500, Steven Sistare wrote:
> 
>>> Anyhow, I came up with this thing. Needs a bit of polishing, the
>>> design is a bit odd for performance reasons, and I only compiled it.
>>
>> Thanks, I'll pull an iommfd development environment together and try it.
>> However, it will also need an interface to change vaddr for each dma region.
>> In general the vaddr will be different when the memory object is re-mapped 
>> after exec.
> 
> Ahh that is yuky :\
> 
> So I still like the one shot approach because it has nice error
> handling properties, and it lets us use the hacky very expensive "stop
> the world" lockng to avoid slowing the fast paths.
> 
> Passing in a sorted list of old_vaddr,new_vaddr is possibly fine, the
> kernel can bsearch it as it goes through all the pages objects.

Sorry, I was imprecise. I meant to say, "it will also need an interface to 
change all vaddrs".  Passing an array or list of new vaddrs in a one-shot
ioctl.

I hope the "very expensive path" is not horribly slow, as it is in the
critical path for guest pause time.  Right now the pause time is less than
100 millisecs.

- Steve

> Due to the way iommufd works, especially with copy, you end up with
> the 'pages' handle that holds the vaddr that many different IOVAs may
> refer to. So it is kind of weird to ask to change a single IOVA's
> mapping, it must always change all the mappings that have been copied
> that share vaddr, pin accounting and so forth.
> 
> This is another reason why I liked the one-shot global everything
> approach, as narrowing the objects to target cannot be done by IOVA -
> at best you could target a specific mm and vaddr range.
> 
> FWIW, there is a nice selftest in iommufd in
> tools/testing/selftests/iommu/iommufd.c and the way to develop
> something like this is to add a simple selftes to exercise your
> scenario and get everything sorted like that before going to qemu.
> 
> Using the vfio compat you can keep the existing qemu vfio type1 and
> just hack in a call the IOMMUFD ioctl in the right spot. No need to
> jump to the iommfd version of qemu for testing.
> 
> Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH V1 7/8] vfio: change dma owner
  2022-12-08 17:39         ` Steven Sistare
@ 2022-12-08 17:46           ` Jason Gunthorpe
  0 siblings, 0 replies; 29+ messages in thread
From: Jason Gunthorpe @ 2022-12-08 17:46 UTC (permalink / raw)
  To: Steven Sistare; +Cc: kvm, Alex Williamson, Cornelia Huck

On Thu, Dec 08, 2022 at 12:39:19PM -0500, Steven Sistare wrote:
> On 12/8/2022 12:15 PM, Jason Gunthorpe wrote:
> > On Thu, Dec 08, 2022 at 11:48:08AM -0500, Steven Sistare wrote:
> > 
> >>> Anyhow, I came up with this thing. Needs a bit of polishing, the
> >>> design is a bit odd for performance reasons, and I only compiled it.
> >>
> >> Thanks, I'll pull an iommfd development environment together and try it.
> >> However, it will also need an interface to change vaddr for each dma region.
> >> In general the vaddr will be different when the memory object is re-mapped 
> >> after exec.
> > 
> > Ahh that is yuky :\
> > 
> > So I still like the one shot approach because it has nice error
> > handling properties, and it lets us use the hacky very expensive "stop
> > the world" lockng to avoid slowing the fast paths.
> > 
> > Passing in a sorted list of old_vaddr,new_vaddr is possibly fine, the
> > kernel can bsearch it as it goes through all the pages objects.
> 
> Sorry, I was imprecise. I meant to say, "it will also need an interface to 
> change all vaddrs".  Passing an array or list of new vaddrs in a one-shot
> ioctl.
> 
> I hope the "very expensive path" is not horribly slow, as it is in the
> critical path for guest pause time.  Right now the pause time is less than
> 100 millisecs.

I think compared to 1 ioctl per map it is probably notably faster, it
is just "horribly expensive" to do the ioctl preamble - eg you would
not want to iterate it.

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH V1 1/8] vfio: delete interfaces to update vaddr
  2022-12-07 15:19     ` Jason Gunthorpe
@ 2022-12-08 19:09       ` Steven Sistare
  2022-12-08 19:44         ` Alex Williamson
  0 siblings, 1 reply; 29+ messages in thread
From: Steven Sistare @ 2022-12-08 19:09 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson; +Cc: kvm, Cornelia Huck

On 12/7/2022 10:19 AM, Jason Gunthorpe wrote:
> On Tue, Dec 06, 2022 at 04:52:32PM -0700, Alex Williamson wrote:
>> On Tue,  6 Dec 2022 13:55:46 -0800
>> Steve Sistare <steven.sistare@oracle.com> wrote:
>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>> index d7d8e09..5c5cc7e 100644
>>> --- a/include/uapi/linux/vfio.h
>>> +++ b/include/uapi/linux/vfio.h
>> ...
>>> @@ -1265,18 +1256,12 @@ struct vfio_bitmap {
>>>   *
>>>   * If flags & VFIO_DMA_UNMAP_FLAG_ALL, unmap all addresses.  iova and size
>>>   * must be 0.  This cannot be combined with the get-dirty-bitmap flag.
>>> - *
>>> - * If flags & VFIO_DMA_UNMAP_FLAG_VADDR, do not unmap, but invalidate host
>>> - * virtual addresses in the iova range.  Tasks that attempt to translate an
>>> - * iova's vaddr will block.  DMA to already-mapped pages continues.  This
>>> - * cannot be combined with the get-dirty-bitmap flag.
>>>   */
>>>  struct vfio_iommu_type1_dma_unmap {
>>>  	__u32	argsz;
>>>  	__u32	flags;
>>>  #define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
>>>  #define VFIO_DMA_UNMAP_FLAG_ALL		     (1 << 1)
>>> -#define VFIO_DMA_UNMAP_FLAG_VADDR	     (1 << 2)
>>
>> This flag should probably be marked reserved.
>>
>> Should we consider this separately for v6.2?
> 
> I think we should merge this immediately, given the security problem.
> 
>> For the remainder, the long term plan is to move to iommufd, so any new
>> feature of type1 would need equivalent support in iommufd.  Thanks,
> 
> At a bare minimum nothing should be merged to type1 that doesn't come
> along with an iommufd implementation too.
> 
> IMHO at this point we should not be changing type1 any more - just do
> it iommufd only please. No reason to write and review everything
> twice.

Alex, your opinion?  Implement in iommufd only, or also in type1?  The latter
makes it more feasible to port to stable kernels and allow qemu with live update
to run on them.  I imagine porting iommufd to a stable kernel would be heavy lift,
and be considered too risky.

- Steve

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH V1 1/8] vfio: delete interfaces to update vaddr
  2022-12-08 19:09       ` Steven Sistare
@ 2022-12-08 19:44         ` Alex Williamson
  0 siblings, 0 replies; 29+ messages in thread
From: Alex Williamson @ 2022-12-08 19:44 UTC (permalink / raw)
  To: Steven Sistare; +Cc: Jason Gunthorpe, kvm, Cornelia Huck

On Thu, 8 Dec 2022 14:09:44 -0500
Steven Sistare <steven.sistare@oracle.com> wrote:

> On 12/7/2022 10:19 AM, Jason Gunthorpe wrote:
> > On Tue, Dec 06, 2022 at 04:52:32PM -0700, Alex Williamson wrote:  
> >> On Tue,  6 Dec 2022 13:55:46 -0800
> >> Steve Sistare <steven.sistare@oracle.com> wrote:  
> >>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >>> index d7d8e09..5c5cc7e 100644
> >>> --- a/include/uapi/linux/vfio.h
> >>> +++ b/include/uapi/linux/vfio.h  
> >> ...  
> >>> @@ -1265,18 +1256,12 @@ struct vfio_bitmap {
> >>>   *
> >>>   * If flags & VFIO_DMA_UNMAP_FLAG_ALL, unmap all addresses.  iova and size
> >>>   * must be 0.  This cannot be combined with the get-dirty-bitmap flag.
> >>> - *
> >>> - * If flags & VFIO_DMA_UNMAP_FLAG_VADDR, do not unmap, but invalidate host
> >>> - * virtual addresses in the iova range.  Tasks that attempt to translate an
> >>> - * iova's vaddr will block.  DMA to already-mapped pages continues.  This
> >>> - * cannot be combined with the get-dirty-bitmap flag.
> >>>   */
> >>>  struct vfio_iommu_type1_dma_unmap {
> >>>  	__u32	argsz;
> >>>  	__u32	flags;
> >>>  #define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
> >>>  #define VFIO_DMA_UNMAP_FLAG_ALL		     (1 << 1)
> >>> -#define VFIO_DMA_UNMAP_FLAG_VADDR	     (1 << 2)  
> >>
> >> This flag should probably be marked reserved.
> >>
> >> Should we consider this separately for v6.2?  
> > 
> > I think we should merge this immediately, given the security problem.
> >   
> >> For the remainder, the long term plan is to move to iommufd, so any new
> >> feature of type1 would need equivalent support in iommufd.  Thanks,  
> > 
> > At a bare minimum nothing should be merged to type1 that doesn't come
> > along with an iommufd implementation too.
> > 
> > IMHO at this point we should not be changing type1 any more - just do
> > it iommufd only please. No reason to write and review everything
> > twice.  
> 
> Alex, your opinion?  Implement in iommufd only, or also in type1?  The latter
> makes it more feasible to port to stable kernels and allow qemu with live update
> to run on them.  I imagine porting iommufd to a stable kernel would be heavy lift,
> and be considered too risky.

I understand your concerns, but this isn't really an upstream stable
kernel discussion.  The only thing relevant to an upstream stable
kernel is the removal of the old, vulnerable interface, which I'm
preparing to queue for v6.2.  The new re-implementation isn't eligible
for upstream stable backports, imo.

So I suspect the only stable kernel relevant to the new implementation
is a downstream stable kernel.  While I agree that backporting iommufd
to get this feature is a heavy lift, the alternative is asking upstream
QEMU and kernel to accept and maintain a separate interface in a
backend slated for deprecation.  That's a lot.

I expect you'll be in good company pushing for downstream support of
iommufd given the various improvements and features it offers.  No
offense, but QEMU live update might not even be the primary reason that
a downstream ought to be interested in iommufd.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH V1 2/8] vfio/type1: dma owner permission
  2022-12-07 15:28   ` Jason Gunthorpe
@ 2022-12-08 20:13     ` Steven Sistare
  0 siblings, 0 replies; 29+ messages in thread
From: Steven Sistare @ 2022-12-08 20:13 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: kvm, Alex Williamson, Cornelia Huck

On 12/7/2022 10:28 AM, Jason Gunthorpe wrote:
> On Tue, Dec 06, 2022 at 01:55:47PM -0800, Steve Sistare wrote:
>> The first task to pin any pages becomes the dma owner, and becomes the only
>> task allowed to pin.  This prevents an application from exceeding the
>> initial task's RLIMIT_MEMLOCK by fork'ing and pinning in children.
> 
> We do not need to play games with the RLIMIT here - RLIMIT is
> inherently insecure and if fork is available then the process can blow
> past the sandbox limit. There is nothing we can do to prevent this in
> the kernel, so don't even try.
> 
> iommufd offers the user based limit tracking which prevents this
> properly.
> 
> And we are working on cgroup based limit tracking that is the best
> option to solve this problem.
> 
> I would rather see us focus on the cgroup stuff than this.

Yes, this is N/A for an iommufd framework that can enforce aggregate
limits across a group of processes.

- Steve

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH V1 1/8] vfio: delete interfaces to update vaddr
  2022-12-06 21:55 ` [PATCH V1 1/8] vfio: delete interfaces to update vaddr Steve Sistare
  2022-12-06 23:52   ` Alex Williamson
  2022-12-07 23:20   ` Jason Gunthorpe
@ 2022-12-08 22:18   ` Alex Williamson
  2 siblings, 0 replies; 29+ messages in thread
From: Alex Williamson @ 2022-12-08 22:18 UTC (permalink / raw)
  To: Steve Sistare; +Cc: kvm, Cornelia Huck, Jason Gunthorpe, Tian, Kevin

On Tue,  6 Dec 2022 13:55:46 -0800
Steve Sistare <steven.sistare@oracle.com> wrote:

> Delete the interfaces that allow an iova range to be re-mapped in a new
> address space.  They allow userland to indefinitely block vfio mediated
> device kernel threads, and do not propagate the locked_vm count to a
> new mm.
> 
>   - disable the VFIO_UPDATE_VADDR extension
>   - delete VFIO_DMA_UNMAP_FLAG_VADDR
>   - delete most of VFIO_DMA_MAP_FLAG_VADDR (but keep some for use in a
>     new implementation in a subsequent patch).
> 
> Revert most of the code of these commits:
> 
>   441e810 ("vfio: interfaces to update vaddr")
>   c3cbab2 ("vfio/type1: implement interfaces to update vaddr")
>   898b9ea ("vfio/type1: block on invalid vaddr")
> 
> Revert these commits.  They are harmless, but no longer used after the
> above are reverted, and this kind of functionality is better handled by
> adding new methods to vfio_iommu_driver_ops.
> 
>   ec5e329 ("vfio: iommu driver notify callback")
>   487ace1 ("vfio/type1: implement notify callback")
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  drivers/vfio/container.c        |   5 --
>  drivers/vfio/vfio.h             |   7 --
>  drivers/vfio/vfio_iommu_type1.c | 144 ++--------------------------------------
>  include/uapi/linux/vfio.h       |  17 +----
>  4 files changed, 8 insertions(+), 165 deletions(-)

Picked just this patch and applied to the vfio next branch w/ my
follow-up patch to clean out the remainder of the VADDR code and mark
feature and flags as deprecated.  Added stable cc for both.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH V1 1/8] vfio: delete interfaces to update vaddr
  2022-12-07 14:26     ` Steven Sistare
  2022-12-07 15:14       ` Alex Williamson
@ 2023-05-17 16:12       ` Jason Gunthorpe
  2023-05-17 17:35         ` Steven Sistare
  1 sibling, 1 reply; 29+ messages in thread
From: Jason Gunthorpe @ 2023-05-17 16:12 UTC (permalink / raw)
  To: Steven Sistare; +Cc: Alex Williamson, kvm, Cornelia Huck

On Wed, Dec 07, 2022 at 09:26:33AM -0500, Steven Sistare wrote:
> > This flag should probably be marked reserved.
> > 
> > Should we consider this separately for v6.2?
> 
> Ideally I would like all kernels to support either the old or new vaddr interface.
> If iommufd + vfio compat does not make 6.2, then I prefer not to delete the old
> interface separately.
> 
> > For the remainder, the long term plan is to move to iommufd, so any new
> > feature of type1 would need equivalent support in iommufd.  Thanks,
> 
> Sure.  I will study iommufd and make a proposal.
> 
> Will you review these patches as is to give feedback on the approach?
> 
> If I show that iommufd and the vfio compat layer can support these interfaces,
> are you open to accepting these in v6.2 if iommufd is still a ways off? I see 
> iommufd in qemu-next, but not the compat layer.

What happened to this? I still haven't seen iommufd support for this?

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH V1 1/8] vfio: delete interfaces to update vaddr
  2023-05-17 16:12       ` Jason Gunthorpe
@ 2023-05-17 17:35         ` Steven Sistare
  0 siblings, 0 replies; 29+ messages in thread
From: Steven Sistare @ 2023-05-17 17:35 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Alex Williamson, kvm, Cornelia Huck

On 5/17/2023 12:12 PM, Jason Gunthorpe wrote:
> On Wed, Dec 07, 2022 at 09:26:33AM -0500, Steven Sistare wrote:
>>> This flag should probably be marked reserved.
>>>
>>> Should we consider this separately for v6.2?
>>
>> Ideally I would like all kernels to support either the old or new vaddr interface.
>> If iommufd + vfio compat does not make 6.2, then I prefer not to delete the old
>> interface separately.
>>
>>> For the remainder, the long term plan is to move to iommufd, so any new
>>> feature of type1 would need equivalent support in iommufd.  Thanks,
>>
>> Sure.  I will study iommufd and make a proposal.
>>
>> Will you review these patches as is to give feedback on the approach?
>>
>> If I show that iommufd and the vfio compat layer can support these interfaces,
>> are you open to accepting these in v6.2 if iommufd is still a ways off? I see 
>> iommufd in qemu-next, but not the compat layer.
> 
> What happened to this? I still haven't seen iommufd support for this?

Hi Jason, other work has kept me busy, but it's on my todo list.
I will get to it soon. 

- Steve 

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2023-05-17 17:36 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-12-06 21:55 [PATCH V1 0/8] vfio virtual address update redo Steve Sistare
2022-12-06 21:55 ` [PATCH V1 1/8] vfio: delete interfaces to update vaddr Steve Sistare
2022-12-06 23:52   ` Alex Williamson
2022-12-07 14:26     ` Steven Sistare
2022-12-07 15:14       ` Alex Williamson
2023-05-17 16:12       ` Jason Gunthorpe
2023-05-17 17:35         ` Steven Sistare
2022-12-07 15:19     ` Jason Gunthorpe
2022-12-08 19:09       ` Steven Sistare
2022-12-08 19:44         ` Alex Williamson
2022-12-07 23:20   ` Jason Gunthorpe
2022-12-08 22:18   ` Alex Williamson
2022-12-06 21:55 ` [PATCH V1 2/8] vfio/type1: dma owner permission Steve Sistare
2022-12-07 15:28   ` Jason Gunthorpe
2022-12-08 20:13     ` Steven Sistare
2022-12-06 21:55 ` [PATCH V1 3/8] vfio: close dma owner Steve Sistare
2022-12-06 21:55 ` [PATCH V1 4/8] vfio/type1: " Steve Sistare
2022-12-06 21:55 ` [PATCH V1 5/8] vfio/type1: track locked_vm per dma Steve Sistare
2022-12-06 21:55 ` [PATCH V1 6/8] vfio/type1: update vaddr Steve Sistare
2022-12-06 21:55 ` [PATCH V1 7/8] vfio: change dma owner Steve Sistare
2022-12-07 16:58   ` Jason Gunthorpe
2022-12-08 16:48     ` Steven Sistare
2022-12-08 17:15       ` Jason Gunthorpe
2022-12-08 17:39         ` Steven Sistare
2022-12-08 17:46           ` Jason Gunthorpe
2022-12-06 21:55 ` [PATCH V1 8/8] vfio/type1: " Steve Sistare
2022-12-07 17:00   ` Jason Gunthorpe
2022-12-08  7:22   ` Dan Carpenter
2022-12-07 15:23 ` [PATCH V1 0/8] vfio virtual address update redo Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox