[PATCH V7 0/7] fixes for virtual address update

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH V7 0/7] fixes for virtual address update
@ 2022-12-20 20:39 Steve Sistare
  2022-12-20 20:39 ` [PATCH V7 1/7] vfio/type1: exclude mdevs from VFIO_UPDATE_VADDR Steve Sistare
                   ` (6 more replies)
  0 siblings, 7 replies; 22+ messages in thread
From: Steve Sistare @ 2022-12-20 20:39 UTC (permalink / raw)
  To: kvm
  Cc: Alex Williamson, Cornelia Huck, Jason Gunthorpe, Kevin Tian,
	Steve Sistare

Fix bugs in the interfaces that allow the underlying memory object of an
iova range to be mapped in a new address space.  They allow userland to
indefinitely block vfio mediated device kernel threads, and do not
propagate the locked_vm count to a new mm.  Also fix a pre-existing bug
that allows locked_vm underflow.

The fixes impose restrictions that eliminate waiting conditions, so
revert the dead code:
  commit 898b9eaeb3fe ("vfio/type1: block on invalid vaddr")
  commit 487ace134053 ("vfio/type1: implement notify callback")
  commit ec5e32940cc9 ("vfio: iommu driver notify callback")

Changes in V2 (thanks Alex):
  * do not allow group attach while vaddrs are invalid
  * add patches to delete dead code
  * add WARN_ON for never-should-happen conditions
  * check for changed mm in unmap.
  * check for vfio_lock_acct failure in remap

Changes in V3 (ditto!):
  * return errno at WARN_ON sites, and make it unique
  * correctly check for dma task mm change
  * change dma owner to current when vaddr is updated
  * add Fixes to commit messages
  * refactored new code in vfio_dma_do_map

Changes in V4:
  * misc cosmetic changes

Changes in V5 (thanks Jason and Kevin):
  * grab mm and use it for locked_vm accounting
  * separate patches for underflow and restoring locked_vm
  * account for reserved pages
  * improve error messages

Changes in V6:
  * drop "count reserved pages" patch
  * add "track locked_vm" patch
  * grab current->mm not group_leader->mm
  * simplify vfio_change_dma_owner
  * fix commit messages

Changes in v7:
  * compare current->mm not group_leader->mm (missed one)
  * misc cosmetic changes

Steve Sistare (7):
  vfio/type1: exclude mdevs from VFIO_UPDATE_VADDR
  vfio/type1: prevent underflow of locked_vm via exec()
  vfio/type1: track locked_vm per dma
  vfio/type1: restore locked_vm
  vfio/type1: revert "block on invalid vaddr"
  vfio/type1: revert "implement notify callback"
  vfio: revert "iommu driver notify callback"

 drivers/vfio/container.c        |   5 -
 drivers/vfio/vfio.h             |   7 --
 drivers/vfio/vfio_iommu_type1.c | 226 ++++++++++++++++++----------------------
 include/uapi/linux/vfio.h       |  15 +--
 4 files changed, 111 insertions(+), 142 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH V7 1/7] vfio/type1: exclude mdevs from VFIO_UPDATE_VADDR
  2022-12-20 20:39 [PATCH V7 0/7] fixes for virtual address update Steve Sistare
@ 2022-12-20 20:39 ` Steve Sistare
  2022-12-20 20:39 ` [PATCH V7 2/7] vfio/type1: prevent underflow of locked_vm via exec() Steve Sistare
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: Steve Sistare @ 2022-12-20 20:39 UTC (permalink / raw)
  To: kvm
  Cc: Alex Williamson, Cornelia Huck, Jason Gunthorpe, Kevin Tian,
	Steve Sistare

Disable the VFIO_UPDATE_VADDR capability if mediated devices are present.
Their kernel threads could be blocked indefinitely by a misbehaving
userland while trying to pin/unpin pages while vaddrs are being updated.

Do not allow groups to be added to the container while vaddr's are invalid,
so we never need to block user threads from pinning, and can delete the
vaddr-waiting code in a subsequent patch.

Fixes: c3cbab24db38 ("vfio/type1: implement interfaces to update vaddr")
Cc: stable@vger.kernel.org
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/vfio/vfio_iommu_type1.c | 44 +++++++++++++++++++++++++++++++++++++++--
 include/uapi/linux/vfio.h       | 15 ++++++++------
 2 files changed, 51 insertions(+), 8 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 23c24fe..144f5bb 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -861,6 +861,12 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 
 	mutex_lock(&iommu->lock);
 
+	if (WARN_ONCE(iommu->vaddr_invalid_count,
+		      "vfio_pin_pages not allowed with VFIO_UPDATE_VADDR\n")) {
+		ret = -EBUSY;
+		goto pin_done;
+	}
+
 	/*
 	 * Wait for all necessary vaddr's to be valid so they can be used in
 	 * the main loop without dropping the lock, to avoid racing vs unmap.
@@ -1343,6 +1349,12 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 
 	mutex_lock(&iommu->lock);
 
+	/* Cannot update vaddr if mdev is present. */
+	if (invalidate_vaddr && !list_empty(&iommu->emulated_iommu_groups)) {
+		ret = -EBUSY;
+		goto unlock;
+	}
+
 	pgshift = __ffs(iommu->pgsize_bitmap);
 	pgsize = (size_t)1 << pgshift;
 
@@ -2185,11 +2197,16 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	struct iommu_domain_geometry *geo;
 	LIST_HEAD(iova_copy);
 	LIST_HEAD(group_resv_regions);
-	int ret = -EINVAL;
+	int ret = -EBUSY;
 
 	mutex_lock(&iommu->lock);
 
+	/* Attach could require pinning, so disallow while vaddr is invalid. */
+	if (iommu->vaddr_invalid_count)
+		goto out_unlock;
+
 	/* Check for duplicates */
+	ret = -EINVAL;
 	if (vfio_iommu_find_iommu_group(iommu, iommu_group))
 		goto out_unlock;
 
@@ -2660,6 +2677,16 @@ static int vfio_domains_have_enforce_cache_coherency(struct vfio_iommu *iommu)
 	return ret;
 }
 
+static bool vfio_iommu_has_emulated(struct vfio_iommu *iommu)
+{
+	bool ret;
+
+	mutex_lock(&iommu->lock);
+	ret = !list_empty(&iommu->emulated_iommu_groups);
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
 static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
 					    unsigned long arg)
 {
@@ -2668,8 +2695,13 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
 	case VFIO_TYPE1v2_IOMMU:
 	case VFIO_TYPE1_NESTING_IOMMU:
 	case VFIO_UNMAP_ALL:
-	case VFIO_UPDATE_VADDR:
 		return 1;
+	case VFIO_UPDATE_VADDR:
+		/*
+		 * Disable this feature if mdevs are present.  They cannot
+		 * safely pin/unpin/rw while vaddrs are being updated.
+		 */
+		return iommu && !vfio_iommu_has_emulated(iommu);
 	case VFIO_DMA_CC_IOMMU:
 		if (!iommu)
 			return 0;
@@ -3138,6 +3170,13 @@ static int vfio_iommu_type1_dma_rw(void *iommu_data, dma_addr_t user_iova,
 	size_t done;
 
 	mutex_lock(&iommu->lock);
+
+	if (WARN_ONCE(iommu->vaddr_invalid_count,
+		      "vfio_dma_rw not allowed with VFIO_UPDATE_VADDR\n")) {
+		ret = -EBUSY;
+		goto out;
+	}
+
 	while (count > 0) {
 		ret = vfio_iommu_type1_dma_rw_chunk(iommu, user_iova, data,
 						    count, write, &done);
@@ -3149,6 +3188,7 @@ static int vfio_iommu_type1_dma_rw(void *iommu_data, dma_addr_t user_iova,
 		user_iova += done;
 	}
 
+out:
 	mutex_unlock(&iommu->lock);
 	return ret;
 }
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index d7d8e09..4e8d344 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -49,7 +49,11 @@
 /* Supports VFIO_DMA_UNMAP_FLAG_ALL */
 #define VFIO_UNMAP_ALL			9
 
-/* Supports the vaddr flag for DMA map and unmap */
+/*
+ * Supports the vaddr flag for DMA map and unmap.  Not supported for mediated
+ * devices, so this capability is subject to change as groups are added or
+ * removed.
+ */
 #define VFIO_UPDATE_VADDR		10
 
 /*
@@ -1215,8 +1219,7 @@ struct vfio_iommu_type1_info_dma_avail {
  * Map process virtual addresses to IO virtual addresses using the
  * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
  *
- * If flags & VFIO_DMA_MAP_FLAG_VADDR, update the base vaddr for iova, and
- * unblock translation of host virtual addresses in the iova range.  The vaddr
+ * If flags & VFIO_DMA_MAP_FLAG_VADDR, update the base vaddr for iova. The vaddr
  * must have previously been invalidated with VFIO_DMA_UNMAP_FLAG_VADDR.  To
  * maintain memory consistency within the user application, the updated vaddr
  * must address the same memory object as originally mapped.  Failure to do so
@@ -1267,9 +1270,9 @@ struct vfio_bitmap {
  * must be 0.  This cannot be combined with the get-dirty-bitmap flag.
  *
  * If flags & VFIO_DMA_UNMAP_FLAG_VADDR, do not unmap, but invalidate host
- * virtual addresses in the iova range.  Tasks that attempt to translate an
- * iova's vaddr will block.  DMA to already-mapped pages continues.  This
- * cannot be combined with the get-dirty-bitmap flag.
+ * virtual addresses in the iova range.  DMA to already-mapped pages continues.
+ * Groups may not be added to the container while any addresses are invalid.
+ * This cannot be combined with the get-dirty-bitmap flag.
  */
 struct vfio_iommu_type1_dma_unmap {
 	__u32	argsz;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V7 2/7] vfio/type1: prevent underflow of locked_vm via exec()
  2022-12-20 20:39 [PATCH V7 0/7] fixes for virtual address update Steve Sistare
  2022-12-20 20:39 ` [PATCH V7 1/7] vfio/type1: exclude mdevs from VFIO_UPDATE_VADDR Steve Sistare
@ 2022-12-20 20:39 ` Steve Sistare
  2023-01-03 15:20   ` Jason Gunthorpe
  2022-12-20 20:39 ` [PATCH V7 3/7] vfio/type1: track locked_vm per dma Steve Sistare
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Steve Sistare @ 2022-12-20 20:39 UTC (permalink / raw)
  To: kvm
  Cc: Alex Williamson, Cornelia Huck, Jason Gunthorpe, Kevin Tian,
	Steve Sistare

When a vfio container is preserved across exec, the task does not change,
but it gets a new mm with locked_vm=0, and loses the count from existing
dma mappings.  If the user later unmaps a dma mapping, locked_vm underflows
to a large unsigned value, and a subsequent dma map request fails with
ENOMEM in __account_locked_vm.

To avoid underflow, grab and save the mm at the time a dma is mapped.
Use that mm when adjusting locked_vm, rather than re-acquiring the saved
task's mm, which may have changed.  If the saved mm is dead, do nothing.

locked_vm is incremented for existing mappings in a subsequent patch.

Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation")
Cc: stable@vger.kernel.org
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
 drivers/vfio/vfio_iommu_type1.c | 27 +++++++++++----------------
 1 file changed, 11 insertions(+), 16 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 144f5bb..71f980b 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -100,6 +100,7 @@ struct vfio_dma {
 	struct task_struct	*task;
 	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
 	unsigned long		*bitmap;
+	struct mm_struct	*mm;
 };
 
 struct vfio_batch {
@@ -420,8 +421,8 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
 	if (!npage)
 		return 0;
 
-	mm = async ? get_task_mm(dma->task) : dma->task->mm;
-	if (!mm)
+	mm = dma->mm;
+	if (async && !mmget_not_zero(mm))
 		return -ESRCH; /* process exited */
 
 	ret = mmap_write_lock_killable(mm);
@@ -794,8 +795,8 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
 	struct mm_struct *mm;
 	int ret;
 
-	mm = get_task_mm(dma->task);
-	if (!mm)
+	mm = dma->mm;
+	if (!mmget_not_zero(mm))
 		return -ENODEV;
 
 	ret = vaddr_get_pfns(mm, vaddr, 1, dma->prot, pfn_base, pages);
@@ -1180,6 +1181,7 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
 	vfio_unmap_unpin(iommu, dma, true);
 	vfio_unlink_dma(iommu, dma);
 	put_task_struct(dma->task);
+	mmdrop(dma->mm);
 	vfio_dma_bitmap_free(dma);
 	if (dma->vaddr_invalid) {
 		iommu->vaddr_invalid_count--;
@@ -1664,15 +1666,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	 * against the locked memory limit and we need to be able to do both
 	 * outside of this call path as pinning can be asynchronous via the
 	 * external interfaces for mdev devices.  RLIMIT_MEMLOCK requires a
-	 * task_struct and VM locked pages requires an mm_struct, however
-	 * holding an indefinite mm reference is not recommended, therefore we
-	 * only hold a reference to a task.  We could hold a reference to
-	 * current, however QEMU uses this call path through vCPU threads,
-	 * which can be killed resulting in a NULL mm and failure in the unmap
-	 * path when called via a different thread.  Avoid this problem by
-	 * using the group_leader as threads within the same group require
-	 * both CLONE_THREAD and CLONE_VM and will therefore use the same
-	 * mm_struct.
+	 * task_struct and VM locked pages requires an mm_struct.
 	 *
 	 * Previously we also used the task for testing CAP_IPC_LOCK at the
 	 * time of pinning and accounting, however has_capability() makes use
@@ -1687,6 +1681,8 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	get_task_struct(current->group_leader);
 	dma->task = current->group_leader;
 	dma->lock_cap = capable(CAP_IPC_LOCK);
+	dma->mm = current->mm;
+	mmgrab(dma->mm);
 
 	dma->pfn_list = RB_ROOT;
 
@@ -3122,9 +3118,8 @@ static int vfio_iommu_type1_dma_rw_chunk(struct vfio_iommu *iommu,
 			!(dma->prot & IOMMU_READ))
 		return -EPERM;
 
-	mm = get_task_mm(dma->task);
-
-	if (!mm)
+	mm = dma->mm;
+	if (!mmget_not_zero(mm))
 		return -EPERM;
 
 	if (kthread)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V7 3/7] vfio/type1: track locked_vm per dma
  2022-12-20 20:39 [PATCH V7 0/7] fixes for virtual address update Steve Sistare
  2022-12-20 20:39 ` [PATCH V7 1/7] vfio/type1: exclude mdevs from VFIO_UPDATE_VADDR Steve Sistare
  2022-12-20 20:39 ` [PATCH V7 2/7] vfio/type1: prevent underflow of locked_vm via exec() Steve Sistare
@ 2022-12-20 20:39 ` Steve Sistare
  2023-01-03 15:21   ` Jason Gunthorpe
  2022-12-20 20:39 ` [PATCH V7 4/7] vfio/type1: restore locked_vm Steve Sistare
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Steve Sistare @ 2022-12-20 20:39 UTC (permalink / raw)
  To: kvm
  Cc: Alex Williamson, Cornelia Huck, Jason Gunthorpe, Kevin Tian,
	Steve Sistare

Track locked_vm per dma struct, and create a new subroutine, both for use
in a subsequent patch.  No functional change.

Fixes: c3cbab24db38 ("vfio/type1: implement interfaces to update vaddr")
Cc: stable@vger.kernel.org
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
 drivers/vfio/vfio_iommu_type1.c | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 71f980b..588d690 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -101,6 +101,7 @@ struct vfio_dma {
 	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
 	unsigned long		*bitmap;
 	struct mm_struct	*mm;
+	long			locked_vm;
 };
 
 struct vfio_batch {
@@ -413,22 +414,21 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
 	return ret;
 }
 
-static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
+static int mm_lock_acct(struct task_struct *task, struct mm_struct *mm,
+			bool lock_cap, long npage, bool async)
 {
-	struct mm_struct *mm;
 	int ret;
 
 	if (!npage)
 		return 0;
 
-	mm = dma->mm;
 	if (async && !mmget_not_zero(mm))
 		return -ESRCH; /* process exited */
 
 	ret = mmap_write_lock_killable(mm);
 	if (!ret) {
-		ret = __account_locked_vm(mm, abs(npage), npage > 0, dma->task,
-					  dma->lock_cap);
+		ret = __account_locked_vm(mm, abs(npage), npage > 0, task,
+					  lock_cap);
 		mmap_write_unlock(mm);
 	}
 
@@ -438,6 +438,16 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
 	return ret;
 }
 
+static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
+{
+	int ret;
+
+	ret = mm_lock_acct(dma->task, dma->mm, dma->lock_cap, npage, async);
+	if (!ret)
+		dma->locked_vm += npage;
+	return ret;
+}
+
 /*
  * Some mappings aren't backed by a struct page, for example an mmap'd
  * MMIO range for our own or another device.  These use a different
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V7 4/7] vfio/type1: restore locked_vm
  2022-12-20 20:39 [PATCH V7 0/7] fixes for virtual address update Steve Sistare
                   ` (2 preceding siblings ...)
  2022-12-20 20:39 ` [PATCH V7 3/7] vfio/type1: track locked_vm per dma Steve Sistare
@ 2022-12-20 20:39 ` Steve Sistare
  2023-01-03 15:22   ` Jason Gunthorpe
  2022-12-20 20:39 ` [PATCH V7 5/7] vfio/type1: revert "block on invalid vaddr" Steve Sistare
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Steve Sistare @ 2022-12-20 20:39 UTC (permalink / raw)
  To: kvm
  Cc: Alex Williamson, Cornelia Huck, Jason Gunthorpe, Kevin Tian,
	Steve Sistare

When a vfio container is preserved across exec or fork-exec, the new
task's mm has a locked_vm count of 0.  After a dma vaddr is updated using
VFIO_DMA_MAP_FLAG_VADDR, locked_vm remains 0, and the pinned memory does
not count against the task's RLIMIT_MEMLOCK.

To restore the correct locked_vm count, when VFIO_DMA_MAP_FLAG_VADDR is
used and the dma's mm has changed, add the dma's locked_vm count to
the new mm->locked_vm, subject to the rlimit.

Fixes: c3cbab24db38 ("vfio/type1: implement interfaces to update vaddr")
Cc: stable@vger.kernel.org
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
 drivers/vfio/vfio_iommu_type1.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 588d690..1036736 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1590,6 +1590,35 @@ static bool vfio_iommu_iova_dma_valid(struct vfio_iommu *iommu,
 	return list_empty(iova);
 }
 
+static int vfio_change_dma_owner(struct vfio_dma *dma)
+{
+	int ret = 0;
+	struct task_struct *task = current->group_leader;
+	struct mm_struct *mm = current->mm;
+
+	if (mm != dma->mm) {
+		long npage = dma->locked_vm;
+		bool lock_cap = capable(CAP_IPC_LOCK);
+
+		ret = mm_lock_acct(task, mm, lock_cap, npage, false);
+		if (ret)
+			return ret;
+
+		mm_lock_acct(dma->task, dma->mm, dma->lock_cap, -npage, true);
+
+		if (dma->task != task) {
+			put_task_struct(dma->task);
+			dma->task = get_task_struct(task);
+		}
+		mmdrop(dma->mm);
+		dma->mm = mm;
+		mmgrab(dma->mm);
+		dma->lock_cap = lock_cap;
+	}
+
+	return ret;
+}
+
 static int vfio_dma_do_map(struct vfio_iommu *iommu,
 			   struct vfio_iommu_type1_dma_map *map)
 {
@@ -1639,6 +1668,9 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 			   dma->size != size) {
 			ret = -EINVAL;
 		} else {
+			ret = vfio_change_dma_owner(dma);
+			if (ret)
+				goto out_unlock;
 			dma->vaddr = vaddr;
 			dma->vaddr_invalid = false;
 			iommu->vaddr_invalid_count--;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V7 5/7] vfio/type1: revert "block on invalid vaddr"
  2022-12-20 20:39 [PATCH V7 0/7] fixes for virtual address update Steve Sistare
                   ` (3 preceding siblings ...)
  2022-12-20 20:39 ` [PATCH V7 4/7] vfio/type1: restore locked_vm Steve Sistare
@ 2022-12-20 20:39 ` Steve Sistare
  2022-12-20 20:39 ` [PATCH V7 6/7] vfio/type1: revert "implement notify callback" Steve Sistare
  2022-12-20 20:39 ` [PATCH V7 7/7] vfio: revert "iommu driver " Steve Sistare
  6 siblings, 0 replies; 22+ messages in thread
From: Steve Sistare @ 2022-12-20 20:39 UTC (permalink / raw)
  To: kvm
  Cc: Alex Williamson, Cornelia Huck, Jason Gunthorpe, Kevin Tian,
	Steve Sistare

Revert this dead code:
  commit 898b9eaeb3fe ("vfio/type1: block on invalid vaddr")

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
 drivers/vfio/vfio_iommu_type1.c | 94 +++--------------------------------------
 1 file changed, 5 insertions(+), 89 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 1036736..eb2d243 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -72,7 +72,6 @@ struct vfio_iommu {
 	unsigned int		vaddr_invalid_count;
 	uint64_t		pgsize_bitmap;
 	uint64_t		num_non_pinned_groups;
-	wait_queue_head_t	vaddr_wait;
 	bool			v2;
 	bool			nesting;
 	bool			dirty_page_tracking;
@@ -154,8 +153,6 @@ struct vfio_regions {
 #define DIRTY_BITMAP_PAGES_MAX	 ((u64)INT_MAX)
 #define DIRTY_BITMAP_SIZE_MAX	 DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
 
-#define WAITED 1
-
 static int put_pfn(unsigned long pfn, int prot);
 
 static struct vfio_iommu_group*
@@ -606,61 +603,6 @@ static int vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
 	return ret;
 }
 
-static int vfio_wait(struct vfio_iommu *iommu)
-{
-	DEFINE_WAIT(wait);
-
-	prepare_to_wait(&iommu->vaddr_wait, &wait, TASK_KILLABLE);
-	mutex_unlock(&iommu->lock);
-	schedule();
-	mutex_lock(&iommu->lock);
-	finish_wait(&iommu->vaddr_wait, &wait);
-	if (kthread_should_stop() || !iommu->container_open ||
-	    fatal_signal_pending(current)) {
-		return -EFAULT;
-	}
-	return WAITED;
-}
-
-/*
- * Find dma struct and wait for its vaddr to be valid.  iommu lock is dropped
- * if the task waits, but is re-locked on return.  Return result in *dma_p.
- * Return 0 on success with no waiting, WAITED on success if waited, and -errno
- * on error.
- */
-static int vfio_find_dma_valid(struct vfio_iommu *iommu, dma_addr_t start,
-			       size_t size, struct vfio_dma **dma_p)
-{
-	int ret = 0;
-
-	do {
-		*dma_p = vfio_find_dma(iommu, start, size);
-		if (!*dma_p)
-			return -EINVAL;
-		else if (!(*dma_p)->vaddr_invalid)
-			return ret;
-		else
-			ret = vfio_wait(iommu);
-	} while (ret == WAITED);
-
-	return ret;
-}
-
-/*
- * Wait for all vaddr in the dma_list to become valid.  iommu lock is dropped
- * if the task waits, but is re-locked on return.  Return 0 on success with no
- * waiting, WAITED on success if waited, and -errno on error.
- */
-static int vfio_wait_all_valid(struct vfio_iommu *iommu)
-{
-	int ret = 0;
-
-	while (iommu->vaddr_invalid_count && ret >= 0)
-		ret = vfio_wait(iommu);
-
-	return ret;
-}
-
 /*
  * Attempt to pin pages.  We really don't want to track all the pfns and
  * the iommu can only map chunks of consecutive pfns anyway, so get the
@@ -861,7 +803,6 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 	unsigned long remote_vaddr;
 	struct vfio_dma *dma;
 	bool do_accounting;
-	dma_addr_t iova;
 
 	if (!iommu || !pages)
 		return -EINVAL;
@@ -878,22 +819,6 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 		goto pin_done;
 	}
 
-	/*
-	 * Wait for all necessary vaddr's to be valid so they can be used in
-	 * the main loop without dropping the lock, to avoid racing vs unmap.
-	 */
-again:
-	if (iommu->vaddr_invalid_count) {
-		for (i = 0; i < npage; i++) {
-			iova = user_iova + PAGE_SIZE * i;
-			ret = vfio_find_dma_valid(iommu, iova, PAGE_SIZE, &dma);
-			if (ret < 0)
-				goto pin_done;
-			if (ret == WAITED)
-				goto again;
-		}
-	}
-
 	/* Fail if no dma_umap notifier is registered */
 	if (list_empty(&iommu->device_list)) {
 		ret = -EINVAL;
@@ -909,6 +834,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 
 	for (i = 0; i < npage; i++) {
 		unsigned long phys_pfn;
+		dma_addr_t iova;
 		struct vfio_pfn *vpfn;
 
 		iova = user_iova + PAGE_SIZE * i;
@@ -1193,10 +1119,8 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
 	put_task_struct(dma->task);
 	mmdrop(dma->mm);
 	vfio_dma_bitmap_free(dma);
-	if (dma->vaddr_invalid) {
+	if (dma->vaddr_invalid)
 		iommu->vaddr_invalid_count--;
-		wake_up_all(&iommu->vaddr_wait);
-	}
 	kfree(dma);
 	iommu->dma_avail++;
 }
@@ -1674,7 +1598,6 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 			dma->vaddr = vaddr;
 			dma->vaddr_invalid = false;
 			iommu->vaddr_invalid_count--;
-			wake_up_all(&iommu->vaddr_wait);
 		}
 		goto out_unlock;
 	} else if (dma) {
@@ -1757,10 +1680,6 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
 	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
 	int ret;
 
-	ret = vfio_wait_all_valid(iommu);
-	if (ret < 0)
-		return ret;
-
 	/* Arbitrarily pick the first domain in the list for lookups */
 	if (!list_empty(&iommu->domain_list))
 		d = list_first_entry(&iommu->domain_list,
@@ -2651,7 +2570,6 @@ static void *vfio_iommu_type1_open(unsigned long arg)
 	mutex_init(&iommu->lock);
 	mutex_init(&iommu->device_list_lock);
 	INIT_LIST_HEAD(&iommu->device_list);
-	init_waitqueue_head(&iommu->vaddr_wait);
 	iommu->pgsize_bitmap = PAGE_MASK;
 	INIT_LIST_HEAD(&iommu->emulated_iommu_groups);
 
@@ -3148,13 +3066,12 @@ static int vfio_iommu_type1_dma_rw_chunk(struct vfio_iommu *iommu,
 	struct vfio_dma *dma;
 	bool kthread = current->mm == NULL;
 	size_t offset;
-	int ret;
 
 	*copied = 0;
 
-	ret = vfio_find_dma_valid(iommu, user_iova, 1, &dma);
-	if (ret < 0)
-		return ret;
+	dma = vfio_find_dma(iommu, user_iova, 1);
+	if (!dma)
+		return -EINVAL;
 
 	if ((write && !(dma->prot & IOMMU_WRITE)) ||
 			!(dma->prot & IOMMU_READ))
@@ -3263,7 +3180,6 @@ static void vfio_iommu_type1_notify(void *iommu_data,
 	mutex_lock(&iommu->lock);
 	iommu->container_open = false;
 	mutex_unlock(&iommu->lock);
-	wake_up_all(&iommu->vaddr_wait);
 }
 
 static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V7 6/7] vfio/type1: revert "implement notify callback"
  2022-12-20 20:39 [PATCH V7 0/7] fixes for virtual address update Steve Sistare
                   ` (4 preceding siblings ...)
  2022-12-20 20:39 ` [PATCH V7 5/7] vfio/type1: revert "block on invalid vaddr" Steve Sistare
@ 2022-12-20 20:39 ` Steve Sistare
  2022-12-20 20:39 ` [PATCH V7 7/7] vfio: revert "iommu driver " Steve Sistare
  6 siblings, 0 replies; 22+ messages in thread
From: Steve Sistare @ 2022-12-20 20:39 UTC (permalink / raw)
  To: kvm
  Cc: Alex Williamson, Cornelia Huck, Jason Gunthorpe, Kevin Tian,
	Steve Sistare

This is dead code.  Revert it.
  commit 487ace134053 ("vfio/type1: implement notify callback")

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
 drivers/vfio/vfio_iommu_type1.c | 15 ---------------
 1 file changed, 15 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index eb2d243..a009e1b 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -75,7 +75,6 @@ struct vfio_iommu {
 	bool			v2;
 	bool			nesting;
 	bool			dirty_page_tracking;
-	bool			container_open;
 	struct list_head	emulated_iommu_groups;
 };
 
@@ -2566,7 +2565,6 @@ static void *vfio_iommu_type1_open(unsigned long arg)
 	INIT_LIST_HEAD(&iommu->iova_list);
 	iommu->dma_list = RB_ROOT;
 	iommu->dma_avail = dma_entry_limit;
-	iommu->container_open = true;
 	mutex_init(&iommu->lock);
 	mutex_init(&iommu->device_list_lock);
 	INIT_LIST_HEAD(&iommu->device_list);
@@ -3170,18 +3168,6 @@ static int vfio_iommu_type1_dma_rw(void *iommu_data, dma_addr_t user_iova,
 	return domain;
 }
 
-static void vfio_iommu_type1_notify(void *iommu_data,
-				    enum vfio_iommu_notify_type event)
-{
-	struct vfio_iommu *iommu = iommu_data;
-
-	if (event != VFIO_IOMMU_CONTAINER_CLOSE)
-		return;
-	mutex_lock(&iommu->lock);
-	iommu->container_open = false;
-	mutex_unlock(&iommu->lock);
-}
-
 static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
 	.name			= "vfio-iommu-type1",
 	.owner			= THIS_MODULE,
@@ -3196,7 +3182,6 @@ static void vfio_iommu_type1_notify(void *iommu_data,
 	.unregister_device	= vfio_iommu_type1_unregister_device,
 	.dma_rw			= vfio_iommu_type1_dma_rw,
 	.group_iommu_domain	= vfio_iommu_type1_group_iommu_domain,
-	.notify			= vfio_iommu_type1_notify,
 };
 
 static int __init vfio_iommu_type1_init(void)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V7 7/7] vfio: revert "iommu driver notify callback"
  2022-12-20 20:39 [PATCH V7 0/7] fixes for virtual address update Steve Sistare
                   ` (5 preceding siblings ...)
  2022-12-20 20:39 ` [PATCH V7 6/7] vfio/type1: revert "implement notify callback" Steve Sistare
@ 2022-12-20 20:39 ` Steve Sistare
  6 siblings, 0 replies; 22+ messages in thread
From: Steve Sistare @ 2022-12-20 20:39 UTC (permalink / raw)
  To: kvm
  Cc: Alex Williamson, Cornelia Huck, Jason Gunthorpe, Kevin Tian,
	Steve Sistare

Revert this dead code:
  commit ec5e32940cc9 ("vfio: iommu driver notify callback")

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
 drivers/vfio/container.c | 5 -----
 drivers/vfio/vfio.h      | 7 -------
 2 files changed, 12 deletions(-)

diff --git a/drivers/vfio/container.c b/drivers/vfio/container.c
index d74164a..5bfd10d 100644
--- a/drivers/vfio/container.c
+++ b/drivers/vfio/container.c
@@ -382,11 +382,6 @@ static int vfio_fops_open(struct inode *inode, struct file *filep)
 static int vfio_fops_release(struct inode *inode, struct file *filep)
 {
 	struct vfio_container *container = filep->private_data;
-	struct vfio_iommu_driver *driver = container->iommu_driver;
-
-	if (driver && driver->ops->notify)
-		driver->ops->notify(container->iommu_data,
-				    VFIO_IOMMU_CONTAINER_CLOSE);
 
 	filep->private_data = NULL;
 
diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index bcad54b..8a439c6 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -62,11 +62,6 @@ struct vfio_group {
 	struct blocking_notifier_head	notifier;
 };
 
-/* events for the backend driver notify callback */
-enum vfio_iommu_notify_type {
-	VFIO_IOMMU_CONTAINER_CLOSE = 0,
-};
-
 /**
  * struct vfio_iommu_driver_ops - VFIO IOMMU driver callbacks
  */
@@ -97,8 +92,6 @@ struct vfio_iommu_driver_ops {
 				  void *data, size_t count, bool write);
 	struct iommu_domain *(*group_iommu_domain)(void *iommu_data,
 						   struct iommu_group *group);
-	void		(*notify)(void *iommu_data,
-				  enum vfio_iommu_notify_type event);
 };
 
 struct vfio_iommu_driver {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH V7 2/7] vfio/type1: prevent underflow of locked_vm via exec()
  2022-12-20 20:39 ` [PATCH V7 2/7] vfio/type1: prevent underflow of locked_vm via exec() Steve Sistare
@ 2023-01-03 15:20   ` Jason Gunthorpe
  2023-01-03 18:12     ` Steven Sistare
  0 siblings, 1 reply; 22+ messages in thread
From: Jason Gunthorpe @ 2023-01-03 15:20 UTC (permalink / raw)
  To: Steve Sistare; +Cc: kvm, Alex Williamson, Cornelia Huck, Kevin Tian

On Tue, Dec 20, 2022 at 12:39:20PM -0800, Steve Sistare wrote:
> When a vfio container is preserved across exec, the task does not change,
> but it gets a new mm with locked_vm=0, and loses the count from existing
> dma mappings.  If the user later unmaps a dma mapping, locked_vm underflows
> to a large unsigned value, and a subsequent dma map request fails with
> ENOMEM in __account_locked_vm.
> 
> To avoid underflow, grab and save the mm at the time a dma is mapped.
> Use that mm when adjusting locked_vm, rather than re-acquiring the saved
> task's mm, which may have changed.  If the saved mm is dead, do nothing.
> 
> locked_vm is incremented for existing mappings in a subsequent patch.
> 
> Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation")
> Cc: stable@vger.kernel.org
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 27 +++++++++++----------------
>  1 file changed, 11 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 144f5bb..71f980b 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -100,6 +100,7 @@ struct vfio_dma {
>  	struct task_struct	*task;
>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
>  	unsigned long		*bitmap;
> +	struct mm_struct	*mm;
>  };
>  
>  struct vfio_batch {
> @@ -420,8 +421,8 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
>  	if (!npage)
>  		return 0;
>  
> -	mm = async ? get_task_mm(dma->task) : dma->task->mm;
> -	if (!mm)
> +	mm = dma->mm;
> +	if (async && !mmget_not_zero(mm))
>  		return -ESRCH; /* process exited */

Just delete the async, the lock_acct always acts on the dma which
always has a singular mm.

FIx the few callers that need it to do the mmget_no_zero() before
calling in.

> @@ -794,8 +795,8 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
>  	struct mm_struct *mm;
>  	int ret;
>  
> -	mm = get_task_mm(dma->task);
> -	if (!mm)
> +	mm = dma->mm;
> +	if (!mmget_not_zero(mm))
>  		return -ENODEV;

eg things get all confused here where we have the mmget_not_zero

But then we go ahead and call vfio_lock_acct() with true

Jason

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V7 3/7] vfio/type1: track locked_vm per dma
  2022-12-20 20:39 ` [PATCH V7 3/7] vfio/type1: track locked_vm per dma Steve Sistare
@ 2023-01-03 15:21   ` Jason Gunthorpe
  2023-01-03 18:13     ` Steven Sistare
  0 siblings, 1 reply; 22+ messages in thread
From: Jason Gunthorpe @ 2023-01-03 15:21 UTC (permalink / raw)
  To: Steve Sistare; +Cc: kvm, Alex Williamson, Cornelia Huck, Kevin Tian

On Tue, Dec 20, 2022 at 12:39:21PM -0800, Steve Sistare wrote:
> Track locked_vm per dma struct, and create a new subroutine, both for use
> in a subsequent patch.  No functional change.
> 
> Fixes: c3cbab24db38 ("vfio/type1: implement interfaces to update vaddr")
> Cc: stable@vger.kernel.org
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 20 +++++++++++++++-----
>  1 file changed, 15 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 71f980b..588d690 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -101,6 +101,7 @@ struct vfio_dma {
>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
>  	unsigned long		*bitmap;
>  	struct mm_struct	*mm;
> +	long			locked_vm;

Why is it long? Can it be negative?

>  };
>  
>  struct vfio_batch {
> @@ -413,22 +414,21 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
>  	return ret;
>  }
>  
> -static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
> +static int mm_lock_acct(struct task_struct *task, struct mm_struct *mm,
> +			bool lock_cap, long npage, bool async)
>  {

Now async is even more confusing, the caller really should have a
valid handle on the mm before using it as an argument like this.

Jason

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V7 4/7] vfio/type1: restore locked_vm
  2022-12-20 20:39 ` [PATCH V7 4/7] vfio/type1: restore locked_vm Steve Sistare
@ 2023-01-03 15:22   ` Jason Gunthorpe
  2023-01-03 18:12     ` Steven Sistare
  0 siblings, 1 reply; 22+ messages in thread
From: Jason Gunthorpe @ 2023-01-03 15:22 UTC (permalink / raw)
  To: Steve Sistare; +Cc: kvm, Alex Williamson, Cornelia Huck, Kevin Tian

On Tue, Dec 20, 2022 at 12:39:22PM -0800, Steve Sistare wrote:
> When a vfio container is preserved across exec or fork-exec, the new
> task's mm has a locked_vm count of 0.  After a dma vaddr is updated using
> VFIO_DMA_MAP_FLAG_VADDR, locked_vm remains 0, and the pinned memory does
> not count against the task's RLIMIT_MEMLOCK.
> 
> To restore the correct locked_vm count, when VFIO_DMA_MAP_FLAG_VADDR is
> used and the dma's mm has changed, add the dma's locked_vm count to
> the new mm->locked_vm, subject to the rlimit.
> 
> Fixes: c3cbab24db38 ("vfio/type1: implement interfaces to update vaddr")
> Cc: stable@vger.kernel.org
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 32 ++++++++++++++++++++++++++++++++
>  1 file changed, 32 insertions(+)

But you should subtract it from the old one as well?

Jason

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V7 4/7] vfio/type1: restore locked_vm
  2023-01-03 15:22   ` Jason Gunthorpe
@ 2023-01-03 18:12     ` Steven Sistare
  0 siblings, 0 replies; 22+ messages in thread
From: Steven Sistare @ 2023-01-03 18:12 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: kvm, Alex Williamson, Cornelia Huck, Kevin Tian

On 1/3/2023 10:22 AM, Jason Gunthorpe wrote:
> On Tue, Dec 20, 2022 at 12:39:22PM -0800, Steve Sistare wrote:
>> When a vfio container is preserved across exec or fork-exec, the new
>> task's mm has a locked_vm count of 0.  After a dma vaddr is updated using
>> VFIO_DMA_MAP_FLAG_VADDR, locked_vm remains 0, and the pinned memory does
>> not count against the task's RLIMIT_MEMLOCK.
>>
>> To restore the correct locked_vm count, when VFIO_DMA_MAP_FLAG_VADDR is
>> used and the dma's mm has changed, add the dma's locked_vm count to
>> the new mm->locked_vm, subject to the rlimit.
>>
>> Fixes: c3cbab24db38 ("vfio/type1: implement interfaces to update vaddr")
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 32 ++++++++++++++++++++++++++++++++
>>  1 file changed, 32 insertions(+)
> 
> But you should subtract it from the old one as well?

Yes, as implemented.  I'll add that to the commit message.

- Steve

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V7 2/7] vfio/type1: prevent underflow of locked_vm via exec()
  2023-01-03 15:20   ` Jason Gunthorpe
@ 2023-01-03 18:12     ` Steven Sistare
  2023-01-03 19:20       ` Jason Gunthorpe
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Sistare @ 2023-01-03 18:12 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson; +Cc: kvm, Cornelia Huck, Kevin Tian

On 1/3/2023 10:20 AM, Jason Gunthorpe wrote:
> On Tue, Dec 20, 2022 at 12:39:20PM -0800, Steve Sistare wrote:
>> When a vfio container is preserved across exec, the task does not change,
>> but it gets a new mm with locked_vm=0, and loses the count from existing
>> dma mappings.  If the user later unmaps a dma mapping, locked_vm underflows
>> to a large unsigned value, and a subsequent dma map request fails with
>> ENOMEM in __account_locked_vm.
>>
>> To avoid underflow, grab and save the mm at the time a dma is mapped.
>> Use that mm when adjusting locked_vm, rather than re-acquiring the saved
>> task's mm, which may have changed.  If the saved mm is dead, do nothing.
>>
>> locked_vm is incremented for existing mappings in a subsequent patch.
>>
>> Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation")
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 27 +++++++++++----------------
>>  1 file changed, 11 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index 144f5bb..71f980b 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -100,6 +100,7 @@ struct vfio_dma {
>>  	struct task_struct	*task;
>>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
>>  	unsigned long		*bitmap;
>> +	struct mm_struct	*mm;
>>  };
>>  
>>  struct vfio_batch {
>> @@ -420,8 +421,8 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
>>  	if (!npage)
>>  		return 0;
>>  
>> -	mm = async ? get_task_mm(dma->task) : dma->task->mm;
>> -	if (!mm)
>> +	mm = dma->mm;
>> +	if (async && !mmget_not_zero(mm))
>>  		return -ESRCH; /* process exited */
> 
> Just delete the async, the lock_acct always acts on the dma which
> always has a singular mm.
> 
> FIx the few callers that need it to do the mmget_no_zero() before
> calling in.

Most of the callers pass async=true:
  ret = vfio_lock_acct(dma, lock_acct, false);
  vfio_lock_acct(dma, locked - unlocked, true);
  ret = vfio_lock_acct(dma, 1, true);
  vfio_lock_acct(dma, -unlocked, true);
  vfio_lock_acct(dma, -1, true);
  vfio_lock_acct(dma, -unlocked, true);
  ret = mm_lock_acct(task, mm, lock_cap, npage, false);
  mm_lock_acct(dma->task, dma->mm, dma->lock_cap, -npage, true);
  vfio_lock_acct(dma, locked - unlocked, true);

Hoisting mmget_not_zero and mmput will make many call sites messier.

Alex, what is your opinion?

>> @@ -794,8 +795,8 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
>>  	struct mm_struct *mm;
>>  	int ret;
>>  
>> -	mm = get_task_mm(dma->task);
>> -	if (!mm)
>> +	mm = dma->mm;
>> +	if (!mmget_not_zero(mm))
>>  		return -ENODEV;
> 
> eg things get all confused here where we have the mmget_not_zero
> 
> But then we go ahead and call vfio_lock_acct() with true

Yes, I should pass false at this call site to optimize.

- Steve

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V7 3/7] vfio/type1: track locked_vm per dma
  2023-01-03 15:21   ` Jason Gunthorpe
@ 2023-01-03 18:13     ` Steven Sistare
  2023-01-09 21:24       ` Steven Sistare
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Sistare @ 2023-01-03 18:13 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: kvm, Alex Williamson, Cornelia Huck, Kevin Tian

On 1/3/2023 10:21 AM, Jason Gunthorpe wrote:
> On Tue, Dec 20, 2022 at 12:39:21PM -0800, Steve Sistare wrote:
>> Track locked_vm per dma struct, and create a new subroutine, both for use
>> in a subsequent patch.  No functional change.
>>
>> Fixes: c3cbab24db38 ("vfio/type1: implement interfaces to update vaddr")
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 20 +++++++++++++++-----
>>  1 file changed, 15 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index 71f980b..588d690 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -101,6 +101,7 @@ struct vfio_dma {
>>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
>>  	unsigned long		*bitmap;
>>  	struct mm_struct	*mm;
>> +	long			locked_vm;
> 
> Why is it long? Can it be negative?

The existing code uses both long and uint64_t for page counts, and I picked one.
I'll use size_t instead to match vfio_dma size.

>>  };
>>  
>>  struct vfio_batch {
>> @@ -413,22 +414,21 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
>>  	return ret;
>>  }
>>  
>> -static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
>> +static int mm_lock_acct(struct task_struct *task, struct mm_struct *mm,
>> +			bool lock_cap, long npage, bool async)
>>  {
> 
> Now async is even more confusing, the caller really should have a
> valid handle on the mm before using it as an argument like this.

The caller holds a grab reference on mm, and mm_lock_acct does mmget_not_zero to 
validate the mm.  IMO this is a close analog of the original vfio_lock_acct code
where the caller holds a get reference on task, and does get_task_mm to validate
the mm.

However, I can hoist the mmget_not_zero from mm_lock_acct to its callsites in
vfio_lock_acct and vfio_change_dma_owner.

- Steve

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V7 2/7] vfio/type1: prevent underflow of locked_vm via exec()
  2023-01-03 18:12     ` Steven Sistare
@ 2023-01-03 19:20       ` Jason Gunthorpe
  2023-01-06 15:14         ` Steven Sistare
  0 siblings, 1 reply; 22+ messages in thread
From: Jason Gunthorpe @ 2023-01-03 19:20 UTC (permalink / raw)
  To: Steven Sistare; +Cc: Alex Williamson, kvm, Cornelia Huck, Kevin Tian

On Tue, Jan 03, 2023 at 01:12:53PM -0500, Steven Sistare wrote:
> On 1/3/2023 10:20 AM, Jason Gunthorpe wrote:
> > On Tue, Dec 20, 2022 at 12:39:20PM -0800, Steve Sistare wrote:
> >> When a vfio container is preserved across exec, the task does not change,
> >> but it gets a new mm with locked_vm=0, and loses the count from existing
> >> dma mappings.  If the user later unmaps a dma mapping, locked_vm underflows
> >> to a large unsigned value, and a subsequent dma map request fails with
> >> ENOMEM in __account_locked_vm.
> >>
> >> To avoid underflow, grab and save the mm at the time a dma is mapped.
> >> Use that mm when adjusting locked_vm, rather than re-acquiring the saved
> >> task's mm, which may have changed.  If the saved mm is dead, do nothing.
> >>
> >> locked_vm is incremented for existing mappings in a subsequent patch.
> >>
> >> Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation")
> >> Cc: stable@vger.kernel.org
> >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> >> ---
> >>  drivers/vfio/vfio_iommu_type1.c | 27 +++++++++++----------------
> >>  1 file changed, 11 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >> index 144f5bb..71f980b 100644
> >> --- a/drivers/vfio/vfio_iommu_type1.c
> >> +++ b/drivers/vfio/vfio_iommu_type1.c
> >> @@ -100,6 +100,7 @@ struct vfio_dma {
> >>  	struct task_struct	*task;
> >>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
> >>  	unsigned long		*bitmap;
> >> +	struct mm_struct	*mm;
> >>  };
> >>  
> >>  struct vfio_batch {
> >> @@ -420,8 +421,8 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
> >>  	if (!npage)
> >>  		return 0;
> >>  
> >> -	mm = async ? get_task_mm(dma->task) : dma->task->mm;
> >> -	if (!mm)
> >> +	mm = dma->mm;
> >> +	if (async && !mmget_not_zero(mm))
> >>  		return -ESRCH; /* process exited */
> > 
> > Just delete the async, the lock_acct always acts on the dma which
> > always has a singular mm.
> > 
> > FIx the few callers that need it to do the mmget_no_zero() before
> > calling in.
> 
> Most of the callers pass async=true:
>   ret = vfio_lock_acct(dma, lock_acct, false);
>   vfio_lock_acct(dma, locked - unlocked, true);
>   ret = vfio_lock_acct(dma, 1, true);
>   vfio_lock_acct(dma, -unlocked, true);
>   vfio_lock_acct(dma, -1, true);
>   vfio_lock_acct(dma, -unlocked, true);
>   ret = mm_lock_acct(task, mm, lock_cap, npage, false);
>   mm_lock_acct(dma->task, dma->mm, dma->lock_cap, -npage, true);
>   vfio_lock_acct(dma, locked - unlocked, true);

Seems like if you make a lock_sub_acct() function that does the -1*
and does the mmget it will be OK?

Jason

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V7 2/7] vfio/type1: prevent underflow of locked_vm via exec()
  2023-01-03 19:20       ` Jason Gunthorpe
@ 2023-01-06 15:14         ` Steven Sistare
  2023-01-09 13:52           ` Jason Gunthorpe
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Sistare @ 2023-01-06 15:14 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Alex Williamson, kvm, Cornelia Huck, Kevin Tian

On 1/3/2023 2:20 PM, Jason Gunthorpe wrote:
> On Tue, Jan 03, 2023 at 01:12:53PM -0500, Steven Sistare wrote:
>> On 1/3/2023 10:20 AM, Jason Gunthorpe wrote:
>>> On Tue, Dec 20, 2022 at 12:39:20PM -0800, Steve Sistare wrote:
>>>> When a vfio container is preserved across exec, the task does not change,
>>>> but it gets a new mm with locked_vm=0, and loses the count from existing
>>>> dma mappings.  If the user later unmaps a dma mapping, locked_vm underflows
>>>> to a large unsigned value, and a subsequent dma map request fails with
>>>> ENOMEM in __account_locked_vm.
>>>>
>>>> To avoid underflow, grab and save the mm at the time a dma is mapped.
>>>> Use that mm when adjusting locked_vm, rather than re-acquiring the saved
>>>> task's mm, which may have changed.  If the saved mm is dead, do nothing.
>>>>
>>>> locked_vm is incremented for existing mappings in a subsequent patch.
>>>>
>>>> Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation")
>>>> Cc: stable@vger.kernel.org
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
>>>> ---
>>>>  drivers/vfio/vfio_iommu_type1.c | 27 +++++++++++----------------
>>>>  1 file changed, 11 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>>> index 144f5bb..71f980b 100644
>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>> @@ -100,6 +100,7 @@ struct vfio_dma {
>>>>  	struct task_struct	*task;
>>>>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
>>>>  	unsigned long		*bitmap;
>>>> +	struct mm_struct	*mm;
>>>>  };
>>>>  
>>>>  struct vfio_batch {
>>>> @@ -420,8 +421,8 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
>>>>  	if (!npage)
>>>>  		return 0;
>>>>  
>>>> -	mm = async ? get_task_mm(dma->task) : dma->task->mm;
>>>> -	if (!mm)
>>>> +	mm = dma->mm;
>>>> +	if (async && !mmget_not_zero(mm))
>>>>  		return -ESRCH; /* process exited */
>>>
>>> Just delete the async, the lock_acct always acts on the dma which
>>> always has a singular mm.
>>>
>>> FIx the few callers that need it to do the mmget_no_zero() before
>>> calling in.
>>
>> Most of the callers pass async=true:
>>   ret = vfio_lock_acct(dma, lock_acct, false);
>>   vfio_lock_acct(dma, locked - unlocked, true);
>>   ret = vfio_lock_acct(dma, 1, true);
>>   vfio_lock_acct(dma, -unlocked, true);
>>   vfio_lock_acct(dma, -1, true);
>>   vfio_lock_acct(dma, -unlocked, true);
>>   ret = mm_lock_acct(task, mm, lock_cap, npage, false);
>>   mm_lock_acct(dma->task, dma->mm, dma->lock_cap, -npage, true);
>>   vfio_lock_acct(dma, locked - unlocked, true);
> 
> Seems like if you make a lock_sub_acct() function that does the -1*
> and does the mmget it will be OK?

Do you mean, provide two versions of vfio_lock_acct?  Simplified:

    vfio_lock_acct()
    {
        mm_lock_acct()
        dma->locked_vm += npage;
    }

    vfio_lock_acct_async()
    {
        mmget_not_zero(dma->mm)

        mm_lock_acct()
        dma->locked_vm += npage;

        mmput(dma->mm);
    }

If so, I will factor this into a separate patch, since it is cosmetic, so stable
can omit it.

- Steve

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V7 2/7] vfio/type1: prevent underflow of locked_vm via exec()
  2023-01-06 15:14         ` Steven Sistare
@ 2023-01-09 13:52           ` Jason Gunthorpe
  2023-01-09 15:54             ` Steven Sistare
  0 siblings, 1 reply; 22+ messages in thread
From: Jason Gunthorpe @ 2023-01-09 13:52 UTC (permalink / raw)
  To: Steven Sistare; +Cc: Alex Williamson, kvm, Cornelia Huck, Kevin Tian

On Fri, Jan 06, 2023 at 10:14:57AM -0500, Steven Sistare wrote:
> On 1/3/2023 2:20 PM, Jason Gunthorpe wrote:
> > On Tue, Jan 03, 2023 at 01:12:53PM -0500, Steven Sistare wrote:
> >> On 1/3/2023 10:20 AM, Jason Gunthorpe wrote:
> >>> On Tue, Dec 20, 2022 at 12:39:20PM -0800, Steve Sistare wrote:
> >>>> When a vfio container is preserved across exec, the task does not change,
> >>>> but it gets a new mm with locked_vm=0, and loses the count from existing
> >>>> dma mappings.  If the user later unmaps a dma mapping, locked_vm underflows
> >>>> to a large unsigned value, and a subsequent dma map request fails with
> >>>> ENOMEM in __account_locked_vm.
> >>>>
> >>>> To avoid underflow, grab and save the mm at the time a dma is mapped.
> >>>> Use that mm when adjusting locked_vm, rather than re-acquiring the saved
> >>>> task's mm, which may have changed.  If the saved mm is dead, do nothing.
> >>>>
> >>>> locked_vm is incremented for existing mappings in a subsequent patch.
> >>>>
> >>>> Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation")
> >>>> Cc: stable@vger.kernel.org
> >>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >>>> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> >>>> ---
> >>>>  drivers/vfio/vfio_iommu_type1.c | 27 +++++++++++----------------
> >>>>  1 file changed, 11 insertions(+), 16 deletions(-)
> >>>>
> >>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >>>> index 144f5bb..71f980b 100644
> >>>> --- a/drivers/vfio/vfio_iommu_type1.c
> >>>> +++ b/drivers/vfio/vfio_iommu_type1.c
> >>>> @@ -100,6 +100,7 @@ struct vfio_dma {
> >>>>  	struct task_struct	*task;
> >>>>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
> >>>>  	unsigned long		*bitmap;
> >>>> +	struct mm_struct	*mm;
> >>>>  };
> >>>>  
> >>>>  struct vfio_batch {
> >>>> @@ -420,8 +421,8 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
> >>>>  	if (!npage)
> >>>>  		return 0;
> >>>>  
> >>>> -	mm = async ? get_task_mm(dma->task) : dma->task->mm;
> >>>> -	if (!mm)
> >>>> +	mm = dma->mm;
> >>>> +	if (async && !mmget_not_zero(mm))
> >>>>  		return -ESRCH; /* process exited */
> >>>
> >>> Just delete the async, the lock_acct always acts on the dma which
> >>> always has a singular mm.
> >>>
> >>> FIx the few callers that need it to do the mmget_no_zero() before
> >>> calling in.
> >>
> >> Most of the callers pass async=true:
> >>   ret = vfio_lock_acct(dma, lock_acct, false);
> >>   vfio_lock_acct(dma, locked - unlocked, true);
> >>   ret = vfio_lock_acct(dma, 1, true);
> >>   vfio_lock_acct(dma, -unlocked, true);
> >>   vfio_lock_acct(dma, -1, true);
> >>   vfio_lock_acct(dma, -unlocked, true);
> >>   ret = mm_lock_acct(task, mm, lock_cap, npage, false);
> >>   mm_lock_acct(dma->task, dma->mm, dma->lock_cap, -npage, true);
> >>   vfio_lock_acct(dma, locked - unlocked, true);
> > 
> > Seems like if you make a lock_sub_acct() function that does the -1*
> > and does the mmget it will be OK?
> 
> Do you mean, provide two versions of vfio_lock_acct?  Simplified:
> 
>     vfio_lock_acct()
>     {
>         mm_lock_acct()
>         dma->locked_vm += npage;
>     }
> 
>     vfio_lock_acct_async()
>     {
>         mmget_not_zero(dma->mm)
> 
>         mm_lock_acct()
>         dma->locked_vm += npage;
> 
>         mmput(dma->mm);
>     }

I was thinking more like 

vfio_lock_acct_subtract()
         mmget_not_zero(dma->mm)
	 mm->locked_vm -= npage

Jason

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V7 2/7] vfio/type1: prevent underflow of locked_vm via exec()
  2023-01-09 13:52           ` Jason Gunthorpe
@ 2023-01-09 15:54             ` Steven Sistare
  2023-01-09 21:16               ` Steven Sistare
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Sistare @ 2023-01-09 15:54 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Alex Williamson, kvm, Cornelia Huck, Kevin Tian

On 1/9/2023 8:52 AM, Jason Gunthorpe wrote:
> On Fri, Jan 06, 2023 at 10:14:57AM -0500, Steven Sistare wrote:
>> On 1/3/2023 2:20 PM, Jason Gunthorpe wrote:
>>> On Tue, Jan 03, 2023 at 01:12:53PM -0500, Steven Sistare wrote:
>>>> On 1/3/2023 10:20 AM, Jason Gunthorpe wrote:
>>>>> On Tue, Dec 20, 2022 at 12:39:20PM -0800, Steve Sistare wrote:
>>>>>> When a vfio container is preserved across exec, the task does not change,
>>>>>> but it gets a new mm with locked_vm=0, and loses the count from existing
>>>>>> dma mappings.  If the user later unmaps a dma mapping, locked_vm underflows
>>>>>> to a large unsigned value, and a subsequent dma map request fails with
>>>>>> ENOMEM in __account_locked_vm.
>>>>>>
>>>>>> To avoid underflow, grab and save the mm at the time a dma is mapped.
>>>>>> Use that mm when adjusting locked_vm, rather than re-acquiring the saved
>>>>>> task's mm, which may have changed.  If the saved mm is dead, do nothing.
>>>>>>
>>>>>> locked_vm is incremented for existing mappings in a subsequent patch.
>>>>>>
>>>>>> Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation")
>>>>>> Cc: stable@vger.kernel.org
>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
>>>>>> ---
>>>>>>  drivers/vfio/vfio_iommu_type1.c | 27 +++++++++++----------------
>>>>>>  1 file changed, 11 insertions(+), 16 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>>>>> index 144f5bb..71f980b 100644
>>>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>>>> @@ -100,6 +100,7 @@ struct vfio_dma {
>>>>>>  	struct task_struct	*task;
>>>>>>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
>>>>>>  	unsigned long		*bitmap;
>>>>>> +	struct mm_struct	*mm;
>>>>>>  };
>>>>>>  
>>>>>>  struct vfio_batch {
>>>>>> @@ -420,8 +421,8 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
>>>>>>  	if (!npage)
>>>>>>  		return 0;
>>>>>>  
>>>>>> -	mm = async ? get_task_mm(dma->task) : dma->task->mm;
>>>>>> -	if (!mm)
>>>>>> +	mm = dma->mm;
>>>>>> +	if (async && !mmget_not_zero(mm))
>>>>>>  		return -ESRCH; /* process exited */
>>>>>
>>>>> Just delete the async, the lock_acct always acts on the dma which
>>>>> always has a singular mm.
>>>>>
>>>>> FIx the few callers that need it to do the mmget_no_zero() before
>>>>> calling in.
>>>>
>>>> Most of the callers pass async=true:
>>>>   ret = vfio_lock_acct(dma, lock_acct, false);
>>>>   vfio_lock_acct(dma, locked - unlocked, true);
>>>>   ret = vfio_lock_acct(dma, 1, true);
>>>>   vfio_lock_acct(dma, -unlocked, true);
>>>>   vfio_lock_acct(dma, -1, true);
>>>>   vfio_lock_acct(dma, -unlocked, true);
>>>>   ret = mm_lock_acct(task, mm, lock_cap, npage, false);
>>>>   mm_lock_acct(dma->task, dma->mm, dma->lock_cap, -npage, true);
>>>>   vfio_lock_acct(dma, locked - unlocked, true);
>>>
>>> Seems like if you make a lock_sub_acct() function that does the -1*
>>> and does the mmget it will be OK?
>>
>> Do you mean, provide two versions of vfio_lock_acct?  Simplified:
>>
>>     vfio_lock_acct()
>>     {
>>         mm_lock_acct()
>>         dma->locked_vm += npage;
>>     }
>>
>>     vfio_lock_acct_async()
>>     {
>>         mmget_not_zero(dma->mm)
>>
>>         mm_lock_acct()
>>         dma->locked_vm += npage;
>>
>>         mmput(dma->mm);
>>     }
> 
> I was thinking more like 
> 
> ()
>        mmget_not_zero(dma->mm)
> 	 mm->locked_vm -= npage 
         ^^^^^^
Is this shorthand for open coding __account_locked_vm?  If so, we are
essentially saying the same thing.  My function vfio_lock_acct_async calls 
mm_lock_acct which calls __account_locked_vm.

But, your vfio_lock_acct_subtract does not call mmput, so maybe I still don't
grok your suggestion.

FWIW here are my functions with all error checking:

static int mm_lock_acct(struct task_struct *task, struct mm_struct *mm,
                        bool lock_cap, long npage)
{
        int ret = mmap_write_lock_killable(mm);

        if (!ret) {
                ret = __account_locked_vm(mm, abs(npage), npage > 0, task,
                                          lock_cap);
                mmap_write_unlock(mm);
        }

        return ret;
}

static int vfio_lock_acct(struct vfio_dma *dma, long npage)
{
        int ret;

        if (!npage)
                return 0;

        ret = mm_lock_acct(dma->task, dma->mm, dma->lock_cap, npage);
        if (!ret)
                dma->locked_vm += npage;

        return ret;
}

static int vfio_lock_acct_async(struct vfio_dma *dma, long npage)
{
        int ret;

        if (!npage)
                return 0;

        if (!mmget_not_zero(dma->mm))
                return -ESRCH; /* process exited */

        ret = mm_lock_acct(dma->task, dma->mm, dma->lock_cap, npage);
        if (!ret)
                dma->locked_vm += npage;

        mmput(dma->mm);

        return ret;
}

- Steve

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V7 2/7] vfio/type1: prevent underflow of locked_vm via exec()
  2023-01-09 15:54             ` Steven Sistare
@ 2023-01-09 21:16               ` Steven Sistare
  2023-01-10 15:02                 ` Jason Gunthorpe
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Sistare @ 2023-01-09 21:16 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Alex Williamson, kvm, Cornelia Huck, Kevin Tian

On 1/9/2023 10:54 AM, Steven Sistare wrote:
> On 1/9/2023 8:52 AM, Jason Gunthorpe wrote:
>> On Fri, Jan 06, 2023 at 10:14:57AM -0500, Steven Sistare wrote:
>>> On 1/3/2023 2:20 PM, Jason Gunthorpe wrote:
>>>> On Tue, Jan 03, 2023 at 01:12:53PM -0500, Steven Sistare wrote:
>>>>> On 1/3/2023 10:20 AM, Jason Gunthorpe wrote:
>>>>>> On Tue, Dec 20, 2022 at 12:39:20PM -0800, Steve Sistare wrote:
>>>>>>> When a vfio container is preserved across exec, the task does not change,
>>>>>>> but it gets a new mm with locked_vm=0, and loses the count from existing
>>>>>>> dma mappings.  If the user later unmaps a dma mapping, locked_vm underflows
>>>>>>> to a large unsigned value, and a subsequent dma map request fails with
>>>>>>> ENOMEM in __account_locked_vm.
>>>>>>>
>>>>>>> To avoid underflow, grab and save the mm at the time a dma is mapped.
>>>>>>> Use that mm when adjusting locked_vm, rather than re-acquiring the saved
>>>>>>> task's mm, which may have changed.  If the saved mm is dead, do nothing.
>>>>>>>
>>>>>>> locked_vm is incremented for existing mappings in a subsequent patch.
>>>>>>>
>>>>>>> Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation")
>>>>>>> Cc: stable@vger.kernel.org
>>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>>> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
>>>>>>> ---
>>>>>>>  drivers/vfio/vfio_iommu_type1.c | 27 +++++++++++----------------
>>>>>>>  1 file changed, 11 insertions(+), 16 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>>>>>> index 144f5bb..71f980b 100644
>>>>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>>>>> @@ -100,6 +100,7 @@ struct vfio_dma {
>>>>>>>  	struct task_struct	*task;
>>>>>>>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
>>>>>>>  	unsigned long		*bitmap;
>>>>>>> +	struct mm_struct	*mm;
>>>>>>>  };
>>>>>>>  
>>>>>>>  struct vfio_batch {
>>>>>>> @@ -420,8 +421,8 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
>>>>>>>  	if (!npage)
>>>>>>>  		return 0;
>>>>>>>  
>>>>>>> -	mm = async ? get_task_mm(dma->task) : dma->task->mm;
>>>>>>> -	if (!mm)
>>>>>>> +	mm = dma->mm;
>>>>>>> +	if (async && !mmget_not_zero(mm))
>>>>>>>  		return -ESRCH; /* process exited */
>>>>>>
>>>>>> Just delete the async, the lock_acct always acts on the dma which
>>>>>> always has a singular mm.
>>>>>>
>>>>>> FIx the few callers that need it to do the mmget_no_zero() before
>>>>>> calling in.
>>>>>
>>>>> Most of the callers pass async=true:
>>>>>   ret = vfio_lock_acct(dma, lock_acct, false);
>>>>>   vfio_lock_acct(dma, locked - unlocked, true);
>>>>>   ret = vfio_lock_acct(dma, 1, true);
>>>>>   vfio_lock_acct(dma, -unlocked, true);
>>>>>   vfio_lock_acct(dma, -1, true);
>>>>>   vfio_lock_acct(dma, -unlocked, true);
>>>>>   ret = mm_lock_acct(task, mm, lock_cap, npage, false);
>>>>>   mm_lock_acct(dma->task, dma->mm, dma->lock_cap, -npage, true);
>>>>>   vfio_lock_acct(dma, locked - unlocked, true);
>>>>
>>>> Seems like if you make a lock_sub_acct() function that does the -1*
>>>> and does the mmget it will be OK?
>>>
>>> Do you mean, provide two versions of vfio_lock_acct?  Simplified:
>>>
>>>     vfio_lock_acct()
>>>     {
>>>         mm_lock_acct()
>>>         dma->locked_vm += npage;
>>>     }
>>>
>>>     vfio_lock_acct_async()
>>>     {
>>>         mmget_not_zero(dma->mm)
>>>
>>>         mm_lock_acct()
>>>         dma->locked_vm += npage;
>>>
>>>         mmput(dma->mm);
>>>     }
>>
>> I was thinking more like 
>>
>> ()
>>        mmget_not_zero(dma->mm)
>> 	 mm->locked_vm -= npage 
>          ^^^^^^
> Is this shorthand for open coding __account_locked_vm?  If so, we are
> essentially saying the same thing.  My function vfio_lock_acct_async calls 
> mm_lock_acct which calls __account_locked_vm.
> 
> But, your vfio_lock_acct_subtract does not call mmput, so maybe I still don't
> grok your suggestion.
> 
> FWIW here are my functions with all error checking:
> 
> static int mm_lock_acct(struct task_struct *task, struct mm_struct *mm,
>                         bool lock_cap, long npage)
> {
>         int ret = mmap_write_lock_killable(mm);
> 
>         if (!ret) {
>                 ret = __account_locked_vm(mm, abs(npage), npage > 0, task,
>                                           lock_cap);
>                 mmap_write_unlock(mm);
>         }
> 
>         return ret;
> }
> 
> static int vfio_lock_acct(struct vfio_dma *dma, long npage)
> {
>         int ret;
> 
>         if (!npage)
>                 return 0;
> 
>         ret = mm_lock_acct(dma->task, dma->mm, dma->lock_cap, npage);
>         if (!ret)
>                 dma->locked_vm += npage;
> 
>         return ret;
> }
> 
> static int vfio_lock_acct_async(struct vfio_dma *dma, long npage)
> {
>         int ret;
> 
>         if (!npage)
>                 return 0;
> 
>         if (!mmget_not_zero(dma->mm))
>                 return -ESRCH; /* process exited */
> 
>         ret = mm_lock_acct(dma->task, dma->mm, dma->lock_cap, npage);
>         if (!ret)
>                 dma->locked_vm += npage;
> 
>         mmput(dma->mm);
> 
>         return ret;
> }

Let's leave the async arg and vfio_lock_acct as is.  We are over-polishing a small
style issue, in pre-existing code, in a soon-to-be dead-end code base.

- Steve

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V7 3/7] vfio/type1: track locked_vm per dma
  2023-01-03 18:13     ` Steven Sistare
@ 2023-01-09 21:24       ` Steven Sistare
  2023-01-10  0:31         ` Jason Gunthorpe
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Sistare @ 2023-01-09 21:24 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: kvm, Alex Williamson, Cornelia Huck, Kevin Tian

On 1/3/2023 1:13 PM, Steven Sistare wrote:
> On 1/3/2023 10:21 AM, Jason Gunthorpe wrote:
>> On Tue, Dec 20, 2022 at 12:39:21PM -0800, Steve Sistare wrote:
>>> Track locked_vm per dma struct, and create a new subroutine, both for use
>>> in a subsequent patch.  No functional change.
>>>
>>> Fixes: c3cbab24db38 ("vfio/type1: implement interfaces to update vaddr")
>>> Cc: stable@vger.kernel.org
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
>>> ---
>>>  drivers/vfio/vfio_iommu_type1.c | 20 +++++++++++++++-----
>>>  1 file changed, 15 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>> index 71f980b..588d690 100644
>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>> @@ -101,6 +101,7 @@ struct vfio_dma {
>>>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
>>>  	unsigned long		*bitmap;
>>>  	struct mm_struct	*mm;
>>> +	long			locked_vm;
>>
>> Why is it long? Can it be negative?
> 
> The existing code uses both long and uint64_t for page counts, and I picked one.
> I'll use size_t instead to match vfio_dma size.
> 
>>>  };
>>>  
>>>  struct vfio_batch {
>>> @@ -413,22 +414,21 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
>>>  	return ret;
>>>  }
>>>  
>>> -static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
>>> +static int mm_lock_acct(struct task_struct *task, struct mm_struct *mm,
>>> +			bool lock_cap, long npage, bool async)
>>>  {
>>
>> Now async is even more confusing, the caller really should have a
>> valid handle on the mm before using it as an argument like this.
> 
> The caller holds a grab reference on mm, and mm_lock_acct does mmget_not_zero to 
> validate the mm.  IMO this is a close analog of the original vfio_lock_acct code
> where the caller holds a get reference on task, and does get_task_mm to validate
> the mm.
> 
> However, I can hoist the mmget_not_zero from mm_lock_acct to its callsites in
> vfio_lock_acct and vfio_change_dma_owner.

Yielding:

static int mm_lock_acct(struct task_struct *task, struct mm_struct *mm,
                        bool lock_cap, long npage)
{
        int ret = mmap_write_lock_killable(mm);

        if (!ret) {
                ret = __account_locked_vm(mm, abs(npage), npage > 0, task,
                                          lock_cap);
                mmap_write_unlock(mm);
        }

        return ret;
}

static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
{
        struct mm_struct *mm = dma->mm;
        int ret;

        if (!npage)
                return 0;

        if (async && !mmget_not_zero(mm))
                return -ESRCH; /* process exited */

        ret = mm_lock_acct(dma->task, mm, dma->lock_cap, npage);
        if (!ret)
                dma->locked_vm += npage;

        if (async)
                mmput(mm);

        return ret;
}

static int vfio_change_dma_owner(struct vfio_dma *dma)
{
...
                ret = mm_lock_acct(task, mm, lock_cap, npage);
                if (ret)
                        return ret;

                if (mmget_not_zero(dma->mm)) {
                        mm_lock_acct(dma->task, dma->mm, dma->lock_cap, -npage);
                        mmput(dma->mm);
                }
...

- Steve

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V7 3/7] vfio/type1: track locked_vm per dma
  2023-01-09 21:24       ` Steven Sistare
@ 2023-01-10  0:31         ` Jason Gunthorpe
  0 siblings, 0 replies; 22+ messages in thread
From: Jason Gunthorpe @ 2023-01-10  0:31 UTC (permalink / raw)
  To: Steven Sistare; +Cc: kvm, Alex Williamson, Cornelia Huck, Kevin Tian

On Mon, Jan 09, 2023 at 04:24:03PM -0500, Steven Sistare wrote:
> On 1/3/2023 1:13 PM, Steven Sistare wrote:
> > On 1/3/2023 10:21 AM, Jason Gunthorpe wrote:
> >> On Tue, Dec 20, 2022 at 12:39:21PM -0800, Steve Sistare wrote:
> >>> Track locked_vm per dma struct, and create a new subroutine, both for use
> >>> in a subsequent patch.  No functional change.
> >>>
> >>> Fixes: c3cbab24db38 ("vfio/type1: implement interfaces to update vaddr")
> >>> Cc: stable@vger.kernel.org
> >>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >>> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> >>> ---
> >>>  drivers/vfio/vfio_iommu_type1.c | 20 +++++++++++++++-----
> >>>  1 file changed, 15 insertions(+), 5 deletions(-)
> >>>
> >>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >>> index 71f980b..588d690 100644
> >>> --- a/drivers/vfio/vfio_iommu_type1.c
> >>> +++ b/drivers/vfio/vfio_iommu_type1.c
> >>> @@ -101,6 +101,7 @@ struct vfio_dma {
> >>>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
> >>>  	unsigned long		*bitmap;
> >>>  	struct mm_struct	*mm;
> >>> +	long			locked_vm;
> >>
> >> Why is it long? Can it be negative?
> > 
> > The existing code uses both long and uint64_t for page counts, and I picked one.
> > I'll use size_t instead to match vfio_dma size.
> > 
> >>>  };
> >>>  
> >>>  struct vfio_batch {
> >>> @@ -413,22 +414,21 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
> >>>  	return ret;
> >>>  }
> >>>  
> >>> -static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
> >>> +static int mm_lock_acct(struct task_struct *task, struct mm_struct *mm,
> >>> +			bool lock_cap, long npage, bool async)
> >>>  {
> >>
> >> Now async is even more confusing, the caller really should have a
> >> valid handle on the mm before using it as an argument like this.
> > 
> > The caller holds a grab reference on mm, and mm_lock_acct does mmget_not_zero to 
> > validate the mm.  IMO this is a close analog of the original vfio_lock_acct code
> > where the caller holds a get reference on task, and does get_task_mm to validate
> > the mm.
> > 
> > However, I can hoist the mmget_not_zero from mm_lock_acct to its callsites in
> > vfio_lock_acct and vfio_change_dma_owner.
> 
> Yielding:
> 
> static int mm_lock_acct(struct task_struct *task, struct mm_struct *mm,
>                         bool lock_cap, long npage)
> {
>         int ret = mmap_write_lock_killable(mm);
> 
>         if (!ret) {

Please don't write in the 'single return' style, that is not kernel
code.

'success oriented flow' means you have early returns and goto error so
a straight line read of the function tells what success looks like

Jason

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V7 2/7] vfio/type1: prevent underflow of locked_vm via exec()
  2023-01-09 21:16               ` Steven Sistare
@ 2023-01-10 15:02                 ` Jason Gunthorpe
  0 siblings, 0 replies; 22+ messages in thread
From: Jason Gunthorpe @ 2023-01-10 15:02 UTC (permalink / raw)
  To: Steven Sistare; +Cc: Alex Williamson, kvm, Cornelia Huck, Kevin Tian

On Mon, Jan 09, 2023 at 04:16:18PM -0500, Steven Sistare wrote:

> Let's leave the async arg and vfio_lock_acct as is.  We are over-polishing a small
> style issue, in pre-existing code, in a soon-to-be dead-end code base.

fine

Jason

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2023-01-10 15:02 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-12-20 20:39 [PATCH V7 0/7] fixes for virtual address update Steve Sistare
2022-12-20 20:39 ` [PATCH V7 1/7] vfio/type1: exclude mdevs from VFIO_UPDATE_VADDR Steve Sistare
2022-12-20 20:39 ` [PATCH V7 2/7] vfio/type1: prevent underflow of locked_vm via exec() Steve Sistare
2023-01-03 15:20   ` Jason Gunthorpe
2023-01-03 18:12     ` Steven Sistare
2023-01-03 19:20       ` Jason Gunthorpe
2023-01-06 15:14         ` Steven Sistare
2023-01-09 13:52           ` Jason Gunthorpe
2023-01-09 15:54             ` Steven Sistare
2023-01-09 21:16               ` Steven Sistare
2023-01-10 15:02                 ` Jason Gunthorpe
2022-12-20 20:39 ` [PATCH V7 3/7] vfio/type1: track locked_vm per dma Steve Sistare
2023-01-03 15:21   ` Jason Gunthorpe
2023-01-03 18:13     ` Steven Sistare
2023-01-09 21:24       ` Steven Sistare
2023-01-10  0:31         ` Jason Gunthorpe
2022-12-20 20:39 ` [PATCH V7 4/7] vfio/type1: restore locked_vm Steve Sistare
2023-01-03 15:22   ` Jason Gunthorpe
2023-01-03 18:12     ` Steven Sistare
2022-12-20 20:39 ` [PATCH V7 5/7] vfio/type1: revert "block on invalid vaddr" Steve Sistare
2022-12-20 20:39 ` [PATCH V7 6/7] vfio/type1: revert "implement notify callback" Steve Sistare
2022-12-20 20:39 ` [PATCH V7 7/7] vfio: revert "iommu driver " Steve Sistare

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox