public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH V1 0/2] fixes for virtual address update
@ 2022-12-13 15:46 Steve Sistare
  2022-12-13 15:46 ` [PATCH V1 1/2] vfio/type1: exclude mdevs from VFIO_UPDATE_VADDR Steve Sistare
  2022-12-13 15:46 ` [PATCH V1 2/2] vfio/type1: prevent locked_vm underflow Steve Sistare
  0 siblings, 2 replies; 12+ messages in thread
From: Steve Sistare @ 2022-12-13 15:46 UTC (permalink / raw)
  To: kvm; +Cc: Alex Williamson, Cornelia Huck, Steve Sistare

Fix bugs in the interfaces that allow the underlying memory object of an
iova range to be mapped in a new address space.  They allow userland to
indefinitely block vfio mediated device kernel threads, and do not
propagate the locked_vm count to a new mm.

Steve Sistare (2):
  vfio/type1: exclude mdevs from VFIO_UPDATE_VADDR
  vfio/type1: prevent locked_vm underflow

 drivers/vfio/vfio_iommu_type1.c | 36 +++++++++++++++++++++++++++++++++++-
 include/uapi/linux/vfio.h       |  6 +++++-
 2 files changed, 40 insertions(+), 2 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH V1 1/2] vfio/type1: exclude mdevs from VFIO_UPDATE_VADDR
  2022-12-13 15:46 [PATCH V1 0/2] fixes for virtual address update Steve Sistare
@ 2022-12-13 15:46 ` Steve Sistare
  2022-12-13 16:26   ` Alex Williamson
  2022-12-13 15:46 ` [PATCH V1 2/2] vfio/type1: prevent locked_vm underflow Steve Sistare
  1 sibling, 1 reply; 12+ messages in thread
From: Steve Sistare @ 2022-12-13 15:46 UTC (permalink / raw)
  To: kvm; +Cc: Alex Williamson, Cornelia Huck, Steve Sistare

Disable the VFIO_UPDATE_VADDR capability if mediated devices are present.
Their kernel threads could be blocked indefinitely by a misbehaving
userland while trying to pin/unpin pages while vaddrs are being updated.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vfio/vfio_iommu_type1.c | 25 ++++++++++++++++++++++++-
 include/uapi/linux/vfio.h       |  6 +++++-
 2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 23c24fe..f81e925 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1343,6 +1343,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 
 	mutex_lock(&iommu->lock);
 
+	/* Cannot update vaddr if mdev is present. */
+	if (invalidate_vaddr && !list_empty(&iommu->emulated_iommu_groups))
+		goto unlock;
+
 	pgshift = __ffs(iommu->pgsize_bitmap);
 	pgsize = (size_t)1 << pgshift;
 
@@ -2189,6 +2193,10 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 
 	mutex_lock(&iommu->lock);
 
+	/* Prevent an mdev from sneaking in while vaddr flags are used. */
+	if (iommu->vaddr_invalid_count && type == VFIO_EMULATED_IOMMU)
+		goto out_unlock;
+
 	/* Check for duplicates */
 	if (vfio_iommu_find_iommu_group(iommu, iommu_group))
 		goto out_unlock;
@@ -2660,6 +2668,20 @@ static int vfio_domains_have_enforce_cache_coherency(struct vfio_iommu *iommu)
 	return ret;
 }
 
+/*
+ * Disable this feature if mdevs are present.  They cannot safely pin/unpin
+ * while vaddrs are being updated.
+ */
+static int vfio_iommu_can_update_vaddr(struct vfio_iommu *iommu)
+{
+	int ret;
+
+	mutex_lock(&iommu->lock);
+	ret = list_empty(&iommu->emulated_iommu_groups);
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
 static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
 					    unsigned long arg)
 {
@@ -2668,8 +2690,9 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
 	case VFIO_TYPE1v2_IOMMU:
 	case VFIO_TYPE1_NESTING_IOMMU:
 	case VFIO_UNMAP_ALL:
-	case VFIO_UPDATE_VADDR:
 		return 1;
+	case VFIO_UPDATE_VADDR:
+		return iommu && vfio_iommu_can_update_vaddr(iommu);
 	case VFIO_DMA_CC_IOMMU:
 		if (!iommu)
 			return 0;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index d7d8e09..6d36b84 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -49,7 +49,11 @@
 /* Supports VFIO_DMA_UNMAP_FLAG_ALL */
 #define VFIO_UNMAP_ALL			9
 
-/* Supports the vaddr flag for DMA map and unmap */
+/*
+ * Supports the vaddr flag for DMA map and unmap.  Not supported for mediated
+ * devices, so this capability is subject to change as groups are added or
+ * removed.
+ */
 #define VFIO_UPDATE_VADDR		10
 
 /*
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH V1 2/2] vfio/type1: prevent locked_vm underflow
  2022-12-13 15:46 [PATCH V1 0/2] fixes for virtual address update Steve Sistare
  2022-12-13 15:46 ` [PATCH V1 1/2] vfio/type1: exclude mdevs from VFIO_UPDATE_VADDR Steve Sistare
@ 2022-12-13 15:46 ` Steve Sistare
  2022-12-13 18:02   ` Alex Williamson
  1 sibling, 1 reply; 12+ messages in thread
From: Steve Sistare @ 2022-12-13 15:46 UTC (permalink / raw)
  To: kvm; +Cc: Alex Williamson, Cornelia Huck, Steve Sistare

When a vfio container is preserved across exec using the VFIO_UPDATE_VADDR
interfaces, locked_vm of the new mm becomes 0.  If the user later unmaps a
dma mapping, locked_vm underflows to a large unsigned value, and a
subsequent dma map request fails with ENOMEM in __account_locked_vm.

To fix, when VFIO_DMA_MAP_FLAG_VADDR is used and the dma's mm has changed,
add the mapping's pinned page count to the new mm->locked_vm, subject to
the rlimit.  Now that mediated devices are excluded when using
VFIO_UPDATE_VADDR, the amount of pinned memory equals the size of the
mapping.

Underflow will not occur when all dma mappings are invalidated before exec.
An attempt to unmap before updating the vaddr with VFIO_DMA_MAP_FLAG_VADDR
will fail with EINVAL because the mapping is in the vaddr_invalid state.
Underflow may still occur in a buggy application that fails to invalidate
all before exec.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vfio/vfio_iommu_type1.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index f81e925..e5a02f8 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -100,6 +100,7 @@ struct vfio_dma {
 	struct task_struct	*task;
 	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
 	unsigned long		*bitmap;
+	struct mm_struct	*mm;
 };
 
 struct vfio_batch {
@@ -1174,6 +1175,7 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
 	vfio_unmap_unpin(iommu, dma, true);
 	vfio_unlink_dma(iommu, dma);
 	put_task_struct(dma->task);
+	mmdrop(dma->mm);
 	vfio_dma_bitmap_free(dma);
 	if (dma->vaddr_invalid) {
 		iommu->vaddr_invalid_count--;
@@ -1622,6 +1624,13 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 			dma->vaddr = vaddr;
 			dma->vaddr_invalid = false;
 			iommu->vaddr_invalid_count--;
+			if (current->mm != dma->mm) {
+				mmdrop(dma->mm);
+				dma->mm = current->mm;
+				mmgrab(dma->mm);
+				ret = vfio_lock_acct(dma, size >> PAGE_SHIFT,
+						     0);
+			}
 			wake_up_all(&iommu->vaddr_wait);
 		}
 		goto out_unlock;
@@ -1679,6 +1688,8 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	get_task_struct(current->group_leader);
 	dma->task = current->group_leader;
 	dma->lock_cap = capable(CAP_IPC_LOCK);
+	dma->mm = dma->task->mm;
+	mmgrab(dma->mm);
 
 	dma->pfn_list = RB_ROOT;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH V1 1/2] vfio/type1: exclude mdevs from VFIO_UPDATE_VADDR
  2022-12-13 15:46 ` [PATCH V1 1/2] vfio/type1: exclude mdevs from VFIO_UPDATE_VADDR Steve Sistare
@ 2022-12-13 16:26   ` Alex Williamson
  2022-12-13 16:54     ` Steven Sistare
  0 siblings, 1 reply; 12+ messages in thread
From: Alex Williamson @ 2022-12-13 16:26 UTC (permalink / raw)
  To: Steve Sistare; +Cc: kvm, Cornelia Huck, Jason Gunthorpe

On Tue, 13 Dec 2022 07:46:55 -0800
Steve Sistare <steven.sistare@oracle.com> wrote:

> Disable the VFIO_UPDATE_VADDR capability if mediated devices are present.
> Their kernel threads could be blocked indefinitely by a misbehaving
> userland while trying to pin/unpin pages while vaddrs are being updated.

Fixes: c3cbab24db38 ("vfio/type1: implement interfaces to update vaddr")

> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 25 ++++++++++++++++++++++++-
>  include/uapi/linux/vfio.h       |  6 +++++-
>  2 files changed, 29 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 23c24fe..f81e925 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -1343,6 +1343,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  
>  	mutex_lock(&iommu->lock);
>  
> +	/* Cannot update vaddr if mdev is present. */
> +	if (invalidate_vaddr && !list_empty(&iommu->emulated_iommu_groups))
> +		goto unlock;
> +
>  	pgshift = __ffs(iommu->pgsize_bitmap);
>  	pgsize = (size_t)1 << pgshift;
>  
> @@ -2189,6 +2193,10 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  
>  	mutex_lock(&iommu->lock);
>  
> +	/* Prevent an mdev from sneaking in while vaddr flags are used. */
> +	if (iommu->vaddr_invalid_count && type == VFIO_EMULATED_IOMMU)
> +		goto out_unlock;

Why only mdev devices?  If we restrict that the user cannot attach a
group while there are invalid vaddrs, and the pin/unpin pages and
dma_rw interfaces are restricted to cases where vaddr_invalid_count is
zero, then we can get rid of all the code to handle waiting for vaddrs.
ie. we could still revert:

898b9eaeb3fe ("vfio/type1: block on invalid vaddr")
487ace134053 ("vfio/type1: implement notify callback")
ec5e32940cc9 ("vfio: iommu driver notify callback")

It appears to me it might be easiest to lead with a clean revert of
these, then follow-up imposing the usage restrictions, and I'd go ahead
and add WARN_ON error paths to the pin/unpin/dma_rw paths to make sure
nobody enters those paths with an elevated invalid count.  Thanks,

Alex

> +
>  	/* Check for duplicates */
>  	if (vfio_iommu_find_iommu_group(iommu, iommu_group))
>  		goto out_unlock;
> @@ -2660,6 +2668,20 @@ static int vfio_domains_have_enforce_cache_coherency(struct vfio_iommu *iommu)
>  	return ret;
>  }
>  
> +/*
> + * Disable this feature if mdevs are present.  They cannot safely pin/unpin
> + * while vaddrs are being updated.
> + */
> +static int vfio_iommu_can_update_vaddr(struct vfio_iommu *iommu)
> +{
> +	int ret;
> +
> +	mutex_lock(&iommu->lock);
> +	ret = list_empty(&iommu->emulated_iommu_groups);
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
>  static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
>  					    unsigned long arg)
>  {
> @@ -2668,8 +2690,9 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
>  	case VFIO_TYPE1v2_IOMMU:
>  	case VFIO_TYPE1_NESTING_IOMMU:
>  	case VFIO_UNMAP_ALL:
> -	case VFIO_UPDATE_VADDR:
>  		return 1;
> +	case VFIO_UPDATE_VADDR:
> +		return iommu && vfio_iommu_can_update_vaddr(iommu);
>  	case VFIO_DMA_CC_IOMMU:
>  		if (!iommu)
>  			return 0;
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index d7d8e09..6d36b84 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -49,7 +49,11 @@
>  /* Supports VFIO_DMA_UNMAP_FLAG_ALL */
>  #define VFIO_UNMAP_ALL			9
>  
> -/* Supports the vaddr flag for DMA map and unmap */
> +/*
> + * Supports the vaddr flag for DMA map and unmap.  Not supported for mediated
> + * devices, so this capability is subject to change as groups are added or
> + * removed.
> + */
>  #define VFIO_UPDATE_VADDR		10
>  
>  /*


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V1 1/2] vfio/type1: exclude mdevs from VFIO_UPDATE_VADDR
  2022-12-13 16:26   ` Alex Williamson
@ 2022-12-13 16:54     ` Steven Sistare
  2022-12-13 17:31       ` Alex Williamson
  0 siblings, 1 reply; 12+ messages in thread
From: Steven Sistare @ 2022-12-13 16:54 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, Cornelia Huck, Jason Gunthorpe

On 12/13/2022 11:26 AM, Alex Williamson wrote:
> On Tue, 13 Dec 2022 07:46:55 -0800
> Steve Sistare <steven.sistare@oracle.com> wrote:
> 
>> Disable the VFIO_UPDATE_VADDR capability if mediated devices are present.
>> Their kernel threads could be blocked indefinitely by a misbehaving
>> userland while trying to pin/unpin pages while vaddrs are being updated.
> 
> Fixes: c3cbab24db38 ("vfio/type1: implement interfaces to update vaddr")
> 
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 25 ++++++++++++++++++++++++-
>>  include/uapi/linux/vfio.h       |  6 +++++-
>>  2 files changed, 29 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index 23c24fe..f81e925 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -1343,6 +1343,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>  
>>  	mutex_lock(&iommu->lock);
>>  
>> +	/* Cannot update vaddr if mdev is present. */
>> +	if (invalidate_vaddr && !list_empty(&iommu->emulated_iommu_groups))
>> +		goto unlock;
>> +
>>  	pgshift = __ffs(iommu->pgsize_bitmap);
>>  	pgsize = (size_t)1 << pgshift;
>>  
>> @@ -2189,6 +2193,10 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>>  
>>  	mutex_lock(&iommu->lock);
>>  
>> +	/* Prevent an mdev from sneaking in while vaddr flags are used. */
>> +	if (iommu->vaddr_invalid_count && type == VFIO_EMULATED_IOMMU)
>> +		goto out_unlock;
> 
> Why only mdev devices?  If we restrict that the user cannot attach a
> group while there are invalid vaddrs, and the pin/unpin pages and
> dma_rw interfaces are restricted to cases where vaddr_invalid_count is
> zero, then we can get rid of all the code to handle waiting for vaddrs.
> ie. we could still revert:
> 
> 898b9eaeb3fe ("vfio/type1: block on invalid vaddr")
> 487ace134053 ("vfio/type1: implement notify callback")
> ec5e32940cc9 ("vfio: iommu driver notify callback")
> 
> It appears to me it might be easiest to lead with a clean revert of
> these, then follow-up imposing the usage restrictions, and I'd go ahead
> and add WARN_ON error paths to the pin/unpin/dma_rw paths to make sure
> nobody enters those paths with an elevated invalid count.  Thanks,

Will do.  I think I will put the revert at the end, though, as dead code 
clean up.  That patch will be larger, and if it is judged to be too large
for stable, it can be omitted from stable.

- Steve

>> +
>>  	/* Check for duplicates */
>>  	if (vfio_iommu_find_iommu_group(iommu, iommu_group))
>>  		goto out_unlock;
>> @@ -2660,6 +2668,20 @@ static int vfio_domains_have_enforce_cache_coherency(struct vfio_iommu *iommu)
>>  	return ret;
>>  }
>>  
>> +/*
>> + * Disable this feature if mdevs are present.  They cannot safely pin/unpin
>> + * while vaddrs are being updated.
>> + */
>> +static int vfio_iommu_can_update_vaddr(struct vfio_iommu *iommu)
>> +{
>> +	int ret;
>> +
>> +	mutex_lock(&iommu->lock);
>> +	ret = list_empty(&iommu->emulated_iommu_groups);
>> +	mutex_unlock(&iommu->lock);
>> +	return ret;
>> +}
>> +
>>  static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
>>  					    unsigned long arg)
>>  {
>> @@ -2668,8 +2690,9 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
>>  	case VFIO_TYPE1v2_IOMMU:
>>  	case VFIO_TYPE1_NESTING_IOMMU:
>>  	case VFIO_UNMAP_ALL:
>> -	case VFIO_UPDATE_VADDR:
>>  		return 1;
>> +	case VFIO_UPDATE_VADDR:
>> +		return iommu && vfio_iommu_can_update_vaddr(iommu);
>>  	case VFIO_DMA_CC_IOMMU:
>>  		if (!iommu)
>>  			return 0;
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index d7d8e09..6d36b84 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -49,7 +49,11 @@
>>  /* Supports VFIO_DMA_UNMAP_FLAG_ALL */
>>  #define VFIO_UNMAP_ALL			9
>>  
>> -/* Supports the vaddr flag for DMA map and unmap */
>> +/*
>> + * Supports the vaddr flag for DMA map and unmap.  Not supported for mediated
>> + * devices, so this capability is subject to change as groups are added or
>> + * removed.
>> + */
>>  #define VFIO_UPDATE_VADDR		10
>>  
>>  /*
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V1 1/2] vfio/type1: exclude mdevs from VFIO_UPDATE_VADDR
  2022-12-13 16:54     ` Steven Sistare
@ 2022-12-13 17:31       ` Alex Williamson
  2022-12-13 17:42         ` Steven Sistare
  0 siblings, 1 reply; 12+ messages in thread
From: Alex Williamson @ 2022-12-13 17:31 UTC (permalink / raw)
  To: Steven Sistare; +Cc: kvm, Cornelia Huck, Jason Gunthorpe

On Tue, 13 Dec 2022 11:54:33 -0500
Steven Sistare <steven.sistare@oracle.com> wrote:

> On 12/13/2022 11:26 AM, Alex Williamson wrote:
> > On Tue, 13 Dec 2022 07:46:55 -0800
> > Steve Sistare <steven.sistare@oracle.com> wrote:
> >   
> >> Disable the VFIO_UPDATE_VADDR capability if mediated devices are present.
> >> Their kernel threads could be blocked indefinitely by a misbehaving
> >> userland while trying to pin/unpin pages while vaddrs are being updated.  
> > 
> > Fixes: c3cbab24db38 ("vfio/type1: implement interfaces to update vaddr")
> >   
> >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >> ---
> >>  drivers/vfio/vfio_iommu_type1.c | 25 ++++++++++++++++++++++++-
> >>  include/uapi/linux/vfio.h       |  6 +++++-
> >>  2 files changed, 29 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >> index 23c24fe..f81e925 100644
> >> --- a/drivers/vfio/vfio_iommu_type1.c
> >> +++ b/drivers/vfio/vfio_iommu_type1.c
> >> @@ -1343,6 +1343,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> >>  
> >>  	mutex_lock(&iommu->lock);
> >>  
> >> +	/* Cannot update vaddr if mdev is present. */
> >> +	if (invalidate_vaddr && !list_empty(&iommu->emulated_iommu_groups))
> >> +		goto unlock;
> >> +
> >>  	pgshift = __ffs(iommu->pgsize_bitmap);
> >>  	pgsize = (size_t)1 << pgshift;
> >>  
> >> @@ -2189,6 +2193,10 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> >>  
> >>  	mutex_lock(&iommu->lock);
> >>  
> >> +	/* Prevent an mdev from sneaking in while vaddr flags are used. */
> >> +	if (iommu->vaddr_invalid_count && type == VFIO_EMULATED_IOMMU)
> >> +		goto out_unlock;  
> > 
> > Why only mdev devices?  If we restrict that the user cannot attach a
> > group while there are invalid vaddrs, and the pin/unpin pages and
> > dma_rw interfaces are restricted to cases where vaddr_invalid_count is
> > zero, then we can get rid of all the code to handle waiting for vaddrs.
> > ie. we could still revert:
> > 
> > 898b9eaeb3fe ("vfio/type1: block on invalid vaddr")
> > 487ace134053 ("vfio/type1: implement notify callback")
> > ec5e32940cc9 ("vfio: iommu driver notify callback")
> > 
> > It appears to me it might be easiest to lead with a clean revert of
> > these, then follow-up imposing the usage restrictions, and I'd go ahead
> > and add WARN_ON error paths to the pin/unpin/dma_rw paths to make sure
> > nobody enters those paths with an elevated invalid count.  Thanks,  
> 
> Will do.  I think I will put the revert at the end, though, as dead code 
> clean up.  That patch will be larger, and if it is judged to be too large
> for stable, it can be omitted from stable.
> 
> - Steve
> 
> >> +
> >>  	/* Check for duplicates */
> >>  	if (vfio_iommu_find_iommu_group(iommu, iommu_group))
> >>  		goto out_unlock;
> >> @@ -2660,6 +2668,20 @@ static int vfio_domains_have_enforce_cache_coherency(struct vfio_iommu *iommu)
> >>  	return ret;
> >>  }
> >>  
> >> +/*
> >> + * Disable this feature if mdevs are present.  They cannot safely pin/unpin
> >> + * while vaddrs are being updated.
> >> + */
> >> +static int vfio_iommu_can_update_vaddr(struct vfio_iommu *iommu)
> >> +{
> >> +	int ret;
> >> +
> >> +	mutex_lock(&iommu->lock);
> >> +	ret = list_empty(&iommu->emulated_iommu_groups);
> >> +	mutex_unlock(&iommu->lock);
> >> +	return ret;
> >> +}

I'd also keep this generic to what it's actually doing, ie. simply
reporting if emulated_iommu_groups are present, so it could be
something like:

static bool vfio_iommu_has_emulated(struct vfio_iommu *iommu)

OTOH, I'm not sure it actually makes sense to dynamically change
reported value, the IOMMU backend supports vaddr update, but there are
usage restrictions and there's no way that this test can't be racy w/
other user actions.  Does it have specific utility in userspace to test
for this immediately before a live-update, or would QEMU enable support
for the feature based only on some initial condition?  Thanks,

Alex


> >> +
> >>  static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
> >>  					    unsigned long arg)
> >>  {
> >> @@ -2668,8 +2690,9 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
> >>  	case VFIO_TYPE1v2_IOMMU:
> >>  	case VFIO_TYPE1_NESTING_IOMMU:
> >>  	case VFIO_UNMAP_ALL:
> >> -	case VFIO_UPDATE_VADDR:
> >>  		return 1;
> >> +	case VFIO_UPDATE_VADDR:
> >> +		return iommu && vfio_iommu_can_update_vaddr(iommu);
> >>  	case VFIO_DMA_CC_IOMMU:
> >>  		if (!iommu)
> >>  			return 0;
> >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >> index d7d8e09..6d36b84 100644
> >> --- a/include/uapi/linux/vfio.h
> >> +++ b/include/uapi/linux/vfio.h
> >> @@ -49,7 +49,11 @@
> >>  /* Supports VFIO_DMA_UNMAP_FLAG_ALL */
> >>  #define VFIO_UNMAP_ALL			9
> >>  
> >> -/* Supports the vaddr flag for DMA map and unmap */
> >> +/*
> >> + * Supports the vaddr flag for DMA map and unmap.  Not supported for mediated
> >> + * devices, so this capability is subject to change as groups are added or
> >> + * removed.
> >> + */
> >>  #define VFIO_UPDATE_VADDR		10
> >>  
> >>  /*  
> >   
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V1 1/2] vfio/type1: exclude mdevs from VFIO_UPDATE_VADDR
  2022-12-13 17:31       ` Alex Williamson
@ 2022-12-13 17:42         ` Steven Sistare
  0 siblings, 0 replies; 12+ messages in thread
From: Steven Sistare @ 2022-12-13 17:42 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, Cornelia Huck, Jason Gunthorpe

On 12/13/2022 12:31 PM, Alex Williamson wrote:
> On Tue, 13 Dec 2022 11:54:33 -0500
> Steven Sistare <steven.sistare@oracle.com> wrote:
> 
>> On 12/13/2022 11:26 AM, Alex Williamson wrote:
>>> On Tue, 13 Dec 2022 07:46:55 -0800
>>> Steve Sistare <steven.sistare@oracle.com> wrote:
>>>   
>>>> Disable the VFIO_UPDATE_VADDR capability if mediated devices are present.
>>>> Their kernel threads could be blocked indefinitely by a misbehaving
>>>> userland while trying to pin/unpin pages while vaddrs are being updated.  
>>>
>>> Fixes: c3cbab24db38 ("vfio/type1: implement interfaces to update vaddr")
>>>   
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>>  drivers/vfio/vfio_iommu_type1.c | 25 ++++++++++++++++++++++++-
>>>>  include/uapi/linux/vfio.h       |  6 +++++-
>>>>  2 files changed, 29 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>>> index 23c24fe..f81e925 100644
>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>> @@ -1343,6 +1343,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>>>  
>>>>  	mutex_lock(&iommu->lock);
>>>>  
>>>> +	/* Cannot update vaddr if mdev is present. */
>>>> +	if (invalidate_vaddr && !list_empty(&iommu->emulated_iommu_groups))
>>>> +		goto unlock;
>>>> +
>>>>  	pgshift = __ffs(iommu->pgsize_bitmap);
>>>>  	pgsize = (size_t)1 << pgshift;
>>>>  
>>>> @@ -2189,6 +2193,10 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>>>>  
>>>>  	mutex_lock(&iommu->lock);
>>>>  
>>>> +	/* Prevent an mdev from sneaking in while vaddr flags are used. */
>>>> +	if (iommu->vaddr_invalid_count && type == VFIO_EMULATED_IOMMU)
>>>> +		goto out_unlock;  
>>>
>>> Why only mdev devices?  If we restrict that the user cannot attach a
>>> group while there are invalid vaddrs, and the pin/unpin pages and
>>> dma_rw interfaces are restricted to cases where vaddr_invalid_count is
>>> zero, then we can get rid of all the code to handle waiting for vaddrs.
>>> ie. we could still revert:
>>>
>>> 898b9eaeb3fe ("vfio/type1: block on invalid vaddr")
>>> 487ace134053 ("vfio/type1: implement notify callback")
>>> ec5e32940cc9 ("vfio: iommu driver notify callback")
>>>
>>> It appears to me it might be easiest to lead with a clean revert of
>>> these, then follow-up imposing the usage restrictions, and I'd go ahead
>>> and add WARN_ON error paths to the pin/unpin/dma_rw paths to make sure
>>> nobody enters those paths with an elevated invalid count.  Thanks,  
>>
>> Will do.  I think I will put the revert at the end, though, as dead code 
>> clean up.  That patch will be larger, and if it is judged to be too large
>> for stable, it can be omitted from stable.
>>
>> - Steve
>>
>>>> +
>>>>  	/* Check for duplicates */
>>>>  	if (vfio_iommu_find_iommu_group(iommu, iommu_group))
>>>>  		goto out_unlock;
>>>> @@ -2660,6 +2668,20 @@ static int vfio_domains_have_enforce_cache_coherency(struct vfio_iommu *iommu)
>>>>  	return ret;
>>>>  }
>>>>  
>>>> +/*
>>>> + * Disable this feature if mdevs are present.  They cannot safely pin/unpin
>>>> + * while vaddrs are being updated.
>>>> + */
>>>> +static int vfio_iommu_can_update_vaddr(struct vfio_iommu *iommu)
>>>> +{
>>>> +	int ret;
>>>> +
>>>> +	mutex_lock(&iommu->lock);
>>>> +	ret = list_empty(&iommu->emulated_iommu_groups);
>>>> +	mutex_unlock(&iommu->lock);
>>>> +	return ret;
>>>> +}
> 
> I'd also keep this generic to what it's actually doing, ie. simply
> reporting if emulated_iommu_groups are present, so it could be
> something like:
> 
> static bool vfio_iommu_has_emulated(struct vfio_iommu *iommu)

OK.

> OTOH, I'm not sure it actually makes sense to dynamically change
> reported value, the IOMMU backend supports vaddr update, but there are
> usage restrictions and there's no way that this test can't be racy w/
> other user actions.  

Yes, but VFIO_DMA_CC_IOMMU has the same issue and the same comment,
"this capability is subject to change as groups are added or removed."

> Does it have specific utility in userspace to test
> for this immediately before a live-update, or would QEMU enable support
> for the feature based only on some initial condition?  Thanks,

We always check immediately before live update.  We also check at startup
if only-cpr-capable is specified.  Now the latter check will be done later
in startup, after groups are added.

- Steve

>>>> +
>>>>  static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
>>>>  					    unsigned long arg)
>>>>  {
>>>> @@ -2668,8 +2690,9 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
>>>>  	case VFIO_TYPE1v2_IOMMU:
>>>>  	case VFIO_TYPE1_NESTING_IOMMU:
>>>>  	case VFIO_UNMAP_ALL:
>>>> -	case VFIO_UPDATE_VADDR:
>>>>  		return 1;
>>>> +	case VFIO_UPDATE_VADDR:
>>>> +		return iommu && vfio_iommu_can_update_vaddr(iommu);
>>>>  	case VFIO_DMA_CC_IOMMU:
>>>>  		if (!iommu)
>>>>  			return 0;
>>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>>> index d7d8e09..6d36b84 100644
>>>> --- a/include/uapi/linux/vfio.h
>>>> +++ b/include/uapi/linux/vfio.h
>>>> @@ -49,7 +49,11 @@
>>>>  /* Supports VFIO_DMA_UNMAP_FLAG_ALL */
>>>>  #define VFIO_UNMAP_ALL			9
>>>>  
>>>> -/* Supports the vaddr flag for DMA map and unmap */
>>>> +/*
>>>> + * Supports the vaddr flag for DMA map and unmap.  Not supported for mediated
>>>> + * devices, so this capability is subject to change as groups are added or
>>>> + * removed.
>>>> + */
>>>>  #define VFIO_UPDATE_VADDR		10
>>>>  
>>>>  /*  
>>>   
>>
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V1 2/2] vfio/type1: prevent locked_vm underflow
  2022-12-13 15:46 ` [PATCH V1 2/2] vfio/type1: prevent locked_vm underflow Steve Sistare
@ 2022-12-13 18:02   ` Alex Williamson
  2022-12-13 18:17     ` Steven Sistare
  0 siblings, 1 reply; 12+ messages in thread
From: Alex Williamson @ 2022-12-13 18:02 UTC (permalink / raw)
  To: Steve Sistare; +Cc: kvm, Cornelia Huck

On Tue, 13 Dec 2022 07:46:56 -0800
Steve Sistare <steven.sistare@oracle.com> wrote:

> When a vfio container is preserved across exec using the VFIO_UPDATE_VADDR
> interfaces, locked_vm of the new mm becomes 0.  If the user later unmaps a
> dma mapping, locked_vm underflows to a large unsigned value, and a
> subsequent dma map request fails with ENOMEM in __account_locked_vm.
> 
> To fix, when VFIO_DMA_MAP_FLAG_VADDR is used and the dma's mm has changed,
> add the mapping's pinned page count to the new mm->locked_vm, subject to
> the rlimit.  Now that mediated devices are excluded when using
> VFIO_UPDATE_VADDR, the amount of pinned memory equals the size of the
> mapping.
> 
> Underflow will not occur when all dma mappings are invalidated before exec.
> An attempt to unmap before updating the vaddr with VFIO_DMA_MAP_FLAG_VADDR
> will fail with EINVAL because the mapping is in the vaddr_invalid state.

Where is this enforced?

> Underflow may still occur in a buggy application that fails to invalidate
> all before exec.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index f81e925..e5a02f8 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -100,6 +100,7 @@ struct vfio_dma {
>  	struct task_struct	*task;
>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
>  	unsigned long		*bitmap;
> +	struct mm_struct	*mm;
>  };
>  
>  struct vfio_batch {
> @@ -1174,6 +1175,7 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  	vfio_unmap_unpin(iommu, dma, true);
>  	vfio_unlink_dma(iommu, dma);
>  	put_task_struct(dma->task);
> +	mmdrop(dma->mm);
>  	vfio_dma_bitmap_free(dma);
>  	if (dma->vaddr_invalid) {
>  		iommu->vaddr_invalid_count--;
> @@ -1622,6 +1624,13 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  			dma->vaddr = vaddr;
>  			dma->vaddr_invalid = false;
>  			iommu->vaddr_invalid_count--;
> +			if (current->mm != dma->mm) {
> +				mmdrop(dma->mm);
> +				dma->mm = current->mm;
> +				mmgrab(dma->mm);
> +				ret = vfio_lock_acct(dma, size >> PAGE_SHIFT,
> +						     0);

What does it actually mean if this fails?  The pages are still pinned.
lock_vm doesn't get updated.  Underflow can still occur.  Thanks,

Alex

> +			}
>  			wake_up_all(&iommu->vaddr_wait);
>  		}
>  		goto out_unlock;
> @@ -1679,6 +1688,8 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  	get_task_struct(current->group_leader);
>  	dma->task = current->group_leader;
>  	dma->lock_cap = capable(CAP_IPC_LOCK);
> +	dma->mm = dma->task->mm;
> +	mmgrab(dma->mm);
>  
>  	dma->pfn_list = RB_ROOT;
>  


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V1 2/2] vfio/type1: prevent locked_vm underflow
  2022-12-13 18:02   ` Alex Williamson
@ 2022-12-13 18:17     ` Steven Sistare
  2022-12-13 18:21       ` Steven Sistare
  0 siblings, 1 reply; 12+ messages in thread
From: Steven Sistare @ 2022-12-13 18:17 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, Cornelia Huck

On 12/13/2022 1:02 PM, Alex Williamson wrote:
> On Tue, 13 Dec 2022 07:46:56 -0800
> Steve Sistare <steven.sistare@oracle.com> wrote:
> 
>> When a vfio container is preserved across exec using the VFIO_UPDATE_VADDR
>> interfaces, locked_vm of the new mm becomes 0.  If the user later unmaps a
>> dma mapping, locked_vm underflows to a large unsigned value, and a
>> subsequent dma map request fails with ENOMEM in __account_locked_vm.
>>
>> To fix, when VFIO_DMA_MAP_FLAG_VADDR is used and the dma's mm has changed,
>> add the mapping's pinned page count to the new mm->locked_vm, subject to
>> the rlimit.  Now that mediated devices are excluded when using
>> VFIO_UPDATE_VADDR, the amount of pinned memory equals the size of the
>> mapping.
>>
>> Underflow will not occur when all dma mappings are invalidated before exec.
>> An attempt to unmap before updating the vaddr with VFIO_DMA_MAP_FLAG_VADDR
>> will fail with EINVAL because the mapping is in the vaddr_invalid state.
> 
> Where is this enforced?

In vfio_dma_do_unmap:
        if (invalidate_vaddr) {
                if (dma->vaddr_invalid) {
                        ...
                        ret = -EINVAL;

>> Underflow may still occur in a buggy application that fails to invalidate
>> all before exec.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 11 +++++++++++
>>  1 file changed, 11 insertions(+)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index f81e925..e5a02f8 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -100,6 +100,7 @@ struct vfio_dma {
>>  	struct task_struct	*task;
>>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
>>  	unsigned long		*bitmap;
>> +	struct mm_struct	*mm;
>>  };
>>  
>>  struct vfio_batch {
>> @@ -1174,6 +1175,7 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>  	vfio_unmap_unpin(iommu, dma, true);
>>  	vfio_unlink_dma(iommu, dma);
>>  	put_task_struct(dma->task);
>> +	mmdrop(dma->mm);
>>  	vfio_dma_bitmap_free(dma);
>>  	if (dma->vaddr_invalid) {
>>  		iommu->vaddr_invalid_count--;
>> @@ -1622,6 +1624,13 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>  			dma->vaddr = vaddr;
>>  			dma->vaddr_invalid = false;
>>  			iommu->vaddr_invalid_count--;
>> +			if (current->mm != dma->mm) {
>> +				mmdrop(dma->mm);
>> +				dma->mm = current->mm;
>> +				mmgrab(dma->mm);
>> +				ret = vfio_lock_acct(dma, size >> PAGE_SHIFT,
>> +						     0);
> 
> What does it actually mean if this fails?  The pages are still pinned.
> lock_vm doesn't get updated.  Underflow can still occur.  Thanks,

If this fails, the user has locked additional memory after exec and before making
this call -- more than was locked before exec -- and the rlimit is exceeded.
A misbehaving application, which will only hurt itself.

However, I should reorder these, and check ret before changing the other state.

- Steve

>> +			}
>>  			wake_up_all(&iommu->vaddr_wait);
>>  		}
>>  		goto out_unlock;
>> @@ -1679,6 +1688,8 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>  	get_task_struct(current->group_leader);
>>  	dma->task = current->group_leader;
>>  	dma->lock_cap = capable(CAP_IPC_LOCK);
>> +	dma->mm = dma->task->mm;
>> +	mmgrab(dma->mm);
>>  
>>  	dma->pfn_list = RB_ROOT;
>>  
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V1 2/2] vfio/type1: prevent locked_vm underflow
  2022-12-13 18:17     ` Steven Sistare
@ 2022-12-13 18:21       ` Steven Sistare
  2022-12-13 19:29         ` Alex Williamson
  0 siblings, 1 reply; 12+ messages in thread
From: Steven Sistare @ 2022-12-13 18:21 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, Cornelia Huck

On 12/13/2022 1:17 PM, Steven Sistare wrote:
> On 12/13/2022 1:02 PM, Alex Williamson wrote:
>> On Tue, 13 Dec 2022 07:46:56 -0800
>> Steve Sistare <steven.sistare@oracle.com> wrote:
>>
>>> When a vfio container is preserved across exec using the VFIO_UPDATE_VADDR
>>> interfaces, locked_vm of the new mm becomes 0.  If the user later unmaps a
>>> dma mapping, locked_vm underflows to a large unsigned value, and a
>>> subsequent dma map request fails with ENOMEM in __account_locked_vm.
>>>
>>> To fix, when VFIO_DMA_MAP_FLAG_VADDR is used and the dma's mm has changed,
>>> add the mapping's pinned page count to the new mm->locked_vm, subject to
>>> the rlimit.  Now that mediated devices are excluded when using
>>> VFIO_UPDATE_VADDR, the amount of pinned memory equals the size of the
>>> mapping.
>>>
>>> Underflow will not occur when all dma mappings are invalidated before exec.
>>> An attempt to unmap before updating the vaddr with VFIO_DMA_MAP_FLAG_VADDR
>>> will fail with EINVAL because the mapping is in the vaddr_invalid state.
>>
>> Where is this enforced?
> 
> In vfio_dma_do_unmap:
>         if (invalidate_vaddr) {
>                 if (dma->vaddr_invalid) {
>                         ...
>                         ret = -EINVAL;

My bad, this is a different case, and my comment in the commit message is
incorrect.  I should test mm != dma->mm during unmap as well, and suppress
the locked_vm deduction there.

- Steve

>>> Underflow may still occur in a buggy application that fails to invalidate
>>> all before exec.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>>  drivers/vfio/vfio_iommu_type1.c | 11 +++++++++++
>>>  1 file changed, 11 insertions(+)
>>>
>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>> index f81e925..e5a02f8 100644
>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>> @@ -100,6 +100,7 @@ struct vfio_dma {
>>>  	struct task_struct	*task;
>>>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
>>>  	unsigned long		*bitmap;
>>> +	struct mm_struct	*mm;
>>>  };
>>>  
>>>  struct vfio_batch {
>>> @@ -1174,6 +1175,7 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>>  	vfio_unmap_unpin(iommu, dma, true);
>>>  	vfio_unlink_dma(iommu, dma);
>>>  	put_task_struct(dma->task);
>>> +	mmdrop(dma->mm);
>>>  	vfio_dma_bitmap_free(dma);
>>>  	if (dma->vaddr_invalid) {
>>>  		iommu->vaddr_invalid_count--;
>>> @@ -1622,6 +1624,13 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>>  			dma->vaddr = vaddr;
>>>  			dma->vaddr_invalid = false;
>>>  			iommu->vaddr_invalid_count--;
>>> +			if (current->mm != dma->mm) {
>>> +				mmdrop(dma->mm);
>>> +				dma->mm = current->mm;
>>> +				mmgrab(dma->mm);
>>> +				ret = vfio_lock_acct(dma, size >> PAGE_SHIFT,
>>> +						     0);
>>
>> What does it actually mean if this fails?  The pages are still pinned.
>> lock_vm doesn't get updated.  Underflow can still occur.  Thanks,
> 
> If this fails, the user has locked additional memory after exec and before making
> this call -- more than was locked before exec -- and the rlimit is exceeded.
> A misbehaving application, which will only hurt itself.
> 
> However, I should reorder these, and check ret before changing the other state.
> 
> - Steve
> 
>>> +			}
>>>  			wake_up_all(&iommu->vaddr_wait);
>>>  		}
>>>  		goto out_unlock;
>>> @@ -1679,6 +1688,8 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>>  	get_task_struct(current->group_leader);
>>>  	dma->task = current->group_leader;
>>>  	dma->lock_cap = capable(CAP_IPC_LOCK);
>>> +	dma->mm = dma->task->mm;
>>> +	mmgrab(dma->mm);
>>>  
>>>  	dma->pfn_list = RB_ROOT;
>>>  
>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V1 2/2] vfio/type1: prevent locked_vm underflow
  2022-12-13 18:21       ` Steven Sistare
@ 2022-12-13 19:29         ` Alex Williamson
  2022-12-13 19:40           ` Steven Sistare
  0 siblings, 1 reply; 12+ messages in thread
From: Alex Williamson @ 2022-12-13 19:29 UTC (permalink / raw)
  To: Steven Sistare; +Cc: kvm, Cornelia Huck

On Tue, 13 Dec 2022 13:21:15 -0500
Steven Sistare <steven.sistare@oracle.com> wrote:

> On 12/13/2022 1:17 PM, Steven Sistare wrote:
> > On 12/13/2022 1:02 PM, Alex Williamson wrote:  
> >> On Tue, 13 Dec 2022 07:46:56 -0800
> >> Steve Sistare <steven.sistare@oracle.com> wrote:
> >>  
> >>> When a vfio container is preserved across exec using the VFIO_UPDATE_VADDR
> >>> interfaces, locked_vm of the new mm becomes 0.  If the user later unmaps a
> >>> dma mapping, locked_vm underflows to a large unsigned value, and a
> >>> subsequent dma map request fails with ENOMEM in __account_locked_vm.
> >>>
> >>> To fix, when VFIO_DMA_MAP_FLAG_VADDR is used and the dma's mm has changed,
> >>> add the mapping's pinned page count to the new mm->locked_vm, subject to
> >>> the rlimit.  Now that mediated devices are excluded when using
> >>> VFIO_UPDATE_VADDR, the amount of pinned memory equals the size of the
> >>> mapping.
> >>>
> >>> Underflow will not occur when all dma mappings are invalidated before exec.
> >>> An attempt to unmap before updating the vaddr with VFIO_DMA_MAP_FLAG_VADDR
> >>> will fail with EINVAL because the mapping is in the vaddr_invalid state.  
> >>
> >> Where is this enforced?  
> > 
> > In vfio_dma_do_unmap:
> >         if (invalidate_vaddr) {
> >                 if (dma->vaddr_invalid) {
> >                         ...
> >                         ret = -EINVAL;  
> 
> My bad, this is a different case, and my comment in the commit message is
> incorrect.  I should test mm != dma->mm during unmap as well, and suppress
> the locked_vm deduction there.

I'm getting confused how this patch actually does anything.  We grab
the mm of the task doing mappings, and we swap that grab when updating
the vaddr, but vfio_lock_acct() uses the original dma->task mm for
accounting.  Therefore how can an underflow occur?  It seems we're
simply failing to adjust locked_vm for the new mm at all.

> >>> Underflow may still occur in a buggy application that fails to invalidate
> >>> all before exec.
> >>>
> >>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >>> ---
> >>>  drivers/vfio/vfio_iommu_type1.c | 11 +++++++++++
> >>>  1 file changed, 11 insertions(+)
> >>>
> >>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >>> index f81e925..e5a02f8 100644
> >>> --- a/drivers/vfio/vfio_iommu_type1.c
> >>> +++ b/drivers/vfio/vfio_iommu_type1.c
> >>> @@ -100,6 +100,7 @@ struct vfio_dma {
> >>>  	struct task_struct	*task;
> >>>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
> >>>  	unsigned long		*bitmap;
> >>> +	struct mm_struct	*mm;
> >>>  };
> >>>  
> >>>  struct vfio_batch {
> >>> @@ -1174,6 +1175,7 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
> >>>  	vfio_unmap_unpin(iommu, dma, true);
> >>>  	vfio_unlink_dma(iommu, dma);
> >>>  	put_task_struct(dma->task);
> >>> +	mmdrop(dma->mm);
> >>>  	vfio_dma_bitmap_free(dma);
> >>>  	if (dma->vaddr_invalid) {
> >>>  		iommu->vaddr_invalid_count--;
> >>> @@ -1622,6 +1624,13 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> >>>  			dma->vaddr = vaddr;
> >>>  			dma->vaddr_invalid = false;
> >>>  			iommu->vaddr_invalid_count--;
> >>> +			if (current->mm != dma->mm) {
> >>> +				mmdrop(dma->mm);
> >>> +				dma->mm = current->mm;
> >>> +				mmgrab(dma->mm);
> >>> +				ret = vfio_lock_acct(dma, size >> PAGE_SHIFT,
> >>> +						     0);  
> >>
> >> What does it actually mean if this fails?  The pages are still pinned.
> >> lock_vm doesn't get updated.  Underflow can still occur.  Thanks,  
> > 
> > If this fails, the user has locked additional memory after exec and before making
> > this call -- more than was locked before exec -- and the rlimit is exceeded.
> > A misbehaving application, which will only hurt itself.
> > 
> > However, I should reorder these, and check ret before changing the other state.

The result would then be that the mapping remains with vaddr_invalid on
error?  Thanks,

Alex

> >>> +			}
> >>>  			wake_up_all(&iommu->vaddr_wait);
> >>>  		}
> >>>  		goto out_unlock;
> >>> @@ -1679,6 +1688,8 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> >>>  	get_task_struct(current->group_leader);
> >>>  	dma->task = current->group_leader;
> >>>  	dma->lock_cap = capable(CAP_IPC_LOCK);
> >>> +	dma->mm = dma->task->mm;
> >>> +	mmgrab(dma->mm);
> >>>  
> >>>  	dma->pfn_list = RB_ROOT;
> >>>    
> >>  
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V1 2/2] vfio/type1: prevent locked_vm underflow
  2022-12-13 19:29         ` Alex Williamson
@ 2022-12-13 19:40           ` Steven Sistare
  0 siblings, 0 replies; 12+ messages in thread
From: Steven Sistare @ 2022-12-13 19:40 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, Cornelia Huck

On 12/13/2022 2:29 PM, Alex Williamson wrote:
> On Tue, 13 Dec 2022 13:21:15 -0500
> Steven Sistare <steven.sistare@oracle.com> wrote:
> 
>> On 12/13/2022 1:17 PM, Steven Sistare wrote:
>>> On 12/13/2022 1:02 PM, Alex Williamson wrote:  
>>>> On Tue, 13 Dec 2022 07:46:56 -0800
>>>> Steve Sistare <steven.sistare@oracle.com> wrote:
>>>>  
>>>>> When a vfio container is preserved across exec using the VFIO_UPDATE_VADDR
>>>>> interfaces, locked_vm of the new mm becomes 0.  If the user later unmaps a
>>>>> dma mapping, locked_vm underflows to a large unsigned value, and a
>>>>> subsequent dma map request fails with ENOMEM in __account_locked_vm.
>>>>>
>>>>> To fix, when VFIO_DMA_MAP_FLAG_VADDR is used and the dma's mm has changed,
>>>>> add the mapping's pinned page count to the new mm->locked_vm, subject to
>>>>> the rlimit.  Now that mediated devices are excluded when using
>>>>> VFIO_UPDATE_VADDR, the amount of pinned memory equals the size of the
>>>>> mapping.
>>>>>
>>>>> Underflow will not occur when all dma mappings are invalidated before exec.
>>>>> An attempt to unmap before updating the vaddr with VFIO_DMA_MAP_FLAG_VADDR
>>>>> will fail with EINVAL because the mapping is in the vaddr_invalid state.  
>>>>
>>>> Where is this enforced?  
>>>
>>> In vfio_dma_do_unmap:
>>>         if (invalidate_vaddr) {
>>>                 if (dma->vaddr_invalid) {
>>>                         ...
>>>                         ret = -EINVAL;  
>>
>> My bad, this is a different case, and my comment in the commit message is
>> incorrect.  I should test mm != dma->mm during unmap as well, and suppress
>> the locked_vm deduction there.
> 
> I'm getting confused how this patch actually does anything.  We grab
> the mm of the task doing mappings, and we swap that grab when updating
> the vaddr, but vfio_lock_acct() uses the original dma->task mm for
> accounting.  Therefore how can an underflow occur?  It seems we're
> simply failing to adjust locked_vm for the new mm at all.

The old code saves dma->task, but not dma->task->mm.  The task's mm changes 
across exec.

>>>>> Underflow may still occur in a buggy application that fails to invalidate
>>>>> all before exec.
>>>>>
>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>> ---
>>>>>  drivers/vfio/vfio_iommu_type1.c | 11 +++++++++++
>>>>>  1 file changed, 11 insertions(+)
>>>>>
>>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>>>> index f81e925..e5a02f8 100644
>>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>>> @@ -100,6 +100,7 @@ struct vfio_dma {
>>>>>  	struct task_struct	*task;
>>>>>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
>>>>>  	unsigned long		*bitmap;
>>>>> +	struct mm_struct	*mm;
>>>>>  };
>>>>>  
>>>>>  struct vfio_batch {
>>>>> @@ -1174,6 +1175,7 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>>>>  	vfio_unmap_unpin(iommu, dma, true);
>>>>>  	vfio_unlink_dma(iommu, dma);
>>>>>  	put_task_struct(dma->task);
>>>>> +	mmdrop(dma->mm);
>>>>>  	vfio_dma_bitmap_free(dma);
>>>>>  	if (dma->vaddr_invalid) {
>>>>>  		iommu->vaddr_invalid_count--;
>>>>> @@ -1622,6 +1624,13 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>>>>  			dma->vaddr = vaddr;
>>>>>  			dma->vaddr_invalid = false;
>>>>>  			iommu->vaddr_invalid_count--;
>>>>> +			if (current->mm != dma->mm) {
>>>>> +				mmdrop(dma->mm);
>>>>> +				dma->mm = current->mm;
>>>>> +				mmgrab(dma->mm);
>>>>> +				ret = vfio_lock_acct(dma, size >> PAGE_SHIFT,
>>>>> +						     0);  
>>>>
>>>> What does it actually mean if this fails?  The pages are still pinned.
>>>> lock_vm doesn't get updated.  Underflow can still occur.  Thanks,  
>>>
>>> If this fails, the user has locked additional memory after exec and before making
>>> this call -- more than was locked before exec -- and the rlimit is exceeded.
>>> A misbehaving application, which will only hurt itself.
>>>
>>> However, I should reorder these, and check ret before changing the other state.
> 
> The result would then be that the mapping remains with vaddr_invalid on
> error?  Thanks,

Correct.  In theory the app could recover by releasing the extra locked memory that
it grabbed, or increase its rlimit, and then try map_flag_vaddr again.

- Steve

>>>>> +			}
>>>>>  			wake_up_all(&iommu->vaddr_wait);
>>>>>  		}
>>>>>  		goto out_unlock;
>>>>> @@ -1679,6 +1688,8 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>>>>  	get_task_struct(current->group_leader);
>>>>>  	dma->task = current->group_leader;
>>>>>  	dma->lock_cap = capable(CAP_IPC_LOCK);
>>>>> +	dma->mm = dma->task->mm;
>>>>> +	mmgrab(dma->mm);
>>>>>  
>>>>>  	dma->pfn_list = RB_ROOT;
>>>>>    
>>>>  
>>
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-12-13 19:40 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-12-13 15:46 [PATCH V1 0/2] fixes for virtual address update Steve Sistare
2022-12-13 15:46 ` [PATCH V1 1/2] vfio/type1: exclude mdevs from VFIO_UPDATE_VADDR Steve Sistare
2022-12-13 16:26   ` Alex Williamson
2022-12-13 16:54     ` Steven Sistare
2022-12-13 17:31       ` Alex Williamson
2022-12-13 17:42         ` Steven Sistare
2022-12-13 15:46 ` [PATCH V1 2/2] vfio/type1: prevent locked_vm underflow Steve Sistare
2022-12-13 18:02   ` Alex Williamson
2022-12-13 18:17     ` Steven Sistare
2022-12-13 18:21       ` Steven Sistare
2022-12-13 19:29         ` Alex Williamson
2022-12-13 19:40           ` Steven Sistare

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox