public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] vfio, dax: disable filesystem-dax and minor fixups
@ 2018-02-04 23:05 Dan Williams
       [not found] ` <151778551496.7139.17808629759104553625.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Dan Williams @ 2018-02-04 23:05 UTC (permalink / raw)
  To: alex.williamson-H+wXaHxf7aLQT0dZR+AlfA
  Cc: Michal Hocko, jack-AlSwsSmVLrQ, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	stable-u79uwXL29TY76Z2rM5mHXA, Alexander Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, hch-jcswGhMUV9g

Alex, here is a change to vaddr_get_pfn() that we discussed in this
thread: https://lists.nongnu.org/archive/html/qemu-devel/2018-01/msg07117.html

Namely, drop support for passing Filesystem-DAX mappings through to
guests. Perhaps in the future we can create some para-virtualized
passthrough interface to coordinate guest-DMA vs host-filesystem
operations. For now, this needs to be disabled for data-integrity and
guaranteeing forward progress of filesystem operations.

If you want to take this through your tree please grab the other dax
fixups as well. Otherwise, let me know and I'll take the lot through the
nvdimm tree.

---

Dan Williams (3):
      dax: fix S_DAX definition
      dax: short circuit vma_is_fsdax() in the CONFIG_FS_DAX=n case
      vfio: disable filesystem-dax page pinning


 drivers/vfio/vfio_iommu_type1.c |   18 +++++++++++++++---
 include/linux/fs.h              |    4 +++-
 2 files changed, 18 insertions(+), 4 deletions(-)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 3/3] vfio: disable filesystem-dax page pinning
       [not found] ` <151778551496.7139.17808629759104553625.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
@ 2018-02-04 23:05   ` Dan Williams
       [not found]     ` <151778553083.7139.6601964812589807125.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
                       ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Dan Williams @ 2018-02-04 23:05 UTC (permalink / raw)
  To: alex.williamson-H+wXaHxf7aLQT0dZR+AlfA
  Cc: Michal Hocko, jack-AlSwsSmVLrQ, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	stable-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, hch-jcswGhMUV9g

Filesystem-DAX is incompatible with 'longterm' page pinning. Without
page cache indirection a DAX mapping maps filesystem blocks directly.
This means that the filesystem must not modify a file's block map while
any page in a mapping is pinned. In order to prevent the situation of
userspace holding of filesystem operations indefinitely, disallow
'longterm' Filesystem-DAX mappings.

RDMA has the same conflict and the plan there is to add a 'with lease'
mechanism to allow the kernel to notify userspace that the mapping is
being torn down for block-map maintenance. Perhaps something similar can
be put in place for vfio.

Note that xfs and ext4 still report:

   "DAX enabled. Warning: EXPERIMENTAL, use at your own risk"

...at mount time, and resolving the dax-dma-vs-truncate problem is one
of the last hurdles to remove that designation.

Cc: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
Cc: kvm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Reported-by: Haozhong Zhang <haozhong.zhang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Fixes: d475c6346a38 ("dax,ext2: replace XIP read and write with DAX I/O")
Signed-off-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/vfio/vfio_iommu_type1.c |   18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index e30e29ae4819..45657e2b1ff7 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -338,11 +338,12 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
 {
 	struct page *page[1];
 	struct vm_area_struct *vma;
+	struct vm_area_struct *vmas[1];
 	int ret;
 
 	if (mm == current->mm) {
-		ret = get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE),
-					  page);
+		ret = get_user_pages_longterm(vaddr, 1, !!(prot & IOMMU_WRITE),
+					      page, vmas);
 	} else {
 		unsigned int flags = 0;
 
@@ -351,7 +352,18 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
 
 		down_read(&mm->mmap_sem);
 		ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
-					    NULL, NULL);
+					    vmas, NULL);
+		/*
+		 * The lifetime of a vaddr_get_pfn() page pin is
+		 * userspace-controlled. In the fs-dax case this could
+		 * lead to indefinite stalls in filesystem operations.
+		 * Disallow attempts to pin fs-dax pages via this
+		 * interface.
+		 */
+		if (ret > 0 && vma_is_fsdax(vmas[0])) {
+			ret = -EOPNOTSUPP;
+			put_page(page[0]);
+		}
 		up_read(&mm->mmap_sem);
 	}

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/3] vfio: disable filesystem-dax page pinning
       [not found]     ` <151778553083.7139.6601964812589807125.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
@ 2018-02-05  3:46       ` Haozhong Zhang
  2018-02-05  3:54         ` Dan Williams
  0 siblings, 1 reply; 8+ messages in thread
From: Haozhong Zhang @ 2018-02-05  3:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Michal Hocko, jack-AlSwsSmVLrQ, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	stable-u79uwXL29TY76Z2rM5mHXA,
	alex.williamson-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, hch-jcswGhMUV9g

On 02/04/18 15:05 -0800, Dan Williams wrote:
> Filesystem-DAX is incompatible with 'longterm' page pinning. Without
> page cache indirection a DAX mapping maps filesystem blocks directly.
> This means that the filesystem must not modify a file's block map while
> any page in a mapping is pinned. In order to prevent the situation of
> userspace holding of filesystem operations indefinitely, disallow
> 'longterm' Filesystem-DAX mappings.
> 
> RDMA has the same conflict and the plan there is to add a 'with lease'
> mechanism to allow the kernel to notify userspace that the mapping is
> being torn down for block-map maintenance. Perhaps something similar can
> be put in place for vfio.
> 
> Note that xfs and ext4 still report:
> 
>    "DAX enabled. Warning: EXPERIMENTAL, use at your own risk"
> 
> ...at mount time, and resolving the dax-dma-vs-truncate problem is one
> of the last hurdles to remove that designation.
> 
> Cc: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
> Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
> Cc: kvm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
> Reported-by: Haozhong Zhang <haozhong.zhang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> Fixes: d475c6346a38 ("dax,ext2: replace XIP read and write with DAX I/O")
> Signed-off-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> ---
>  drivers/vfio/vfio_iommu_type1.c |   18 +++++++++++++++---
>  1 file changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index e30e29ae4819..45657e2b1ff7 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -338,11 +338,12 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>  {
>  	struct page *page[1];
>  	struct vm_area_struct *vma;
> +	struct vm_area_struct *vmas[1];
>  	int ret;
>  
>  	if (mm == current->mm) {
> -		ret = get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE),
> -					  page);
> +		ret = get_user_pages_longterm(vaddr, 1, !!(prot & IOMMU_WRITE),
> +					      page, vmas);

vmas is not used subsequently if this branch is taken, so can we use
NULL here?

Thanks,
Haozhong

>  	} else {
>  		unsigned int flags = 0;
>  
> @@ -351,7 +352,18 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>  
>  		down_read(&mm->mmap_sem);
>  		ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
> -					    NULL, NULL);
> +					    vmas, NULL);
> +		/*
> +		 * The lifetime of a vaddr_get_pfn() page pin is
> +		 * userspace-controlled. In the fs-dax case this could
> +		 * lead to indefinite stalls in filesystem operations.
> +		 * Disallow attempts to pin fs-dax pages via this
> +		 * interface.
> +		 */
> +		if (ret > 0 && vma_is_fsdax(vmas[0])) {
> +			ret = -EOPNOTSUPP;
> +			put_page(page[0]);
> +		}
>  		up_read(&mm->mmap_sem);
>  	}
>  
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/3] vfio: disable filesystem-dax page pinning
  2018-02-05  3:46       ` Haozhong Zhang
@ 2018-02-05  3:54         ` Dan Williams
  0 siblings, 0 replies; 8+ messages in thread
From: Dan Williams @ 2018-02-05  3:54 UTC (permalink / raw)
  To: Dan Williams, Alex Williamson, Michal Hocko, Jan Kara, KVM list,
	linux-nvdimm, Linux Kernel Mailing List, stable, linux-fsdevel,
	Christoph Hellwig

On Sun, Feb 4, 2018 at 7:46 PM, Haozhong Zhang <haozhong.zhang@intel.com> wrote:
> On 02/04/18 15:05 -0800, Dan Williams wrote:
>> Filesystem-DAX is incompatible with 'longterm' page pinning. Without
>> page cache indirection a DAX mapping maps filesystem blocks directly.
>> This means that the filesystem must not modify a file's block map while
>> any page in a mapping is pinned. In order to prevent the situation of
>> userspace holding of filesystem operations indefinitely, disallow
>> 'longterm' Filesystem-DAX mappings.
>>
>> RDMA has the same conflict and the plan there is to add a 'with lease'
>> mechanism to allow the kernel to notify userspace that the mapping is
>> being torn down for block-map maintenance. Perhaps something similar can
>> be put in place for vfio.
>>
>> Note that xfs and ext4 still report:
>>
>>    "DAX enabled. Warning: EXPERIMENTAL, use at your own risk"
>>
>> ...at mount time, and resolving the dax-dma-vs-truncate problem is one
>> of the last hurdles to remove that designation.
>>
>> Cc: Alex Williamson <alex.williamson@redhat.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: kvm@vger.kernel.org
>> Cc: <stable@vger.kernel.org>
>> Reported-by: Haozhong Zhang <haozhong.zhang@intel.com>
>> Fixes: d475c6346a38 ("dax,ext2: replace XIP read and write with DAX I/O")
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  drivers/vfio/vfio_iommu_type1.c |   18 +++++++++++++++---
>>  1 file changed, 15 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index e30e29ae4819..45657e2b1ff7 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -338,11 +338,12 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>>  {
>>       struct page *page[1];
>>       struct vm_area_struct *vma;
>> +     struct vm_area_struct *vmas[1];
>>       int ret;
>>
>>       if (mm == current->mm) {
>> -             ret = get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE),
>> -                                       page);
>> +             ret = get_user_pages_longterm(vaddr, 1, !!(prot & IOMMU_WRITE),
>> +                                           page, vmas);
>
> vmas is not used subsequently if this branch is taken, so can we use
> NULL here?

I'd rather go the other way and refactor this a bit further to skip
the find_vma_intersection() below since get_user_pages() already does
that work.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/3] vfio: disable filesystem-dax page pinning
  2018-02-04 23:05   ` [PATCH 3/3] vfio: disable filesystem-dax page pinning Dan Williams
       [not found]     ` <151778553083.7139.6601964812589807125.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
@ 2018-02-05 21:44     ` Alex Williamson
       [not found]       ` <20180205144422.1ca67ab5-DGNDKt5SQtizQB+pC5nmwQ@public.gmane.org>
  2018-02-06  7:53     ` Haozhong Zhang
  2 siblings, 1 reply; 8+ messages in thread
From: Alex Williamson @ 2018-02-05 21:44 UTC (permalink / raw)
  To: Dan Williams
  Cc: Haozhong Zhang, Michal Hocko, jack, kvm, linux-nvdimm,
	linux-kernel, stable, linux-fsdevel, hch, aik

On Sun, 04 Feb 2018 15:05:30 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> Filesystem-DAX is incompatible with 'longterm' page pinning. Without
> page cache indirection a DAX mapping maps filesystem blocks directly.
> This means that the filesystem must not modify a file's block map while
> any page in a mapping is pinned. In order to prevent the situation of
> userspace holding of filesystem operations indefinitely, disallow
> 'longterm' Filesystem-DAX mappings.
> 
> RDMA has the same conflict and the plan there is to add a 'with lease'
> mechanism to allow the kernel to notify userspace that the mapping is
> being torn down for block-map maintenance. Perhaps something similar can
> be put in place for vfio.
> 
> Note that xfs and ext4 still report:
> 
>    "DAX enabled. Warning: EXPERIMENTAL, use at your own risk"
> 
> ...at mount time, and resolving the dax-dma-vs-truncate problem is one
> of the last hurdles to remove that designation.
> 
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: kvm@vger.kernel.org
> Cc: <stable@vger.kernel.org>
> Reported-by: Haozhong Zhang <haozhong.zhang@intel.com>
> Fixes: d475c6346a38 ("dax,ext2: replace XIP read and write with DAX I/O")
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c |   18 +++++++++++++++---
>  1 file changed, 15 insertions(+), 3 deletions(-)

This isn't without some expense, a vfio mapping and un-mapping unit
test incurs ~1.5% increase in system time losing access to gup_fast().
Also, I think tce_iommu_use_page() is going to have the same problem, it
provides the same sort of functionality for a different vfio IOMMU
backend.  Please take this through your tree and I'll add a todo list
item to see how we might improve this.

Acked-by: Alex Williamson <alex.williamson@redhat.com>

Thanks,
Alex

> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index e30e29ae4819..45657e2b1ff7 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -338,11 +338,12 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>  {
>  	struct page *page[1];
>  	struct vm_area_struct *vma;
> +	struct vm_area_struct *vmas[1];
>  	int ret;
>  
>  	if (mm == current->mm) {
> -		ret = get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE),
> -					  page);
> +		ret = get_user_pages_longterm(vaddr, 1, !!(prot & IOMMU_WRITE),
> +					      page, vmas);
>  	} else {
>  		unsigned int flags = 0;
>  
> @@ -351,7 +352,18 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>  
>  		down_read(&mm->mmap_sem);
>  		ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
> -					    NULL, NULL);
> +					    vmas, NULL);
> +		/*
> +		 * The lifetime of a vaddr_get_pfn() page pin is
> +		 * userspace-controlled. In the fs-dax case this could
> +		 * lead to indefinite stalls in filesystem operations.
> +		 * Disallow attempts to pin fs-dax pages via this
> +		 * interface.
> +		 */
> +		if (ret > 0 && vma_is_fsdax(vmas[0])) {
> +			ret = -EOPNOTSUPP;
> +			put_page(page[0]);
> +		}
>  		up_read(&mm->mmap_sem);
>  	}
>  
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/3] vfio: disable filesystem-dax page pinning
       [not found]       ` <20180205144422.1ca67ab5-DGNDKt5SQtizQB+pC5nmwQ@public.gmane.org>
@ 2018-02-05 22:01         ` Dan Williams
  0 siblings, 0 replies; 8+ messages in thread
From: Dan Williams @ 2018-02-05 22:01 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Michal Hocko, Jan Kara, KVM list,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, aik-sLpHqDYs0B2HXe+LvDLADg,
	Linux Kernel Mailing List, stable-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel, Christoph Hellwig

On Mon, Feb 5, 2018 at 1:44 PM, Alex Williamson
<alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Sun, 04 Feb 2018 15:05:30 -0800
> Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>
>> Filesystem-DAX is incompatible with 'longterm' page pinning. Without
>> page cache indirection a DAX mapping maps filesystem blocks directly.
>> This means that the filesystem must not modify a file's block map while
>> any page in a mapping is pinned. In order to prevent the situation of
>> userspace holding of filesystem operations indefinitely, disallow
>> 'longterm' Filesystem-DAX mappings.
>>
>> RDMA has the same conflict and the plan there is to add a 'with lease'
>> mechanism to allow the kernel to notify userspace that the mapping is
>> being torn down for block-map maintenance. Perhaps something similar can
>> be put in place for vfio.
>>
>> Note that xfs and ext4 still report:
>>
>>    "DAX enabled. Warning: EXPERIMENTAL, use at your own risk"
>>
>> ...at mount time, and resolving the dax-dma-vs-truncate problem is one
>> of the last hurdles to remove that designation.
>>
>> Cc: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> Cc: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
>> Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
>> Cc: kvm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
>> Reported-by: Haozhong Zhang <haozhong.zhang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>> Fixes: d475c6346a38 ("dax,ext2: replace XIP read and write with DAX I/O")
>> Signed-off-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>> ---
>>  drivers/vfio/vfio_iommu_type1.c |   18 +++++++++++++++---
>>  1 file changed, 15 insertions(+), 3 deletions(-)
>
> This isn't without some expense, a vfio mapping and un-mapping unit
> test incurs ~1.5% increase in system time losing access to gup_fast().
> Also, I think tce_iommu_use_page() is going to have the same problem, it
> provides the same sort of functionality for a different vfio IOMMU
> backend.  Please take this through your tree and I'll add a todo list
> item to see how we might improve this.
>
> Acked-by: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Thanks Alex.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/3] vfio: disable filesystem-dax page pinning
  2018-02-04 23:05   ` [PATCH 3/3] vfio: disable filesystem-dax page pinning Dan Williams
       [not found]     ` <151778553083.7139.6601964812589807125.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
  2018-02-05 21:44     ` Alex Williamson
@ 2018-02-06  7:53     ` Haozhong Zhang
  2018-02-06 15:09       ` Dan Williams
  2 siblings, 1 reply; 8+ messages in thread
From: Haozhong Zhang @ 2018-02-06  7:53 UTC (permalink / raw)
  To: Dan Williams
  Cc: alex.williamson, Michal Hocko, jack, kvm, linux-nvdimm,
	linux-kernel, stable, linux-fsdevel, hch

Hi Dan,

On 02/04/18 15:05 -0800, Dan Williams wrote:
> Filesystem-DAX is incompatible with 'longterm' page pinning. Without
> page cache indirection a DAX mapping maps filesystem blocks directly.
> This means that the filesystem must not modify a file's block map while
> any page in a mapping is pinned. In order to prevent the situation of
> userspace holding of filesystem operations indefinitely, disallow
> 'longterm' Filesystem-DAX mappings.
> 
> RDMA has the same conflict and the plan there is to add a 'with lease'
> mechanism to allow the kernel to notify userspace that the mapping is
> being torn down for block-map maintenance. Perhaps something similar can
> be put in place for vfio.
> 
> Note that xfs and ext4 still report:
> 
>    "DAX enabled. Warning: EXPERIMENTAL, use at your own risk"
> 
> ...at mount time, and resolving the dax-dma-vs-truncate problem is one
> of the last hurdles to remove that designation.
> 
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: kvm@vger.kernel.org
> Cc: <stable@vger.kernel.org>
> Reported-by: Haozhong Zhang <haozhong.zhang@intel.com>
> Fixes: d475c6346a38 ("dax,ext2: replace XIP read and write with DAX I/O")
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c |   18 +++++++++++++++---
>  1 file changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index e30e29ae4819..45657e2b1ff7 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -338,11 +338,12 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>  {
>  	struct page *page[1];
>  	struct vm_area_struct *vma;
> +	struct vm_area_struct *vmas[1];
>  	int ret;
>  
>  	if (mm == current->mm) {
> -		ret = get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE),
> -					  page);
> +		ret = get_user_pages_longterm(vaddr, 1, !!(prot & IOMMU_WRITE),
> +					      page, vmas);
>  	} else {
>  		unsigned int flags = 0;
>  
> @@ -351,7 +352,18 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>  
>  		down_read(&mm->mmap_sem);
>  		ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
> -					    NULL, NULL);
> +					    vmas, NULL);
> +		/*
> +		 * The lifetime of a vaddr_get_pfn() page pin is
> +		 * userspace-controlled. In the fs-dax case this could
> +		 * lead to indefinite stalls in filesystem operations.
> +		 * Disallow attempts to pin fs-dax pages via this
> +		 * interface.
> +		 */
> +		if (ret > 0 && vma_is_fsdax(vmas[0])) {
> +			ret = -EOPNOTSUPP;
> +			put_page(page[0]);
> +		}
>  		up_read(&mm->mmap_sem);
>  	}
>  
> 

Besides this patch series, are there other patches needed to make
vma_is_fsdax() to work with device-dax?

I applied this patch series on the libvdimm-for-next branch of nvdimm
tree (ee95f4059a83), and found this patch series also failed
device-dax mapping with vfio. It can be reproduced by following steps:

1. Attach PCI device at BDF 0000:03:10.2 to vfio-pci.
   # modprobe vfio-pci
   # lspci -n -s 0000:03:10.2
   03:10.2 0200: 8086:1515 (rev 01)
   # echo 0000:03:10.2 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind
   # echo 8086:1515 > /sys/bus/pci/drivers/vfio-pci/new_id

2. Use RAM to emulate NVDIMM and create a device-dax device /dev/dax0.0
   # cat /proc/iomem
   ...
   100000000-2ffffffff : Persistent Memory (legacy)
     100000000-2ffffffff : namespace0.0
   ...

   # ndctl create-namespace -f -e namespace0.0 -m dax
   {
     "dev":"namespace0.0",
     "mode":"dax",
     "size":8453619712,
     "uuid":"e1db00bc-f830-4f1b-ac18-091ae7df4f93",
     "daxdevs":[
       {
         "chardev":"dax0.0",
         "size":8453619712
       }
     ]
   }

3. Create a VM with assigned PCI device in step 1 and the device-dax
   device in step 2.
   # qemu-system-x86_64 -machine pc,accel=kvm,nvdimm=on -smp host \
                        -m 4G,slots=32,maxmem=128G \
                        -drive file=VM_DISK_IMG.img,format=raw,if=virtio \
                        -object memory-backend-file,id=nv_be1,share=on,mem-path=/dev/dax0.0,size=4G,align=2M \
                        -device nvdimm,id=nv1,memdev=nv_be1 \
                        -device ioh3420,id=root.0,slot=4 \
                        -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:03:10.2,id=nic1,bus=pci.0,addr=0x6

   It then fails with the following QEMU error messages:
     qemu-system-x86_64: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:03:10.2,id=nic1,bus=pci.0,addr=0x6: VFIO_MAP_DMA: -95
     qemu-system-x86_64: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:03:10.2,id=nic1,bus=pci.0,addr=0x6: vfio_dma_map(0x5643804a92c0, 0x140000000, 0xffe00000, 0x7f2ed5200000) = -95 (Operation not supported)
     qemu-system-x86_64: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:03:10.2,id=nic1,bus=pci.0,addr=0x6: vfio error: 0000:03:10.2: failed to setup container for group 52: memory listener initialization failed for container: Operation not supported

   I added the following debug messages after the
   get_user_pages_longterm() call in this patch,
       if (vmas[0] && vma_is_dax(vmas[0]))
               printk(KERN_DEBUG "%s: longterm failed for pfn 0x%lx, ret %d\n",
                      __func__, page_to_pfn(page[0]), ret);
   and shows get_user_pages_longterm() returns -EOPNOTSUPP on the
   first device-dax page mapping.



Haozhong

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/3] vfio: disable filesystem-dax page pinning
  2018-02-06  7:53     ` Haozhong Zhang
@ 2018-02-06 15:09       ` Dan Williams
  0 siblings, 0 replies; 8+ messages in thread
From: Dan Williams @ 2018-02-06 15:09 UTC (permalink / raw)
  To: Dan Williams, Alex Williamson, Michal Hocko, Jan Kara, KVM list,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Linux Kernel Mailing List,
	stable-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel, Christoph Hellwig

On Mon, Feb 5, 2018 at 11:53 PM, Haozhong Zhang
<haozhong.zhang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> Hi Dan,
>
> On 02/04/18 15:05 -0800, Dan Williams wrote:
>> Filesystem-DAX is incompatible with 'longterm' page pinning. Without
>> page cache indirection a DAX mapping maps filesystem blocks directly.
>> This means that the filesystem must not modify a file's block map while
>> any page in a mapping is pinned. In order to prevent the situation of
>> userspace holding of filesystem operations indefinitely, disallow
>> 'longterm' Filesystem-DAX mappings.
>>
>> RDMA has the same conflict and the plan there is to add a 'with lease'
>> mechanism to allow the kernel to notify userspace that the mapping is
>> being torn down for block-map maintenance. Perhaps something similar can
>> be put in place for vfio.
>>
>> Note that xfs and ext4 still report:
>>
>>    "DAX enabled. Warning: EXPERIMENTAL, use at your own risk"
>>
>> ...at mount time, and resolving the dax-dma-vs-truncate problem is one
>> of the last hurdles to remove that designation.
>>
>> Cc: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> Cc: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
>> Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
>> Cc: kvm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
>> Reported-by: Haozhong Zhang <haozhong.zhang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>> Fixes: d475c6346a38 ("dax,ext2: replace XIP read and write with DAX I/O")
>> Signed-off-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>> ---
>>  drivers/vfio/vfio_iommu_type1.c |   18 +++++++++++++++---
>>  1 file changed, 15 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index e30e29ae4819..45657e2b1ff7 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -338,11 +338,12 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>>  {
>>       struct page *page[1];
>>       struct vm_area_struct *vma;
>> +     struct vm_area_struct *vmas[1];
>>       int ret;
>>
>>       if (mm == current->mm) {
>> -             ret = get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE),
>> -                                       page);
>> +             ret = get_user_pages_longterm(vaddr, 1, !!(prot & IOMMU_WRITE),
>> +                                           page, vmas);
>>       } else {
>>               unsigned int flags = 0;
>>
>> @@ -351,7 +352,18 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>>
>>               down_read(&mm->mmap_sem);
>>               ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
>> -                                         NULL, NULL);
>> +                                         vmas, NULL);
>> +             /*
>> +              * The lifetime of a vaddr_get_pfn() page pin is
>> +              * userspace-controlled. In the fs-dax case this could
>> +              * lead to indefinite stalls in filesystem operations.
>> +              * Disallow attempts to pin fs-dax pages via this
>> +              * interface.
>> +              */
>> +             if (ret > 0 && vma_is_fsdax(vmas[0])) {
>> +                     ret = -EOPNOTSUPP;
>> +                     put_page(page[0]);
>> +             }
>>               up_read(&mm->mmap_sem);
>>       }
>>
>>
>
> Besides this patch series, are there other patches needed to make
> vma_is_fsdax() to work with device-dax?
>
> I applied this patch series on the libvdimm-for-next branch of nvdimm
> tree (ee95f4059a83), and found this patch series also failed
> device-dax mapping with vfio. It can be reproduced by following steps:
>
> 1. Attach PCI device at BDF 0000:03:10.2 to vfio-pci.
>    # modprobe vfio-pci
>    # lspci -n -s 0000:03:10.2
>    03:10.2 0200: 8086:1515 (rev 01)
>    # echo 0000:03:10.2 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind
>    # echo 8086:1515 > /sys/bus/pci/drivers/vfio-pci/new_id
>
> 2. Use RAM to emulate NVDIMM and create a device-dax device /dev/dax0.0
>    # cat /proc/iomem
>    ...
>    100000000-2ffffffff : Persistent Memory (legacy)
>      100000000-2ffffffff : namespace0.0
>    ...
>
>    # ndctl create-namespace -f -e namespace0.0 -m dax
>    {
>      "dev":"namespace0.0",
>      "mode":"dax",
>      "size":8453619712,
>      "uuid":"e1db00bc-f830-4f1b-ac18-091ae7df4f93",
>      "daxdevs":[
>        {
>          "chardev":"dax0.0",
>          "size":8453619712
>        }
>      ]
>    }
>
> 3. Create a VM with assigned PCI device in step 1 and the device-dax
>    device in step 2.
>    # qemu-system-x86_64 -machine pc,accel=kvm,nvdimm=on -smp host \
>                         -m 4G,slots=32,maxmem=128G \
>                         -drive file=VM_DISK_IMG.img,format=raw,if=virtio \
>                         -object memory-backend-file,id=nv_be1,share=on,mem-path=/dev/dax0.0,size=4G,align=2M \
>                         -device nvdimm,id=nv1,memdev=nv_be1 \
>                         -device ioh3420,id=root.0,slot=4 \
>                         -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:03:10.2,id=nic1,bus=pci.0,addr=0x6
>
>    It then fails with the following QEMU error messages:
>      qemu-system-x86_64: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:03:10.2,id=nic1,bus=pci.0,addr=0x6: VFIO_MAP_DMA: -95
>      qemu-system-x86_64: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:03:10.2,id=nic1,bus=pci.0,addr=0x6: vfio_dma_map(0x5643804a92c0, 0x140000000, 0xffe00000, 0x7f2ed5200000) = -95 (Operation not supported)
>      qemu-system-x86_64: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:03:10.2,id=nic1,bus=pci.0,addr=0x6: vfio error: 0000:03:10.2: failed to setup container for group 52: memory listener initialization failed for container: Operation not supported
>
>    I added the following debug messages after the
>    get_user_pages_longterm() call in this patch,
>        if (vmas[0] && vma_is_dax(vmas[0]))
>                printk(KERN_DEBUG "%s: longterm failed for pfn 0x%lx, ret %d\n",
>                       __func__, page_to_pfn(page[0]), ret);
>    and shows get_user_pages_longterm() returns -EOPNOTSUPP on the
>    first device-dax page mapping.

Thanks for that thorough debug, I'll take a look today.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-02-06 15:09 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-02-04 23:05 [PATCH 0/3] vfio, dax: disable filesystem-dax and minor fixups Dan Williams
     [not found] ` <151778551496.7139.17808629759104553625.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2018-02-04 23:05   ` [PATCH 3/3] vfio: disable filesystem-dax page pinning Dan Williams
     [not found]     ` <151778553083.7139.6601964812589807125.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2018-02-05  3:46       ` Haozhong Zhang
2018-02-05  3:54         ` Dan Williams
2018-02-05 21:44     ` Alex Williamson
     [not found]       ` <20180205144422.1ca67ab5-DGNDKt5SQtizQB+pC5nmwQ@public.gmane.org>
2018-02-05 22:01         ` Dan Williams
2018-02-06  7:53     ` Haozhong Zhang
2018-02-06 15:09       ` Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox