[PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support

public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-20  2:38 Dan Williams
  2017-10-20  2:39 ` [PATCH v3 09/13] IB/core: disable memory registration of fileystem-dax vmas Dan Williams
  2017-10-20  7:47 ` [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support Christoph Hellwig
  0 siblings, 2 replies; 15+ messages in thread
From: Dan Williams @ 2017-10-20  2:38 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
  Cc: Michal Hocko, Jan Kara, Benjamin Herrenschmidt, Dave Hansen,
	Dave Chinner, J. Bruce Fields, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Paul Mackerras, Sean Hefty, Jeff Layton, Matthew Wilcox,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Michael Ellerman, Jeff Moyer,
	hch-jcswGhMUV9g, Jason Gunthorpe, Doug Ledford, Ross Zwisler,
	Hal Rosenstock, Heiko Carstens,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Alex

Changes since v2 [1]:
* Add 'dax: handle truncate of dma-busy pages' which builds on the
  removal of page-less dax to fix a latent bug handling dma vs truncate.
* Disable get_user_pages_fast() for dax
* Disable RDMA memory registrations against filesystem-DAX mappings for
  non-ODP (On Demand Paging / Shared Virtual Memory) hardware.
* Fix a compile error when building with HMM enabled

---
tl;dr: A brute force approach to ensure that truncate waits for any
in-flight DMA before freeing filesystem-DAX blocks to the filesystem's
block allocator.

While reviewing the MAP_DIRECT proposal Christoph noted:

    get_user_pages on DAX doesn't give the same guarantees as on
    pagecache or anonymous memory, and that is the problem we need to
    fix. In fact I'm pretty sure if we try hard enough (and we might
    have to try very hard) we can see the same problem with plain direct
    I/O and without any RDMA involved, e.g. do a larger direct I/O write
    to memory that is mmap()ed from a DAX file, then truncate the DAX
    file and reallocate the blocks, and we might corrupt that new file.
    We'll probably need a special setup where there is little other
    chance but to reallocate those used blocks.
    
    So what we need to do first is to fix get_user_pages vs unmapping
    DAX mmap()ed blocks, be that from a hole punch, truncate, COW
    operation, etc.

I was able to trigger the failure with "[PATCH v3 08/13]
tools/testing/nvdimm: add 'bio_delay' mechanism" to keep block i/o pages
busy so a punch-hole operation can truncate the blocks before the DMA
finishes.

The solution presented is not pretty. It creates a stream of leases, one
for each get_user_pages() invocation, and polls page reference counts
until DMA stops. We're missing a reliable way to not only trap the
DMA-idle event, but also block new references being taken on pages while
truncate is allowed to progress. "[PATCH v3 12/13] dax: handle truncate of
dma-busy pages" presents other options considered, and notes that this
solution can only be viewed as a stop-gap.

Given the need to poll page-reference counts this approach builds on the
removal of 'page-less DAX' support. From the last submission Andrew
asked for clarification on the move to now require pages for DAX.
Quoting "[PATCH v3 02/13] dax: require 'struct page' for filesystem
dax":

    Note that when the initial dax support was being merged a few years
    back there was concern that struct page was unsuitable for use with
    next generation persistent memory devices. The theoretical concern
    was that struct page access, being such a hotly used data structure
    in the kernel, would lead to media wear out. While that was a
    reasonable conservative starting position it has not held true in
    practice. We have long since committed to using
    devm_memremap_pages() to support higher order kernel functionality
    that needs get_user_pages() and pfn_to_page().
 

---

Dan Williams (13):
      dax: quiet bdev_dax_supported()
      dax: require 'struct page' for filesystem dax
      dax: stop using VM_MIXEDMAP for dax
      dax: stop using VM_HUGEPAGE for dax
      dax: stop requiring a live device for dax_flush()
      dax: store pfns in the radix
      dax: warn if dma collides with truncate
      tools/testing/nvdimm: add 'bio_delay' mechanism
      IB/core: disable memory registration of fileystem-dax vmas
      mm: disable get_user_pages_fast() for dax
      fs: use smp_load_acquire in break_{layout,lease}
      dax: handle truncate of dma-busy pages
      xfs: wire up FL_ALLOCATED support


 arch/powerpc/sysdev/axonram.c         |    1 
 drivers/dax/device.c                  |    1 
 drivers/dax/super.c                   |   18 +-
 drivers/infiniband/core/umem.c        |   49 ++++-
 drivers/s390/block/dcssblk.c          |    1 
 fs/Kconfig                            |    1 
 fs/dax.c                              |  296 ++++++++++++++++++++++++++++-----
 fs/ext2/file.c                        |    1 
 fs/ext4/file.c                        |    1 
 fs/locks.c                            |   17 ++
 fs/xfs/xfs_aops.c                     |   24 +++
 fs/xfs/xfs_file.c                     |   66 +++++++
 fs/xfs/xfs_inode.h                    |    1 
 fs/xfs/xfs_ioctl.c                    |    7 -
 include/linux/dax.h                   |   23 +++
 include/linux/fs.h                    |   32 +++-
 include/linux/vma.h                   |   33 ++++
 mm/gup.c                              |   75 ++++----
 mm/huge_memory.c                      |    8 -
 mm/ksm.c                              |    3 
 mm/madvise.c                          |    2 
 mm/memory.c                           |   20 ++
 mm/migrate.c                          |    3 
 mm/mlock.c                            |    5 -
 mm/mmap.c                             |    8 -
 tools/testing/nvdimm/Kbuild           |    1 
 tools/testing/nvdimm/test/iomap.c     |   62 +++++++
 tools/testing/nvdimm/test/nfit.c      |   34 ++++
 tools/testing/nvdimm/test/nfit_test.h |    1 
 29 files changed, 651 insertions(+), 143 deletions(-)
 create mode 100644 include/linux/vma.h
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v3 09/13] IB/core: disable memory registration of fileystem-dax vmas
  2017-10-20  2:38 [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support Dan Williams
@ 2017-10-20  2:39 ` Dan Williams
  2017-10-20  7:47 ` [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support Christoph Hellwig
  1 sibling, 0 replies; 15+ messages in thread
From: Dan Williams @ 2017-10-20  2:39 UTC (permalink / raw)
  To: akpm
  Cc: Sean Hefty, linux-xfs, linux-nvdimm, linux-rdma, linux-kernel,
	Jeff Moyer, hch, Jason Gunthorpe, linux-mm, Doug Ledford,
	linux-fsdevel, Ross Zwisler, Hal Rosenstock

Until there is a solution to the dma-to-dax vs truncate problem it is
not safe to allow RDMA to create long standing memory registrations
against filesytem-dax vmas. Device-dax vmas do not have this problem and
are explicitly allowed.

This is temporary until a "memory registration with layout-lease"
mechanism can be implemented, and is limited to non-ODP (On Demand
Paging) capable RDMA devices.

Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: <linux-rdma@vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/infiniband/core/umem.c |   49 +++++++++++++++++++++++++++++++---------
 1 file changed, 38 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 21e60b1e2ff4..c30d286c1f24 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -147,19 +147,21 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	umem->hugetlb   = 1;
 
 	page_list = (struct page **) __get_free_page(GFP_KERNEL);
-	if (!page_list) {
-		put_pid(umem->pid);
-		kfree(umem);
-		return ERR_PTR(-ENOMEM);
-	}
+	if (!page_list)
+		goto err_pagelist;
 
 	/*
-	 * if we can't alloc the vma_list, it's not so bad;
-	 * just assume the memory is not hugetlb memory
+	 * If DAX is enabled we need the vma to protect against
+	 * registering filesystem-dax memory. Otherwise we can tolerate
+	 * a failure to allocate the vma_list and just assume that all
+	 * vmas are not hugetlb-vmas.
 	 */
 	vma_list = (struct vm_area_struct **) __get_free_page(GFP_KERNEL);
-	if (!vma_list)
+	if (!vma_list) {
+		if (IS_ENABLED(CONFIG_FS_DAX))
+			goto err_vmalist;
 		umem->hugetlb = 0;
+	}
 
 	npages = ib_umem_num_pages(umem);
 
@@ -199,15 +201,34 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 		if (ret < 0)
 			goto out;
 
-		umem->npages += ret;
 		cur_base += ret * PAGE_SIZE;
 		npages   -= ret;
 
 		for_each_sg(sg_list_start, sg, ret, i) {
-			if (vma_list && !is_vm_hugetlb_page(vma_list[i]))
-				umem->hugetlb = 0;
+			struct vm_area_struct *vma;
+			struct inode *inode;
 
 			sg_set_page(sg, page_list[i], PAGE_SIZE, 0);
+			umem->npages++;
+
+			if (!vma_list)
+				continue;
+			vma = vma_list[i];
+
+			if (!is_vm_hugetlb_page(vma))
+				umem->hugetlb = 0;
+
+			if (!vma_is_dax(vma))
+				continue;
+
+			/* device-dax is safe for rdma... */
+			inode = file_inode(vma->vm_file);
+			if (inode->i_mode == S_IFCHR)
+				continue;
+
+			/* ...filesystem-dax is not. */
+			ret = -EOPNOTSUPP;
+			goto out;
 		}
 
 		/* preparing for next loop */
@@ -242,6 +263,12 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	free_page((unsigned long) page_list);
 
 	return ret < 0 ? ERR_PTR(ret) : umem;
+err_vmalist:
+	free_page((unsigned long) page_list);
+err_pagelist:
+	put_pid(umem->pid);
+	kfree(umem);
+	return ERR_PTR(-ENOMEM);
 }
 EXPORT_SYMBOL(ib_umem_get);
 

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
  2017-10-20  2:38 [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support Dan Williams
  2017-10-20  2:39 ` [PATCH v3 09/13] IB/core: disable memory registration of fileystem-dax vmas Dan Williams
@ 2017-10-20  7:47 ` Christoph Hellwig
  2017-10-20  9:31   ` Christoph Hellwig
  1 sibling, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2017-10-20  7:47 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Michal Hocko, Jan Kara, Benjamin Herrenschmidt, Dave Hansen,
	Dave Chinner, J. Bruce Fields, linux-mm, Paul Mackerras,
	Sean Hefty, Jeff Layton, Matthew Wilcox, linux-rdma,
	Michael Ellerman, Jeff Moyer, hch, Jason Gunthorpe, Doug Ledford,
	Ross Zwisler, Hal Rosenstock, Heiko Carstens, linux-nvdimm,
	Alexander Viro, Gerald Schaefer, Darri

> The solution presented is not pretty. It creates a stream of leases, one
> for each get_user_pages() invocation, and polls page reference counts
> until DMA stops. We're missing a reliable way to not only trap the
> DMA-idle event, but also block new references being taken on pages while
> truncate is allowed to progress. "[PATCH v3 12/13] dax: handle truncate of
> dma-busy pages" presents other options considered, and notes that this
> solution can only be viewed as a stop-gap.

I'd like to brainstorm how we can do something better.

How about:

If we hit a page with an elevated refcount in truncate / hole puch
etc for a DAX file system we do not free the blocks in the file system,
but add it to the extent busy list.  We mark the page as delayed
free (e.g. page flag?) so that when it finally hits refcount zero we
call back into the file system to remove it from the busy list.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
  2017-10-20  7:47 ` [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support Christoph Hellwig
@ 2017-10-20  9:31   ` Christoph Hellwig
  2017-10-26 10:58     ` Jan Kara
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2017-10-20  9:31 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Michal Hocko, Jan Kara, Benjamin Herrenschmidt, Dave Hansen,
	Dave Chinner, J. Bruce Fields, linux-mm, Paul Mackerras,
	Sean Hefty, Jeff Layton, Matthew Wilcox, linux-rdma,
	Michael Ellerman, Jeff Moyer, hch, Jason Gunthorpe, Doug Ledford,
	Ross Zwisler, Hal Rosenstock, Heiko Carstens, linux-nvdimm,
	Alexander Viro, Gerald Schaefer, Darri

On Fri, Oct 20, 2017 at 09:47:50AM +0200, Christoph Hellwig wrote:
> I'd like to brainstorm how we can do something better.
> 
> How about:
> 
> If we hit a page with an elevated refcount in truncate / hole puch
> etc for a DAX file system we do not free the blocks in the file system,
> but add it to the extent busy list.  We mark the page as delayed
> free (e.g. page flag?) so that when it finally hits refcount zero we
> call back into the file system to remove it from the busy list.

Brainstorming some more:

Given that on a DAX file there shouldn't be any long-term page
references after we unmap it from the page table and don't allow
get_user_pages calls why not wait for the references for all
DAX pages to go away first?  E.g. if we find a DAX page in
truncate_inode_pages_range that has an elevated refcount we set
a new flag to prevent new references from showing up, and then
simply wait for it to go away.  Instead of a busy way we can
do this through a few hashed waitqueued in dev_pagemap.  And in
fact put_zone_device_page already gets called when putting the
last page so we can handle the wakeup from there.

In fact if we can't find a page flag for the stop new callers
things we could probably come up with a way to do that through
dev_pagemap somehow, but I'm not sure how efficient that would
be.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
  2017-10-20  9:31   ` Christoph Hellwig
@ 2017-10-26 10:58     ` Jan Kara
  2017-10-26 23:51       ` Williams, Dan J
                         ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Jan Kara @ 2017-10-26 10:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, akpm, Michal Hocko, Jan Kara,
	Benjamin Herrenschmidt, Dave Hansen, Dave Chinner,
	J. Bruce Fields, linux-mm, Paul Mackerras, Sean Hefty,
	Jeff Layton, Matthew Wilcox, linux-rdma, Michael Ellerman,
	Jeff Moyer, Jason Gunthorpe, Doug Ledford, Ross Zwisler,
	Hal Rosenstock, Heiko Carstens, linux-nvdimm, Alexander Viro,
	Gerald Schaefer <gerald.>

On Fri 20-10-17 11:31:48, Christoph Hellwig wrote:
> On Fri, Oct 20, 2017 at 09:47:50AM +0200, Christoph Hellwig wrote:
> > I'd like to brainstorm how we can do something better.
> > 
> > How about:
> > 
> > If we hit a page with an elevated refcount in truncate / hole puch
> > etc for a DAX file system we do not free the blocks in the file system,
> > but add it to the extent busy list.  We mark the page as delayed
> > free (e.g. page flag?) so that when it finally hits refcount zero we
> > call back into the file system to remove it from the busy list.
> 
> Brainstorming some more:
> 
> Given that on a DAX file there shouldn't be any long-term page
> references after we unmap it from the page table and don't allow
> get_user_pages calls why not wait for the references for all
> DAX pages to go away first?  E.g. if we find a DAX page in
> truncate_inode_pages_range that has an elevated refcount we set
> a new flag to prevent new references from showing up, and then
> simply wait for it to go away.  Instead of a busy way we can
> do this through a few hashed waitqueued in dev_pagemap.  And in
> fact put_zone_device_page already gets called when putting the
> last page so we can handle the wakeup from there.
> 
> In fact if we can't find a page flag for the stop new callers
> things we could probably come up with a way to do that through
> dev_pagemap somehow, but I'm not sure how efficient that would
> be.

We were talking about this yesterday with Dan so some more brainstorming
from us. We can implement the solution with extent busy list in ext4
relatively easily - we already have such list currently similarly to XFS.
There would be some modifications needed but nothing too complex. The
biggest downside of this solution I see is that it requires per-filesystem
solution for busy extents - ext4 and XFS are reasonably fine, however btrfs
may have problems and ext2 definitely will need some modifications.
Invisible used blocks may be surprising to users at times although given
page refs should be relatively short term, that should not be a big issue.
But are we guaranteed page refs are short term? E.g. if someone creates
v4l2 videobuf in MAP_SHARED mapping of a file on DAX filesystem, page refs
can be rather long-term similarly as in RDMA case. Also freeing of blocks
on page reference drop is another async entry point into the filesystem
which could unpleasantly surprise us but I guess workqueues would solve
that reasonably fine.

WRT waiting for page refs to be dropped before proceeding with truncate (or
punch hole for that matter - that case is even nastier since we don't have
i_size to guard us). What I like about this solution is that it is very
visible there's something unusual going on with the file being truncated /
punched and so problems are easier to diagnose / fix from the admin side.
So far we have guarded hole punching from concurrent faults (and
get_user_pages() does fault once you do unmap_mapping_range()) with
I_MMAP_LOCK (or its equivalent in ext4). We cannot easily wait for page
refs to be dropped under I_MMAP_LOCK as that could deadlock - the most
obvious case Dan came up with is when GUP obtains ref to page A, then hole
punch comes grabbing I_MMAP_LOCK and waiting for page ref on A to be
dropped, and then GUP blocks on trying to fault in another page.

I think we cannot easily prevent new page references to be grabbed as you
write above since nobody expects stuff like get_page() to fail. But I 
think that unmapping relevant pages and then preventing them to be faulted
in again is workable and stops GUP as well. The problem with that is though
what to do with page faults to such pages - you cannot just fail them for
hole punch, and you cannot easily allocate new blocks either. So we are
back at a situation where we need to detach blocks from the inode and then
wait for page refs to be dropped - so some form of busy extents. Am I
missing something?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
  2017-10-26 10:58     ` Jan Kara
@ 2017-10-26 23:51       ` Williams, Dan J
  2017-10-27  6:48         ` Dave Chinner
  2017-10-27  6:45       ` Christoph Hellwig
  2017-10-29 23:46       ` Dan Williams
  2 siblings, 1 reply; 15+ messages in thread
From: Williams, Dan J @ 2017-10-26 23:51 UTC (permalink / raw)
  To: hch@lst.de, jack@suse.cz
  Cc: schwidefsky@de.ibm.com, darrick.wong@oracle.com,
	dledford@redhat.com, linux-rdma@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, bfields@fieldses.org,
	linux-mm@kvack.org, heiko.carstens@de.ibm.com,
	dave.hansen@linux.intel.com, linux-xfs@vger.kernel.org,
	linux-kernel@vger.kernel.org, jmoyer@redhat.com,
	viro@zeniv.linux.org.uk, kirill.shutemov@linux.intel.com, akpm

On Thu, 2017-10-26 at 12:58 +0200, Jan Kara wrote:
> On Fri 20-10-17 11:31:48, Christoph Hellwig wrote:
> > On Fri, Oct 20, 2017 at 09:47:50AM +0200, Christoph Hellwig wrote:
> > > I'd like to brainstorm how we can do something better.
> > > 
> > > How about:
> > > 
> > > If we hit a page with an elevated refcount in truncate / hole puch
> > > etc for a DAX file system we do not free the blocks in the file system,
> > > but add it to the extent busy list.  We mark the page as delayed
> > > free (e.g. page flag?) so that when it finally hits refcount zero we
> > > call back into the file system to remove it from the busy list.
> > 
> > Brainstorming some more:
> > 
> > Given that on a DAX file there shouldn't be any long-term page
> > references after we unmap it from the page table and don't allow
> > get_user_pages calls why not wait for the references for all
> > DAX pages to go away first?  E.g. if we find a DAX page in
> > truncate_inode_pages_range that has an elevated refcount we set
> > a new flag to prevent new references from showing up, and then
> > simply wait for it to go away.  Instead of a busy way we can
> > do this through a few hashed waitqueued in dev_pagemap.  And in
> > fact put_zone_device_page already gets called when putting the
> > last page so we can handle the wakeup from there.
> > 
> > In fact if we can't find a page flag for the stop new callers
> > things we could probably come up with a way to do that through
> > dev_pagemap somehow, but I'm not sure how efficient that would
> > be.
> 
> We were talking about this yesterday with Dan so some more brainstorming
> from us. We can implement the solution with extent busy list in ext4
> relatively easily - we already have such list currently similarly to XFS.
> There would be some modifications needed but nothing too complex. The
> biggest downside of this solution I see is that it requires per-filesystem
> solution for busy extents - ext4 and XFS are reasonably fine, however btrfs
> may have problems and ext2 definitely will need some modifications.
> Invisible used blocks may be surprising to users at times although given
> page refs should be relatively short term, that should not be a big issue.
> But are we guaranteed page refs are short term? E.g. if someone creates
> v4l2 videobuf in MAP_SHARED mapping of a file on DAX filesystem, page refs
> can be rather long-term similarly as in RDMA case. Also freeing of blocks
> on page reference drop is another async entry point into the filesystem
> which could unpleasantly surprise us but I guess workqueues would solve
> that reasonably fine.
> 
> WRT waiting for page refs to be dropped before proceeding with truncate (or
> punch hole for that matter - that case is even nastier since we don't have
> i_size to guard us). What I like about this solution is that it is very
> visible there's something unusual going on with the file being truncated /
> punched and so problems are easier to diagnose / fix from the admin side.
> So far we have guarded hole punching from concurrent faults (and
> get_user_pages() does fault once you do unmap_mapping_range()) with
> I_MMAP_LOCK (or its equivalent in ext4). We cannot easily wait for page
> refs to be dropped under I_MMAP_LOCK as that could deadlock - the most
> obvious case Dan came up with is when GUP obtains ref to page A, then hole
> punch comes grabbing I_MMAP_LOCK and waiting for page ref on A to be
> dropped, and then GUP blocks on trying to fault in another page.
> 
> I think we cannot easily prevent new page references to be grabbed as you
> write above since nobody expects stuff like get_page() to fail. But I 
> think that unmapping relevant pages and then preventing them to be faulted
> in again is workable and stops GUP as well. The problem with that is though
> what to do with page faults to such pages - you cannot just fail them for
> hole punch, and you cannot easily allocate new blocks either. So we are
> back at a situation where we need to detach blocks from the inode and then
> wait for page refs to be dropped - so some form of busy extents. Am I
> missing something?
> 

No, that's a good summary of what we talked about. However, I did go
back and give the new lock approach a try and was able to get my test
to pass. The new locking is not pretty especially since you need to
drop and reacquire the lock so that get_user_pages() can finish
grabbing all the pages it needs. Here are the two primary patches in
the series, do you think the extent-busy approach would be cleaner?

---

commit 5023d20a0aa795ddafd43655be1bfb2cbc7f4445
Author: Dan Williams <dan.j.williams@intel.com>
Date:   Wed Oct 25 05:14:54 2017 -0700

    mm, dax: handle truncate of dma-busy pages
    
    get_user_pages() pins file backed memory pages for access by dma
    devices. However, it only pins the memory pages not the page-to-file
    offset association. If a file is truncated the pages are mapped out of
    the file and dma may continue indefinitely into a page that is owned by
    a device driver. This breaks coherency of the file vs dma, but the
    assumption is that if userspace wants the file-space truncated it does
    not matter what data is inbound from the device, it is not relevant
    anymore.
    
    The assumptions of the truncate-page-cache model are broken by DAX where
    the target DMA page *is* the filesystem block. Leaving the page pinned
    for DMA, but truncating the file block out of the file, means that the
    filesytem is free to reallocate a block under active DMA to another
    file!
    
    Here are some possible options for fixing this situation ('truncate' and
    'fallocate(punch hole)' are synonymous below):
    
        1/ Fail truncate while any file blocks might be under dma
    
        2/ Block (sleep-wait) truncate while any file blocks might be under
           dma
    
        3/ Remap file blocks to a "lost+found"-like file-inode where
           dma can continue and we might see what inbound data from DMA was
           mapped out of the original file. Blocks in this file could be
           freed back to the filesystem when dma eventually ends.
    
        4/ List the blocks under DMA in the extent busy list and either hold
           off commit of the truncate transaction until commit, or otherwise
           keep the blocks marked busy so the allocator does not reuse them
           until DMA completes.
    
        5/ Disable dax until option 3 or another long term solution has been
           implemented. However, filesystem-dax is still marked experimental
           for concerns like this.
    
    Option 1 will throw failures where userspace has never expected them
    before, option 2 might hang the truncating process indefinitely, and
    option 3 requires per filesystem enabling to remap blocks from one inode
    to another.  Option 2 is implemented in this patch for the DAX path with
    the expectation that non-transient users of get_user_pages() (RDMA) are
    disallowed from setting up dax mappings and that the potential delay
    introduced to the truncate path is acceptable compared to the response
    time of the page cache case. This can only be seen as a stop-gap until
    we can solve the problem of safely sequestering unallocated filesystem
    blocks under active dma.
    
    The solution introduces a new inode semaphore that that is held
    exclusively for get_user_pages() and held for read at truncate while
    sleep-waiting on a hashed waitqueue.
    
    Credit for option 3 goes to Dave Hansen, who proposed something similar
    as an alternative way to solve the problem that MAP_DIRECT was trying to
    solve. Credit for option 4 goes to Christoph Hellwig.
    
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jeff Moyer <jmoyer@redhat.com>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Matthew Wilcox <mawilcox@microsoft.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
    Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Reported-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Dan Williams <dan.j.williams@intel.com>

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 4ac359e14777..a5a4b95ffdaf 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -167,6 +167,7 @@ struct dax_device {
 #if IS_ENABLED(CONFIG_FS_DAX)
 static void generic_dax_pagefree(struct page *page, void *data)
 {
+	wake_up_devmap_idle(&page->_refcount);
 }
 
 struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
diff --git a/fs/dax.c b/fs/dax.c
index fd5d385988d1..f2c98f9cb833 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -346,6 +346,19 @@ static void dax_disassociate_entry(void *entry, struct inode *inode, bool trunc)
 	}
 }
 
+static struct page *dma_busy_page(void *entry)
+{
+	unsigned long pfn, end_pfn;
+
+	for_each_entry_pfn(entry, pfn, end_pfn) {
+		struct page *page = pfn_to_page(pfn);
+
+		if (page_ref_count(page) > 1)
+			return page;
+	}
+	return NULL;
+}
+
 /*
  * Find radix tree entry at given index. If it points to an exceptional entry,
  * return it with the radix tree entry locked. If the radix tree doesn't
@@ -487,6 +500,97 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
 	return entry;
 }
 
+static int wait_page(atomic_t *_refcount)
+{
+	struct page *page = container_of(_refcount, struct page, _refcount);
+	struct inode *inode = page->inode;
+
+	if (page_ref_count(page) == 1)
+		return 0;
+
+	i_daxdma_unlock_shared(inode);
+	schedule();
+	i_daxdma_lock_shared(inode);
+
+	/*
+	 * if we bounced the daxdma_lock then we need to rescan the
+	 * truncate area.
+	 */
+	return 1;
+}
+
+void dax_wait_dma(struct address_space *mapping, loff_t lstart, loff_t len)
+{
+	struct inode *inode = mapping->host;
+	pgoff_t	indices[PAGEVEC_SIZE];
+	pgoff_t	start, end, index;
+	struct pagevec pvec;
+	unsigned i;
+
+	lockdep_assert_held(&inode->i_dax_dmasem);
+
+	if (lstart < 0 || len < -1)
+		return;
+
+	/* in the limited case get_user_pages for dax is disabled */
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+		return;
+
+	if (!dax_mapping(mapping))
+		return;
+
+	if (mapping->nrexceptional == 0)
+		return;
+
+	if (len == -1)
+		end = -1;
+	else
+		end = (lstart + len) >> PAGE_SHIFT;
+	start = lstart >> PAGE_SHIFT;
+
+retry:
+	pagevec_init(&pvec, 0);
+	index = start;
+	while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
+				min(end - index, (pgoff_t)PAGEVEC_SIZE),
+				indices)) {
+		for (i = 0; i < pagevec_count(&pvec); i++) {
+			struct page *pvec_ent = pvec.pages[i];
+			struct page *page = NULL;
+			void *entry;
+
+			index = indices[i];
+			if (index >= end)
+				break;
+
+			if (!radix_tree_exceptional_entry(pvec_ent))
+				continue;
+
+			spin_lock_irq(&mapping->tree_lock);
+			entry = get_unlocked_mapping_entry(mapping, index, NULL);
+			if (entry)
+				page = dma_busy_page(entry);
+			put_unlocked_mapping_entry(mapping, index, entry);
+			spin_unlock_irq(&mapping->tree_lock);
+
+			if (page && wait_on_devmap_idle(&page->_refcount,
+						wait_page,
+						TASK_UNINTERRUPTIBLE) != 0) {
+				/*
+				 * We dropped the dma lock, so we need
+				 * to revalidate that previously seen
+				 * idle pages are still idle.
+				 */
+				goto retry;
+			}
+		}
+		pagevec_remove_exceptionals(&pvec);
+		pagevec_release(&pvec);
+		index++;
+	}
+}
+EXPORT_SYMBOL_GPL(dax_wait_dma);
+
 static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 					  pgoff_t index, bool trunc)
 {
@@ -509,8 +613,10 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 out:
 	put_unlocked_mapping_entry(mapping, index, entry);
 	spin_unlock_irq(&mapping->tree_lock);
+
 	return ret;
 }
+
 /*
  * Delete exceptional DAX entry at @index from @mapping. Wait for radix tree
  * entry to get unlocked before deleting it.
diff --git a/fs/inode.c b/fs/inode.c
index d1e35b53bb23..95408e87a96c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -192,6 +192,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	inode->i_fsnotify_mask = 0;
 #endif
 	inode->i_flctx = NULL;
+	i_daxdma_init(inode);
 	this_cpu_inc(nr_inodes);
 
 	return 0;
diff --git a/include/linux/dax.h b/include/linux/dax.h
index ea21ebfd1889..6ce1c50519e7 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -100,10 +100,15 @@ int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 				      pgoff_t index);
 
 #ifdef CONFIG_FS_DAX
+void dax_wait_dma(struct address_space *mapping, loff_t lstart, loff_t len);
 int __dax_zero_page_range(struct block_device *bdev,
 		struct dax_device *dax_dev, sector_t sector,
 		unsigned int offset, unsigned int length);
 #else
+static inline void dax_wait_dma(struct address_space *mapping, loff_t lstart,
+		loff_t len)
+{
+}
 static inline int __dax_zero_page_range(struct block_device *bdev,
 		struct dax_device *dax_dev, sector_t sector,
 		unsigned int offset, unsigned int length)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 13dab191a23e..cd5b4a092d1c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -645,6 +645,9 @@ struct inode {
 #ifdef CONFIG_IMA
 	atomic_t		i_readcount; /* struct files open RO */
 #endif
+#ifdef CONFIG_FS_DAX
+	struct rw_semaphore	i_dax_dmasem;
+#endif
 	const struct file_operations	*i_fop;	/* former ->i_op->default_file_ops */
 	struct file_lock_context	*i_flctx;
 	struct address_space	i_data;
@@ -747,6 +750,59 @@ static inline void inode_lock_nested(struct inode *inode, unsigned subclass)
 	down_write_nested(&inode->i_rwsem, subclass);
 }
 
+#ifdef CONFIG_FS_DAX
+static inline void i_daxdma_init(struct inode *inode)
+{
+	init_rwsem(&inode->i_dax_dmasem);
+}
+
+static inline void i_daxdma_lock(struct inode *inode)
+{
+	down_write(&inode->i_dax_dmasem);
+}
+
+static inline void i_daxdma_unlock(struct inode *inode)
+{
+	up_write(&inode->i_dax_dmasem);
+}
+
+static inline void i_daxdma_lock_shared(struct inode *inode)
+{
+	/*
+	 * The write lock is taken under mmap_sem in the
+	 * get_user_pages() path the read lock nests in the truncate
+	 * path.
+	 */
+#define DAXDMA_TRUNCATE_CLASS 1
+	down_read_nested(&inode->i_dax_dmasem, DAXDMA_TRUNCATE_CLASS);
+}
+
+static inline void i_daxdma_unlock_shared(struct inode *inode)
+{
+	up_read(&inode->i_dax_dmasem);
+}
+#else /* CONFIG_FS_DAX */
+static inline void i_daxdma_init(struct inode *inode)
+{
+}
+
+static inline void i_daxdma_lock(struct inode *inode)
+{
+}
+
+static inline void i_daxdma_unlock(struct inode *inode)
+{
+}
+
+static inline void i_daxdma_lock_shared(struct inode *inode)
+{
+}
+
+static inline void i_daxdma_unlock_shared(struct inode *inode)
+{
+}
+#endif /* CONFIG_FS_DAX */
+
 void lock_two_nondirectories(struct inode *, struct inode*);
 void unlock_two_nondirectories(struct inode *, struct inode*);
 
diff --git a/include/linux/wait_bit.h b/include/linux/wait_bit.h
index 12b26660d7e9..6186ecdb9df7 100644
--- a/include/linux/wait_bit.h
+++ b/include/linux/wait_bit.h
@@ -30,10 +30,12 @@ int __wait_on_bit(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *
 int __wait_on_bit_lock(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry, wait_bit_action_f *action, unsigned int mode);
 void wake_up_bit(void *word, int bit);
 void wake_up_atomic_t(atomic_t *p);
+void wake_up_devmap_idle(atomic_t *p);
 int out_of_line_wait_on_bit(void *word, int, wait_bit_action_f *action, unsigned int mode);
 int out_of_line_wait_on_bit_timeout(void *word, int, wait_bit_action_f *action, unsigned int mode, unsigned long timeout);
 int out_of_line_wait_on_bit_lock(void *word, int, wait_bit_action_f *action, unsigned int mode);
 int out_of_line_wait_on_atomic_t(atomic_t *p, int (*)(atomic_t *), unsigned int mode);
+int out_of_line_wait_on_devmap_idle(atomic_t *p, int (*)(atomic_t *), unsigned int mode);
 struct wait_queue_head *bit_waitqueue(void *word, int bit);
 extern void __init wait_bit_init(void);
 
@@ -258,4 +260,12 @@ int wait_on_atomic_t(atomic_t *val, int (*action)(atomic_t *), unsigned mode)
 	return out_of_line_wait_on_atomic_t(val, action, mode);
 }
 
+static inline
+int wait_on_devmap_idle(atomic_t *val, int (*action)(atomic_t *), unsigned mode)
+{
+	might_sleep();
+	if (atomic_read(val) == 1)
+		return 0;
+	return out_of_line_wait_on_devmap_idle(val, action, mode);
+}
 #endif /* _LINUX_WAIT_BIT_H */
diff --git a/kernel/sched/wait_bit.c b/kernel/sched/wait_bit.c
index f8159698aa4d..6ea93149614a 100644
--- a/kernel/sched/wait_bit.c
+++ b/kernel/sched/wait_bit.c
@@ -162,11 +162,17 @@ static inline wait_queue_head_t *atomic_t_waitqueue(atomic_t *p)
 	return bit_waitqueue(p, 0);
 }
 
-static int wake_atomic_t_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync,
-				  void *arg)
+static inline struct wait_bit_queue_entry *to_wait_bit_q(
+		struct wait_queue_entry *wq_entry)
+{
+	return container_of(wq_entry, struct wait_bit_queue_entry, wq_entry);
+}
+
+static int wake_atomic_t_function(struct wait_queue_entry *wq_entry,
+		unsigned mode, int sync, void *arg)
 {
 	struct wait_bit_key *key = arg;
-	struct wait_bit_queue_entry *wait_bit = container_of(wq_entry, struct wait_bit_queue_entry, wq_entry);
+	struct wait_bit_queue_entry *wait_bit = to_wait_bit_q(wq_entry);
 	atomic_t *val = key->flags;
 
 	if (wait_bit->key.flags != key->flags ||
@@ -176,14 +182,29 @@ static int wake_atomic_t_function(struct wait_queue_entry *wq_entry, unsigned mo
 	return autoremove_wake_function(wq_entry, mode, sync, key);
 }
 
+static int wake_devmap_idle_function(struct wait_queue_entry *wq_entry,
+		unsigned mode, int sync, void *arg)
+{
+	struct wait_bit_key *key = arg;
+	struct wait_bit_queue_entry *wait_bit = to_wait_bit_q(wq_entry);
+	atomic_t *val = key->flags;
+
+	if (wait_bit->key.flags != key->flags ||
+	    wait_bit->key.bit_nr != key->bit_nr ||
+	    atomic_read(val) != 1)
+		return 0;
+	return autoremove_wake_function(wq_entry, mode, sync, key);
+}
+
 /*
  * To allow interruptible waiting and asynchronous (i.e. nonblocking) waiting,
  * the actions of __wait_on_atomic_t() are permitted return codes.  Nonzero
  * return codes halt waiting and return.
  */
 static __sched
-int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry,
-		       int (*action)(atomic_t *), unsigned mode)
+int __wait_on_atomic_t(struct wait_queue_head *wq_head,
+		struct wait_bit_queue_entry *wbq_entry,
+		int (*action)(atomic_t *), unsigned mode, int target)
 {
 	atomic_t *val;
 	int ret = 0;
@@ -191,10 +212,10 @@ int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_en
 	do {
 		prepare_to_wait(wq_head, &wbq_entry->wq_entry, mode);
 		val = wbq_entry->key.flags;
-		if (atomic_read(val) == 0)
+		if (atomic_read(val) == target)
 			break;
 		ret = (*action)(val);
-	} while (!ret && atomic_read(val) != 0);
+	} while (!ret && atomic_read(val) != target);
 	finish_wait(wq_head, &wbq_entry->wq_entry);
 	return ret;
 }
@@ -210,16 +231,37 @@ int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_en
 		},							\
 	}
 
+#define DEFINE_WAIT_DEVMAP_IDLE(name, p)					\
+	struct wait_bit_queue_entry name = {				\
+		.key = __WAIT_ATOMIC_T_KEY_INITIALIZER(p),		\
+		.wq_entry = {						\
+			.private	= current,			\
+			.func		= wake_devmap_idle_function,	\
+			.entry		=				\
+				LIST_HEAD_INIT((name).wq_entry.entry),	\
+		},							\
+	}
+
 __sched int out_of_line_wait_on_atomic_t(atomic_t *p, int (*action)(atomic_t *),
 					 unsigned mode)
 {
 	struct wait_queue_head *wq_head = atomic_t_waitqueue(p);
 	DEFINE_WAIT_ATOMIC_T(wq_entry, p);
 
-	return __wait_on_atomic_t(wq_head, &wq_entry, action, mode);
+	return __wait_on_atomic_t(wq_head, &wq_entry, action, mode, 0);
 }
 EXPORT_SYMBOL(out_of_line_wait_on_atomic_t);
 
+__sched int out_of_line_wait_on_devmap_idle(atomic_t *p, int (*action)(atomic_t *),
+					 unsigned mode)
+{
+	struct wait_queue_head *wq_head = atomic_t_waitqueue(p);
+	DEFINE_WAIT_DEVMAP_IDLE(wq_entry, p);
+
+	return __wait_on_atomic_t(wq_head, &wq_entry, action, mode, 1);
+}
+EXPORT_SYMBOL(out_of_line_wait_on_devmap_idle);
+
 /**
  * wake_up_atomic_t - Wake up a waiter on a atomic_t
  * @p: The atomic_t being waited on, a kernel virtual address
@@ -235,6 +277,12 @@ void wake_up_atomic_t(atomic_t *p)
 }
 EXPORT_SYMBOL(wake_up_atomic_t);
 
+void wake_up_devmap_idle(atomic_t *p)
+{
+	__wake_up_bit(atomic_t_waitqueue(p), p, WAIT_ATOMIC_T_BIT_NR);
+}
+EXPORT_SYMBOL(wake_up_devmap_idle);
+
 __sched int bit_wait(struct wait_bit_key *word, int mode)
 {
 	schedule();
diff --git a/mm/gup.c b/mm/gup.c
index 308be897d22a..fd7b2a2e2d19 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -579,6 +579,41 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 	return 0;
 }
 
+static struct inode *do_dax_lock(struct vm_area_struct *vma,
+		unsigned int foll_flags)
+{
+	struct file *file;
+	struct inode *inode;
+
+	if (!(foll_flags & FOLL_GET))
+		return NULL;
+	if (!vma_is_dax(vma))
+		return NULL;
+	file = vma->vm_file;
+	inode = file_inode(file);
+	if (inode->i_mode == S_IFCHR)
+		return NULL;
+	return inode;
+}
+
+static struct inode *dax_truncate_lock(struct vm_area_struct *vma,
+		unsigned int foll_flags)
+{
+	struct inode *inode = do_dax_lock(vma, foll_flags);
+
+	if (!inode)
+		return NULL;
+	i_daxdma_lock(inode);
+	return inode;
+}
+
+static void dax_truncate_unlock(struct inode *inode)
+{
+	if (!inode)
+		return;
+	i_daxdma_unlock(inode);
+}
+
 /**
  * __get_user_pages() - pin user pages in memory
  * @tsk:	task_struct of target task
@@ -659,6 +694,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 
 	do {
 		struct page *page;
+		struct inode *inode;
 		unsigned int foll_flags = gup_flags;
 		unsigned int page_increm;
 
@@ -693,7 +729,9 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		if (unlikely(fatal_signal_pending(current)))
 			return i ? i : -ERESTARTSYS;
 		cond_resched();
+		inode = dax_truncate_lock(vma, foll_flags);
 		page = follow_page_mask(vma, start, foll_flags, &page_mask);
+		dax_truncate_unlock(inode);
 		if (!page) {
 			int ret;
 			ret = faultin_page(tsk, vma, start, &foll_flags,

commit 67d952314e9989b3b1945c50488f4a0f760264c3
Author: Dan Williams <dan.j.williams@intel.com>
Date:   Tue Oct 24 13:41:22 2017 -0700

    xfs: wire up dax dma waiting
    
    The dax-dma vs truncate collision avoidance involves acquiring the new
    i_dax_dmasem and validating the no ranges that are to be mapped out of
    the file are active for dma. If any are found we wait for page idle
    and retry the scan. The locations where we implement this wait line up
    with where we currently wait for pnfs layout leases to expire.
    
    Since we need both dma to be idle and leases to be broken, and since
    xfs_break_layouts drops locks, we need to retry the dma busy scan until
    we can complete one that finds no busy pages.
    
    Cc: Jan Kara <jack@suse.cz>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
    Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Dan Williams <dan.j.williams@intel.com>

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index c6780743f8ec..e3ec46c28c60 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -347,7 +347,7 @@ xfs_file_aio_write_checks(
 		return error;
 
 	error = xfs_break_layouts(inode, iolock);
-	if (error)
+	if (error < 0)
 		return error;
 
 	/*
@@ -762,7 +762,7 @@ xfs_file_fallocate(
 	struct xfs_inode	*ip = XFS_I(inode);
 	long			error;
 	enum xfs_prealloc_flags	flags = 0;
-	uint			iolock = XFS_IOLOCK_EXCL;
+	uint			iolock = XFS_DAXDMA_LOCK_SHARED;
 	loff_t			new_size = 0;
 	bool			do_file_insert = 0;
 
@@ -771,10 +771,20 @@ xfs_file_fallocate(
 	if (mode & ~XFS_FALLOC_FL_SUPPORTED)
 		return -EOPNOTSUPP;
 
+retry:
 	xfs_ilock(ip, iolock);
+	dax_wait_dma(inode->i_mapping, offset, len);
+
+	xfs_ilock(ip, XFS_IOLOCK_EXCL);
+	iolock |= XFS_IOLOCK_EXCL;
 	error = xfs_break_layouts(inode, &iolock);
-	if (error)
+	if (error < 0)
 		goto out_unlock;
+	else if (error > 0 && IS_ENABLED(CONFIG_FS_DAX)) {
+		xfs_iunlock(ip, iolock);
+		iolock = XFS_DAXDMA_LOCK_SHARED;
+		goto retry;
+	}
 
 	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
 	iolock |= XFS_MMAPLOCK_EXCL;
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 4ec5b7f45401..783f15894b7b 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -171,7 +171,14 @@ xfs_ilock_attr_map_shared(
  * taken in places where we need to invalidate the page cache in a race
  * free manner (e.g. truncate, hole punch and other extent manipulation
  * functions).
- */
+ *
+ * The XFS_DAXDMA_LOCK_SHARED lock is a CONFIG_FS_DAX special case lock
+ * for synchronizing truncate vs ongoing DMA. The get_user_pages() path
+ * will hold this lock exclusively when incrementing page reference
+ * counts for DMA. Before an extent can be truncated we need to complete
+ * a validate-idle sweep of all pages in the range while holding this
+ * lock in shared mode.
+*/
 void
 xfs_ilock(
 	xfs_inode_t		*ip,
@@ -192,6 +199,9 @@ xfs_ilock(
 	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
 	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);
 
+	if (lock_flags & XFS_DAXDMA_LOCK_SHARED)
+		i_daxdma_lock_shared(VFS_I(ip));
+
 	if (lock_flags & XFS_IOLOCK_EXCL) {
 		down_write_nested(&VFS_I(ip)->i_rwsem,
 				  XFS_IOLOCK_DEP(lock_flags));
@@ -328,6 +338,9 @@ xfs_iunlock(
 	else if (lock_flags & XFS_ILOCK_SHARED)
 		mrunlock_shared(&ip->i_lock);
 
+	if (lock_flags & XFS_DAXDMA_LOCK_SHARED)
+		i_daxdma_unlock_shared(VFS_I(ip));
+
 	trace_xfs_iunlock(ip, lock_flags, _RET_IP_);
 }
 
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 0ee453de239a..0662edf00529 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -283,10 +283,12 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
 #define	XFS_ILOCK_SHARED	(1<<3)
 #define	XFS_MMAPLOCK_EXCL	(1<<4)
 #define	XFS_MMAPLOCK_SHARED	(1<<5)
+#define	XFS_DAXDMA_LOCK_SHARED	(1<<6)
 
 #define XFS_LOCK_MASK		(XFS_IOLOCK_EXCL | XFS_IOLOCK_SHARED \
 				| XFS_ILOCK_EXCL | XFS_ILOCK_SHARED \
-				| XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED)
+				| XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED \
+				| XFS_DAXDMA_LOCK_SHARED)
 
 #define XFS_LOCK_FLAGS \
 	{ XFS_IOLOCK_EXCL,	"IOLOCK_EXCL" }, \
@@ -294,7 +296,8 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
 	{ XFS_ILOCK_EXCL,	"ILOCK_EXCL" }, \
 	{ XFS_ILOCK_SHARED,	"ILOCK_SHARED" }, \
 	{ XFS_MMAPLOCK_EXCL,	"MMAPLOCK_EXCL" }, \
-	{ XFS_MMAPLOCK_SHARED,	"MMAPLOCK_SHARED" }
+	{ XFS_MMAPLOCK_SHARED,	"MMAPLOCK_SHARED" }, \
+	{ XFS_DAXDMA_LOCK_SHARED, "XFS_DAXDMA_LOCK_SHARED" }
 
 
 /*
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index aa75389be8cf..fd384ea00ede 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -612,7 +612,7 @@ xfs_ioc_space(
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct iattr		iattr;
 	enum xfs_prealloc_flags	flags = 0;
-	uint			iolock = XFS_IOLOCK_EXCL;
+	uint			iolock = XFS_DAXDMA_LOCK_SHARED;
 	int			error;
 
 	/*
@@ -637,18 +637,6 @@ xfs_ioc_space(
 	if (filp->f_mode & FMODE_NOCMTIME)
 		flags |= XFS_PREALLOC_INVISIBLE;
 
-	error = mnt_want_write_file(filp);
-	if (error)
-		return error;
-
-	xfs_ilock(ip, iolock);
-	error = xfs_break_layouts(inode, &iolock);
-	if (error)
-		goto out_unlock;
-
-	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
-	iolock |= XFS_MMAPLOCK_EXCL;
-
 	switch (bf->l_whence) {
 	case 0: /*SEEK_SET*/
 		break;
@@ -659,10 +647,31 @@ xfs_ioc_space(
 		bf->l_start += XFS_ISIZE(ip);
 		break;
 	default:
-		error = -EINVAL;
+		return -EINVAL;
+	}
+
+	error = mnt_want_write_file(filp);
+	if (error)
+		return error;
+
+retry:
+	xfs_ilock(ip, iolock);
+	dax_wait_dma(inode->i_mapping, bf->l_start, bf->l_len);
+
+	xfs_ilock(ip, XFS_IOLOCK_EXCL);
+	iolock |= XFS_IOLOCK_EXCL;
+	error = xfs_break_layouts(inode, &iolock);
+	if (error < 0)
 		goto out_unlock;
+	else if (error > 0 && IS_ENABLED(CONFIG_FS_DAX)) {
+		xfs_iunlock(ip, iolock);
+		iolock = XFS_DAXDMA_LOCK_SHARED;
+		goto retry;
 	}
 
+	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
+	iolock |= XFS_MMAPLOCK_EXCL;
+
 	/*
 	 * length of <= 0 for resv/unresv/zero is invalid.  length for
 	 * alloc/free is ignored completely and we have no idea what userspace
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index 4246876df7b7..5f4d46b3cd7f 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -35,18 +35,19 @@ xfs_break_layouts(
 	uint			*iolock)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
-	int			error;
+	int			error, did_unlock = 0;
 
 	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
 
 	while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
 		xfs_iunlock(ip, *iolock);
+		did_unlock = 1;
 		error = break_layout(inode, true);
 		*iolock = XFS_IOLOCK_EXCL;
 		xfs_ilock(ip, *iolock);
 	}
 
-	return error;
+	return error < 0 ? error : did_unlock;
 }
 
 /*

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
  2017-10-26 23:51       ` Williams, Dan J
@ 2017-10-27  6:48         ` Dave Chinner
  2017-10-27 11:42           ` Dan Williams
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2017-10-27  6:48 UTC (permalink / raw)
  To: Williams, Dan J
  Cc: hch@lst.de, jack@suse.cz, schwidefsky@de.ibm.com,
	darrick.wong@oracle.com, dledford@redhat.com,
	linux-rdma@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	bfields@fieldses.org, linux-mm@kvack.org,
	heiko.carstens@de.ibm.com, dave.hansen@linux.intel.com,
	linux-xfs@vger.kernel.org, linux-kernel@vger.kernel.org,
	jmoyer@redhat.com, viro@zeniv.linux.org.uk,
	kirill.shutemov@linux.intel.com, akpm@linux-foundation.org

On Thu, Oct 26, 2017 at 11:51:04PM +0000, Williams, Dan J wrote:
> On Thu, 2017-10-26 at 12:58 +0200, Jan Kara wrote:
> > On Fri 20-10-17 11:31:48, Christoph Hellwig wrote:
> > > On Fri, Oct 20, 2017 at 09:47:50AM +0200, Christoph Hellwig wrote:
> > > > I'd like to brainstorm how we can do something better.
> > > > 
> > > > How about:
> > > > 
> > > > If we hit a page with an elevated refcount in truncate / hole puch
> > > > etc for a DAX file system we do not free the blocks in the file system,
> > > > but add it to the extent busy list.  We mark the page as delayed
> > > > free (e.g. page flag?) so that when it finally hits refcount zero we
> > > > call back into the file system to remove it from the busy list.
> > > 
> > > Brainstorming some more:
> > > 
> > > Given that on a DAX file there shouldn't be any long-term page
> > > references after we unmap it from the page table and don't allow
> > > get_user_pages calls why not wait for the references for all
> > > DAX pages to go away first?  E.g. if we find a DAX page in
> > > truncate_inode_pages_range that has an elevated refcount we set
> > > a new flag to prevent new references from showing up, and then
> > > simply wait for it to go away.  Instead of a busy way we can
> > > do this through a few hashed waitqueued in dev_pagemap.  And in
> > > fact put_zone_device_page already gets called when putting the
> > > last page so we can handle the wakeup from there.
> > > 
> > > In fact if we can't find a page flag for the stop new callers
> > > things we could probably come up with a way to do that through
> > > dev_pagemap somehow, but I'm not sure how efficient that would
> > > be.
> > 
> > We were talking about this yesterday with Dan so some more brainstorming
> > from us. We can implement the solution with extent busy list in ext4
> > relatively easily - we already have such list currently similarly to XFS.
> > There would be some modifications needed but nothing too complex. The
> > biggest downside of this solution I see is that it requires per-filesystem
> > solution for busy extents - ext4 and XFS are reasonably fine, however btrfs
> > may have problems and ext2 definitely will need some modifications.
> > Invisible used blocks may be surprising to users at times although given
> > page refs should be relatively short term, that should not be a big issue.
> > But are we guaranteed page refs are short term? E.g. if someone creates
> > v4l2 videobuf in MAP_SHARED mapping of a file on DAX filesystem, page refs
> > can be rather long-term similarly as in RDMA case. Also freeing of blocks
> > on page reference drop is another async entry point into the filesystem
> > which could unpleasantly surprise us but I guess workqueues would solve
> > that reasonably fine.
> > 
> > WRT waiting for page refs to be dropped before proceeding with truncate (or
> > punch hole for that matter - that case is even nastier since we don't have
> > i_size to guard us). What I like about this solution is that it is very
> > visible there's something unusual going on with the file being truncated /
> > punched and so problems are easier to diagnose / fix from the admin side.
> > So far we have guarded hole punching from concurrent faults (and
> > get_user_pages() does fault once you do unmap_mapping_range()) with
> > I_MMAP_LOCK (or its equivalent in ext4). We cannot easily wait for page
> > refs to be dropped under I_MMAP_LOCK as that could deadlock - the most
> > obvious case Dan came up with is when GUP obtains ref to page A, then hole
> > punch comes grabbing I_MMAP_LOCK and waiting for page ref on A to be
> > dropped, and then GUP blocks on trying to fault in another page.
> > 
> > I think we cannot easily prevent new page references to be grabbed as you
> > write above since nobody expects stuff like get_page() to fail. But I 
> > think that unmapping relevant pages and then preventing them to be faulted
> > in again is workable and stops GUP as well. The problem with that is though
> > what to do with page faults to such pages - you cannot just fail them for
> > hole punch, and you cannot easily allocate new blocks either. So we are
> > back at a situation where we need to detach blocks from the inode and then
> > wait for page refs to be dropped - so some form of busy extents. Am I
> > missing something?
> > 
> 
> No, that's a good summary of what we talked about. However, I did go
> back and give the new lock approach a try and was able to get my test
> to pass. The new locking is not pretty especially since you need to
> drop and reacquire the lock so that get_user_pages() can finish
> grabbing all the pages it needs. Here are the two primary patches in
> the series, do you think the extent-busy approach would be cleaner?

The XFS_DAXDMA.... 

$DEITY that patch is so ugly I can't even bring myself to type it.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
  2017-10-27  6:48         ` Dave Chinner
@ 2017-10-27 11:42           ` Dan Williams
  2017-10-29 21:52             ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Dan Williams @ 2017-10-27 11:42 UTC (permalink / raw)
  To: Dave Chinner
  Cc: mhocko@suse.com, jack@suse.cz, benh@kernel.crashing.org,
	dave.hansen@linux.intel.com, heiko.carstens@de.ibm.com,
	bfields@fieldses.org, linux-mm@kvack.org, paulus@samba.org,
	Hefty, Sean, jlayton@poochiereds.net, mawilcox@microsoft.com,
	linux-rdma@vger.kernel.org, mpe@ellerman.id.au,
	dledford@redhat.com, hch@lst.de, jgunthorpe@obsidianresearch.com,
	hal.rosenstock@gmail.com, schwidefsky@de.ibm.com, viro

[-- Attachment #1: Type: text/plain, Size: 651 bytes --]

[replying from my phone, please forgive formatting]

On Friday, October 27, 2017, Dave Chinner <david@fromorbit.com> wrote:


> > Here are the two primary patches in
> > the series, do you think the extent-busy approach would be cleaner?
>
> The XFS_DAXDMA....
>
> $DEITY that patch is so ugly I can't even bring myself to type it.


Right, and so is the problem it's trying to solve. So where do you want to
go from here?

I could go back to the FL_ALLOCATED approach, but use page idle callbacks
instead of polling for the lease end notification. Or do we want to try
busy extents? My concern with busy extents is that it requires more per-fs
code.

[-- Attachment #2: Type: text/html, Size: 923 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
  2017-10-27 11:42           ` Dan Williams
@ 2017-10-29 21:52             ` Dave Chinner
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2017-10-29 21:52 UTC (permalink / raw)
  To: Dan Williams
  Cc: mhocko@suse.com, jack@suse.cz, benh@kernel.crashing.org,
	dave.hansen@linux.intel.com, heiko.carstens@de.ibm.com,
	bfields@fieldses.org, linux-mm@kvack.org, paulus@samba.org,
	Hefty, Sean, jlayton@poochiereds.net, mawilcox@microsoft.com,
	linux-rdma@vger.kernel.org, mpe@ellerman.id.au,
	dledford@redhat.com, hch@lst.de, jgunthorpe@obsidianresearch.com,
	hal.rosenstock@gmail.com, schwidefsky@de.ibm.com, viro

On Fri, Oct 27, 2017 at 01:42:16PM +0200, Dan Williams wrote:
> [replying from my phone, please forgive formatting]
> 
> On Friday, October 27, 2017, Dave Chinner <david@fromorbit.com> wrote:
> 
> 
> > > Here are the two primary patches in
> > > the series, do you think the extent-busy approach would be cleaner?
> >
> > The XFS_DAXDMA....
> >
> > $DEITY that patch is so ugly I can't even bring myself to type it.
> 
> 
> Right, and so is the problem it's trying to solve. So where do you want to
> go from here?
> 
> I could go back to the FL_ALLOCATED approach, but use page idle callbacks
> instead of polling for the lease end notification. Or do we want to try
> busy extents? My concern with busy extents is that it requires more per-fs
> code.

I don't care if it takes more per-fs code to solve the problem -
dumping butt-ugly, nasty locking crap into filesystems that
filesystem developers are completely unable to test is about the
worst possible solution you can come up with.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
  2017-10-26 10:58     ` Jan Kara
  2017-10-26 23:51       ` Williams, Dan J
@ 2017-10-27  6:45       ` Christoph Hellwig
  2017-10-29 23:46       ` Dan Williams
  2 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2017-10-27  6:45 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Dan Williams, akpm, Michal Hocko,
	Benjamin Herrenschmidt, Dave Hansen, Dave Chinner,
	J. Bruce Fields, linux-mm, Paul Mackerras, Sean Hefty,
	Jeff Layton, Matthew Wilcox, linux-rdma, Michael Ellerman,
	Jeff Moyer, Jason Gunthorpe, Doug Ledford, Ross Zwisler,
	Hal Rosenstock, Heiko

On Thu, Oct 26, 2017 at 12:58:50PM +0200, Jan Kara wrote:
> But are we guaranteed page refs are short term? E.g. if someone creates
> v4l2 videobuf in MAP_SHARED mapping of a file on DAX filesystem, page refs
> can be rather long-term similarly as in RDMA case. Also freeing of blocks
> on page reference drop is another async entry point into the filesystem
> which could unpleasantly surprise us but I guess workqueues would solve
> that reasonably fine.

The point is that we need to prohibit long term elevated page counts
with DAX anyway - we can't just let people grab allocated blocks forever
while ignoring file system operations.  For stage 1 we'll just need to
fail those, and in the long run they will have to use a mechanism
similar to FL_LAYOUT locks to deal with file system allocation changes.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
  2017-10-26 10:58     ` Jan Kara
  2017-10-26 23:51       ` Williams, Dan J
  2017-10-27  6:45       ` Christoph Hellwig
@ 2017-10-29 23:46       ` Dan Williams
  2017-10-30  2:00         ` Dave Chinner
  2 siblings, 1 reply; 15+ messages in thread
From: Dan Williams @ 2017-10-29 23:46 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Michal Hocko, Benjamin Herrenschmidt,
	Dave Hansen, Heiko Carstens, J. Bruce Fields, linux-mm,
	Paul Mackerras, Sean Hefty, Jeff Layton, Matthew Wilcox,
	linux-rdma, Michael Ellerman, Jason Gunthorpe, Doug Ledford,
	Hal Rosenstock, Dave Chinner, linux-fsdevel, Alexander Viro,
	Gerald Schaefer, linux-nvdimm@lists.01.org, Linux

On Thu, Oct 26, 2017 at 3:58 AM, Jan Kara <jack@suse.cz> wrote:
> On Fri 20-10-17 11:31:48, Christoph Hellwig wrote:
>> On Fri, Oct 20, 2017 at 09:47:50AM +0200, Christoph Hellwig wrote:
>> > I'd like to brainstorm how we can do something better.
>> >
>> > How about:
>> >
>> > If we hit a page with an elevated refcount in truncate / hole puch
>> > etc for a DAX file system we do not free the blocks in the file system,
>> > but add it to the extent busy list.  We mark the page as delayed
>> > free (e.g. page flag?) so that when it finally hits refcount zero we
>> > call back into the file system to remove it from the busy list.
>>
>> Brainstorming some more:
>>
>> Given that on a DAX file there shouldn't be any long-term page
>> references after we unmap it from the page table and don't allow
>> get_user_pages calls why not wait for the references for all
>> DAX pages to go away first?  E.g. if we find a DAX page in
>> truncate_inode_pages_range that has an elevated refcount we set
>> a new flag to prevent new references from showing up, and then
>> simply wait for it to go away.  Instead of a busy way we can
>> do this through a few hashed waitqueued in dev_pagemap.  And in
>> fact put_zone_device_page already gets called when putting the
>> last page so we can handle the wakeup from there.
>>
>> In fact if we can't find a page flag for the stop new callers
>> things we could probably come up with a way to do that through
>> dev_pagemap somehow, but I'm not sure how efficient that would
>> be.
>
> We were talking about this yesterday with Dan so some more brainstorming
> from us. We can implement the solution with extent busy list in ext4
> relatively easily - we already have such list currently similarly to XFS.
> There would be some modifications needed but nothing too complex. The
> biggest downside of this solution I see is that it requires per-filesystem
> solution for busy extents - ext4 and XFS are reasonably fine, however btrfs
> may have problems and ext2 definitely will need some modifications.
> Invisible used blocks may be surprising to users at times although given
> page refs should be relatively short term, that should not be a big issue.
> But are we guaranteed page refs are short term? E.g. if someone creates
> v4l2 videobuf in MAP_SHARED mapping of a file on DAX filesystem, page refs
> can be rather long-term similarly as in RDMA case. Also freeing of blocks
> on page reference drop is another async entry point into the filesystem
> which could unpleasantly surprise us but I guess workqueues would solve
> that reasonably fine.
>
> WRT waiting for page refs to be dropped before proceeding with truncate (or
> punch hole for that matter - that case is even nastier since we don't have
> i_size to guard us). What I like about this solution is that it is very
> visible there's something unusual going on with the file being truncated /
> punched and so problems are easier to diagnose / fix from the admin side.
> So far we have guarded hole punching from concurrent faults (and
> get_user_pages() does fault once you do unmap_mapping_range()) with
> I_MMAP_LOCK (or its equivalent in ext4). We cannot easily wait for page
> refs to be dropped under I_MMAP_LOCK as that could deadlock - the most
> obvious case Dan came up with is when GUP obtains ref to page A, then hole
> punch comes grabbing I_MMAP_LOCK and waiting for page ref on A to be
> dropped, and then GUP blocks on trying to fault in another page.
>
> I think we cannot easily prevent new page references to be grabbed as you
> write above since nobody expects stuff like get_page() to fail. But I
> think that unmapping relevant pages and then preventing them to be faulted
> in again is workable and stops GUP as well. The problem with that is though
> what to do with page faults to such pages - you cannot just fail them for
> hole punch, and you cannot easily allocate new blocks either. So we are
> back at a situation where we need to detach blocks from the inode and then
> wait for page refs to be dropped - so some form of busy extents. Am I
> missing something?

Coming back to this since Dave has made clear that new locking to
coordinate get_user_pages() is a no-go.

We can unmap to force new get_user_pages() attempts to block on the
per-fs mmap lock, but if punch-hole finds any elevated pages it needs
to drop the mmap lock and wait. We need this lock dropped to get
around the problem that the driver will not start to drop page
references until it has elevated the page references on all the pages
in the I/O. If we need to drop the mmap lock that makes it impossible
to coordinate this unlock/retry loop within truncate_inode_pages_range
which would otherwise be the natural place to land this code.

Would it be palatable to unmap and drain dma in any path that needs to
detach blocks from an inode? Something like the following that builds
on dax_wait_dma() tried to achieve, but does not introduce a new lock
for the fs to manage:

retry:
    per_fs_mmap_lock(inode);
    unmap_mapping_range(mapping, start, end); /* new page references
cannot be established */
    if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) {
        per_fs_mmap_unlock(inode); /* new page references can happen,
so we need to start over */
        wait_for_page_idle(dax_page);
        goto retry;
    }
    truncate_inode_pages_range(mapping, start, end);
    per_fs_mmap_unlock(inode);

Given how far away taking the mmap lock occurs from where we know we
are actually performing a punch-hole operation this may lead to
unnecessary unmapping and dma flushing.

As far as I can see the extent busy mechanism does not simplify the
solution. If we have code to wait for the pages to go idle might as
well have the truncate/punch-hole wait on that event in the first
instance.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
  2017-10-29 23:46       ` Dan Williams
@ 2017-10-30  2:00         ` Dave Chinner
  2017-10-30  8:38           ` Jan Kara
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2017-10-30  2:00 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Christoph Hellwig, Michal Hocko, Benjamin Herrenschmidt,
	Dave Hansen, Heiko Carstens, J. Bruce Fields, linux-mm,
	Paul Mackerras, Sean Hefty, Jeff Layton, Matthew Wilcox,
	linux-rdma, Michael Ellerman, Jason Gunthorpe, Doug Ledford,
	Hal Rosenstock, linux-fsdevel, Alexander Viro, Gerald Schaefer,
	linux-nvdimm@lists.01.org,
	Linux Kernel Mailing List <linux-kernel>

On Sun, Oct 29, 2017 at 04:46:44PM -0700, Dan Williams wrote:
> On Thu, Oct 26, 2017 at 3:58 AM, Jan Kara <jack@suse.cz> wrote:
> > On Fri 20-10-17 11:31:48, Christoph Hellwig wrote:
> >> On Fri, Oct 20, 2017 at 09:47:50AM +0200, Christoph Hellwig wrote:
> >> > I'd like to brainstorm how we can do something better.
> >> >
> >> > How about:
> >> >
> >> > If we hit a page with an elevated refcount in truncate / hole puch
> >> > etc for a DAX file system we do not free the blocks in the file system,
> >> > but add it to the extent busy list.  We mark the page as delayed
> >> > free (e.g. page flag?) so that when it finally hits refcount zero we
> >> > call back into the file system to remove it from the busy list.
> >>
> >> Brainstorming some more:
> >>
> >> Given that on a DAX file there shouldn't be any long-term page
> >> references after we unmap it from the page table and don't allow
> >> get_user_pages calls why not wait for the references for all
> >> DAX pages to go away first?  E.g. if we find a DAX page in
> >> truncate_inode_pages_range that has an elevated refcount we set
> >> a new flag to prevent new references from showing up, and then
> >> simply wait for it to go away.  Instead of a busy way we can
> >> do this through a few hashed waitqueued in dev_pagemap.  And in
> >> fact put_zone_device_page already gets called when putting the
> >> last page so we can handle the wakeup from there.
> >>
> >> In fact if we can't find a page flag for the stop new callers
> >> things we could probably come up with a way to do that through
> >> dev_pagemap somehow, but I'm not sure how efficient that would
> >> be.
> >
> > We were talking about this yesterday with Dan so some more brainstorming
> > from us. We can implement the solution with extent busy list in ext4
> > relatively easily - we already have such list currently similarly to XFS.
> > There would be some modifications needed but nothing too complex. The
> > biggest downside of this solution I see is that it requires per-filesystem
> > solution for busy extents - ext4 and XFS are reasonably fine, however btrfs
> > may have problems and ext2 definitely will need some modifications.
> > Invisible used blocks may be surprising to users at times although given
> > page refs should be relatively short term, that should not be a big issue.
> > But are we guaranteed page refs are short term? E.g. if someone creates
> > v4l2 videobuf in MAP_SHARED mapping of a file on DAX filesystem, page refs
> > can be rather long-term similarly as in RDMA case. Also freeing of blocks
> > on page reference drop is another async entry point into the filesystem
> > which could unpleasantly surprise us but I guess workqueues would solve
> > that reasonably fine.
> >
> > WRT waiting for page refs to be dropped before proceeding with truncate (or
> > punch hole for that matter - that case is even nastier since we don't have
> > i_size to guard us). What I like about this solution is that it is very
> > visible there's something unusual going on with the file being truncated /
> > punched and so problems are easier to diagnose / fix from the admin side.
> > So far we have guarded hole punching from concurrent faults (and
> > get_user_pages() does fault once you do unmap_mapping_range()) with
> > I_MMAP_LOCK (or its equivalent in ext4). We cannot easily wait for page
> > refs to be dropped under I_MMAP_LOCK as that could deadlock - the most
> > obvious case Dan came up with is when GUP obtains ref to page A, then hole
> > punch comes grabbing I_MMAP_LOCK and waiting for page ref on A to be
> > dropped, and then GUP blocks on trying to fault in another page.
> >
> > I think we cannot easily prevent new page references to be grabbed as you
> > write above since nobody expects stuff like get_page() to fail. But I
> > think that unmapping relevant pages and then preventing them to be faulted
> > in again is workable and stops GUP as well. The problem with that is though
> > what to do with page faults to such pages - you cannot just fail them for
> > hole punch, and you cannot easily allocate new blocks either. So we are
> > back at a situation where we need to detach blocks from the inode and then
> > wait for page refs to be dropped - so some form of busy extents. Am I
> > missing something?
> 
> Coming back to this since Dave has made clear that new locking to
> coordinate get_user_pages() is a no-go.
> 
> We can unmap to force new get_user_pages() attempts to block on the
> per-fs mmap lock, but if punch-hole finds any elevated pages it needs
> to drop the mmap lock and wait. We need this lock dropped to get
> around the problem that the driver will not start to drop page
> references until it has elevated the page references on all the pages
> in the I/O. If we need to drop the mmap lock that makes it impossible
> to coordinate this unlock/retry loop within truncate_inode_pages_range
> which would otherwise be the natural place to land this code.
> 
> Would it be palatable to unmap and drain dma in any path that needs to
> detach blocks from an inode? Something like the following that builds
> on dax_wait_dma() tried to achieve, but does not introduce a new lock
> for the fs to manage:
> 
> retry:
>     per_fs_mmap_lock(inode);
>     unmap_mapping_range(mapping, start, end); /* new page references
> cannot be established */
>     if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) {
>         per_fs_mmap_unlock(inode); /* new page references can happen,
> so we need to start over */
>         wait_for_page_idle(dax_page);
>         goto retry;
>     }
>     truncate_inode_pages_range(mapping, start, end);
>     per_fs_mmap_unlock(inode);

These retry loops you keep proposing are just bloody horrible.  They
are basically just a method for blocking an operation until whatever
condition is preventing the invalidation goes away. IMO, that's an
ugly solution no matter how much lipstick you dress it up with.

i.e. the blocking loops mean the user process is going to be blocked
for arbitrary lengths of time. That's not a solution, it's just
passing the buck - now the userspace developers need to work around
truncate/hole punch being randomly blocked for arbitrary lengths of
time.

The whole point of pushing this into the busy extent list is that it
doesn't require blocking operations. i.e the re-use of the underlying
storage is simply delayed until notification that it is safe to
re-use comes along, but the extent removal operation doesn't get
blocked.

That's how we treat extents that require discard operations after
they have been freed - they remain in the busy list until the
discard IO completion signals "all done" and clears the busy extent.
Here we need to hold off clearing the extent until we get the "all
done" from the dax code.

e.g. what needs to happen when trying to do the invalidation is
something like this (assuming invalidate_inode_pages2_range() will
actually fail on pages under DMA):

	flags = 0;
	if (IS_DAX()) {
		error = invalidate_inode_pages2_range()
		if (error == -EBUSY && dax_dma_busy_page())
			flags = EXTENT_BUSY_DAX;
		else
			truncate_pagecache(); /* blocking */
	} else {
		truncate_pagecache();
	}

that EXTENT_BUSY_DAX flag needs to be carried all the way through to
the xfs_free_extent -> xfs_extent_busy_insert(). That's probably the
most complex part of the patch.

This flag then prevents xfs_extent_busy_reuse() from allowing reuse
of the extent.

And in xfs_extent_busy_clear(), they need to be treated sort of like
discarded extents. On transaction commit callback, we need to check
if there are still busy daxdma pages over the extent range, and if
there are we leave it in the busy list, otherwise it can be cleared.
For everything that is left in the busy list, the dax dma code will
need to call back into the filesystem when that page is released and
when the extent no long has any dax dma busy pages left over it it
can be cleared from the list.

Once we have the dax code to call back into the filesystem when the
problematic daxdma pages are released, and everything else should be
relatively straight forward...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
  2017-10-30  2:00         ` Dave Chinner
@ 2017-10-30  8:38           ` Jan Kara
  2017-10-30 11:20             ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Kara @ 2017-10-30  8:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dan Williams, Jan Kara, Christoph Hellwig, Michal Hocko,
	Benjamin Herrenschmidt, Dave Hansen, Heiko Carstens,
	J. Bruce Fields, linux-mm, Paul Mackerras, Sean Hefty,
	Jeff Layton, Matthew Wilcox, linux-rdma, Michael Ellerman,
	Jason Gunthorpe, Doug Ledford, Hal Rosenstock, linux-fsdevel,
	Alexander Viro, Gerald Schaefer, linux-nvdimm@lists.01.org

Hi,

On Mon 30-10-17 13:00:23, Dave Chinner wrote:
> On Sun, Oct 29, 2017 at 04:46:44PM -0700, Dan Williams wrote:
> > Coming back to this since Dave has made clear that new locking to
> > coordinate get_user_pages() is a no-go.
> > 
> > We can unmap to force new get_user_pages() attempts to block on the
> > per-fs mmap lock, but if punch-hole finds any elevated pages it needs
> > to drop the mmap lock and wait. We need this lock dropped to get
> > around the problem that the driver will not start to drop page
> > references until it has elevated the page references on all the pages
> > in the I/O. If we need to drop the mmap lock that makes it impossible
> > to coordinate this unlock/retry loop within truncate_inode_pages_range
> > which would otherwise be the natural place to land this code.
> > 
> > Would it be palatable to unmap and drain dma in any path that needs to
> > detach blocks from an inode? Something like the following that builds
> > on dax_wait_dma() tried to achieve, but does not introduce a new lock
> > for the fs to manage:
> > 
> > retry:
> >     per_fs_mmap_lock(inode);
> >     unmap_mapping_range(mapping, start, end); /* new page references
> > cannot be established */
> >     if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) {
> >         per_fs_mmap_unlock(inode); /* new page references can happen,
> > so we need to start over */
> >         wait_for_page_idle(dax_page);
> >         goto retry;
> >     }
> >     truncate_inode_pages_range(mapping, start, end);
> >     per_fs_mmap_unlock(inode);
> 
> These retry loops you keep proposing are just bloody horrible.  They
> are basically just a method for blocking an operation until whatever
> condition is preventing the invalidation goes away. IMO, that's an
> ugly solution no matter how much lipstick you dress it up with.
> 
> i.e. the blocking loops mean the user process is going to be blocked
> for arbitrary lengths of time. That's not a solution, it's just
> passing the buck - now the userspace developers need to work around
> truncate/hole punch being randomly blocked for arbitrary lengths of
> time.

So I see substantial difference between how you and Christoph think this
should be handled. Christoph writes in [1]:

The point is that we need to prohibit long term elevated page counts
with DAX anyway - we can't just let people grab allocated blocks forever
while ignoring file system operations.  For stage 1 we'll just need to
fail those, and in the long run they will have to use a mechanism
similar to FL_LAYOUT locks to deal with file system allocation changes.

So Christoph wants to block truncate until references are released, forbid
long term references until userspace acquiring them supports some kind of
lease-breaking. OTOH you suggest truncate should just proceed leaving
blocks allocated until references are released. We cannot have both... I'm
leaning more towards the approach Christoph suggests as it puts the burned
to the place which is causing it - the application having long term
references - and applications needing this should be sufficiently rare that
we don't have to devise a general mechanism in the kernel for this.

If the solution Christoph suggests is acceptable to you, I think we should
first write a patch to forbid acquiring long term references to DAX blocks.
On top of that we can implement mechanism to block truncate while there are
short term references pending (and for that retry loops would be IMHO
acceptable). And then we can work on a mechanism to notify userspace that
it needs to drop references to blocks that are going to be truncated so
that we can re-enable taking of long term references.

								Honza

[1]
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1522887.html

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
  2017-10-30  8:38           ` Jan Kara
@ 2017-10-30 11:20             ` Dave Chinner
  2017-10-30 17:51               ` Dan Williams
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2017-10-30 11:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dan Williams, Christoph Hellwig, Michal Hocko,
	Benjamin Herrenschmidt, Dave Hansen, Heiko Carstens,
	J. Bruce Fields, linux-mm, Paul Mackerras, Sean Hefty,
	Jeff Layton, Matthew Wilcox, linux-rdma, Michael Ellerman,
	Jason Gunthorpe, Doug Ledford, Hal Rosenstock, linux-fsdevel,
	Alexander Viro, Gerald Schaefer, linux-nvdimm@lists.01.org, Linux

On Mon, Oct 30, 2017 at 09:38:07AM +0100, Jan Kara wrote:
> Hi,
> 
> On Mon 30-10-17 13:00:23, Dave Chinner wrote:
> > On Sun, Oct 29, 2017 at 04:46:44PM -0700, Dan Williams wrote:
> > > Coming back to this since Dave has made clear that new locking to
> > > coordinate get_user_pages() is a no-go.
> > > 
> > > We can unmap to force new get_user_pages() attempts to block on the
> > > per-fs mmap lock, but if punch-hole finds any elevated pages it needs
> > > to drop the mmap lock and wait. We need this lock dropped to get
> > > around the problem that the driver will not start to drop page
> > > references until it has elevated the page references on all the pages
> > > in the I/O. If we need to drop the mmap lock that makes it impossible
> > > to coordinate this unlock/retry loop within truncate_inode_pages_range
> > > which would otherwise be the natural place to land this code.
> > > 
> > > Would it be palatable to unmap and drain dma in any path that needs to
> > > detach blocks from an inode? Something like the following that builds
> > > on dax_wait_dma() tried to achieve, but does not introduce a new lock
> > > for the fs to manage:
> > > 
> > > retry:
> > >     per_fs_mmap_lock(inode);
> > >     unmap_mapping_range(mapping, start, end); /* new page references
> > > cannot be established */
> > >     if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) {
> > >         per_fs_mmap_unlock(inode); /* new page references can happen,
> > > so we need to start over */
> > >         wait_for_page_idle(dax_page);
> > >         goto retry;
> > >     }
> > >     truncate_inode_pages_range(mapping, start, end);
> > >     per_fs_mmap_unlock(inode);
> > 
> > These retry loops you keep proposing are just bloody horrible.  They
> > are basically just a method for blocking an operation until whatever
> > condition is preventing the invalidation goes away. IMO, that's an
> > ugly solution no matter how much lipstick you dress it up with.
> > 
> > i.e. the blocking loops mean the user process is going to be blocked
> > for arbitrary lengths of time. That's not a solution, it's just
> > passing the buck - now the userspace developers need to work around
> > truncate/hole punch being randomly blocked for arbitrary lengths of
> > time.
> 
> So I see substantial difference between how you and Christoph think this
> should be handled. Christoph writes in [1]:
> 
> The point is that we need to prohibit long term elevated page counts
> with DAX anyway - we can't just let people grab allocated blocks forever
> while ignoring file system operations.  For stage 1 we'll just need to
> fail those, and in the long run they will have to use a mechanism
> similar to FL_LAYOUT locks to deal with file system allocation changes.
> 
> So Christoph wants to block truncate until references are released, forbid
> long term references until userspace acquiring them supports some kind of
> lease-breaking. OTOH you suggest truncate should just proceed leaving
> blocks allocated until references are released.

I don't see what I'm suggesting is a solution to long term elevated
page counts. Just something that can park extents until layout
leases are broken and references released. That's a few tens of
seconds at most.

> We cannot have both... I'm leaning more towards the approach
> Christoph suggests as it puts the burned to the place which is
> causing it - the application having long term references - and
> applications needing this should be sufficiently rare that we
> don't have to devise a general mechanism in the kernel for this.

I have no problems with blocking truncate forever if that's the
desired solution for an elevated page count due to a DMA reference
to a page. But that has absolutely nothing to do with the filesystem
though - it's a page reference vs mapping invalidation problem, not
a filesystem/inode problem.

Perhaps pages with active DAX DMA mapping references need a page
flag to indicate that invalidation must block on the page similar to
the writeback flag...

> If the solution Christoph suggests is acceptable to you, I think
> we should first write a patch to forbid acquiring long term
> references to DAX blocks.  On top of that we can implement
> mechanism to block truncate while there are short term references
> pending (and for that retry loops would be IMHO acceptable).

The problem with retry loops is that they are making a mess of an
already complex set of locking contraints on the indoe IO path. It's
rapidly descending into an unmaintainable mess - falling off the
locking cliff only make sthe code harder to maintain - please look
for solutions that don't require new locks or lock retry loops.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
  2017-10-30 11:20             ` Dave Chinner
@ 2017-10-30 17:51               ` Dan Williams
  0 siblings, 0 replies; 15+ messages in thread
From: Dan Williams @ 2017-10-30 17:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Michal Hocko, Benjamin Herrenschmidt, Dave Hansen,
	Heiko Carstens, J. Bruce Fields, linux-mm, Paul Mackerras,
	Jeff Layton, Sean Hefty, Matthew Wilcox, linux-rdma,
	Michael Ellerman, Christoph Hellwig, Jason Gunthorpe,
	Doug Ledford, Hal Rosenstock, Martin Schwidefsky, Alexander Viro,
	Gerald Schaefer, linux-nvdimm@lists.01.org, Linux

On Mon, Oct 30, 2017 at 4:20 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Oct 30, 2017 at 09:38:07AM +0100, Jan Kara wrote:
>> Hi,
>>
>> On Mon 30-10-17 13:00:23, Dave Chinner wrote:
>> > On Sun, Oct 29, 2017 at 04:46:44PM -0700, Dan Williams wrote:
>> > > Coming back to this since Dave has made clear that new locking to
>> > > coordinate get_user_pages() is a no-go.
>> > >
>> > > We can unmap to force new get_user_pages() attempts to block on the
>> > > per-fs mmap lock, but if punch-hole finds any elevated pages it needs
>> > > to drop the mmap lock and wait. We need this lock dropped to get
>> > > around the problem that the driver will not start to drop page
>> > > references until it has elevated the page references on all the pages
>> > > in the I/O. If we need to drop the mmap lock that makes it impossible
>> > > to coordinate this unlock/retry loop within truncate_inode_pages_range
>> > > which would otherwise be the natural place to land this code.
>> > >
>> > > Would it be palatable to unmap and drain dma in any path that needs to
>> > > detach blocks from an inode? Something like the following that builds
>> > > on dax_wait_dma() tried to achieve, but does not introduce a new lock
>> > > for the fs to manage:
>> > >
>> > > retry:
>> > >     per_fs_mmap_lock(inode);
>> > >     unmap_mapping_range(mapping, start, end); /* new page references
>> > > cannot be established */
>> > >     if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) {
>> > >         per_fs_mmap_unlock(inode); /* new page references can happen,
>> > > so we need to start over */
>> > >         wait_for_page_idle(dax_page);
>> > >         goto retry;
>> > >     }
>> > >     truncate_inode_pages_range(mapping, start, end);
>> > >     per_fs_mmap_unlock(inode);
>> >
>> > These retry loops you keep proposing are just bloody horrible.  They
>> > are basically just a method for blocking an operation until whatever
>> > condition is preventing the invalidation goes away. IMO, that's an
>> > ugly solution no matter how much lipstick you dress it up with.
>> >
>> > i.e. the blocking loops mean the user process is going to be blocked
>> > for arbitrary lengths of time. That's not a solution, it's just
>> > passing the buck - now the userspace developers need to work around
>> > truncate/hole punch being randomly blocked for arbitrary lengths of
>> > time.
>>
>> So I see substantial difference between how you and Christoph think this
>> should be handled. Christoph writes in [1]:
>>
>> The point is that we need to prohibit long term elevated page counts
>> with DAX anyway - we can't just let people grab allocated blocks forever
>> while ignoring file system operations.  For stage 1 we'll just need to
>> fail those, and in the long run they will have to use a mechanism
>> similar to FL_LAYOUT locks to deal with file system allocation changes.
>>
>> So Christoph wants to block truncate until references are released, forbid
>> long term references until userspace acquiring them supports some kind of
>> lease-breaking. OTOH you suggest truncate should just proceed leaving
>> blocks allocated until references are released.
>
> I don't see what I'm suggesting is a solution to long term elevated
> page counts. Just something that can park extents until layout
> leases are broken and references released. That's a few tens of
> seconds at most.
>
>> We cannot have both... I'm leaning more towards the approach
>> Christoph suggests as it puts the burned to the place which is
>> causing it - the application having long term references - and
>> applications needing this should be sufficiently rare that we
>> don't have to devise a general mechanism in the kernel for this.
>
> I have no problems with blocking truncate forever if that's the
> desired solution for an elevated page count due to a DMA reference
> to a page. But that has absolutely nothing to do with the filesystem
> though - it's a page reference vs mapping invalidation problem, not
> a filesystem/inode problem.
>
> Perhaps pages with active DAX DMA mapping references need a page
> flag to indicate that invalidation must block on the page similar to
> the writeback flag...

We effectively already have this flag since pages where
is_zone_device_page() == true can only have their reference count
elevated by get_user_pages().

More importantly we can not block invalidation on an elevated page
count because that page count may never drop until all references have
been acquired. I.e. iov_iter_get_pages() grabs a range of pages
potentially across multiple vmas and does not drop any references in
the range until all pages have had their count elevated.

>> If the solution Christoph suggests is acceptable to you, I think
>> we should first write a patch to forbid acquiring long term
>> references to DAX blocks.  On top of that we can implement
>> mechanism to block truncate while there are short term references
>> pending (and for that retry loops would be IMHO acceptable).
>
> The problem with retry loops is that they are making a mess of an
> already complex set of locking contraints on the indoe IO path. It's
> rapidly descending into an unmaintainable mess - falling off the
> locking cliff only make sthe code harder to maintain - please look
> for solutions that don't require new locks or lock retry loops.

I was hoping to make the retry loop no worse than the one we already
perform for xfs_break_layouts(), and then the approach can be easily
shared between ext4 and xfs.

However before we get there, we need quite a bit of reworks (require
struct page for dax, use pfns in the dax radix, disable long held page
reference counts for DAX i.e. RDMA / V4L2...). I'll submit those
preparation steps first and then we can circle back to the "how to
wait for DAX-DMA to end" problem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2017-10-30 17:51 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-10-20  2:38 [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support Dan Williams
2017-10-20  2:39 ` [PATCH v3 09/13] IB/core: disable memory registration of fileystem-dax vmas Dan Williams
2017-10-20  7:47 ` [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support Christoph Hellwig
2017-10-20  9:31   ` Christoph Hellwig
2017-10-26 10:58     ` Jan Kara
2017-10-26 23:51       ` Williams, Dan J
2017-10-27  6:48         ` Dave Chinner
2017-10-27 11:42           ` Dan Williams
2017-10-29 21:52             ` Dave Chinner
2017-10-27  6:45       ` Christoph Hellwig
2017-10-29 23:46       ` Dan Williams
2017-10-30  2:00         ` Dave Chinner
2017-10-30  8:38           ` Jan Kara
2017-10-30 11:20             ` Dave Chinner
2017-10-30 17:51               ` Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox