From: Dave Chinner <david@fromorbit.com>
To: "Williams, Dan J" <dan.j.williams@intel.com>
Cc: "hch@lst.de" <hch@lst.de>, "jack@suse.cz" <jack@suse.cz>,
"schwidefsky@de.ibm.com" <schwidefsky@de.ibm.com>,
"darrick.wong@oracle.com" <darrick.wong@oracle.com>,
"dledford@redhat.com" <dledford@redhat.com>,
"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
"bfields@fieldses.org" <bfields@fieldses.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"heiko.carstens@de.ibm.com" <heiko.carstens@de.ibm.com>,
"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
"linux-xfs@vger.kernel.org" <linux-xfs@vger.kernel.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"jmoyer@redhat.com" <jmoyer@redhat.com>,
"viro@zeniv.linux.org.uk" <viro@zeniv.linux.org.uk>,
"kirill.shutemov@linux.intel.com"
<kirill.shutemov@linux.intel.com>,
"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
"Hefty, Sean" <sean.hefty@intel.com>,
"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>,
"jlayton@poochiereds.net" <jlayton@poochiereds.net>,
"mawilcox@microsoft.com" <mawilcox@microsoft.com>,
"mhocko@suse.com" <mhocko@suse.com>,
"ross.zwisler@linux.intel.com" <ross.zwisler@linux.intel.com>,
"gerald.schaefer@de.ibm.com" <gerald.schaefer@de.ibm.com>,
"jgunthorpe@obsidianresearch.com"
<jgunthorpe@obsidianresearch.com>,
"hal.rosenstock@gmail.com" <hal.rosenstock@gmail.com>,
"benh@kernel.crashing.org" <benh@kernel.crashing.org>,
"mpe@ellerman.id.au" <mpe@ellerman.id.au>,
"paulus@samba.org" <paulus@samba.org>
Subject: Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
Date: Fri, 27 Oct 2017 17:48:54 +1100 [thread overview]
Message-ID: <20171027064854.GE3666@dastard> (raw)
In-Reply-To: <1509061831.25213.2.camel@intel.com>
On Thu, Oct 26, 2017 at 11:51:04PM +0000, Williams, Dan J wrote:
> On Thu, 2017-10-26 at 12:58 +0200, Jan Kara wrote:
> > On Fri 20-10-17 11:31:48, Christoph Hellwig wrote:
> > > On Fri, Oct 20, 2017 at 09:47:50AM +0200, Christoph Hellwig wrote:
> > > > I'd like to brainstorm how we can do something better.
> > > >
> > > > How about:
> > > >
> > > > If we hit a page with an elevated refcount in truncate / hole puch
> > > > etc for a DAX file system we do not free the blocks in the file system,
> > > > but add it to the extent busy list. We mark the page as delayed
> > > > free (e.g. page flag?) so that when it finally hits refcount zero we
> > > > call back into the file system to remove it from the busy list.
> > >
> > > Brainstorming some more:
> > >
> > > Given that on a DAX file there shouldn't be any long-term page
> > > references after we unmap it from the page table and don't allow
> > > get_user_pages calls why not wait for the references for all
> > > DAX pages to go away first? E.g. if we find a DAX page in
> > > truncate_inode_pages_range that has an elevated refcount we set
> > > a new flag to prevent new references from showing up, and then
> > > simply wait for it to go away. Instead of a busy way we can
> > > do this through a few hashed waitqueued in dev_pagemap. And in
> > > fact put_zone_device_page already gets called when putting the
> > > last page so we can handle the wakeup from there.
> > >
> > > In fact if we can't find a page flag for the stop new callers
> > > things we could probably come up with a way to do that through
> > > dev_pagemap somehow, but I'm not sure how efficient that would
> > > be.
> >
> > We were talking about this yesterday with Dan so some more brainstorming
> > from us. We can implement the solution with extent busy list in ext4
> > relatively easily - we already have such list currently similarly to XFS.
> > There would be some modifications needed but nothing too complex. The
> > biggest downside of this solution I see is that it requires per-filesystem
> > solution for busy extents - ext4 and XFS are reasonably fine, however btrfs
> > may have problems and ext2 definitely will need some modifications.
> > Invisible used blocks may be surprising to users at times although given
> > page refs should be relatively short term, that should not be a big issue.
> > But are we guaranteed page refs are short term? E.g. if someone creates
> > v4l2 videobuf in MAP_SHARED mapping of a file on DAX filesystem, page refs
> > can be rather long-term similarly as in RDMA case. Also freeing of blocks
> > on page reference drop is another async entry point into the filesystem
> > which could unpleasantly surprise us but I guess workqueues would solve
> > that reasonably fine.
> >
> > WRT waiting for page refs to be dropped before proceeding with truncate (or
> > punch hole for that matter - that case is even nastier since we don't have
> > i_size to guard us). What I like about this solution is that it is very
> > visible there's something unusual going on with the file being truncated /
> > punched and so problems are easier to diagnose / fix from the admin side.
> > So far we have guarded hole punching from concurrent faults (and
> > get_user_pages() does fault once you do unmap_mapping_range()) with
> > I_MMAP_LOCK (or its equivalent in ext4). We cannot easily wait for page
> > refs to be dropped under I_MMAP_LOCK as that could deadlock - the most
> > obvious case Dan came up with is when GUP obtains ref to page A, then hole
> > punch comes grabbing I_MMAP_LOCK and waiting for page ref on A to be
> > dropped, and then GUP blocks on trying to fault in another page.
> >
> > I think we cannot easily prevent new page references to be grabbed as you
> > write above since nobody expects stuff like get_page() to fail. But I
> > think that unmapping relevant pages and then preventing them to be faulted
> > in again is workable and stops GUP as well. The problem with that is though
> > what to do with page faults to such pages - you cannot just fail them for
> > hole punch, and you cannot easily allocate new blocks either. So we are
> > back at a situation where we need to detach blocks from the inode and then
> > wait for page refs to be dropped - so some form of busy extents. Am I
> > missing something?
> >
>
> No, that's a good summary of what we talked about. However, I did go
> back and give the new lock approach a try and was able to get my test
> to pass. The new locking is not pretty especially since you need to
> drop and reacquire the lock so that get_user_pages() can finish
> grabbing all the pages it needs. Here are the two primary patches in
> the series, do you think the extent-busy approach would be cleaner?
The XFS_DAXDMA....
$DEITY that patch is so ugly I can't even bring myself to type it.
-Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2017-10-27 6:48 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-10-20 2:38 [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support Dan Williams
2017-10-20 2:39 ` [PATCH v3 01/13] dax: quiet bdev_dax_supported() Dan Williams
2017-10-20 2:39 ` [PATCH v3 02/13] dax: require 'struct page' for filesystem dax Dan Williams
2017-10-20 7:57 ` Christoph Hellwig
2017-10-20 15:23 ` Dan Williams
2017-10-20 16:29 ` Christoph Hellwig
2017-10-20 22:29 ` Dan Williams
2017-10-21 3:20 ` Matthew Wilcox
2017-10-21 4:16 ` Dan Williams
2017-10-21 8:15 ` Christoph Hellwig
2017-10-23 5:18 ` Martin Schwidefsky
2017-10-23 8:55 ` Dan Williams
2017-10-23 10:44 ` Martin Schwidefsky
2017-10-23 11:20 ` Dan Williams
2017-10-20 2:39 ` [PATCH v3 03/13] dax: stop using VM_MIXEDMAP for dax Dan Williams
2017-10-20 2:39 ` [PATCH v3 04/13] dax: stop using VM_HUGEPAGE " Dan Williams
2017-10-20 2:39 ` [PATCH v3 05/13] dax: stop requiring a live device for dax_flush() Dan Williams
2017-10-20 2:39 ` [PATCH v3 06/13] dax: store pfns in the radix Dan Williams
2017-10-20 2:39 ` [PATCH v3 07/13] dax: warn if dma collides with truncate Dan Williams
2017-10-20 2:39 ` [PATCH v3 08/13] tools/testing/nvdimm: add 'bio_delay' mechanism Dan Williams
2017-10-20 2:39 ` [PATCH v3 09/13] IB/core: disable memory registration of fileystem-dax vmas Dan Williams
2017-10-20 2:39 ` [PATCH v3 10/13] mm: disable get_user_pages_fast() for dax Dan Williams
2017-10-20 2:39 ` [PATCH v3 11/13] fs: use smp_load_acquire in break_{layout,lease} Dan Williams
2017-10-20 12:39 ` Jeffrey Layton
2017-10-20 2:40 ` [PATCH v3 12/13] dax: handle truncate of dma-busy pages Dan Williams
2017-10-20 13:05 ` Jeff Layton
2017-10-20 15:42 ` Dan Williams
2017-10-20 16:32 ` Christoph Hellwig
2017-10-20 17:27 ` Dan Williams
2017-10-20 20:36 ` Brian Foster
2017-10-21 8:11 ` Christoph Hellwig
2017-10-20 2:40 ` [PATCH v3 13/13] xfs: wire up FL_ALLOCATED support Dan Williams
2017-10-20 7:47 ` [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support Christoph Hellwig
2017-10-20 9:31 ` Christoph Hellwig
2017-10-26 10:58 ` Jan Kara
2017-10-26 23:51 ` Williams, Dan J
2017-10-27 6:48 ` Dave Chinner [this message]
[not found] ` <CAA9_cmdx7T2jnfw6TvL0_3ytfs-h-X06uF3_7Ex-YP12YKpwng@mail.gmail.com>
2017-10-29 21:52 ` Dave Chinner
2017-10-27 6:45 ` Christoph Hellwig
2017-10-29 23:46 ` Dan Williams
2017-10-30 2:00 ` Dave Chinner
2017-10-30 8:38 ` Jan Kara
2017-10-30 11:20 ` Dave Chinner
2017-10-30 17:51 ` Dan Williams
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20171027064854.GE3666@dastard \
--to=david@fromorbit.com \
--cc=akpm@linux-foundation.org \
--cc=benh@kernel.crashing.org \
--cc=bfields@fieldses.org \
--cc=dan.j.williams@intel.com \
--cc=darrick.wong@oracle.com \
--cc=dave.hansen@linux.intel.com \
--cc=dledford@redhat.com \
--cc=gerald.schaefer@de.ibm.com \
--cc=hal.rosenstock@gmail.com \
--cc=hch@lst.de \
--cc=heiko.carstens@de.ibm.com \
--cc=jack@suse.cz \
--cc=jgunthorpe@obsidianresearch.com \
--cc=jlayton@poochiereds.net \
--cc=jmoyer@redhat.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-nvdimm@lists.01.org \
--cc=linux-rdma@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=mawilcox@microsoft.com \
--cc=mhocko@suse.com \
--cc=mpe@ellerman.id.au \
--cc=paulus@samba.org \
--cc=ross.zwisler@linux.intel.com \
--cc=schwidefsky@de.ibm.com \
--cc=sean.hefty@intel.com \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).