public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>, Christoph Hellwig <hch@lst.de>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>,
	Dave Chinner <david@fromorbit.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	linux-xfs@vger.kernel.org, Jeff Moyer <jmoyer@redhat.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Andy Lutomirski <luto@kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Ross Zwisler <ross.zwisler@linux.intel.com>,
	Linux API <linux-api@vger.kernel.org>,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
Date: Wed, 16 Aug 2017 15:57:58 +0200	[thread overview]
Message-ID: <20170816135758.GD4898@quack2.suse.cz> (raw)
In-Reply-To: <CAPcyv4hFTn4Fz5o+Gm857mS-RA6WAVsf4CmwiLiK2O8w2_SamQ@mail.gmail.com>

On Tue 15-08-17 16:50:55, Dan Williams wrote:
> On Tue, Aug 15, 2017 at 1:37 AM, Jan Kara <jack@suse.cz> wrote:
> > On Mon 14-08-17 09:14:42, Dan Williams wrote:
> >> On Mon, Aug 14, 2017 at 5:40 AM, Jan Kara <jack@suse.cz> wrote:
> >> > On Sun 13-08-17 13:31:45, Dan Williams wrote:
> >> >> On Sun, Aug 13, 2017 at 2:24 AM, Christoph Hellwig <hch@lst.de> wrote:
> >> >> > Thay being said I think we absolutely should support RDMA memory
> >> >> > registrations for DAX mappings.  I'm just not sure how S_IOMAP_IMMUTABLE
> >> >> > helps with that.  We'll want a MAP_SYNC | MAP_POPULATE to make sure
> >> >> > all the blocks are polulated and all ptes are set up.  Second we need
> >> >> > to make sure get_user_page works, which for now means we'll need a
> >> >> > struct page mapping for the region (which will be really annoying
> >> >> > for PCIe mappings, like the upcoming NVMe persistent memory region),
> >> >> > and we need to gurantee that the extent mapping won't change while
> >> >> > the get_user_pages holds the pages inside it.  I think that is true
> >> >> > due to side effects even with the current DAX code, but we'll need to
> >> >> > make it explicit.  And maybe that's where we need to converge -
> >> >> > "sealing" the extent map makes sense as such a temporary measure
> >> >> > that is not persisted on disk, which automatically gets released
> >> >> > when the holding process exits, because we sort of already do this
> >> >> > implicitly.  It might also make sense to have explicitl breakable
> >> >> > seals similar to what I do for the pNFS blocks kernel server, as
> >> >> > any userspace RDMA file server would also need those semantics.
> >> >>
> >> >> Ok, how about a MAP_DIRECT flag that arranges for faults to that range to:
> >> >>
> >> >>     1/ only succeed if the fault can be satisfied without page cache
> >> >>
> >> >>     2/ only install a pte for the fault if it can do so without
> >> >> triggering block map updates
> >> >>
> >> >> So, I think it would still end up setting an inode flag to make
> >> >> xfs_bmapi_write() fail while any process has a MAP_DIRECT mapping
> >> >> active. However, it would not record that state in the on-disk
> >> >> metadata and it would automatically clear at munmap time. That should
> >> >> be enough to support the host-persistent-memory, and
> >> >> NVMe-persistent-memory use cases (provided we have struct page for
> >> >> NVMe). Although, we need more safety infrastructure in the NVMe case
> >> >> where we would need to software manage I/O coherence.
> >> >
> >> > Hum, this proposal (and the problems you are trying to deal with) seem very
> >> > similar to Peter Zijlstra's mpin() proposal from 2014 [1], just moved to
> >> > the DAX area (and so additionally complicated by the fact that filesystems
> >> > now have to care). The patch set was not merged due to lack of interest I
> >> > think but it looked sensible and the proposed API would make sense for more
> >> > stuff than just DAX so maybe it would be better than MAP_DIRECT flag?
> >>
> >> Interesting, but I'm not sure I see the correlation. mm_mpin() makes a
> >> "no-fault" guarantee and fixes the accounting of locked System RAM.
> >> MAP_DIRECT still allows faults, and DAX mappings don't consume System
> >> RAM so the accounting problem is not there for DAX. mm_pin() also does
> >> not appear to have a relationship to a file backed memory like mmap
> >> allows.
> >
> > So the accounting part is probably non-interesting for DAX purposes and I
> > agree there are other differences as well. But mm_mpin() prevented page
> > migrations which is parallel to your requirement of "offset->block mapping
> > is permanent".  Furthermore mm_mpin() work was there for RDMA so that it
> > has saner interface to pin pages than get_user_pages() and you mention RDMA
> > and similar technologies as a usecase for your work for similar reasons.
> > So my thought was that possibly we should have the same API for pinning
> > "storage" for RDMA transfers regardless of whether the backing is page
> > cache or pmem and the API should be usable for in-kernel users as well?
> > mmap flag seems a bit clumsy in this regard so maybe a form of a separate
> > syscall - be it mpin(start, len) or some other name - might be more
> > suitable?
> 
> Can you say about more about why an mmap flag for this feels awkward
> to you? I think there's symmetry between O_SYNC / O_DIRECT setting up
> synchronous / page-cache-bypass file descriptors and MAP_SYNC /
> MAP_DIRECT setting up synchronous and page-cache bypass mappings.

So my thinking was, that for in-kernel users it might be a bit more
difficult to use mmap flag directly as they generally won't need to setup
the mapping. But that can be certainly dealt with by proper helpers for
in-kernel users.

> "Pinning" also feels like the wrong mechanism when you consider
> hardware is moving toward eliminating the pinning requirement over
> time. SVM "Shared Virtual Memory" hardware will just operate on cpu
> virtual addresses directly and generate typical faults. On such
> hardware MAP_DIRECT would be a nop relative to MAP_SYNC, so you
> wouldn't want your application to be stuck with the legacy concept
> that pages need to be explicitly "pinned".

OK, makes sense.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

  reply	other threads:[~2017-08-16 13:58 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-04  2:28 [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap Dan Williams
2017-08-04  2:28 ` [PATCH v2 1/5] fs, xfs: introduce S_IOMAP_IMMUTABLE Dan Williams
2017-08-04 20:00   ` Darrick J. Wong
2017-08-04 20:31     ` Dan Williams
2017-08-05  9:47   ` Christoph Hellwig
2017-08-07  0:25     ` Dave Chinner
2017-08-11 10:34       ` Christoph Hellwig
2017-08-04  2:28 ` [PATCH v2 2/5] fs, xfs: introduce FALLOC_FL_SEAL_BLOCK_MAP Dan Williams
2017-08-04 19:46   ` Darrick J. Wong
2017-08-04 19:52     ` Dan Williams
2017-08-04 23:31   ` Dave Chinner
2017-08-04 23:43     ` Dan Williams
2017-08-05  0:04       ` Dave Chinner
2017-08-04  2:28 ` [PATCH v2 3/5] fs, xfs: introduce FALLOC_FL_UNSEAL_BLOCK_MAP Dan Williams
2017-08-04 20:04   ` Darrick J. Wong
2017-08-04 20:36     ` Dan Williams
2017-08-04  2:28 ` [PATCH v2 4/5] xfs: introduce XFS_DIFLAG2_IOMAP_IMMUTABLE Dan Williams
2017-08-04 20:33   ` Darrick J. Wong
2017-08-04 20:45     ` Dan Williams
2017-08-04 23:46     ` Dave Chinner
2017-08-04 23:57       ` Darrick J. Wong
2017-08-04  2:28 ` [PATCH v2 5/5] xfs: toggle XFS_DIFLAG2_IOMAP_IMMUTABLE in response to fallocate Dan Williams
2017-08-04 20:14   ` Darrick J. Wong
2017-08-04 20:47     ` Dan Williams
2017-08-04 20:53       ` Darrick J. Wong
2017-08-04 20:55         ` Dan Williams
2017-08-04  2:38 ` [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap Dan Williams
2017-08-05  9:50   ` Christoph Hellwig
2017-08-06 18:51     ` Dan Williams
2017-08-11 10:44       ` Christoph Hellwig
2017-08-11 22:26         ` Dan Williams
2017-08-12  3:57           ` Andy Lutomirski
2017-08-12  4:44             ` Dan Williams
2017-08-12  7:34             ` Christoph Hellwig
2017-08-12  7:33           ` Christoph Hellwig
2017-08-12 19:19             ` Dan Williams
2017-08-13  9:24               ` Christoph Hellwig
2017-08-13 20:31                 ` Dan Williams
2017-08-14 12:40                   ` Jan Kara
2017-08-14 16:14                     ` Dan Williams
2017-08-15  8:37                       ` Jan Kara
2017-08-15 23:50                         ` Dan Williams
2017-08-16 13:57                           ` Jan Kara [this message]
2017-08-21  9:16                     ` Peter Zijlstra
2017-08-14 21:46                   ` Darrick J. Wong
2017-08-13 23:46                 ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170816135758.GD4898@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=dan.j.williams@intel.com \
    --cc=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=hch@lst.de \
    --cc=jmoyer@redhat.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=peterz@infradead.org \
    --cc=ross.zwisler@linux.intel.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox