From: "Darrick J. Wong" <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
To: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>,
"linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org"
<linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org>,
Linux API <linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>,
"linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
<linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
linux-xfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Alexander Viro
<viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>,
Andy Lutomirski <luto-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
linux-fsdevel
<linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
Subject: Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
Date: Mon, 14 Aug 2017 14:46:15 -0700 [thread overview]
Message-ID: <20170814214615.GB4796@magnolia> (raw)
In-Reply-To: <CAPcyv4ixTgSWG9K2Eg3XJmOvqJht81qL+Z3njoOjcXCD7XMpZw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Sun, Aug 13, 2017 at 01:31:45PM -0700, Dan Williams wrote:
> On Sun, Aug 13, 2017 at 2:24 AM, Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> wrote:
> > On Sat, Aug 12, 2017 at 12:19:50PM -0700, Dan Williams wrote:
> >> The application does not need to know the storage address, it needs to
> >> know that the storage address to file offset is fixed. With this
> >> information it can make assumptions about the permanence of results it
> >> gets from the kernel.
> >
> > Only if we clearly document that fact - and documenting the permanence
> > is different from saying the block map won't change.
>
> I can get on board with that.
>
> >
> >> For example get_user_pages() today makes no guarantees outside of
> >> "page will not be freed",
> >
> > It also makes the extremely important gurantee that the page won't
> > _move_ - e.g. that we won't do a memory migration for compaction or
> > other reasons. That's why for example RDMA can use to register
> > memory and then we can later set up memory windows that point to this
> > registration from userspace and implement userspace RDMA.
> >
> >> but with immutable files and dax you now
> >> have a mechanism for userspace to coordinate direct access to storage
> >> addresses. Those raw storage addresses need not be exposed to the
> >> application, as you say it doesn't need to know that detail. MAP_SYNC
> >> does not fully satisfy this case because it requires agents that can
> >> generate MMU faults to coordinate with the filesystem.
> >
> > The file system is always in the fault path, can you explain what other
> > agents you are talking about?
>
> Exactly the one's you mention below. SVM hardware can just use a
> MAP_SYNC mapping and be sure that its metadata dirtying writes are
> synchronized with the filesystem through the fault path. Hardware that
> does not have SVM, or hypervisors like Xen that want to attach their
> own static metadata about the file offset to physical block mapping,
> need a mechanism to make sure the block map is sealed while they have
> it mapped.
>
> >> All I know is that SMB Direct for persistent memory seems like a
> >> potential consumer. I know they're not going to use a userspace
> >> filesystem or put an SMB server in the kernel.
> >
> > Last I talked to the Samba folks they didn't expect a userspace
> > SMB direct implementation to work anyway due to the fact that
> > libibverbs memory registrations interact badly with their fork()ing
> > daemon model. That being said during the recent submission of the
> > RDMA client code some comments were made about userspace versions of
> > it, so I'm not sure if that opinion has changed in one way or another.
>
> Ok.
>
> >
> > Thay being said I think we absolutely should support RDMA memory
> > registrations for DAX mappings. I'm just not sure how S_IOMAP_IMMUTABLE
> > helps with that. We'll want a MAP_SYNC | MAP_POPULATE to make sure
> > all the blocks are polulated and all ptes are set up. Second we need
> > to make sure get_user_page works, which for now means we'll need a
> > struct page mapping for the region (which will be really annoying
> > for PCIe mappings, like the upcoming NVMe persistent memory region),
> > and we need to gurantee that the extent mapping won't change while
> > the get_user_pages holds the pages inside it. I think that is true
> > due to side effects even with the current DAX code, but we'll need to
> > make it explicit. And maybe that's where we need to converge -
> > "sealing" the extent map makes sense as such a temporary measure
> > that is not persisted on disk, which automatically gets released
> > when the holding process exits, because we sort of already do this
> > implicitly. It might also make sense to have explicitl breakable
> > seals similar to what I do for the pNFS blocks kernel server, as
> > any userspace RDMA file server would also need those semantics.
>
> Ok, how about a MAP_DIRECT flag that arranges for faults to that range to:
>
> 1/ only succeed if the fault can be satisfied without page cache
>
> 2/ only install a pte for the fault if it can do so without
> triggering block map updates
>
> So, I think it would still end up setting an inode flag to make
> xfs_bmapi_write() fail while any process has a MAP_DIRECT mapping
> active. However, it would not record that state in the on-disk
> metadata and it would automatically clear at munmap time. That should
TBH even after the last round of 'do we need this on-disk flag?' I still
wasn't 100% convinced that we really needed a permanent flag vs.
requiring apps to ask for a sealed iomap mmap like what you just
described, so I'm glad this converation has continue. :)
--D
> be enough to support the host-persistent-memory, and
> NVMe-persistent-memory use cases (provided we have struct page for
> NVMe). Although, we need more safety infrastructure in the NVMe case
> where we would need to software manage I/O coherence.
>
> > Last but not least we have any interesting additional case for modern
> > Mellanox hardware - On Demand Paging where we don't actually do a
> > get_user_pages but the hardware implements SVM and thus gets fed
> > virtual addresses directly. My head spins when talking about the
> > implications for DAX mappings on that, so I'm just throwing that in
> > for now instead of trying to come up with a solution.
>
> Yeah, DAX + SVM needs more thought.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2017-08-14 21:46 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <150181368442.32119.13336247800141074356.stgit@dwillia2-desk3.amr.corp.intel.com>
[not found] ` <150181368442.32119.13336247800141074356.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2017-08-04 2:38 ` [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap Dan Williams
[not found] ` <CAPcyv4ii41F-Rj9pPGc0FHwrQ=hkSF_f0niQDn5_NjU-wcL+gg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-08-05 9:50 ` Christoph Hellwig
[not found] ` <20170805095013.GC14930-jcswGhMUV9g@public.gmane.org>
2017-08-06 18:51 ` Dan Williams
[not found] ` <CAPcyv4jgKmakB0WRUjx=2eD3YJ1x+C8cgnR6tA+g4+m+0etawQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-08-11 10:44 ` Christoph Hellwig
[not found] ` <20170811104429.GA13736-jcswGhMUV9g@public.gmane.org>
2017-08-11 22:26 ` Dan Williams
2017-08-12 3:57 ` Andy Lutomirski
[not found] ` <CALCETrVvMbaxobdydtsdQWHyP1VhL1fpq1qS4M3=SmR1y4x5kw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-08-12 4:44 ` Dan Williams
2017-08-12 7:34 ` Christoph Hellwig
[not found] ` <CAPcyv4jrZ5a+zmAehZDxfP=+6BNCFAXOFWro2L7ruLkk+cY7OQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-08-12 7:33 ` Christoph Hellwig
[not found] ` <20170812073349.GA12679-jcswGhMUV9g@public.gmane.org>
2017-08-12 19:19 ` Dan Williams
2017-08-13 9:24 ` Christoph Hellwig
[not found] ` <20170813092436.GB32112-jcswGhMUV9g@public.gmane.org>
2017-08-13 20:31 ` Dan Williams
[not found] ` <CAPcyv4ixTgSWG9K2Eg3XJmOvqJht81qL+Z3njoOjcXCD7XMpZw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-08-14 12:40 ` Jan Kara
2017-08-14 16:14 ` Dan Williams
[not found] ` <CAPcyv4hi_Y5Qj=h_Qf4Bcyv+EWBosa2gQT+-8ro3hPY9VMshSA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-08-15 8:37 ` Jan Kara
2017-08-15 23:50 ` Dan Williams
[not found] ` <CAPcyv4hFTn4Fz5o+Gm857mS-RA6WAVsf4CmwiLiK2O8w2_SamQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-08-16 13:57 ` Jan Kara
[not found] ` <20170814124059.GC17820-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
2017-08-21 9:16 ` Peter Zijlstra
2017-08-14 21:46 ` Darrick J. Wong [this message]
2017-08-13 23:46 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170814214615.GB4796@magnolia \
--to=darrick.wong-qhclzuegtsvqt0dzr+alfa@public.gmane.org \
--cc=dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org \
--cc=david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org \
--cc=hch-jcswGhMUV9g@public.gmane.org \
--cc=jack-AlSwsSmVLrQ@public.gmane.org \
--cc=linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org \
--cc=linux-xfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=luto-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org \
--cc=viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).