From: Matthew Wilcox <willy@linux.intel.com>
To: Zach Brown <zab@zabbo.net>
Cc: Matthew Wilcox <willy@linux.intel.com>, linux-fsdevel@vger.kernel.org
Subject: Re: direct_access, pinning and truncation
Date: Thu, 9 Oct 2014 12:44:44 -0400 [thread overview]
Message-ID: <20141009164444.GO5098@wil.cx> (raw)
In-Reply-To: <20141008232132.GA10656@lenny.home.zabbo.net>
On Wed, Oct 08, 2014 at 04:21:32PM -0700, Zach Brown wrote:
> [... figuring out how g_u_p() references can prevent freeing and
> re-using the underlying mapped pmem addresses given the lack of struct
> pages for the mapping]
>
> > I see three solutions here:
> >
> > 1. If get_user_pages() is called, copy from PMEM into DRAM, and provide
> > the caller with the struct pages of the DRAM. Modify DAX to handle some
> > file pages being in the page cache, and make sure that we know whether
> > the PMEM or DRAM is up to date. This has the obvious downside that
> > get_user_pages() becomes slow.
>
> And serialize transitions and fs stores to pmem regions. And now
> storing to dram-fronted pmem goes through all the dirtying and writeback
> machinery. This sounds like a nightmare to me, to be honest.
That's not so bad ... it's just normal page-cache stuff, really. It'd be
per-page serialisation, just like the current gunk we go through to get
sparse loads to not allocate backing store.
> > 2. Modify filesystems that support DAX to handle pinning blocks.
> > Some filesystems (that support COW and snapshots) already support
> > reference-counting individual blocks. We may be ale to do better by
> > using a tree of pinned extents or something. This makes it much harder
> > to modify a filesystem to support DAX, and I don't see patches adding
> > this capability to ext2 being warmly welcomed.
>
> This seems.. doable? Recording the referenced pmem in free lists in the
> fs is fine as long as the pmem isn't modified until the references are
> released, right?
As long as it's not *allocated* to anything else (which seems to be what
you're actually saying in the next paragraph).
> Maybe in the allocator you skip otherwise free blocks if they intersect
> with the run time structure (rbtree of extents, presumably) that is
> taking the place of reference counts in struct page. There aren't
> *that* many allocator entry points. I guess you'd need to avoid other
> modifications of free space like trimming :/. It still seems reasonably
> doable?
Ah, so on reboot, the on-disk data structures are all correct, and
the in-memory data structures went away with the runtime pinning of
the memory. Nice.
> And hey, lord knows we love to implement rbtrees of extents in file
> systems! (btrfs: struct extent_state, ext4: struct extent_status)
>
> The tricky part would be maintaining that structure behind g_u_p() and
> put_page() calls. Probably a richer interface that gives callers
> something more than just raw page pointers.
>
> > 3. Make truncate() block if it hits a pinned page. There's really no
> > good reason to truncate a file that has pinned pages; it's either a bug
> > or you're trying to be nasty to someone. We actually already have code
> > for this; inode_dio_wait() / inode_dio_done(). But page pinning isn't
> > just for O_DIRECT I/Os and other transient users like crypto, it's also
> > for long-lived things like RDMA, where we could potentially block for
> > an indefinite time.
>
> I have no concrete examples, but I agree that it sounds like the sort of
> thing that would bite us in the ass if we miss some use case :/.
>
> I guess my initial vote is for trying a less-than-perfect prototype of
> #2 to see just how hairy the rough outline gets.
Thinking about it now, it seems less hairy than I initially thought. I'll
give it a quick try and see how it goes.
next prev parent reply other threads:[~2014-10-09 16:44 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-10-08 19:05 direct_access, pinning and truncation Matthew Wilcox
2014-10-08 23:21 ` Zach Brown
2014-10-09 16:44 ` Matthew Wilcox [this message]
2014-10-09 19:14 ` Zach Brown
2014-10-10 10:01 ` Jan Kara
2014-10-09 1:10 ` Dave Chinner
2014-10-09 15:25 ` Matthew Wilcox
2014-10-13 1:19 ` Dave Chinner
2014-10-19 9:51 ` Boaz Harrosh
2014-10-10 13:08 ` Jan Kara
2014-10-10 14:24 ` Matthew Wilcox
2014-10-19 11:08 ` Boaz Harrosh
2014-10-19 23:01 ` Dave Chinner
2014-10-21 9:17 ` Boaz Harrosh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20141009164444.GO5098@wil.cx \
--to=willy@linux.intel.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=zab@zabbo.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).