From: Matthew Wilcox <willy@linux.intel.com>
To: Zach Brown <zab@zabbo.net>
Cc: Matthew Wilcox <willy@linux.intel.com>, linux-fsdevel@vger.kernel.org
Subject: Re: direct_access, pinning and truncation
Date: Thu, 9 Oct 2014 12:44:44 -0400 [thread overview]
Message-ID: <20141009164444.GO5098@wil.cx> (raw)
In-Reply-To: <20141008232132.GA10656@lenny.home.zabbo.net>
On Wed, Oct 08, 2014 at 04:21:32PM -0700, Zach Brown wrote:
> [... figuring out how g_u_p() references can prevent freeing and
> re-using the underlying mapped pmem addresses given the lack of struct
> pages for the mapping]
>
> > I see three solutions here:
> >
> > 1. If get_user_pages() is called, copy from PMEM into DRAM, and provide
> > the caller with the struct pages of the DRAM. Modify DAX to handle some
> > file pages being in the page cache, and make sure that we know whether
> > the PMEM or DRAM is up to date. This has the obvious downside that
> > get_user_pages() becomes slow.
>
> And serialize transitions and fs stores to pmem regions. And now
> storing to dram-fronted pmem goes through all the dirtying and writeback
> machinery. This sounds like a nightmare to me, to be honest.
That's not so bad ... it's just normal page-cache stuff, really. It'd be
per-page serialisation, just like the current gunk we go through to get
sparse loads to not allocate backing store.
> > 2. Modify filesystems that support DAX to handle pinning blocks.
> > Some filesystems (that support COW and snapshots) already support
> > reference-counting individual blocks. We may be ale to do better by
> > using a tree of pinned extents or something. This makes it much harder
> > to modify a filesystem to support DAX, and I don't see patches adding
> > this capability to ext2 being warmly welcomed.
>
> This seems.. doable? Recording the referenced pmem in free lists in the
> fs is fine as long as the pmem isn't modified until the references are
> released, right?
As long as it's not *allocated* to anything else (which seems to be what
you're actually saying in the next paragraph).
> Maybe in the allocator you skip otherwise free blocks if they intersect
> with the run time structure (rbtree of extents, presumably) that is
> taking the place of reference counts in struct page. There aren't
> *that* many allocator entry points. I guess you'd need to avoid other
> modifications of free space like trimming :/. It still seems reasonably
> doable?
Ah, so on reboot, the on-disk data structures are all correct, and
the in-memory data structures went away with the runtime pinning of
the memory. Nice.
> And hey, lord knows we love to implement rbtrees of extents in file
> systems! (btrfs: struct extent_state, ext4: struct extent_status)
>
> The tricky part would be maintaining that structure behind g_u_p() and
> put_page() calls. Probably a richer interface that gives callers
> something more than just raw page pointers.
>
> > 3. Make truncate() block if it hits a pinned page. There's really no
> > good reason to truncate a file that has pinned pages; it's either a bug
> > or you're trying to be nasty to someone. We actually already have code
> > for this; inode_dio_wait() / inode_dio_done(). But page pinning isn't
> > just for O_DIRECT I/Os and other transient users like crypto, it's also
> > for long-lived things like RDMA, where we could potentially block for
> > an indefinite time.
>
> I have no concrete examples, but I agree that it sounds like the sort of
> thing that would bite us in the ass if we miss some use case :/.
>
> I guess my initial vote is for trying a less-than-perfect prototype of
> #2 to see just how hairy the rough outline gets.
Thinking about it now, it seems less hairy than I initially thought. I'll
give it a quick try and see how it goes.
next prev parent reply other threads:[~2014-10-09 16:44 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-10-08 19:05 direct_access, pinning and truncation Matthew Wilcox
2014-10-08 23:21 ` Zach Brown
2014-10-09 16:44 ` Matthew Wilcox [this message]
2014-10-09 19:14 ` Zach Brown
2014-10-10 10:01 ` Jan Kara
2014-10-09 1:10 ` Dave Chinner
2014-10-09 15:25 ` Matthew Wilcox
2014-10-13 1:19 ` Dave Chinner
2014-10-19 9:51 ` Boaz Harrosh
2014-10-10 13:08 ` Jan Kara
2014-10-10 14:24 ` Matthew Wilcox
2014-10-19 11:08 ` Boaz Harrosh
2014-10-19 23:01 ` Dave Chinner
2014-10-21 9:17 ` Boaz Harrosh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20141009164444.GO5098@wil.cx \
--to=willy@linux.intel.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=zab@zabbo.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.