linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Matthew Wilcox <willy@linux.intel.com>
To: Zach Brown <zab@zabbo.net>
Cc: Matthew Wilcox <willy@linux.intel.com>, linux-fsdevel@vger.kernel.org
Subject: Re: direct_access, pinning and truncation
Date: Thu, 9 Oct 2014 12:44:44 -0400	[thread overview]
Message-ID: <20141009164444.GO5098@wil.cx> (raw)
In-Reply-To: <20141008232132.GA10656@lenny.home.zabbo.net>

On Wed, Oct 08, 2014 at 04:21:32PM -0700, Zach Brown wrote:
> [... figuring out how g_u_p() references can prevent freeing and
> re-using the underlying mapped pmem addresses given the lack of struct
> pages for the mapping]
> 
> > I see three solutions here:
> > 
> > 1. If get_user_pages() is called, copy from PMEM into DRAM, and provide
> > the caller with the struct pages of the DRAM.  Modify DAX to handle some
> > file pages being in the page cache, and make sure that we know whether
> > the PMEM or DRAM is up to date.  This has the obvious downside that
> > get_user_pages() becomes slow.
> 
> And serialize transitions and fs stores to pmem regions.  And now
> storing to dram-fronted pmem goes through all the dirtying and writeback
> machinery.  This sounds like a nightmare to me, to be honest.

That's not so bad ... it's just normal page-cache stuff, really.  It'd be
per-page serialisation, just like the current gunk we go through to get
sparse loads to not allocate backing store.

> > 2. Modify filesystems that support DAX to handle pinning blocks.
> > Some filesystems (that support COW and snapshots) already support
> > reference-counting individual blocks.  We may be ale to do better by
> > using a tree of pinned extents or something.  This makes it much harder
> > to modify a filesystem to support DAX, and I don't see patches adding
> > this capability to ext2 being warmly welcomed.
> 
> This seems.. doable?  Recording the referenced pmem in free lists in the
> fs is fine as long as the pmem isn't modified until the references are
> released, right?

As long as it's not *allocated* to anything else (which seems to be what
you're actually saying in the next paragraph).

> Maybe in the allocator you skip otherwise free blocks if they intersect
> with the run time structure (rbtree of extents, presumably) that is
> taking the place of reference counts in struct page.  There aren't
> *that* many allocator entry points.  I guess you'd need to avoid other
> modifications of free space like trimming :/.  It still seems reasonably
> doable?

Ah, so on reboot, the on-disk data structures are all correct, and
the in-memory data structures went away with the runtime pinning of
the memory.  Nice.

> And hey, lord knows we love to implement rbtrees of extents in file
> systems!  (btrfs: struct extent_state, ext4: struct extent_status)
> 
> The tricky part would be maintaining that structure behind g_u_p() and
> put_page() calls.  Probably a richer interface that gives callers
> something more than just raw page pointers.
> 
> > 3. Make truncate() block if it hits a pinned page.  There's really no
> > good reason to truncate a file that has pinned pages; it's either a bug
> > or you're trying to be nasty to someone.  We actually already have code
> > for this; inode_dio_wait() / inode_dio_done().  But page pinning isn't
> > just for O_DIRECT I/Os and other transient users like crypto, it's also
> > for long-lived things like RDMA, where we could potentially block for
> > an indefinite time.
> 
> I have no concrete examples, but I agree that it sounds like the sort of
> thing that would bite us in the ass if we miss some use case :/.
> 
> I guess my initial vote is for trying a less-than-perfect prototype of
> #2 to see just how hairy the rough outline gets.

Thinking about it now, it seems less hairy than I initially thought.  I'll
give it a quick try and see how it goes.


  reply	other threads:[~2014-10-09 16:44 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-10-08 19:05 direct_access, pinning and truncation Matthew Wilcox
2014-10-08 23:21 ` Zach Brown
2014-10-09 16:44   ` Matthew Wilcox [this message]
2014-10-09 19:14     ` Zach Brown
2014-10-10 10:01       ` Jan Kara
2014-10-09  1:10 ` Dave Chinner
2014-10-09 15:25   ` Matthew Wilcox
2014-10-13  1:19     ` Dave Chinner
2014-10-19  9:51     ` Boaz Harrosh
2014-10-10 13:08 ` Jan Kara
2014-10-10 14:24   ` Matthew Wilcox
2014-10-19 11:08     ` Boaz Harrosh
2014-10-19 23:01       ` Dave Chinner
2014-10-21  9:17         ` Boaz Harrosh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20141009164444.GO5098@wil.cx \
    --to=willy@linux.intel.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=zab@zabbo.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).