linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Matthew Wilcox <matthew@wil.cx>
Cc: Theodore Ts'o <tytso@mit.edu>,
	Matthew Wilcox <matthew.r.wilcox@intel.com>,
	linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH v3 0/3] Add XIP support to ext4
Date: Mon, 23 Dec 2013 17:56:41 +1100	[thread overview]
Message-ID: <20131223065641.GI3220@dastard> (raw)
In-Reply-To: <20131223034554.GA11091@parisc-linux.org>

On Sun, Dec 22, 2013 at 08:45:54PM -0700, Matthew Wilcox wrote:
> On Mon, Dec 23, 2013 at 02:36:41PM +1100, Dave Chinner wrote:
> > What I'm trying to say is that I think the whole idea of XIP is
> > separate from the page cache is completely the wrong way to go about
> > fixing it. XIP should simply be a method of mapping backing device
> > pages into the existing per-inode mapping tree.  If we need to
> > encode, remap, etc because of constraints of the configuration (be
> > it filesystem implementation or block device encodings) then we just
> > use the normal buffered IO path, with the ->writepages path hitting
> > the block layer to do the memcpy or encoding into persistent
> > memory. Otherwise we just hit the direct IO path we've been talking
> > about up to this point...
> 
> That's a very filesystem person way of thinking about the problem :-)
> The problem is that you've now pushed it off on the MM people.

I didn't comment on this before, but now I've had a bit of time to
think about it, it's become obvious to me that there is a
fundamental disconnect here.  To risk stating the obvious, but
persistent memory is just memory and someone has to manage it.

I'll state up front that I do spend a fair bit of time in memory
management code - all the shrinker scaling for NUMA systems that
landed recently was stuff I originally wrote. I'm spending time
reviewing patches to get memcg awareness into the shrinkers and
filesystem caches.  Persistent memory has a lot of overlap between
the MM and FS subsystems, just like shrinkers overlap lots of
different subsystems...

So from a filesystem perspective, we move data in and out of pages
of memory that are managed by the memory management subsystem, and
we move that data to and from filesystem blocks via an IO path.

The management of the memory that filesystems use is actually
the responsibility of the memory management subsystem - allocation,
reclaim, tracking, etc are all handled by the mm subsystem. That has
tendrils down into filesystem code - writeback for cleaning pages,
shrinkers for freeing inodes, dentries and other filesystem caches,
etc.

Persistent memory may be physically different to volatile memory,
but it is still exposed as byte addressable, mappable pages of
memory to the OS. Hence it could be treated in exactly the same way
that volatile memory pages are treated.

That is, a persistent memory device could be considered to be a
block device with a page sized sector. i.e. a 1:1 mapping between
the block device address space and the persistent memory page. A
filesystem tracks sectors in the block device address space with
filesystem metadata to expose the storage in a namespace, but that's
not the same thing as using managing how persistent memory is
exposed to virtual addresses in userspace. The former is data
indexing, the latter is a data access.

In terms of data indexing, the inode mapping tree is used to track
the relationship between the file offset of the user data, the
memory backing the data and the block index in the filesystem. That
realtionship is read from filesystem metadata.

For data access, the memory backing the data is tracked via
a struct page allocated out of volatile system memory. To get that
data to/from the backing storage, we need to perform an IO
operation on the memory backing the data, and we determine where to
get that from via the data index...

In the case of XIP, we still have the same data index relationship.
The difference is in the data access - XIP gets the backing memory
from the block device rather than from the free memory the VM.
However, we don't get a struct page - we get an opaque handle we
cannot use for data indexing purposes, and hence we need unique IO
paths to deal with this difference.

If the persistent memory device can hand us struct pages rather than
mapped memory handles, we don't need to change our data indexing
methods, nor do we need to change the way data in the page cache is
accessed. mmap() gets direct access, just like the current XIP, but
we can use all of the smarts filesystems have for optimal
block allocation.

Further, if the persistent memory device implements an IO implementation
(->make_request) like brd does (brd_make_request), then we get double
buffered persistent memory that we can use for things like stacked
IO devices that encode the data that is being stored. It all ends up
completely transparent to the filesystem, the mm subsystem, the
users, etc. XIP just works automatically when it can, otherwise it
just behaves like a really fast block device....

IOWs, I don't see XIP as something that should be tacked on to the
side of the filesystems and bypass the normal IO paths. it's
somethign that should be integrated directly and used automatically
if it can be used. And that requires persistent memory to be treated
as pages just like volatile memory.

That's how I see persistent memory fitting into the FS/MM world. It
needs help from both the FS and MM subsystems, and to try to
shoe-horn it completely into one or the other just won't work in the
long run.

The reality is that you're on a steep learning curve here, Willy.
What filesystems do and the way they interact with the MM subsystem
interact is a whole lot more complex that you realised.  I know that
XIP is not a new concept (I writing XIP stuff 20 years ago on 68000s
with a whole 6MB of battery backed SRAM), but filesystems and the
page cache have got a whole lot more complex since ext2....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  parent reply	other threads:[~2013-12-23  6:56 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-12-17 19:18 [PATCH v3 0/3] Add XIP support to ext4 Matthew Wilcox
2013-12-17 19:18 ` [PATCH v3 1/3] Fix XIP fault vs truncate race Matthew Wilcox
2013-12-17 19:18 ` [PATCH v3 2/3] xip: Add xip_zero_page_range Matthew Wilcox
2013-12-17 19:18 ` [PATCH v3 3/3] ext4: Add XIP functionality Matthew Wilcox
2013-12-17 22:30 ` [PATCH v3 0/3] Add XIP support to ext4 Dave Chinner
2013-12-18  2:31   ` Matthew Wilcox
2013-12-18  5:01     ` Theodore Ts'o
2013-12-18 14:27       ` Matthew Wilcox
2013-12-19  2:07         ` Theodore Ts'o
2013-12-19  4:12           ` Matthew Wilcox
2013-12-19  4:37             ` Dave Chinner
2013-12-19  5:43             ` Theodore Ts'o
2013-12-19 15:20               ` Matthew Wilcox
2013-12-19 16:17                 ` Theodore Ts'o
2013-12-19 17:12                   ` Matthew Wilcox
2013-12-19 17:18                     ` Theodore Ts'o
2013-12-20 18:17                       ` Matthew Wilcox
2013-12-20 19:34                         ` Theodore Ts'o
2013-12-20 20:11                           ` Matthew Wilcox
2013-12-23  3:36                             ` Dave Chinner
2013-12-23  3:45                               ` Matthew Wilcox
2013-12-23  4:32                                 ` Dave Chinner
2013-12-23  6:56                                 ` Dave Chinner [this message]
2013-12-23 14:51                                   ` Theodore Ts'o
2013-12-23  3:16                         ` Dave Chinner
2013-12-24 16:27                           ` Matthew Wilcox
2013-12-18 12:33     ` Dave Chinner
2013-12-18 15:22       ` Matthew Wilcox
2013-12-19  0:48         ` Dave Chinner
2013-12-19  1:05           ` Matthew Wilcox
2013-12-19  1:58             ` Dave Chinner
2013-12-19 15:32               ` Matthew Wilcox
2013-12-19 23:46                 ` Dave Chinner
2013-12-20 16:45                   ` Matthew Wilcox
2013-12-23  4:14                     ` Dave Chinner
2013-12-18 18:13   ` Eric Sandeen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131223065641.GI3220@dastard \
    --to=david@fromorbit.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=matthew.r.wilcox@intel.com \
    --cc=matthew@wil.cx \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).