From: Dave Chinner <david@fromorbit.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>,
linux-fsdevel@vger.kernel.org,
"Darrick J. Wong" <darrick.wong@oracle.com>,
Christoph Hellwig <hch@lst.de>, Jan Kara <jack@suse.cz>
Subject: Re: VFS caching of file extents
Date: Thu, 29 Aug 2024 16:05:36 +1000 [thread overview]
Message-ID: <ZtAPsMcc3IC1VaAF@dread.disaster.area> (raw)
In-Reply-To: <Zs9_l1w0SuJO4ZbO@casper.infradead.org>
On Wed, Aug 28, 2024 at 08:50:47PM +0100, Matthew Wilcox wrote:
> On Wed, Aug 28, 2024 at 03:46:34PM -0400, Chuck Lever wrote:
> > On Wed, Aug 28, 2024 at 08:34:00PM +0100, Matthew Wilcox wrote:
> > > There are a few problems I think this can solve. One is efficient
> > > implementation of NFS READPLUS.
> >
> > To expand on this, we're talking about the Linux NFS server's
> > implementation of the NFSv4.2 READ_PLUS operation, which is
> > specified here:
> >
> > https://www.rfc-editor.org/rfc/rfc7862.html#section-15.10
> >
> > The READ_PLUS operation can return an array of content segments that
> > include regular data, holes in the file, or data patterns. Knowing
> > how the filesystem stores a file would help NFSD identify where it
> > can return a representation of a hole rather than a string of actual
> > zeroes, for instance.
>
> Thanks for the reference; I went looking for it and found only the
> draft.
>
> Another thing this could help with is reducing page cache usage for
> very sparse files. Today if we attempt to read() or page fault on a
> file hole, we allocate a fresh page of memory and ask the filesystem to
> fill it. The filesystem notices that it's a hole and calls memset().
> If the VFS knew that the extent was a hole, it could use the shared zero
> page instead. Don't know how much of a performance win this would be,
> but it might be useful.
Ah. OK. Maybe I see the reason you are asking this question now.
Buffered reads are still based on the old page-cache-first IO
mechanisms and so doing smart stuff with "filesystems things"
are difficult to do.
i.e. readahead allocates folios for the readahead range before it
asks the filesystem to do the readahead IO, it is unaware of how the
file is laid out. Hence it can't do smart things with holes.
And it paints the filesystems into a corner, too, because they can't
modify the set of folios that it needs to fill with data. Hence
the filesystem can't do smart things with holes or unwritten
extents, either.
To solve this, the proposal being made is to lift the filesystem
mapping information up into "the VFS" so that the existing buffered
read code has awareness of the file mapping. That allows this page
cache code to do smarter things. e.g. special case folio
instantiation w.r.t. sparse files (amongst other things).
Have I got that right?
If so, then we've been here before, and we've solve these problems
by inverting the IO path operations. i.e. we do filesystem mapping
operations first, then populate the page cache based on the mapping
being returned.
This is how the iomap buffered write path works, and that's the
reason it supports all the modern filesystem goodies realtively
easily.
The exception to this model in iomap is buffered reads (i.e.
readahead). We still just do what the page cache asks us to do, and
clearly that is now starting to hurt us in the same way the page
cache centric IO model was hurting us for buffered writes a decade
ago.
So, can we invert readahead like we did with buffered writes? That
is, we hand the readahead window that needs to be filled (i.e. a
{mapping, pos, len} tuple) to the filesystem (iomap) which can then
iterate mappings over the readahead range. iomap_iter_readahead()
can then populate the page cache with appropriately sized folios and
do the IO, or use the zero page when over a hole or unwritten
extent...
Note that optimisations like zero-page-over-holes also need write
path changes. We'd need to change iomap_get_folio() to tell
__filemap_get_folio() to replace zero pages with newly allocated
writeable folios during write operations...
-Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2024-08-29 6:05 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-08-28 19:34 VFS caching of file extents Matthew Wilcox
2024-08-28 19:46 ` Chuck Lever
2024-08-28 19:50 ` Matthew Wilcox
2024-08-29 6:05 ` Dave Chinner [this message]
2024-08-28 20:30 ` Josef Bacik
2024-08-28 23:46 ` Dave Chinner
2024-08-29 1:57 ` Darrick J. Wong
2024-08-29 4:00 ` Christoph Hellwig
2024-08-29 13:52 ` Chuck Lever III
2024-08-29 22:36 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZtAPsMcc3IC1VaAF@dread.disaster.area \
--to=david@fromorbit.com \
--cc=chuck.lever@oracle.com \
--cc=darrick.wong@oracle.com \
--cc=hch@lst.de \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.