VFS caching of file extents

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* VFS caching of file extents
@ 2024-08-28 19:34 Matthew Wilcox
  2024-08-28 19:46 ` Chuck Lever
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Matthew Wilcox @ 2024-08-28 19:34 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Chinner, Darrick J. Wong, Christoph Hellwig, Chuck Lever,
	Jan Kara

Today it is the responsibility of each filesystem to maintain the mapping
from file logical addresses to disk blocks (*).  There are various ways
to query that information, eg calling get_block() or using iomap.

What if we pull that information up into the VFS?  Filesystems obviously
_control_ that information, so need to be able to invalidate entries.
And we wouldn't want to store all extents in the VFS all the time, so
would need to have a way to call into the filesystem to populate ranges
of files.  We'd need to decide how to lock/protect that information
-- a per-file lock?  A per-extent lock?  No locking, just a seqcount?
We need a COW bit in the extent which tells the user that this extent
is fine for reading through, but if there's a write to be done then the
filesystem needs to be asked to create a new extent.

There are a few problems I think this can solve.  One is efficient
implementation of NFS READPLUS.  Another is the callback from iomap
to the filesystem when doing buffered writeback.  A third is having a
common implementation of FIEMAP.  I've heard rumours that FUSE would like
something like this, and maybe there are other users that would crop up.

Anyway, this is as far as my thinking has got on this topic for now.
Maybe there's a good idea here, maybe it's all a huge overengineered mess
waiting to happen.  I'm sure other people know this area of filesystems
better than I do.

(*) For block device filesystems.  Obviously network filesystems and
synthetic filesystems don't care and can stop reading now.  Umm, unless
maybe they _want_ to use it, eg maybe there's a sharded thing going on and
the fs wants to store information about each shard in the extent cache?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: VFS caching of file extents
  2024-08-28 19:34 VFS caching of file extents Matthew Wilcox
@ 2024-08-28 19:46 ` Chuck Lever
  2024-08-28 19:50   ` Matthew Wilcox
  2024-08-28 20:30 ` Josef Bacik
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Chuck Lever @ 2024-08-28 19:46 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, Dave Chinner, Darrick J. Wong, Christoph Hellwig,
	Jan Kara

On Wed, Aug 28, 2024 at 08:34:00PM +0100, Matthew Wilcox wrote:
> Today it is the responsibility of each filesystem to maintain the mapping
> from file logical addresses to disk blocks (*).  There are various ways
> to query that information, eg calling get_block() or using iomap.
> 
> What if we pull that information up into the VFS?  Filesystems obviously
> _control_ that information, so need to be able to invalidate entries.
> And we wouldn't want to store all extents in the VFS all the time, so
> would need to have a way to call into the filesystem to populate ranges
> of files.  We'd need to decide how to lock/protect that information
> -- a per-file lock?  A per-extent lock?  No locking, just a seqcount?
> We need a COW bit in the extent which tells the user that this extent
> is fine for reading through, but if there's a write to be done then the
> filesystem needs to be asked to create a new extent.
> 
> There are a few problems I think this can solve.  One is efficient
> implementation of NFS READPLUS.

To expand on this, we're talking about the Linux NFS server's
implementation of the NFSv4.2 READ_PLUS operation, which is
specified here:

  https://www.rfc-editor.org/rfc/rfc7862.html#section-15.10

The READ_PLUS operation can return an array of content segments that
include regular data, holes in the file, or data patterns. Knowing
how the filesystem stores a file would help NFSD identify where it
can return a representation of a hole rather than a string of actual
zeroes, for instance.


> Another is the callback from iomap
> to the filesystem when doing buffered writeback.  A third is having a
> common implementation of FIEMAP.  I've heard rumours that FUSE would like
> something like this, and maybe there are other users that would crop up.
> 
> Anyway, this is as far as my thinking has got on this topic for now.
> Maybe there's a good idea here, maybe it's all a huge overengineered mess
> waiting to happen.  I'm sure other people know this area of filesystems
> better than I do.
> 
> (*) For block device filesystems.  Obviously network filesystems and
> synthetic filesystems don't care and can stop reading now.  Umm, unless
> maybe they _want_ to use it, eg maybe there's a sharded thing going on and
> the fs wants to store information about each shard in the extent cache?
> 

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: VFS caching of file extents
  2024-08-28 19:46 ` Chuck Lever
@ 2024-08-28 19:50   ` Matthew Wilcox
  2024-08-29  6:05     ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Matthew Wilcox @ 2024-08-28 19:50 UTC (permalink / raw)
  To: Chuck Lever
  Cc: linux-fsdevel, Dave Chinner, Darrick J. Wong, Christoph Hellwig,
	Jan Kara

On Wed, Aug 28, 2024 at 03:46:34PM -0400, Chuck Lever wrote:
> On Wed, Aug 28, 2024 at 08:34:00PM +0100, Matthew Wilcox wrote:
> > There are a few problems I think this can solve.  One is efficient
> > implementation of NFS READPLUS.
> 
> To expand on this, we're talking about the Linux NFS server's
> implementation of the NFSv4.2 READ_PLUS operation, which is
> specified here:
> 
>   https://www.rfc-editor.org/rfc/rfc7862.html#section-15.10
> 
> The READ_PLUS operation can return an array of content segments that
> include regular data, holes in the file, or data patterns. Knowing
> how the filesystem stores a file would help NFSD identify where it
> can return a representation of a hole rather than a string of actual
> zeroes, for instance.

Thanks for the reference; I went looking for it and found only the
draft.

Another thing this could help with is reducing page cache usage for
very sparse files.  Today if we attempt to read() or page fault on a
file hole, we allocate a fresh page of memory and ask the filesystem to
fill it.  The filesystem notices that it's a hole and calls memset().
If the VFS knew that the extent was a hole, it could use the shared zero
page instead.  Don't know how much of a performance win this would be,
but it might be useful.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: VFS caching of file extents
  2024-08-28 19:34 VFS caching of file extents Matthew Wilcox
  2024-08-28 19:46 ` Chuck Lever
@ 2024-08-28 20:30 ` Josef Bacik
  2024-08-28 23:46 ` Dave Chinner
  2024-08-29  1:57 ` Darrick J. Wong
  3 siblings, 0 replies; 10+ messages in thread
From: Josef Bacik @ 2024-08-28 20:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, Dave Chinner, Darrick J. Wong, Christoph Hellwig,
	Chuck Lever, Jan Kara

On Wed, Aug 28, 2024 at 08:34:00PM +0100, Matthew Wilcox wrote:
> Today it is the responsibility of each filesystem to maintain the mapping
> from file logical addresses to disk blocks (*).  There are various ways
> to query that information, eg calling get_block() or using iomap.
> 
> What if we pull that information up into the VFS?  Filesystems obviously
> _control_ that information, so need to be able to invalidate entries.
> And we wouldn't want to store all extents in the VFS all the time, so
> would need to have a way to call into the filesystem to populate ranges
> of files.  We'd need to decide how to lock/protect that information
> -- a per-file lock?  A per-extent lock?  No locking, just a seqcount?
> We need a COW bit in the extent which tells the user that this extent
> is fine for reading through, but if there's a write to be done then the
> filesystem needs to be asked to create a new extent.
> 

At least for btrfs we store a lot of things in our extent map, so I'm not sure
if everybody wants to share the overhead of the amount of information we keep
cached in these entries.

We also protect all that with an extent lock, which again I'm not entirely sure
everybody wants to adopt our extent locking.  If we pushed the locking
responsibility into the file system then hooray, but that makes the generic
implementation more complex.

> There are a few problems I think this can solve.  One is efficient
> implementation of NFS READPLUS.  Another is the callback from iomap
> to the filesystem when doing buffered writeback.  A third is having a
> common implementation of FIEMAP.  I've heard rumours that FUSE would like
> something like this, and maybe there are other users that would crop up.
> 

For us we actually stopped using our in memory cache for FIEMAP because it ended
up being way slower and kind of a pain to work with all the different ways we'll
update the cache based on io happening.  Our FIEMAP implementation just reads
the extents on disk because it's easier/cleaner to just walk through the btree
than the cache.

> Anyway, this is as far as my thinking has got on this topic for now.
> Maybe there's a good idea here, maybe it's all a huge overengineered mess
> waiting to happen.  I'm sure other people know this area of filesystems
> better than I do.

Maybe it's fine for simpler file systems, and it could probably be argued that
btrfs is a bit over-engineered in this case, but I worry it'll turn into one of
those "this seemed like a good idea at the time, but after we added all the
features everybody needed we ended up with something way more complex"
scenarios.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: VFS caching of file extents
  2024-08-28 19:34 VFS caching of file extents Matthew Wilcox
  2024-08-28 19:46 ` Chuck Lever
  2024-08-28 20:30 ` Josef Bacik
@ 2024-08-28 23:46 ` Dave Chinner
  2024-08-29  1:57 ` Darrick J. Wong
  3 siblings, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2024-08-28 23:46 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, Darrick J. Wong, Christoph Hellwig, Chuck Lever,
	Jan Kara

On Wed, Aug 28, 2024 at 08:34:00PM +0100, Matthew Wilcox wrote:
> Today it is the responsibility of each filesystem to maintain the mapping
> from file logical addresses to disk blocks (*).  There are various ways
> to query that information, eg calling get_block() or using iomap.
> 
> What if we pull that information up into the VFS? 

We explicitly pulled that information out of the VFS by moving away
from per-page bufferheads that stored the disk mapping for the
cached data to the on-demand query based iomap infrastructure.

> Filesystems obviously
> _control_ that information, so need to be able to invalidate entries.

Which is one of the reasons for keeping it out of the VFS...

Besides, which set of mapping information that the filesystem holds are we
talking about here?

FYI: XFS has *three* sets of mapping information per inode - the
data fork, the xattr fork and the COW fork. The data fork and the
COW fork both reference file data mappings, and they can overlap
whilst there are COW operations ongoing. Regular files can also have
xattr fork mappings.

Further, directories and symlinks have both data and xattr fork
based mappings, and they do not use the VFS for caching metadata -
that is all internal to the filesystem. Hence if we move to caching
mapping information in the VFS, we have to expose the VFS inode all
the way down into the lowest layers of the XFS metadata subsystem
when there is absolutely nothing that is Linux/VFS specific.

IOWs, if we don't cache mapping information natively in the
filesystem, we are forcing filesystems to drill VFS structures deep
into their internal metadata implementations.  Hence if you're
thinking purely about caching file data mappings at the VFS, then
what you're asking filesystems to support is multiple extent map
caching schemes instead of just one. 

And I'm largely ignoring the transactional change requirements
for extent maps, and how a VFS cache would place the cache locking
both above and below transaction boundaries. And then there's the
inevitable shrinker interface for reclaim of cached VFS extent maps
and the placement of the locking both above and below memory
allocation. That's a recipe for lockdep false positives all over the
place...

> And we wouldn't want to store all extents in the VFS all the time, so
> would need to have a way to call into the filesystem to populate ranges
> of files.

This would require substantial modification to filesysetms like XFS
that assume the mapping cache is always fully populated before a
lookup or modification is done. It's not just a case of "make sure
this range is populated", it's also a change of the entire locking
model for extent map access because cache population under a shared
lock is inherently racy.

> We'd need to decide how to lock/protect that information
> -- a per-file lock?  A per-extent lock?  No locking, just a seqcount?

Right now Xfs uses a private per-inode metadata rwsem for exclusion
and we generally don't have terrible contention problems with that strategy. Other
filesystems use private rwsems, too, but often they only protect
mapping operations, not all the metadata in the inode. Other filesystems
use per-extent locking.

As such, I'm not sure there is a "one size fits all" model here...

> We need a COW bit in the extent which tells the user that this extent
> is fine for reading through, but if there's a write to be done then the
> filesystem needs to be asked to create a new extent.

It's more than that - we need somewhere to hold the COW extent
mappings that we've allocated and overlap existing data mappings.
We do delayed allocation and/or preallocation with allocate-around
for COW to minimise fragmentation. Hence we have concurrent mappings
for the same file range for the existing data and where the dirty
cached data is going to end up being placed when it is finally
written. And then on IO completion we do the transactional update
to punch out the old data extent and swap in the new data extent
from the COW fork where we just wrote the new data to.

IOWs, managing COW mappings is much more complex than a simple flag
that says "this range needs allocation on writeback". Yes, we can do
unwritten extents like that (i.e. a simple flag in the extent to
say "do unwritten extent conversion on IO completion"), but COW is
much, much more complex...

> There are a few problems I think this can solve.  One is efficient
> implementation of NFS READPLUS.

How "inefficient" is an iomap implementation? It iterates one extent
at a time, and a readplus iterator can simply encode data and holes
as it queries teh range one extent at a time, right?

> Another is the callback from iomap
> to the filesystem when doing buffered writeback.

Filesystems need to do COW setup work or delayed allocation here, so
we have to call into the filesystem regardless of whether there is a
VFS mapping cache or not.

In that case the callout requires exclusive locking, but if it's an
overwrite, the callout only needs shared locking.  But until we call
into the filesystem we don't know what operation we have to perform
or which type of locks we have to take because the extent map can
change until we hold the internal extent map lock...

Fundamentally, I don't want operations like truncate, hole punch,
etc to have to grow *another* lock. We currently have to take
the inode lock, the invalidate lock and internal metadata locks
to lock everything out. With an independent mapping cache, we're
also going to have to take that lock as well to, especially if
things like writeback only use the mapping cache lock.

> A third is having a
> common implementation of FIEMAP.

We've already got that with iomap.

> I've heard rumours that FUSE would like
> something like this, and maybe there are other users that would crop up.
> 
> Anyway, this is as far as my thinking has got on this topic for now.
> Maybe there's a good idea here, maybe it's all a huge overengineered mess
> waiting to happen.  I'm sure other people know this area of filesystems
> better than I do.

Caching mapping state in the VFS has proven to be less than ideal in
the past for reasons of coherency and resource usage. We've
explicitly moved away from that model to an extent-query model with
iomap, and right now I'm not seeing any advantages or additional
functionality that caching extent maps in the VFS would bring over
the existing iomap model...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: VFS caching of file extents
  2024-08-28 19:34 VFS caching of file extents Matthew Wilcox
                   ` (2 preceding siblings ...)
  2024-08-28 23:46 ` Dave Chinner
@ 2024-08-29  1:57 ` Darrick J. Wong
  2024-08-29  4:00   ` Christoph Hellwig
  3 siblings, 1 reply; 10+ messages in thread
From: Darrick J. Wong @ 2024-08-29  1:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, Dave Chinner, Darrick J. Wong, Christoph Hellwig,
	Chuck Lever, Jan Kara

On Wed, Aug 28, 2024 at 08:34:00PM +0100, Matthew Wilcox wrote:
> Today it is the responsibility of each filesystem to maintain the mapping
> from file logical addresses to disk blocks (*).  There are various ways
> to query that information, eg calling get_block() or using iomap.
> 
> What if we pull that information up into the VFS?  Filesystems obviously
> _control_ that information, so need to be able to invalidate entries.
> And we wouldn't want to store all extents in the VFS all the time, so
> would need to have a way to call into the filesystem to populate ranges
> of files.  We'd need to decide how to lock/protect that information
> -- a per-file lock?  A per-extent lock?  No locking, just a seqcount?
> We need a COW bit in the extent which tells the user that this extent
> is fine for reading through, but if there's a write to be done then the
> filesystem needs to be asked to create a new extent.
> 
> There are a few problems I think this can solve.  One is efficient
> implementation of NFS READPLUS.  Another is the callback from iomap

Wouldn't readplus (and maybe a sparse copy program) rather have
something that is "SEEK_DATA, fill the buffer with data from that file
position, and tell me what pos the data came from"?

> to the filesystem when doing buffered writeback.  A third is having a
> common implementation of FIEMAP.  I've heard rumours that FUSE would like
> something like this, and maybe there are other users that would crop up.

My 2-second hot take on this is that FUSE might benefit from an incore
mapping cache, but only because (rcu)locking the cache to query it is
likely faster than jumping out to userspace to ask the server process.
If the fuse server could invalidate parts of that cache, that might not
be too terrible.

> Anyway, this is as far as my thinking has got on this topic for now.
> Maybe there's a good idea here, maybe it's all a huge overengineered mess
> waiting to happen.  I'm sure other people know this area of filesystems
> better than I do.

I also suspect that devising a "simple" mapping tree for simple
filesystems will quickly devolve into a mess of figuring out their adhoc
locking and making that work.  Even enabling iomap one long-tail fs at a
time sounds like a 10 year project, and they already usually have some
weird notion of coordination of mapping.

"But then there's ext4" etc.

--D

> (*) For block device filesystems.  Obviously network filesystems and
> synthetic filesystems don't care and can stop reading now.  Umm, unless
> maybe they _want_ to use it, eg maybe there's a sharded thing going on and
> the fs wants to store information about each shard in the extent cache?
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: VFS caching of file extents
  2024-08-29  1:57 ` Darrick J. Wong
@ 2024-08-29  4:00   ` Christoph Hellwig
  2024-08-29 13:52     ` Chuck Lever III
  0 siblings, 1 reply; 10+ messages in thread
From: Christoph Hellwig @ 2024-08-29  4:00 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox, linux-fsdevel, Dave Chinner, Darrick J. Wong,
	Christoph Hellwig, Chuck Lever, Jan Kara

> Wouldn't readplus (and maybe a sparse copy program) rather have
> something that is "SEEK_DATA, fill the buffer with data from that file
> position, and tell me what pos the data came from"?

Or rather a read operation that returns a length but no data if there
is a hole.  Either way a potentially incoherent VFS cache is the wrong
way to implement it.

> I also suspect that devising a "simple" mapping tree for simple
> filesystems will quickly devolve into a mess of figuring out their adhoc
> locking and making that work.

Heh.  If simple really is the file systems just using the buffer_head
helpers without any magic it might actually not be that bad, but once
it gets a little more complicated I tend to agree.

But the first thing would be to establish if we actually need it at all,
or if the buffer_head caching of their metadata is actually enough for
the file systems.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: VFS caching of file extents
  2024-08-28 19:50   ` Matthew Wilcox
@ 2024-08-29  6:05     ` Dave Chinner
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2024-08-29  6:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Chuck Lever, linux-fsdevel, Darrick J. Wong, Christoph Hellwig,
	Jan Kara

On Wed, Aug 28, 2024 at 08:50:47PM +0100, Matthew Wilcox wrote:
> On Wed, Aug 28, 2024 at 03:46:34PM -0400, Chuck Lever wrote:
> > On Wed, Aug 28, 2024 at 08:34:00PM +0100, Matthew Wilcox wrote:
> > > There are a few problems I think this can solve.  One is efficient
> > > implementation of NFS READPLUS.
> > 
> > To expand on this, we're talking about the Linux NFS server's
> > implementation of the NFSv4.2 READ_PLUS operation, which is
> > specified here:
> > 
> >   https://www.rfc-editor.org/rfc/rfc7862.html#section-15.10
> > 
> > The READ_PLUS operation can return an array of content segments that
> > include regular data, holes in the file, or data patterns. Knowing
> > how the filesystem stores a file would help NFSD identify where it
> > can return a representation of a hole rather than a string of actual
> > zeroes, for instance.
> 
> Thanks for the reference; I went looking for it and found only the
> draft.
> 
> Another thing this could help with is reducing page cache usage for
> very sparse files.  Today if we attempt to read() or page fault on a
> file hole, we allocate a fresh page of memory and ask the filesystem to
> fill it.  The filesystem notices that it's a hole and calls memset().
> If the VFS knew that the extent was a hole, it could use the shared zero
> page instead.  Don't know how much of a performance win this would be,
> but it might be useful.

Ah. OK. Maybe I see the reason you are asking this question now.

Buffered reads are still based on the old page-cache-first IO
mechanisms and so doing smart stuff with "filesystems things"
are difficult to do.

i.e. readahead allocates folios for the readahead range before it
asks the filesystem to do the readahead IO, it is unaware of how the
file is laid out. Hence it can't do smart things with holes.

And it paints the filesystems into a corner, too, because they can't
modify the set of folios that it needs to fill with data. Hence
the filesystem can't do smart things with holes or unwritten
extents, either.

To solve this, the proposal being made is to lift the filesystem
mapping information up into "the VFS" so that the existing buffered
read code has awareness of the file mapping. That allows this page
cache code to do smarter things. e.g. special case folio
instantiation w.r.t. sparse files (amongst other things).

Have I got that right?

If so, then we've been here before, and we've solve these problems
by inverting the IO path operations. i.e. we do filesystem mapping
operations first, then populate the page cache based on the mapping
being returned.

This is how the iomap buffered write path works, and that's the
reason it supports all the modern filesystem goodies realtively
easily.

The exception to this model in iomap is buffered reads (i.e.
readahead). We still just do what the page cache asks us to do, and
clearly that is now starting to hurt us in the same way the page
cache centric IO model was hurting us for buffered writes a decade
ago.

So, can we invert readahead like we did with buffered writes? That
is, we hand the readahead window that needs to be filled (i.e. a
{mapping, pos, len} tuple) to the filesystem (iomap) which can then
iterate mappings over the readahead range.  iomap_iter_readahead()
can then populate the page cache with appropriately sized folios and
do the IO, or use the zero page when over a hole or unwritten
extent...

Note that optimisations like zero-page-over-holes also need write
path changes. We'd need to change iomap_get_folio() to tell
__filemap_get_folio() to replace zero pages with newly allocated
writeable folios during write operations...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: VFS caching of file extents
  2024-08-29  4:00   ` Christoph Hellwig
@ 2024-08-29 13:52     ` Chuck Lever III
  2024-08-29 22:36       ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Chuck Lever III @ 2024-08-29 13:52 UTC (permalink / raw)
  To: Christoph Hellwig, Darrick J. Wong
  Cc: Matthew Wilcox, Linux FS Devel, Dave Chinner, Darrick Wong,
	Jan Kara

> On Aug 29, 2024, at 12:00 AM, Christoph Hellwig <hch@lst.de> wrote:
> 
>> Wouldn't readplus (and maybe a sparse copy program) rather have
>> something that is "SEEK_DATA, fill the buffer with data from that file
>> position, and tell me what pos the data came from"?
> 
> Or rather a read operation that returns a length but no data if there
> is a hole.

That is essentially what READ_PLUS does. The "HOLE" array
members are a length. The receiving client is then free
to represent that hole in whatever way is most convenient.

NFSD can certainly implement READ_PLUS so that it returns
only a single array element -- and that element would be
either CONTENT or HOLE -- in a possibly short read result.
(That might be what it is doing already, come to think of
it).

The problem with SEEK_DATA AIUI is that READ_PLUS wants
a snapshot of the file's state. SEEK_DATA is a separate
operation, so some kind of serialization would be
necessary to prevent file changes between reads and seeks.

--
Chuck Lever

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: VFS caching of file extents
  2024-08-29 13:52     ` Chuck Lever III
@ 2024-08-29 22:36       ` Dave Chinner
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2024-08-29 22:36 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: Christoph Hellwig, Darrick J. Wong, Matthew Wilcox,
	Linux FS Devel, Darrick Wong, Jan Kara

On Thu, Aug 29, 2024 at 01:52:40PM +0000, Chuck Lever III wrote:
> 
> 
> > On Aug 29, 2024, at 12:00 AM, Christoph Hellwig <hch@lst.de> wrote:
> > 
> >> Wouldn't readplus (and maybe a sparse copy program) rather have
> >> something that is "SEEK_DATA, fill the buffer with data from that file
> >> position, and tell me what pos the data came from"?
> > 
> > Or rather a read operation that returns a length but no data if there
> > is a hole.

Right, it needs to be an iomap level operation - if the map returned
is a hole it records a hole, otherwise it reads the through the page
cache.

But to do this, we need the buffered reads to hit the filesystem
first and get the mapping to determine how to process the incoming
read operation, not go through the page cache first and bypass the
fs entirely because there are cached pages full of zeroes in the
page cache over the hole...

> That is essentially what READ_PLUS does. The "HOLE" array
> members are a length. The receiving client is then free
> to represent that hole in whatever way is most convenient.
> 
> NFSD can certainly implement READ_PLUS so that it returns
> only a single array element -- and that element would be
> either CONTENT or HOLE -- in a possibly short read result.
> (That might be what it is doing already, come to think of
> it).
> 
> The problem with SEEK_DATA AIUI is that READ_PLUS wants
> a snapshot of the file's state. SEEK_DATA is a separate
> operation, so some kind of serialization would be
> necessary to prevent file changes between reads and seeks.

Note that SEEK_DATA also handles unwritten extents correct. It will
report them as a hole if there is no cached data over the range,
otherwise it will be considered data. The only downside of this is
that if the unwritten extent has been read and the page cache
contains clean zeroed folios, SEEK_DATA will always return that
unwritten range as data.

IOWs, unwritten extent mappings for READPLUS would need special
handling to only return data if the page cache folios over the
unwritten range are dirty...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-08-29 22:36 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-28 19:34 VFS caching of file extents Matthew Wilcox
2024-08-28 19:46 ` Chuck Lever
2024-08-28 19:50   ` Matthew Wilcox
2024-08-29  6:05     ` Dave Chinner
2024-08-28 20:30 ` Josef Bacik
2024-08-28 23:46 ` Dave Chinner
2024-08-29  1:57 ` Darrick J. Wong
2024-08-29  4:00   ` Christoph Hellwig
2024-08-29 13:52     ` Chuck Lever III
2024-08-29 22:36       ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).