Re: [PATCH v3 6/6] NFSD: Repeal and replace the READ_PLUS implementation

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jeff Layton <jlayton@kernel.org>
To: Dave Chinner <david@fromorbit.com>,
	Chuck Lever III <chuck.lever@oracle.com>
Cc: Anna Schumaker <anna@kernel.org>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH v3 6/6] NFSD: Repeal and replace the READ_PLUS implementation
Date: Wed, 20 Jul 2022 08:55:23 -0400	[thread overview]
Message-ID: <8122876aa3afe2b57d2c3153045d3e1936210b98.camel@kernel.org> (raw)
In-Reply-To: <20220720023610.GN3600936@dread.disaster.area>

On Wed, 2022-07-20 at 12:36 +1000, Dave Chinner wrote:
> On Wed, Jul 20, 2022 at 01:26:13AM +0000, Chuck Lever III wrote:
> > 
> > 
> > > On Jul 19, 2022, at 6:44 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > 
> > > On Tue, Jul 19, 2022 at 04:46:50PM -0400, Anna Schumaker wrote:
> > > > On Sun, Jul 17, 2022 at 9:16 PM Dave Chinner <david@fromorbit.com> wrote:
> > > > > 
> > > > > On Fri, Jul 15, 2022 at 07:08:13PM +0000, Chuck Lever III wrote:
> > > > > > > On Jul 15, 2022, at 2:44 PM, Anna Schumaker <anna@kernel.org> wrote:
> > > > > > > 
> > > > > > > From: Anna Schumaker <Anna.Schumaker@Netapp.com>
> > > > > > > 
> > > > > > > Rather than relying on the underlying filesystem to tell us where hole
> > > > > > > and data segments are through vfs_llseek(), let's instead do the hole
> > > > > > > compression ourselves. This has a few advantages over the old
> > > > > > > implementation:
> > > > > > > 
> > > > > > > 1) A single call to the underlying filesystem through nfsd_readv() means
> > > > > > >  the file can't change from underneath us in the middle of encoding.
> > > > > 
> > > > > Hi Anna,
> > > > > 
> > > > > I'm assuming you mean the vfs_llseek(SEEK_HOLE) call at the start
> > > > > of nfsd4_encode_read_plus_data() that is used to trim the data that
> > > > > has already been read out of the file?
> > > > 
> > > > There is also the vfs_llseek(SEEK_DATA) call at the start of
> > > > nfsd4_encode_read_plus_hole(). They are used to determine the length
> > > > of the current hole or data segment.
> > > > 
> > > > > 
> > > > > What's the problem with racing with a hole punch here? All it does
> > > > > is shorten the read data returned to match the new hole, so all it's
> > > > > doing is making the returned data "more correct".
> > > > 
> > > > The problem is we call vfs_llseek() potentially many times when
> > > > encoding a single reply to READ_PLUS. nfsd4_encode_read_plus() has a
> > > > loop where we alternate between hole and data segments until we've
> > > > encoded the requested number of bytes. My attempts at locking the file
> > > > have resulted in a deadlock since vfs_llseek() also locks the file, so
> > > > the file could change from underneath us during each iteration of the
> > > > loop.
> > > 
> > > So the problem being solved is that the current encoding is not
> > > atomic, rather than trying to avoid any computational overhead of
> > > multiple vfs_llseek calls (which are largely just the same extent
> > > lookups as we do during the read call)?
> > 
> > Reviewing [1] and [2] I don't find any remarks about atomicity
> > guarantees. If a client needs an uncontended view of a file's
> > data, it's expected to fence other accessors via a OPEN(deny)
> > or LOCK operation, or serialize the requests itself.
> 
> You've got the wrong "atomicity" scope :)
> 
> What I was talking about is the internal server side data operation
> atomicity. that is, what is returned from the read to the READ_PLUS
> code is not atomic w.r.t. the vfs_llseek() that are then used to
> determine where there holes in the data returned by the read are.
> 
> Hence after the read has returned data to READ_PLUS, something else
> can modify the data in the file (e.g. filling a hole or punching a
> new one out) and then the ranges vfs_llseek() returns to READ_PLUS
> does not match the data that is has in it's local buffer.
> 
> i.e. to do what the READ_PLUS operation is doing now, it would
> somehow need to ensure no modifications can be made between the read
> starting and the last vfs_llseek() call completing. IOWs, they need
> to be performed as an atomic operation, not as a set of
> independently locked (or unlocked!) operations as they are now.
> 
> > > The implementation just seems backwards to me - rather than reading
> > > data and then trying to work out where the holes are, I suspect it
> > > should be working out where the holes are and then reading the data.
> > > This is how the IO path in filesystems work, so it would seem like a
> > > no-brainer to try to leverage the infrastructure we already have to
> > > do that.
> > > 
> > > The information is there and we have infrastructure that exposes it
> > > to the IO path, it's just *underneath* the page cache and the page
> > > cache destroys the information that it used to build the data it
> > > returns to the NFSD.
> > > 
> > > IOWs, it seems to me that what READ_PLUS really wants is a "sparse
> > > read operation" from the filesystem rather than the current "read
> > > that fills holes with zeroes". i.e. a read operation that sets an
> > > iocb flag like RWF_SPARSE_READ to tell the filesystem to trim the
> > > read to just the ranges that contain data.
> > > 
> > > That way the read populates the page cache over a single contiguous
> > > range of data and returns with the {offset, len} that spans the
> > > range that is read and mapped. The caller can then read that region
> > > out of the page cache and mark all the non-data regions as holes in
> > > whatever manner they need to.
> > > 
> > > The iomap infrastructure that XFS and other filesystems use provide
> > > this exact "map only what contains data" capability - an iomap tells
> > > the page cache exactly what underlies the data range (hole, data,
> > > unwritten extents, etc) in an efficient manner, so it wouldn't be a
> > > huge stretch just to limit read IO ranges to those that contain only
> > > DATA extents.
> > > 
> > > At this point READ_PLUS then just needs to iterate doing sparse
> > > reads and recording the ranges that return data as vector of some
> > > kind that is then passes to the encoding function to encode it as
> > > a sparse READ_PLUS data range....
> > 
> > The iomap approach
> 
> Not actually what I proposed - I'm suggesting a new kiocb flag that
> changes what the read passed to the filesystem does. My comments
> about iomap are that this infrastructure already provides the extent
> range query mechanisms that allow us to efficiently perform such
> "restricted range" IO operations.
> 
> > seems sensible to me and covers the two basic
> > usage scenarios:
> > 
> > - Large sparse files, where we want to conserve both network
> >   bandwidth and client (and intermediate) cache occupancy.
> >   These are best served by exposing data and holes.
> 
> *nod*
> 
> > - Everyday files that are relatively small and generally will
> >   continue few, if any, holes. These are best served by using
> >   a splice read (where possible) -- that is, READ_PLUS in this
> >   case should work exactly like READ.
> 
> *nod*
> 
> > My impression of client implementations is that, a priori,
> > a client does not know whether a file contains holes or not,
> > but will probably always use READ_PLUS and let the server
> > make the choice for it.
> 
> *nod*
> 
> > Now how does the server make that choice? Is there an attribute
> > bit that indicates when a file should be treated as sparse? Can
> > we assume that immutable files (or compressed files) should
> > always be treated as sparse? Alternately, the server might use
> > the file's data : hole ratio.
> 
> None of the above. The NFS server has no business knowing intimate
> details about how the filesystem has laid out the file. All it cares
> about ranges containing data and ranges that have no data (holes).
> 
> If the filesystem can support a sparse read, it returns sparse
> ranges containing data to the NFS server. If the filesystem can't
> support it, or it's internal file layout doesn't allow for efficient
> hole/data discrimination, then it can just return the entire read
> range.
> 
> Alternatively, in this latter case, the filesystem could call a
> generic "sparse read" implementation that runs memchr_inv() to find
> the first data range to return. Then the NFS server doesn't have to
> code things differently, filesystems don't need to advertise
> support for sparse reads, etc because every filesystem could
> support sparse reads.
> 
> The only difference is that some filesystems will be much more
> efficient and faster at it than others. We already see that sort
> of thing with btrfs and seek hole/data on large cached files so I
> don't see "filesystems perform differently" as a problem here...
> 

^^^
This seems like more trouble than it's worth, and would probably result
in worse performance. The generic implementation should just return a
single non-sparse extent in the sparse read reply if it doesn't know how
to fill out a sparse read properly. IOW, we shouldn't try to find holes,
unless the underlying filesystem can do that itself via iomap sparse
read or some similar mechanism.
-- 
Jeff Layton <jlayton@kernel.org>

next prev parent reply	other threads:[~2022-07-20 12:55 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20220715184433.838521-1-anna@kernel.org>
     [not found] ` <20220715184433.838521-7-anna@kernel.org>
2022-07-15 19:08   ` [PATCH v3 6/6] NFSD: Repeal and replace the READ_PLUS implementation Chuck Lever III
2022-07-18  1:15     ` Dave Chinner
2022-07-19 17:21       ` Chuck Lever III
2022-07-19 20:24         ` Anna Schumaker
2022-07-19 20:47           ` Chuck Lever III
2022-07-19 21:10           ` Matthew Wilcox
2022-07-19 23:18         ` Dave Chinner
2022-07-19 20:46       ` Anna Schumaker
2022-07-19 22:44         ` Dave Chinner
2022-07-20  1:26           ` Chuck Lever III
2022-07-20  2:36             ` Dave Chinner
2022-07-20  4:18               ` Chuck Lever III
2022-07-22  0:44                 ` Dave Chinner
2022-07-22 15:09                   ` Chuck Lever III
2022-08-18 18:31                     ` Anna Schumaker
2022-08-19 15:18                       ` Chuck Lever III
2022-07-20 12:55               ` Jeff Layton [this message]
2022-07-21 23:12                 ` Dave Chinner
2022-07-21 20:47         ` Josef Bacik
2022-07-22 12:45           ` Anna Schumaker
2022-07-22 13:32             ` Josef Bacik
2022-07-22 13:43               ` Anna Schumaker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8122876aa3afe2b57d2c3153045d3e1936210b98.camel@kernel.org \
    --to=jlayton@kernel.org \
    --cc=anna@kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).