public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Josef Bacik <josef@toxicpanda.com>
To: Anna Schumaker <anna@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>,
	Chuck Lever III <chuck.lever@oracle.com>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH v3 6/6] NFSD: Repeal and replace the READ_PLUS implementation
Date: Fri, 22 Jul 2022 09:32:37 -0400	[thread overview]
Message-ID: <Ytqm9azcQfrC4vpG@localhost.localdomain> (raw)
In-Reply-To: <CAFX2JfkLW1RC9T45dN5pzfENQ+LXqF=cxDS7hxGUzaheuH07kQ@mail.gmail.com>

On Fri, Jul 22, 2022 at 08:45:13AM -0400, Anna Schumaker wrote:
> On Thu, Jul 21, 2022 at 4:48 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > On Tue, Jul 19, 2022 at 04:46:50PM -0400, Anna Schumaker wrote:
> > > On Sun, Jul 17, 2022 at 9:16 PM Dave Chinner <david@fromorbit.com> wrote:
> > > >
> > > > On Fri, Jul 15, 2022 at 07:08:13PM +0000, Chuck Lever III wrote:
> > > > > > On Jul 15, 2022, at 2:44 PM, Anna Schumaker <anna@kernel.org> wrote:
> > > > > >
> > > > > > From: Anna Schumaker <Anna.Schumaker@Netapp.com>
> > > > > >
> > > > > > Rather than relying on the underlying filesystem to tell us where hole
> > > > > > and data segments are through vfs_llseek(), let's instead do the hole
> > > > > > compression ourselves. This has a few advantages over the old
> > > > > > implementation:
> > > > > >
> > > > > > 1) A single call to the underlying filesystem through nfsd_readv() means
> > > > > >   the file can't change from underneath us in the middle of encoding.
> > > >
> > > > Hi Anna,
> > > >
> > > > I'm assuming you mean the vfs_llseek(SEEK_HOLE) call at the start
> > > > of nfsd4_encode_read_plus_data() that is used to trim the data that
> > > > has already been read out of the file?
> > >
> > > There is also the vfs_llseek(SEEK_DATA) call at the start of
> > > nfsd4_encode_read_plus_hole(). They are used to determine the length
> > > of the current hole or data segment.
> > >
> > > >
> > > > What's the problem with racing with a hole punch here? All it does
> > > > is shorten the read data returned to match the new hole, so all it's
> > > > doing is making the returned data "more correct".
> > >
> > > The problem is we call vfs_llseek() potentially many times when
> > > encoding a single reply to READ_PLUS. nfsd4_encode_read_plus() has a
> > > loop where we alternate between hole and data segments until we've
> > > encoded the requested number of bytes. My attempts at locking the file
> > > have resulted in a deadlock since vfs_llseek() also locks the file, so
> > > the file could change from underneath us during each iteration of the
> > > loop.
> > >
> > > >
> > > > OTOH, if something allocates over a hole that the read filled with
> > > > zeros, what's the problem with occasionally returning zeros as data?
> > > > Regardless, if this has raced with a write to the file that filled
> > > > that hole, we're already returning stale data/hole information to
> > > > the client regardless of whether we trim it or not....
> > > >
> > > > i.e. I can't see a correctness or data integrity problem here that
> > > > doesn't already exist, and I have my doubts that hole
> > > > punching/filling racing with reads happens often enough to create a
> > > > performance or bandwidth problem OTW. Hence I've really got no idea
> > > > what the problem that needs to be solved here is.
> > > >
> > > > Can you explain what the symptoms of the problem a user would see
> > > > that this change solves?
> > >
> > > This fixes xfstests generic/091 and generic/263, along with this
> > > reported bug: https://bugzilla.kernel.org/show_bug.cgi?id=215673
> > > >
> > > > > > 2) A single call to the underlying filestem also means that the
> > > > > >   underlying filesystem only needs to synchronize cached and on-disk
> > > > > >   data one time instead of potentially many speeding up the reply.
> > > >
> > > > SEEK_HOLE/DATA doesn't require cached data to be sync'd to disk to
> > > > be coherent - that's only a problem FIEMAP has (and syncing cached
> > > > data doesn't fix the TOCTOU coherency issue!).  i.e. SEEK_HOLE/DATA
> > > > will check the page cache for data if appropriate (e.g. unwritten
> > > > disk extents may have data in memory over the top of them) instead
> > > > of syncing data to disk.
> > >
> > > For some reason, btrfs on virtual hardware has terrible performance
> > > numbers when using vfs_llseek() with files that are already in the
> > > server's cache. I think it had something to do with how they lock
> > > extents, and some extra work that needs to be done if the file already
> > > exists in the server's memory but it's been  a few years since I've
> > > gone into their code to figure out where the slowdown is coming from.
> > > See this section of my performance results wiki page:
> > > https://wiki.linux-nfs.org/wiki/index.php/Read_Plus_May_2022#BTRFS_3
> > >
> >
> > I just did this locally, once in my test vm's and once on my continuous
> > performance testing hardware and I'm not able to reproduce the numbers, so I
> > think I'm doing something wrong.
> >
> > My test is stupid, I just dd a 5gib file, come behind it and punch holes every
> > other 4k.  Then I umount and remount, SEEK_DATA+SEEK_HOLE through the whole
> > file, and then do it again so I have uncached and cached.  The numbers I'm
> > seeing are equivalent to ext4/xfs.  Granted on my VM I had to redo the test
> > because I had lockdep and other debugging on which makes us look stupid because
> > of the extent locking stuff, but with it off everything appears normal.
> >
> > Does this more or less mirror your testing?  Looking at our code we're not doing
> > anything inherently stupid, so I'm not entirely sure what could be the problem.
> > Thanks,
> 
> I think that's pretty close to what the server is doing with the
> current code, except the NFS server would also do a read for every
> data segment it found. I've been using `vmtouch` to make sure the file
> doesn't get evicted from the server's page cache for my cached data.
>

I messed with it some more and I can't get it to be slow.  If you trip over
something like this again just give a shout and I'd be happy to dig in.  Thanks,

Josef 

  reply	other threads:[~2022-07-22 13:32 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-15 18:44 [PATCH v3 0/6] NFSD: Improvements for the NFSv4.2 READ_PLUS operation Anna Schumaker
2022-07-15 18:44 ` [PATCH v3 1/6] SUNRPC: Introduce xdr_stream_move_subsegment() Anna Schumaker
2022-07-15 18:44 ` [PATCH v3 2/6] SUNRPC: Introduce xdr_encode_double() Anna Schumaker
2022-07-15 18:44 ` [PATCH v3 3/6] SUNRPC: Introduce xdr_buf_trim_head() Anna Schumaker
2022-07-15 18:44 ` [PATCH v3 4/6] SUNRPC: Introduce xdr_buf_nth_page_address() Anna Schumaker
2022-07-15 18:44 ` [PATCH v3 5/6] SUNRPC: Export xdr_buf_pagecount() Anna Schumaker
2022-07-15 18:44 ` [PATCH v3 6/6] NFSD: Repeal and replace the READ_PLUS implementation Anna Schumaker
2022-07-15 19:08   ` Chuck Lever III
2022-07-18  1:15     ` Dave Chinner
2022-07-19 17:21       ` Chuck Lever III
2022-07-19 20:24         ` Anna Schumaker
2022-07-19 20:47           ` Chuck Lever III
2022-07-19 21:10           ` Matthew Wilcox
2022-07-19 23:18         ` Dave Chinner
2022-07-19 20:46       ` Anna Schumaker
2022-07-19 22:44         ` Dave Chinner
2022-07-20  1:26           ` Chuck Lever III
2022-07-20  2:36             ` Dave Chinner
2022-07-20  4:18               ` Chuck Lever III
2022-07-22  0:44                 ` Dave Chinner
2022-07-22 15:09                   ` Chuck Lever III
2022-08-18 18:31                     ` Anna Schumaker
2022-08-19 15:18                       ` Chuck Lever III
2022-07-20 12:55               ` Jeff Layton
2022-07-21 23:12                 ` Dave Chinner
2022-07-21 20:47         ` Josef Bacik
2022-07-22 12:45           ` Anna Schumaker
2022-07-22 13:32             ` Josef Bacik [this message]
2022-07-22 13:43               ` Anna Schumaker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Ytqm9azcQfrC4vpG@localhost.localdomain \
    --to=josef@toxicpanda.com \
    --cc=anna@kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox