Re: [PATCH 3/3] nfsd: simplify write verifier handling

Linux NFS development
 help / color / mirror / Atom feed

From: Trond Myklebust <trondmy@kernel.org>
To: Rick Macklem <rick.macklem@gmail.com>, Jeff Layton <jlayton@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>,
	willy@infradead.org, linux-nfs@vger.kernel.org
Subject: Re: [PATCH 3/3] nfsd: simplify write verifier handling
Date: Tue, 14 Feb 2023 18:34:46 -0500	[thread overview]
Message-ID: <813f86d8b2a6788728e1c1b8648fce4cda3c36f1.camel@kernel.org> (raw)
In-Reply-To: <CAM5tNy71pXhzf4wA+KuhF6kCOYspzpfNa8HR=d1XM=G_gng9dQ@mail.gmail.com>

On Tue, 2023-02-14 at 14:57 -0800, Rick Macklem wrote:
> On Tue, Feb 14, 2023 at 5:53 AM Jeff Layton <jlayton@kernel.org>
> wrote:
> > 
> > On Mon, 2023-02-13 at 22:28 -0500, Trond Myklebust wrote:
> > > On Mon, 2023-02-13 at 16:49 -0800, Rick Macklem wrote:
> > > > On Mon, Feb 13, 2023 at 1:14 PM Jeff Layton
> > > > <jlayton@kernel.org>
> > > > wrote:
> > > > > 
> > > > > CAUTION: This email originated from outside of the University
> > > > > of
> > > > > Guelph. Do not click links or open attachments unless you
> > > > > recognize
> > > > > the sender and know the content is safe. If in doubt, forward
> > > > > suspicious emails to IThelp@uoguelph.ca
> > > > > 
> > > > > 
> > > > > The write verifier exists to tell the client when the server
> > > > > may
> > > > > have
> > > > > forgotten some unstable writes. The typical way that this
> > > > > happens
> > > > > is if
> > > > > the server crashes, but we've also extended nfsd to change it
> > > > > when
> > > > > there
> > > > > are writeback errors as well.
> > > > > 
> > > > > The way it works today though, we call something like
> > > > > vfs_fsync
> > > > > (e.g.
> > > > > for a COMMIT call) and if we get back an error, we'll reset
> > > > > the
> > > > > write
> > > > > verifier.
> > > > > 
> > > > > This is non-optimal for a couple of reasons:
> > > > > 
> > > > > 1/ There could be significant delay between an error being
> > > > > recorded and the reset. It would be ideal if the write
> > > > > verifier
> > > > > were to
> > > > > change as soon as the error was recorded.
> > > > > 
> > > > > 2/ It's a bit of a waste, in that if we get a writeback error
> > > > > on a
> > > > > single inode, we'll end up resetting the write verifier for
> > > > > everything,
> > > > > even on inodes that may be fine (e.g. on a completely
> > > > > separate fs).
> > > > > 
> > > > Here's the snippet from RFC8881:
> > > >    The final portion of the result is the field writeverf. 
> > > > This
> > > > field
> > > >    is the write verifier and is a cookie that the client can
> > > > use to
> > > >    determine whether a server has changed instance state (e.g.,
> > > > server
> > > >    restart) between a call to WRITE and a subsequent call to
> > > > either
> > > >    WRITE or COMMIT.  This cookie MUST be unchanged during a
> > > > single
> > > >    instance of the NFSv4.1 server and MUST be unique between
> > > > instances
> > > >    of the NFSv4.1 server.  If the cookie changes, then the
> > > > client
> > > > MUST
> > > >    assume that any data written with an UNSTABLE4 value for
> > > > committed
> > > >    and an old writeverf in the reply has been lost and will
> > > > need to
> > > > be
> > > >    recovered.
> > > > 
> > > > I've always interpreted the writeverf as "per-server" and not 
> > > > "per-
> > > > file".
> > > > Although I'll admit the above does not make that crystal clear,
> > > > it
> > > > does make
> > > > it clear that the writeverf applies to a "server instance" and
> > > > not a
> > > > file or
> > > > file system on the server.
> > > > 
> > > > The FreeBSD client assumes it is "per-server" and re-writes all
> > > > uncommitted
> > > > writes for the server, not just ones for the file (or file
> > > > system)
> > > > the
> > > > writeverf is
> > > > replied with.  (I vaguely recall Solaris does the same?)
> > > > 
> > > > At the very least, I think you should run this past the IETF
> > > > working
> > > > group
> > > > (nfsv4@ietf.org) to see what they say w.r.t. the writeverf
> > > > being
> > > > "per-file" vs
> > > > "per-server".
> > > > 
> > > 
> > > As I recall, we've already had this discussion on the IETF NFSv4
> > > working group mailing list:
> > > https://mailarchive.ietf.org/arch/msg/nfsv4/99Ow2muMylXKWd9lzi9_BX2LJDY/
> > > 
> > > 
> > > That's why I kept it a global in the first place.
> > > 
> > > Now note that RFC8881 does also clarify in Section 18.3.3 that:
> > > 
> > > 
> > >    The server must vary the value of the write
> > >    verifier at each server event or instantiation that may lead
> > > to a
> > >    loss of uncommitted data.  Most commonly this occurs when the
> > > server
> > >    is restarted; however, other events at the server may result
> > > in
> > >    uncommitted data loss as well.
> > > 
> > > So I feel it is quite OK to use the verifier the way we do now in
> > > order
> > > to signify that a fatal write error has occurred and that clients
> > > must
> > > resend any data that was uncommitted.
> > > 
> > 
> > Thanks, I missed that discussion. I think you guys have convinced
> > me
> > that we have to keep this per-server. I won't bother starting a new
> > thread on it.
> > 
> > It's a pity. It would have been a lot more elegant as a per-inode
> > thing!
> > 
> If you think it is worth the effort, you could propose an extension
> to
> 4.2. Something like Write_plus, Commit_plus operations.
> 
> rick
> 
OK. Apparently some further clarification is required here.

The main reason for needing to bump the boot verifier on an error is
NFSv3. Unlike NFSv4, the NFSv3 clients are stateless, and so error
propagation using Jeff's "errseq" mechanism is inherently flawed
because it is ultimately caching the I/O errors in a file descriptor.
The fact that the file cache garbage collector can close these NFSv3
file descriptors at any time without any possibility of coordination
with the clients, and causing loss of the "errseq" cached data, means
that we must have an alternative mechanism to propagate error state for
unstable writes. Hence the use of the boot verifier.

NFSv4 does not have these limitations. The clients are required to use
stateful OPEN/CLOSE/DELEGRETURN/... operations in order to signify to
the NFSv4 server when file descriptors need to be kept open, and hence
"errseq" data needs to be preserved. The only cases where that
assumption is violated are that of a network partition and server
reboot, where it is obvious to all parties that information has been
lost.
Note: The Linux NFSv4 client does not currently assume that information
might be lost during network partition, so we might want to look into
fixing that. However it does obviously recover safely during a server
reboot thanks to the boot verifier mechanism.

IOW: there should be no need for a change in NFSv4 semantics in order
to address this issue. The problem is, as usual, mostly limited to
NFSv3.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com

     prev parent reply	other threads:[~2023-02-14 23:35 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-13 21:13 [PATCH 0/3] nfsd: write verifier fixes and optimization Jeff Layton
2023-02-13 21:13 ` [PATCH 1/3] nfsd: copy the whole verifier in nfsd_copy_write_verifier Jeff Layton
2023-02-13 21:13 ` [PATCH 2/3] errseq: add a new errseq_fetch helper Jeff Layton
2023-02-13 21:13 ` [PATCH 3/3] nfsd: simplify write verifier handling Jeff Layton
2023-02-14  0:49   ` Rick Macklem
2023-02-14  3:28     ` Trond Myklebust
2023-02-14 13:53       ` Jeff Layton
2023-02-14 14:58         ` Chuck Lever III
2023-02-14 15:01           ` Jeff Layton
2023-02-14 22:57         ` Rick Macklem
2023-02-14 23:16           ` Jeff Layton
2023-02-14 23:34           ` Trond Myklebust [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=813f86d8b2a6788728e1c1b8648fce4cda3c36f1.camel@kernel.org \
    --to=trondmy@kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=jlayton@kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=rick.macklem@gmail.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox