Re: Question about clone_range() metadata stability

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Trond Myklebust <trondmy@hammerspace.com>
Cc: "david@fromorbit.com" <david@fromorbit.com>,
	"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>
Subject: Re: Question about clone_range() metadata stability
Date: Tue, 3 Dec 2019 08:35:26 -0800	[thread overview]
Message-ID: <20191203163526.GD7323@magnolia> (raw)
In-Reply-To: <52f1afb6e0a2026840da6f4b98a5e01a247447e5.camel@hammerspace.com>

On Tue, Dec 03, 2019 at 07:36:29AM +0000, Trond Myklebust wrote:
> On Mon, 2019-12-02 at 08:05 +1100, Dave Chinner wrote:
> > On Wed, Nov 27, 2019 at 12:21:36PM -0800, Darrick J. Wong wrote:
> > > On Wed, Nov 27, 2019 at 06:38:46PM +0000, Trond Myklebust wrote:
> > > > Hi all
> > > > 
> > > > A quick question about clone_range() and guarantees around
> > > > metadata
> > > > stability.
> > > > 
> > > > Are users required to call fsync/fsync_range() after calling
> > > > clone_range() in order to guarantee that the cloned range
> > > > metadata is
> > > > persisted?
> > > 
> > > Yes.
> > > 
> > > > I'm assuming that it is required in order to guarantee that
> > > > data is persisted.
> > > 
> > > Data and metadata.  XFS and ocfs2's reflink implementations will
> > > flush
> > > the page cache before starting the remap, but they both require
> > > fsync to
> > > force the log/journal to disk.
> > 
> > So we need to call xfs_fs_nfs_commit_metadata() to get that done
> > post vfs_clone_file_range() completion on the server side, yes?
> > 
> 
> I chose to implement this using a full call to vfs_fsync_range(), since
> we really do want to ensure data stability as well. Consider, for
> instance, the case where client A is running an application, and client
> B runs vfs_clone_file_range() in order to create a point in time
> snapshot of the file for disaster recovery purposes...

Seems reasonable, since (alas) we didn't define the ->remap_range api to
guarantee that for you.

> > > (AFAICT the same reasoning applies to btrfs, but don't trust my
> > > word for
> > > it.)
> > > 
> > > > I'm asking because knfsd currently just does a call to
> > > > vfs_clone_file_range() when parsing a NFSv4.2 CLONE operation. It
> > > > does
> > > > not call fsync()/fsync_range() on the destination file, and since
> > > > the
> > > > NFSv4.2 protocol does not require you to perform any other
> > > > operation in
> > > > order to persist data/metadata, I'm worried that we may be
> > > > corrupting
> > > > the cloned file if the NFS server crashes at the wrong moment
> > > > after the
> > > > client has been told the clone completed.
> > 
> > Yup, that's exactly what server side calls to commit_metadata() are
> > supposed to address.
> > 
> > I suspect to be correct, this might require commit_metadata() to be
> > called on both the source and destination inodes, as both of them
> > may have modified metadata as a result of the clone operation. For
> > XFS one of them will be a no-op, but for other filesystems that
> > don't implement ->commit_metadata, we'll need to call
> > sync_inode_metadata() on both inodes...
> > 
> 
> That's interesting. I hadn't considered that a clone might cause the
> source metadata to change as well. What kind of change specifically are
> we talking about? Is it just delayed block allocation, or is there
> more?

In XFS' case, we added a per-inode flag to help us bypass the reference
count lookup during a write if the file has never shared any blocks, so
if you never share anything, you'll never pay any of the runtime costs
of the COW mechanism.

ocfs2's design has a reference count tree that is shared between groups
of files that have been reflinked from each other.  So if you start with
unshared files A and B and clone A to A1 and A2; and B to B1 and B2,
then A* will have their own refcount tree and B* will also have their
own refcount tree.

Either way, nfs has to assume that changes could have been made to the
source file.

--D

> Thanks
>   Trond
> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@hammerspace.com
> 
>

next prev parent reply	other threads:[~2019-12-03 16:35 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-27 18:38 Question about clone_range() metadata stability Trond Myklebust
2019-11-27 20:21 ` Darrick J. Wong
2019-11-29 12:43   ` Filipe Manana
2019-12-01 21:05   ` Dave Chinner
2019-12-02 17:09     ` Darrick J. Wong
2019-12-03  7:36     ` Trond Myklebust
2019-12-03 16:35       ` Darrick J. Wong [this message]
2019-12-03 23:00         ` Trond Myklebust
2019-12-06  1:31       ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191203163526.GD7323@magnolia \
    --to=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=trondmy@hammerspace.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).