From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Trond Myklebust <trondmy@hammerspace.com>
Cc: "david@fromorbit.com" <david@fromorbit.com>,
"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>
Subject: Re: Question about clone_range() metadata stability
Date: Tue, 3 Dec 2019 08:35:26 -0800 [thread overview]
Message-ID: <20191203163526.GD7323@magnolia> (raw)
In-Reply-To: <52f1afb6e0a2026840da6f4b98a5e01a247447e5.camel@hammerspace.com>
On Tue, Dec 03, 2019 at 07:36:29AM +0000, Trond Myklebust wrote:
> On Mon, 2019-12-02 at 08:05 +1100, Dave Chinner wrote:
> > On Wed, Nov 27, 2019 at 12:21:36PM -0800, Darrick J. Wong wrote:
> > > On Wed, Nov 27, 2019 at 06:38:46PM +0000, Trond Myklebust wrote:
> > > > Hi all
> > > >
> > > > A quick question about clone_range() and guarantees around
> > > > metadata
> > > > stability.
> > > >
> > > > Are users required to call fsync/fsync_range() after calling
> > > > clone_range() in order to guarantee that the cloned range
> > > > metadata is
> > > > persisted?
> > >
> > > Yes.
> > >
> > > > I'm assuming that it is required in order to guarantee that
> > > > data is persisted.
> > >
> > > Data and metadata. XFS and ocfs2's reflink implementations will
> > > flush
> > > the page cache before starting the remap, but they both require
> > > fsync to
> > > force the log/journal to disk.
> >
> > So we need to call xfs_fs_nfs_commit_metadata() to get that done
> > post vfs_clone_file_range() completion on the server side, yes?
> >
>
> I chose to implement this using a full call to vfs_fsync_range(), since
> we really do want to ensure data stability as well. Consider, for
> instance, the case where client A is running an application, and client
> B runs vfs_clone_file_range() in order to create a point in time
> snapshot of the file for disaster recovery purposes...
Seems reasonable, since (alas) we didn't define the ->remap_range api to
guarantee that for you.
> > > (AFAICT the same reasoning applies to btrfs, but don't trust my
> > > word for
> > > it.)
> > >
> > > > I'm asking because knfsd currently just does a call to
> > > > vfs_clone_file_range() when parsing a NFSv4.2 CLONE operation. It
> > > > does
> > > > not call fsync()/fsync_range() on the destination file, and since
> > > > the
> > > > NFSv4.2 protocol does not require you to perform any other
> > > > operation in
> > > > order to persist data/metadata, I'm worried that we may be
> > > > corrupting
> > > > the cloned file if the NFS server crashes at the wrong moment
> > > > after the
> > > > client has been told the clone completed.
> >
> > Yup, that's exactly what server side calls to commit_metadata() are
> > supposed to address.
> >
> > I suspect to be correct, this might require commit_metadata() to be
> > called on both the source and destination inodes, as both of them
> > may have modified metadata as a result of the clone operation. For
> > XFS one of them will be a no-op, but for other filesystems that
> > don't implement ->commit_metadata, we'll need to call
> > sync_inode_metadata() on both inodes...
> >
>
> That's interesting. I hadn't considered that a clone might cause the
> source metadata to change as well. What kind of change specifically are
> we talking about? Is it just delayed block allocation, or is there
> more?
In XFS' case, we added a per-inode flag to help us bypass the reference
count lookup during a write if the file has never shared any blocks, so
if you never share anything, you'll never pay any of the runtime costs
of the COW mechanism.
ocfs2's design has a reference count tree that is shared between groups
of files that have been reflinked from each other. So if you start with
unshared files A and B and clone A to A1 and A2; and B to B1 and B2,
then A* will have their own refcount tree and B* will also have their
own refcount tree.
Either way, nfs has to assume that changes could have been made to the
source file.
--D
> Thanks
> Trond
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@hammerspace.com
>
>
next prev parent reply other threads:[~2019-12-03 16:35 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-11-27 18:38 Question about clone_range() metadata stability Trond Myklebust
2019-11-27 20:21 ` Darrick J. Wong
2019-11-29 12:43 ` Filipe Manana
2019-12-01 21:05 ` Dave Chinner
2019-12-02 17:09 ` Darrick J. Wong
2019-12-03 7:36 ` Trond Myklebust
2019-12-03 16:35 ` Darrick J. Wong [this message]
2019-12-03 23:00 ` Trond Myklebust
2019-12-06 1:31 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20191203163526.GD7323@magnolia \
--to=darrick.wong@oracle.com \
--cc=david@fromorbit.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=trondmy@hammerspace.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).