From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Catherine Hoang <catherine.hoang@oracle.com>,
"linux-xfs@vger.kernel.org" <linux-xfs@vger.kernel.org>
Subject: Re: proposal: enhance 'cp --reflink' to expose ioctl_ficlonerange
Date: Tue, 19 Sep 2023 17:00:58 -0700 [thread overview]
Message-ID: <20230920000058.GF348037@frogsfrogsfrogs> (raw)
In-Reply-To: <ZQk23NIAcY0BDpfI@dread.disaster.area>
On Tue, Sep 19, 2023 at 03:51:24PM +1000, Dave Chinner wrote:
> On Tue, Sep 19, 2023 at 02:43:32AM +0000, Catherine Hoang wrote:
> > Hi all,
> >
> > Darrick and I have been working on designing a new ioctl FICLONERANGE2. The
> > following text attempts to explain our needs and reasoning behind this decision.
> >
> >
> > Contents
> > --------
> > 1. Problem Statement
> > 2. Proof of Concept
> > 3. Proposed Solution
> > 4. User Interface
> > 5. Testing Plan
> >
> >
> > 1. Problem Statement
> > --------------------
> >
> > One of our VM cluster management products needs to snapshot KVM image files
> > so that they can be restored in case of failure. Snapshotting is done by
> > redirecting VM disk writes to a sidecar file and using reflink on the disk
> > image, specifically the FICLONE ioctl as used by "cp --reflink". Reflink
> > locks the source and destination files while it operates, which means that
> > reads from the main vm disk image are blocked, causing the vm to stall. When
> > an image file is heavily fragmented, the copy process could take several
> > minutes. Some of the vm image files have 50-100 million extent records, and
> > duplicating that much metadata locks the file for 30 minutes or more. Having
> > activities suspended for such a long time in a cluster node could result in
> > node eviction. A node eviction occurs when the cluster manager determines
> > that the vm is unresponsive. One of the criteria for determining that a VM
> > is unresponsive is the failure of filesystems in the guest to respond for an
> > unacceptably long time. In order to solve this problem, we need to provide a
> > variant of FICLONE that releases the file locks periodically to allow reads
> > to occur as vmbackup runs. The purpose of this feature is to allow vmbackup
> > to run without causing downtime.
>
> Interesting problem to have - let me see if I understand it
> properly.
>
> Writes are redirected away from the file being cloned, but reads go
> directly to the source file being cloned?
>
> But cloning can take a long time, so breaking up the clone operation
> into multiple discrete ranges will allow reads through
> to the file being cloned with minimal latency. However, you don't
> want writes to the source file because that results in the
> atomicity of the clone operation being violated and corrupting the
> snapshot.
>
> Hence the redirected writes ensure that the file being cloned does
> not change from syscall to syscall. This means the time interrupted
> clone operation can restart from where it left off and you still get
> an consistent image clone for the snapshot.
>
> Did I get that right?
Right.
> If so, I'm wondering about the general usefulness of this
> multi-syscall construct - having to ensure that it isn't written to
> between syscalls is quite the constraint.
Write isolation is not that much of a constraint. Qemu can set up the
sidecar internally and commit the sidecar back into the original image.
libvirt wraps this functionality.
> I wonder if we can do better than that and not need a new syscall;
> shared read + clone seems more like an inode extent list access
> serialisation problem than anything else...
>
> <thinks for a bit>
>
> Ok. a clone does not change any data in the source file.
Right. The only modifications it does is to fsync the range, and that's
only an implementation detail of ocfs2 & xfs.
> Neither do read IO operations.
>
> Hence from a data integrity perspective, there's no reason why read
> IO and FICLONE can't run concurrently on the source file.
<nod>
> Writes we still need to block so that the clone is an atomic
> point in time image of the file, but reads could be allowed.
<nod>
> The XFS clone implementation takes the IOLOCK_EXCL high up, and
> then lower down it iterates one extent doing the sharing operation.
> It holds the ILOCK_EXCL while it is modifying the extent in both the
> source and destination files, then commits the transaction and drops
> the ILOCKs.
>
> OK, so we have fine-grained ILOCK serialisation during the clone for
> access/modification to the extent list. Excellent, I think we can
> make this work.
>
> So:
>
> 1. take IOLOCK_EXCL like we already do on the source and destination
> files.
>
> 2. Once all the pre work is done, set a "clone in progress" flag on
> the in-memory source inode.
>
> 3. atomically demote the source inode IOLOCK_EXCL to IOLOCK_SHARED.
>
> 4. read IO and the clone serialise access to the extent list via the
> ILOCK. We know this works fine, because that's how the extent list
> access serialisation for concurrent read and write direct IO works.
>
> 5. buffered writes take the IOLOCK_EXCL, so they block until the
> clone completes. Same behaviour as right now, all good.
I think pnfs layouts and DAX writes also take IOLOCK_EXCL, right? So
once reflink breaks the layouts, we're good there too?
> 6. direct IO writes need to be modified to check the "clone in
> progress" flag after taking the IOLOCK_SHARED. If it is set, we have
> to drop the IOLOCK_SHARED and take it IOLOCK_EXCL. This will block
> until the clone completes.
>
> 7. when the clone completes, we clear the "clone in progress" flag
> and drop all the IOLOCKs that are held.
>
> AFAICT, this will give us shared clone vs read and exclusive clone
> vs write IO semantics for all clone operations. And if I've
> understood the problem statement correctly, this will avoid the
> read IO latency problems that long running clone operations cause
> without needing a new syscall.
>
> Thoughts?
I think that'll work.
--D
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
next prev parent reply other threads:[~2023-09-20 0:01 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-19 2:43 proposal: enhance 'cp --reflink' to expose ioctl_ficlonerange Catherine Hoang
2023-09-19 3:31 ` Bagas Sanjaya
2023-09-19 5:51 ` Dave Chinner
2023-09-20 0:00 ` Darrick J. Wong [this message]
2023-09-20 1:07 ` Dave Chinner
2023-09-21 22:26 ` Darrick J. Wong
2023-09-21 23:18 ` Dave Chinner
2023-09-25 21:28 ` Darrick J. Wong
2023-09-26 21:18 ` Catherine Hoang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230920000058.GF348037@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=catherine.hoang@oracle.com \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.