proposal: enhance 'cp --reflink' to expose ioctl

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* proposal: enhance 'cp --reflink' to expose ioctl_ficlonerange
@ 2023-09-19  2:43 Catherine Hoang
  2023-09-19  3:31 ` Bagas Sanjaya
  2023-09-19  5:51 ` Dave Chinner
  0 siblings, 2 replies; 9+ messages in thread
From: Catherine Hoang @ 2023-09-19  2:43 UTC (permalink / raw)
  To: linux-xfs@vger.kernel.org; +Cc: Darrick J. Wong

Hi all,

Darrick and I have been working on designing a new ioctl FICLONERANGE2. The
following text attempts to explain our needs and reasoning behind this decision. 

Contents
--------
1. Problem Statement
2. Proof of Concept
3. Proposed Solution
4. User Interface
5. Testing Plan

1. Problem Statement
--------------------

One of our VM cluster management products needs to snapshot KVM image files
so that they can be restored in case of failure. Snapshotting is done by
redirecting VM disk writes to a sidecar file and using reflink on the disk
image, specifically the FICLONE ioctl as used by "cp --reflink". Reflink
locks the source and destination files while it operates, which means that
reads from the main vm disk image are blocked, causing the vm to stall. When
an image file is heavily fragmented, the copy process could take several
minutes. Some of the vm image files have 50-100 million extent records, and
duplicating that much metadata locks the file for 30 minutes or more. Having
activities suspended for such a long time in a cluster node could result in
node eviction. A node eviction occurs when the cluster manager determines
that the vm is unresponsive. One of the criteria for determining that a VM
is unresponsive is the failure of filesystems in the guest to respond for an
unacceptably long time. In order to solve this problem, we need to provide a
variant of FICLONE that releases the file locks periodically to allow reads
to occur as vmbackup runs. The purpose of this feature is to allow vmbackup
to run without causing downtime.

2. Proof of Concept
-------------------
Doing reflink in chunks enables the kernel to drop the file lock between
chunks, allowing IO to proceed. Here we test this approach using a fixed
chunk size of 1MB. Testing this tool on a heavily fragmented image gives us
the following execution times:

Number of extents in the test file - 419746, size=150GB

command                       Time
cp --reflink                  18s
Fixed chunk copy(1MB chunks)  20s

We also performed these tests while simulating a readwrite workload on the
image using fio. The copy times obtained are shown below.

Using "cp --reflink"
read : io=497732KB, bw=4141.3KB/s, iops=1035, runt=120188msec
   lat (msec): min=41, max=18240, avg=467.09, stdev=1079.23 

write: io=498528KB, bw=4147.1KB/s, iops=1036, runt=120188msec
   lat (msec): min=44, max=17257, avg=520.01, stdev=1144.76 

Using chunk based copy with chunk size 1MB 
read : io=617476KB, bw=5136.8KB/s, iops=1284, runt=120209msec
   lat (msec): min=6, max=3849, avg=385.28, stdev=487.71 

write: io=617252KB, bw=5134.9KB/s, iops=1283, runt=120209msec
   lat (msec): min=7, max=3850, avg=411.95, stdev=512.18

These results demonstrate that periodically dropping the file lock reduces
IO latency on a heavily fragmented file. Our tests show a max IO latency of
17s with regular reflink copy and a max IO latency of 3s with chunk based copy.

3. Proposed Solution
--------------------
The command "cp --reflink" currently uses the FICLONE ioctl, which does not
have an option to provide a chunk size. Using the existing FICLONERANGE ioctl
would allow us to perform a chunk based copy as shown above.

case FICLONE:
	return ioctl_file_clone(filp, arg, 0, 0, 0);

case FICLONERANGE:
	return ioctl_file_clone_range(filp, argp);

However, we can improve on this method by implementing a time based copy, in
which we perform as much work as possible in a given time period. For example,
we could do 15s of work before releasing the file locks (recall a node eviction
occurs after ~30s). In order to implement a time based copy, we will need to
pass additional arguments through the ioctl. Because the struct used by
FICLONERANGE is already full, we are not able to add any new fields. Therefore,
we need to implement a new ioctl.

The proposed solution is to define a new ioctl FICLONERANGE2 which differs
from FICLONERANGE in two ways:

(1) FICLONERANGE2 will implement a time budget. There are two ways we can do this:
	(a) Define a flag that permits kernel exits (with -ERESTARTSYS) on
	regular signals. This is the least invasive to the kernel, since we
	already have mechanisms for queuing and checking for signals. This
	would not replace the current behavior of returning with -EINTR on fatal
	signals.
	(b) Add an explicit time budget field to the FICLONERANGE2 arguments
	structure and plumb that through the kernel calls.
(2) FICLONERANGE2 will return the work completion status. There are two ways we
can do this:
	(a) Add the amount of work done to the pos fields and subtract the
	amount of work done from the length field. This "cursor" like operation
	would set up userspace to call the kernel again if the request was
	only partially filled without having to update anything. This might
	be tricky given the "length==0" trick that means "reflink to the
	source file's EOF".
	(b) Provide an explicit field in the args structure to return the
	amount of work done and require userspace to adjust the pos/length fields.

4. User Interface
-----------------
The current arguments structure for FICLONERANGE is shown below.

struct file_clone_range {
	__s64 src_fd;
	__u64 src_offset;
	__u64 src_length;
	__u64 dest_offset;
};

The new FICLONERANGE2 arguments structure will likely be larger. Depending
on the chosen implementation, we may need several additional fields.

	__u64 flags;
	__u64 time_budget_ms;
	__u64 work_done;

5. Testing Plan
---------------

The fstests suite already contains tests for the existing clone functionality.
These tests can be found under the following groups:

clone - FICLONE/FICLONERANGE ioctls
clone_stress - stress testing FICLONE/FICLONERANGE

We will also need to create additional tests for the new FICLONERANGE2 ioctl.

- Write a test case that performs a FICLONERANGE2 copy with a time budget.
  If our implementation allows FICLONERANGE2 to be restarted after a signal
  interruption, we can test this by creating a loop and setting up a signal
  via alarm(2) or timer_create(2).
- Write a test case that generates a file with many extents and tests that
  the kernel exits with partial completion when given a very short time budget.

Comments and feedback appreciated!

Catherine

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: proposal: enhance 'cp --reflink' to expose ioctl_ficlonerange
  2023-09-19  2:43 proposal: enhance 'cp --reflink' to expose ioctl_ficlonerange Catherine Hoang
@ 2023-09-19  3:31 ` Bagas Sanjaya
  2023-09-19  5:51 ` Dave Chinner
  1 sibling, 0 replies; 9+ messages in thread
From: Bagas Sanjaya @ 2023-09-19  3:31 UTC (permalink / raw)
  To: Catherine Hoang, linux-xfs@vger.kernel.org; +Cc: Darrick J. Wong

[-- Attachment #1: Type: text/plain, Size: 660 bytes --]

On Tue, Sep 19, 2023 at 02:43:32AM +0000, Catherine Hoang wrote:
> - Write a test case that performs a FICLONERANGE2 copy with a time budget.
>   If our implementation allows FICLONERANGE2 to be restarted after a signal
>   interruption, we can test this by creating a loop and setting up a signal
>   via alarm(2) or timer_create(2).
> - Write a test case that generates a file with many extents and tests that
>   the kernel exits with partial completion when given a very short time budget.

For the actual tests, what are selected time budgets that are to be
implemented in the test?

-- 
An old man doll... just what I always wanted! - Clara

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: proposal: enhance 'cp --reflink' to expose ioctl_ficlonerange
  2023-09-19  2:43 proposal: enhance 'cp --reflink' to expose ioctl_ficlonerange Catherine Hoang
  2023-09-19  3:31 ` Bagas Sanjaya
@ 2023-09-19  5:51 ` Dave Chinner
  2023-09-20  0:00   ` Darrick J. Wong
  1 sibling, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2023-09-19  5:51 UTC (permalink / raw)
  To: Catherine Hoang; +Cc: linux-xfs@vger.kernel.org, Darrick J. Wong

On Tue, Sep 19, 2023 at 02:43:32AM +0000, Catherine Hoang wrote:
> Hi all,
> 
> Darrick and I have been working on designing a new ioctl FICLONERANGE2. The
> following text attempts to explain our needs and reasoning behind this decision. 
> 
> 
> Contents
> --------
> 1. Problem Statement
> 2. Proof of Concept
> 3. Proposed Solution
> 4. User Interface
> 5. Testing Plan
> 
> 
> 1. Problem Statement
> --------------------
> 
> One of our VM cluster management products needs to snapshot KVM image files
> so that they can be restored in case of failure. Snapshotting is done by
> redirecting VM disk writes to a sidecar file and using reflink on the disk
> image, specifically the FICLONE ioctl as used by "cp --reflink". Reflink
> locks the source and destination files while it operates, which means that
> reads from the main vm disk image are blocked, causing the vm to stall. When
> an image file is heavily fragmented, the copy process could take several
> minutes. Some of the vm image files have 50-100 million extent records, and
> duplicating that much metadata locks the file for 30 minutes or more. Having
> activities suspended for such a long time in a cluster node could result in
> node eviction. A node eviction occurs when the cluster manager determines
> that the vm is unresponsive. One of the criteria for determining that a VM
> is unresponsive is the failure of filesystems in the guest to respond for an
> unacceptably long time. In order to solve this problem, we need to provide a
> variant of FICLONE that releases the file locks periodically to allow reads
> to occur as vmbackup runs. The purpose of this feature is to allow vmbackup
> to run without causing downtime.

Interesting problem to have - let me see if I understand it
properly.

Writes are redirected away from the file being cloned, but reads go
directly to the source file being cloned?

But cloning can take a long time, so breaking up the clone operation
into multiple discrete ranges will allow reads through
to the file being cloned with minimal latency. However, you don't
want writes to the source file because that results in the
atomicity of the clone operation being violated and corrupting the
snapshot.

Hence the redirected writes ensure that the file being cloned does
not change from syscall to syscall. This means the time interrupted
clone operation can restart from where it left off and you still get
an consistent image clone for the snapshot.

Did I get that right?

If so, I'm wondering about the general usefulness of this
multi-syscall construct - having to ensure that it isn't written to
between syscalls is quite the constraint.

I wonder if we can do better than that and not need a new syscall;
shared read + clone seems more like an inode extent list access
serialisation problem than anything else...

<thinks for a bit>

Ok. a clone does not change any data in the source file.

Neither do read IO operations.

Hence from a data integrity perspective, there's no reason why read
IO and FICLONE can't run concurrently on the source file.

Writes we still need to block so that the clone is an atomic
point in time image of the file, but reads could be allowed.

The XFS clone implementation takes the IOLOCK_EXCL high up, and
then lower down it iterates one extent doing the sharing operation.
It holds the ILOCK_EXCL while it is modifying the extent in both the
source and destination files, then commits the transaction and drops
the ILOCKs.

OK, so we have fine-grained ILOCK serialisation during the clone for
access/modification to the extent list. Excellent, I think we can
make this work.

So:

1. take IOLOCK_EXCL like we already do on the source and destination
files.

2. Once all the pre work is done, set a "clone in progress" flag on
the in-memory source inode.

3. atomically demote the source inode IOLOCK_EXCL to IOLOCK_SHARED.

4. read IO and the clone serialise access to the extent list via the
ILOCK. We know this works fine, because that's how the extent list
access serialisation for concurrent read and write direct IO works.

5. buffered writes take the IOLOCK_EXCL, so they block until the
clone completes. Same behaviour as right now, all good.

6. direct IO writes need to be modified to check the "clone in
progress" flag after taking the IOLOCK_SHARED. If it is set, we have
to drop the IOLOCK_SHARED and take it IOLOCK_EXCL. This will block
until the clone completes.

7. when the clone completes, we clear the "clone in progress" flag
and drop all the IOLOCKs that are held.

AFAICT, this will give us shared clone vs read and exclusive clone
vs write IO semantics for all clone operations. And if I've
understood the problem statement correctly, this will avoid the
read IO latency problems that long running clone operations cause
without needing a new syscall.

Thoughts?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: proposal: enhance 'cp --reflink' to expose ioctl_ficlonerange
  2023-09-19  5:51 ` Dave Chinner
@ 2023-09-20  0:00   ` Darrick J. Wong
  2023-09-20  1:07     ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2023-09-20  0:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Catherine Hoang, linux-xfs@vger.kernel.org

On Tue, Sep 19, 2023 at 03:51:24PM +1000, Dave Chinner wrote:
> On Tue, Sep 19, 2023 at 02:43:32AM +0000, Catherine Hoang wrote:
> > Hi all,
> > 
> > Darrick and I have been working on designing a new ioctl FICLONERANGE2. The
> > following text attempts to explain our needs and reasoning behind this decision. 
> > 
> > 
> > Contents
> > --------
> > 1. Problem Statement
> > 2. Proof of Concept
> > 3. Proposed Solution
> > 4. User Interface
> > 5. Testing Plan
> > 
> > 
> > 1. Problem Statement
> > --------------------
> > 
> > One of our VM cluster management products needs to snapshot KVM image files
> > so that they can be restored in case of failure. Snapshotting is done by
> > redirecting VM disk writes to a sidecar file and using reflink on the disk
> > image, specifically the FICLONE ioctl as used by "cp --reflink". Reflink
> > locks the source and destination files while it operates, which means that
> > reads from the main vm disk image are blocked, causing the vm to stall. When
> > an image file is heavily fragmented, the copy process could take several
> > minutes. Some of the vm image files have 50-100 million extent records, and
> > duplicating that much metadata locks the file for 30 minutes or more. Having
> > activities suspended for such a long time in a cluster node could result in
> > node eviction. A node eviction occurs when the cluster manager determines
> > that the vm is unresponsive. One of the criteria for determining that a VM
> > is unresponsive is the failure of filesystems in the guest to respond for an
> > unacceptably long time. In order to solve this problem, we need to provide a
> > variant of FICLONE that releases the file locks periodically to allow reads
> > to occur as vmbackup runs. The purpose of this feature is to allow vmbackup
> > to run without causing downtime.
> 
> Interesting problem to have - let me see if I understand it
> properly.
> 
> Writes are redirected away from the file being cloned, but reads go
> directly to the source file being cloned?
> 
> But cloning can take a long time, so breaking up the clone operation
> into multiple discrete ranges will allow reads through
> to the file being cloned with minimal latency. However, you don't
> want writes to the source file because that results in the
> atomicity of the clone operation being violated and corrupting the
> snapshot.
> 
> Hence the redirected writes ensure that the file being cloned does
> not change from syscall to syscall. This means the time interrupted
> clone operation can restart from where it left off and you still get
> an consistent image clone for the snapshot.
> 
> Did I get that right?

Right.

> If so, I'm wondering about the general usefulness of this
> multi-syscall construct - having to ensure that it isn't written to
> between syscalls is quite the constraint.

Write isolation is not that much of a constraint.  Qemu can set up the
sidecar internally and commit the sidecar back into the original image.
libvirt wraps this functionality.

> I wonder if we can do better than that and not need a new syscall;
> shared read + clone seems more like an inode extent list access
> serialisation problem than anything else...
> 
> <thinks for a bit>
> 
> Ok. a clone does not change any data in the source file.

Right.  The only modifications it does is to fsync the range, and that's
only an implementation detail of ocfs2 & xfs.

> Neither do read IO operations.
> 
> Hence from a data integrity perspective, there's no reason why read
> IO and FICLONE can't run concurrently on the source file.

<nod>

> Writes we still need to block so that the clone is an atomic
> point in time image of the file, but reads could be allowed.

<nod>

> The XFS clone implementation takes the IOLOCK_EXCL high up, and
> then lower down it iterates one extent doing the sharing operation.
> It holds the ILOCK_EXCL while it is modifying the extent in both the
> source and destination files, then commits the transaction and drops
> the ILOCKs.
> 
> OK, so we have fine-grained ILOCK serialisation during the clone for
> access/modification to the extent list. Excellent, I think we can
> make this work.
> 
> So:
> 
> 1. take IOLOCK_EXCL like we already do on the source and destination
> files.
> 
> 2. Once all the pre work is done, set a "clone in progress" flag on
> the in-memory source inode.
> 
> 3. atomically demote the source inode IOLOCK_EXCL to IOLOCK_SHARED.
> 
> 4. read IO and the clone serialise access to the extent list via the
> ILOCK. We know this works fine, because that's how the extent list
> access serialisation for concurrent read and write direct IO works.
> 
> 5. buffered writes take the IOLOCK_EXCL, so they block until the
> clone completes. Same behaviour as right now, all good.

I think pnfs layouts and DAX writes also take IOLOCK_EXCL, right?  So
once reflink breaks the layouts, we're good there too?

> 6. direct IO writes need to be modified to check the "clone in
> progress" flag after taking the IOLOCK_SHARED. If it is set, we have
> to drop the IOLOCK_SHARED and take it IOLOCK_EXCL. This will block
> until the clone completes.
> 
> 7. when the clone completes, we clear the "clone in progress" flag
> and drop all the IOLOCKs that are held.
> 
> AFAICT, this will give us shared clone vs read and exclusive clone
> vs write IO semantics for all clone operations. And if I've
> understood the problem statement correctly, this will avoid the
> read IO latency problems that long running clone operations cause
> without needing a new syscall.
> 
> Thoughts?

I think that'll work.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: proposal: enhance 'cp --reflink' to expose ioctl_ficlonerange
  2023-09-20  0:00   ` Darrick J. Wong
@ 2023-09-20  1:07     ` Dave Chinner
  2023-09-21 22:26       ` Darrick J. Wong
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2023-09-20  1:07 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Catherine Hoang, linux-xfs@vger.kernel.org

On Tue, Sep 19, 2023 at 05:00:58PM -0700, Darrick J. Wong wrote:
> On Tue, Sep 19, 2023 at 03:51:24PM +1000, Dave Chinner wrote:
> > On Tue, Sep 19, 2023 at 02:43:32AM +0000, Catherine Hoang wrote:
> > The XFS clone implementation takes the IOLOCK_EXCL high up, and
> > then lower down it iterates one extent doing the sharing operation.
> > It holds the ILOCK_EXCL while it is modifying the extent in both the
> > source and destination files, then commits the transaction and drops
> > the ILOCKs.
> > 
> > OK, so we have fine-grained ILOCK serialisation during the clone for
> > access/modification to the extent list. Excellent, I think we can
> > make this work.
> > 
> > So:
> > 
> > 1. take IOLOCK_EXCL like we already do on the source and destination
> > files.
> > 
> > 2. Once all the pre work is done, set a "clone in progress" flag on
> > the in-memory source inode.
> > 
> > 3. atomically demote the source inode IOLOCK_EXCL to IOLOCK_SHARED.
> > 
> > 4. read IO and the clone serialise access to the extent list via the
> > ILOCK. We know this works fine, because that's how the extent list
> > access serialisation for concurrent read and write direct IO works.
> > 
> > 5. buffered writes take the IOLOCK_EXCL, so they block until the
> > clone completes. Same behaviour as right now, all good.
> 
> I think pnfs layouts and DAX writes also take IOLOCK_EXCL, right?  So
> once reflink breaks the layouts, we're good there too?

I think so.

<looks to confirm>

The pnfs code in xfs_fs_map_blocks() will reject mappings on any
inode marked with shared extents, so I think the fact that we
set the inode as having shared extents before we finish
xfs_reflink_remap_prep() will cause pnfs mappings to kick out before
we even take the IOLOCK.

But, regardless of that, both new PNFS mappings and DAX writes use
IOLOCK_EXCL, and xfs_ilock2_io_mmap() breaks both PNFS and DAX
layouts which will force them to finish what they are doing and sync
data before the clone operation grabs the IOLOCK_EXCL. They'll block
on the clone holding the IOLOCK from that point onwards, so I think
we're good here.

hmmmmm.

<notes that xfs_ilock2_io_mmap() calls filemap_invalidate_lock_two()>

Sigh.

That will block buffered reads trying to instantiate new pages
in the page cache. However, this isn't why the invalidate lock is
held - that's being held to lock out lock page faults (i.e. mmap()
access) whilst the clone is running.

We really only need to lock out mmap writes, and the only way to do
that is to prevent write faults from making progress whilst the
clone is running.

__xfs_filemap_fault() currently takes XFS_MMAPLOCK_SHARED for write
faults - I think we need it to look at the "clone in progress" flag
for write faults, too, and use XFS_MMAPLOCK_EXCL in that case.

That would then allow us to demote the invalidate lock on the source
file the same way we do the IOLOCK, allowing buffered reads to
populate the page caceh but have write faults block until the clone
completes (as they do now, same as writes).

Is there anything else I missed?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: proposal: enhance 'cp --reflink' to expose ioctl_ficlonerange
  2023-09-20  1:07     ` Dave Chinner
@ 2023-09-21 22:26       ` Darrick J. Wong
  2023-09-21 23:18         ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2023-09-21 22:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Catherine Hoang, linux-xfs@vger.kernel.org

On Wed, Sep 20, 2023 at 11:07:37AM +1000, Dave Chinner wrote:
> On Tue, Sep 19, 2023 at 05:00:58PM -0700, Darrick J. Wong wrote:
> > On Tue, Sep 19, 2023 at 03:51:24PM +1000, Dave Chinner wrote:
> > > On Tue, Sep 19, 2023 at 02:43:32AM +0000, Catherine Hoang wrote:
> > > The XFS clone implementation takes the IOLOCK_EXCL high up, and
> > > then lower down it iterates one extent doing the sharing operation.
> > > It holds the ILOCK_EXCL while it is modifying the extent in both the
> > > source and destination files, then commits the transaction and drops
> > > the ILOCKs.
> > > 
> > > OK, so we have fine-grained ILOCK serialisation during the clone for
> > > access/modification to the extent list. Excellent, I think we can
> > > make this work.
> > > 
> > > So:
> > > 
> > > 1. take IOLOCK_EXCL like we already do on the source and destination
> > > files.
> > > 
> > > 2. Once all the pre work is done, set a "clone in progress" flag on
> > > the in-memory source inode.
> > > 
> > > 3. atomically demote the source inode IOLOCK_EXCL to IOLOCK_SHARED.
> > > 
> > > 4. read IO and the clone serialise access to the extent list via the
> > > ILOCK. We know this works fine, because that's how the extent list
> > > access serialisation for concurrent read and write direct IO works.
> > > 
> > > 5. buffered writes take the IOLOCK_EXCL, so they block until the
> > > clone completes. Same behaviour as right now, all good.
> > 
> > I think pnfs layouts and DAX writes also take IOLOCK_EXCL, right?  So
> > once reflink breaks the layouts, we're good there too?
> 
> I think so.
> 
> <looks to confirm>
> 
> The pnfs code in xfs_fs_map_blocks() will reject mappings on any
> inode marked with shared extents, so I think the fact that we
> set the inode as having shared extents before we finish
> xfs_reflink_remap_prep() will cause pnfs mappings to kick out before
> we even take the IOLOCK.
> 
> But, regardless of that, both new PNFS mappings and DAX writes use
> IOLOCK_EXCL, and xfs_ilock2_io_mmap() breaks both PNFS and DAX
> layouts which will force them to finish what they are doing and sync
> data before the clone operation grabs the IOLOCK_EXCL. They'll block
> on the clone holding the IOLOCK from that point onwards, so I think
> we're good here.
> 
> hmmmmm.
> 
> <notes that xfs_ilock2_io_mmap() calls filemap_invalidate_lock_two()>
> 
> Sigh.
> 
> That will block buffered reads trying to instantiate new pages
> in the page cache. However, this isn't why the invalidate lock is
> held - that's being held to lock out lock page faults (i.e. mmap()
> access) whilst the clone is running.
> 
> 
> We really only need to lock out mmap writes, and the only way to do
> that is to prevent write faults from making progress whilst the
> clone is running.
> 
> __xfs_filemap_fault() currently takes XFS_MMAPLOCK_SHARED for write
> faults - I think we need it to look at the "clone in progress" flag
> for write faults, too, and use XFS_MMAPLOCK_EXCL in that case.
> 
> That would then allow us to demote the invalidate lock on the source
> file the same way we do the IOLOCK, allowing buffered reads to
> populate the page caceh but have write faults block until the clone
> completes (as they do now, same as writes).
> 
> Is there anything else I missed?

I think that's it.  I'd wondered how much we really care about reflink
stalling read faults, but yeah, let's fix both.

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: proposal: enhance 'cp --reflink' to expose ioctl_ficlonerange
  2023-09-21 22:26       ` Darrick J. Wong
@ 2023-09-21 23:18         ` Dave Chinner
  2023-09-25 21:28           ` Darrick J. Wong
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2023-09-21 23:18 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Catherine Hoang, linux-xfs@vger.kernel.org

On Thu, Sep 21, 2023 at 03:26:28PM -0700, Darrick J. Wong wrote:
> On Wed, Sep 20, 2023 at 11:07:37AM +1000, Dave Chinner wrote:
> > On Tue, Sep 19, 2023 at 05:00:58PM -0700, Darrick J. Wong wrote:
> > > On Tue, Sep 19, 2023 at 03:51:24PM +1000, Dave Chinner wrote:
> > > > On Tue, Sep 19, 2023 at 02:43:32AM +0000, Catherine Hoang wrote:
> > > > The XFS clone implementation takes the IOLOCK_EXCL high up, and
> > > > then lower down it iterates one extent doing the sharing operation.
> > > > It holds the ILOCK_EXCL while it is modifying the extent in both the
> > > > source and destination files, then commits the transaction and drops
> > > > the ILOCKs.
> > > > 
> > > > OK, so we have fine-grained ILOCK serialisation during the clone for
> > > > access/modification to the extent list. Excellent, I think we can
> > > > make this work.
> > > > 
> > > > So:
> > > > 
> > > > 1. take IOLOCK_EXCL like we already do on the source and destination
> > > > files.
> > > > 
> > > > 2. Once all the pre work is done, set a "clone in progress" flag on
> > > > the in-memory source inode.
> > > > 
> > > > 3. atomically demote the source inode IOLOCK_EXCL to IOLOCK_SHARED.
> > > > 
> > > > 4. read IO and the clone serialise access to the extent list via the
> > > > ILOCK. We know this works fine, because that's how the extent list
> > > > access serialisation for concurrent read and write direct IO works.
> > > > 
> > > > 5. buffered writes take the IOLOCK_EXCL, so they block until the
> > > > clone completes. Same behaviour as right now, all good.
> > > 
> > > I think pnfs layouts and DAX writes also take IOLOCK_EXCL, right?  So
> > > once reflink breaks the layouts, we're good there too?
> > 
> > I think so.
> > 
> > <looks to confirm>
> > 
> > The pnfs code in xfs_fs_map_blocks() will reject mappings on any
> > inode marked with shared extents, so I think the fact that we
> > set the inode as having shared extents before we finish
> > xfs_reflink_remap_prep() will cause pnfs mappings to kick out before
> > we even take the IOLOCK.
> > 
> > But, regardless of that, both new PNFS mappings and DAX writes use
> > IOLOCK_EXCL, and xfs_ilock2_io_mmap() breaks both PNFS and DAX
> > layouts which will force them to finish what they are doing and sync
> > data before the clone operation grabs the IOLOCK_EXCL. They'll block
> > on the clone holding the IOLOCK from that point onwards, so I think
> > we're good here.
> > 
> > hmmmmm.
> > 
> > <notes that xfs_ilock2_io_mmap() calls filemap_invalidate_lock_two()>
> > 
> > Sigh.
> > 
> > That will block buffered reads trying to instantiate new pages
> > in the page cache. However, this isn't why the invalidate lock is
> > held - that's being held to lock out lock page faults (i.e. mmap()
> > access) whilst the clone is running.
> > 
> > 
> > We really only need to lock out mmap writes, and the only way to do
> > that is to prevent write faults from making progress whilst the
> > clone is running.
> > 
> > __xfs_filemap_fault() currently takes XFS_MMAPLOCK_SHARED for write
> > faults - I think we need it to look at the "clone in progress" flag
> > for write faults, too, and use XFS_MMAPLOCK_EXCL in that case.
> > 
> > That would then allow us to demote the invalidate lock on the source
> > file the same way we do the IOLOCK, allowing buffered reads to
> > populate the page caceh but have write faults block until the clone
> > completes (as they do now, same as writes).
> > 
> > Is there anything else I missed?
> 
> I think that's it.  I'd wondered how much we really care about reflink
> stalling read faults, but yeah, let's fix both.

Well, it's not so much about mmap as the fact that holding
invalidate lock exclusive prevents adding or removing folios to the
page cache from any path. Hence the change as I originally proposed
would block the buffered read path trying to add pages to the page
cache the same as it will block the read fault path....

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: proposal: enhance 'cp --reflink' to expose ioctl_ficlonerange
  2023-09-21 23:18         ` Dave Chinner
@ 2023-09-25 21:28           ` Darrick J. Wong
  2023-09-26 21:18             ` Catherine Hoang
  0 siblings, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2023-09-25 21:28 UTC (permalink / raw)
  To: Catherine Hoang, Dave Chinner; +Cc: linux-xfs@vger.kernel.org

On Fri, Sep 22, 2023 at 09:18:48AM +1000, Dave Chinner wrote:
> On Thu, Sep 21, 2023 at 03:26:28PM -0700, Darrick J. Wong wrote:
> > On Wed, Sep 20, 2023 at 11:07:37AM +1000, Dave Chinner wrote:
> > > On Tue, Sep 19, 2023 at 05:00:58PM -0700, Darrick J. Wong wrote:
> > > > On Tue, Sep 19, 2023 at 03:51:24PM +1000, Dave Chinner wrote:
> > > > > On Tue, Sep 19, 2023 at 02:43:32AM +0000, Catherine Hoang wrote:
> > > > > The XFS clone implementation takes the IOLOCK_EXCL high up, and
> > > > > then lower down it iterates one extent doing the sharing operation.
> > > > > It holds the ILOCK_EXCL while it is modifying the extent in both the
> > > > > source and destination files, then commits the transaction and drops
> > > > > the ILOCKs.
> > > > > 
> > > > > OK, so we have fine-grained ILOCK serialisation during the clone for
> > > > > access/modification to the extent list. Excellent, I think we can
> > > > > make this work.
> > > > > 
> > > > > So:
> > > > > 
> > > > > 1. take IOLOCK_EXCL like we already do on the source and destination
> > > > > files.
> > > > > 
> > > > > 2. Once all the pre work is done, set a "clone in progress" flag on
> > > > > the in-memory source inode.
> > > > > 
> > > > > 3. atomically demote the source inode IOLOCK_EXCL to IOLOCK_SHARED.
> > > > > 
> > > > > 4. read IO and the clone serialise access to the extent list via the
> > > > > ILOCK. We know this works fine, because that's how the extent list
> > > > > access serialisation for concurrent read and write direct IO works.
> > > > > 
> > > > > 5. buffered writes take the IOLOCK_EXCL, so they block until the
> > > > > clone completes. Same behaviour as right now, all good.
> > > > 
> > > > I think pnfs layouts and DAX writes also take IOLOCK_EXCL, right?  So
> > > > once reflink breaks the layouts, we're good there too?
> > > 
> > > I think so.
> > > 
> > > <looks to confirm>
> > > 
> > > The pnfs code in xfs_fs_map_blocks() will reject mappings on any
> > > inode marked with shared extents, so I think the fact that we
> > > set the inode as having shared extents before we finish
> > > xfs_reflink_remap_prep() will cause pnfs mappings to kick out before
> > > we even take the IOLOCK.
> > > 
> > > But, regardless of that, both new PNFS mappings and DAX writes use
> > > IOLOCK_EXCL, and xfs_ilock2_io_mmap() breaks both PNFS and DAX
> > > layouts which will force them to finish what they are doing and sync
> > > data before the clone operation grabs the IOLOCK_EXCL. They'll block
> > > on the clone holding the IOLOCK from that point onwards, so I think
> > > we're good here.
> > > 
> > > hmmmmm.
> > > 
> > > <notes that xfs_ilock2_io_mmap() calls filemap_invalidate_lock_two()>
> > > 
> > > Sigh.
> > > 
> > > That will block buffered reads trying to instantiate new pages
> > > in the page cache. However, this isn't why the invalidate lock is
> > > held - that's being held to lock out lock page faults (i.e. mmap()
> > > access) whilst the clone is running.
> > > 
> > > 
> > > We really only need to lock out mmap writes, and the only way to do
> > > that is to prevent write faults from making progress whilst the
> > > clone is running.
> > > 
> > > __xfs_filemap_fault() currently takes XFS_MMAPLOCK_SHARED for write
> > > faults - I think we need it to look at the "clone in progress" flag
> > > for write faults, too, and use XFS_MMAPLOCK_EXCL in that case.
> > > 
> > > That would then allow us to demote the invalidate lock on the source
> > > file the same way we do the IOLOCK, allowing buffered reads to
> > > populate the page caceh but have write faults block until the clone
> > > completes (as they do now, same as writes).
> > > 
> > > Is there anything else I missed?
> > 
> > I think that's it.  I'd wondered how much we really care about reflink
> > stalling read faults, but yeah, let's fix both.
> 
> Well, it's not so much about mmap as the fact that holding
> invalidate lock exclusive prevents adding or removing folios to the
> page cache from any path. Hence the change as I originally proposed
> would block the buffered read path trying to add pages to the page
> cache the same as it will block the read fault path....

Ah, ok.

Catherine: Do you have enough information to get started on a proof of
concept?

--D

> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: proposal: enhance 'cp --reflink' to expose ioctl_ficlonerange
  2023-09-25 21:28           ` Darrick J. Wong
@ 2023-09-26 21:18             ` Catherine Hoang
  0 siblings, 0 replies; 9+ messages in thread
From: Catherine Hoang @ 2023-09-26 21:18 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs@vger.kernel.org

> On Sep 25, 2023, at 2:28 PM, Darrick J. Wong <djwong@kernel.org> wrote:
> 
> On Fri, Sep 22, 2023 at 09:18:48AM +1000, Dave Chinner wrote:
>> On Thu, Sep 21, 2023 at 03:26:28PM -0700, Darrick J. Wong wrote:
>>> On Wed, Sep 20, 2023 at 11:07:37AM +1000, Dave Chinner wrote:
>>>> On Tue, Sep 19, 2023 at 05:00:58PM -0700, Darrick J. Wong wrote:
>>>>> On Tue, Sep 19, 2023 at 03:51:24PM +1000, Dave Chinner wrote:
>>>>>> On Tue, Sep 19, 2023 at 02:43:32AM +0000, Catherine Hoang wrote:
>>>>>> The XFS clone implementation takes the IOLOCK_EXCL high up, and
>>>>>> then lower down it iterates one extent doing the sharing operation.
>>>>>> It holds the ILOCK_EXCL while it is modifying the extent in both the
>>>>>> source and destination files, then commits the transaction and drops
>>>>>> the ILOCKs.
>>>>>> 
>>>>>> OK, so we have fine-grained ILOCK serialisation during the clone for
>>>>>> access/modification to the extent list. Excellent, I think we can
>>>>>> make this work.
>>>>>> 
>>>>>> So:
>>>>>> 
>>>>>> 1. take IOLOCK_EXCL like we already do on the source and destination
>>>>>> files.
>>>>>> 
>>>>>> 2. Once all the pre work is done, set a "clone in progress" flag on
>>>>>> the in-memory source inode.
>>>>>> 
>>>>>> 3. atomically demote the source inode IOLOCK_EXCL to IOLOCK_SHARED.
>>>>>> 
>>>>>> 4. read IO and the clone serialise access to the extent list via the
>>>>>> ILOCK. We know this works fine, because that's how the extent list
>>>>>> access serialisation for concurrent read and write direct IO works.
>>>>>> 
>>>>>> 5. buffered writes take the IOLOCK_EXCL, so they block until the
>>>>>> clone completes. Same behaviour as right now, all good.
>>>>> 
>>>>> I think pnfs layouts and DAX writes also take IOLOCK_EXCL, right?  So
>>>>> once reflink breaks the layouts, we're good there too?
>>>> 
>>>> I think so.
>>>> 
>>>> <looks to confirm>
>>>> 
>>>> The pnfs code in xfs_fs_map_blocks() will reject mappings on any
>>>> inode marked with shared extents, so I think the fact that we
>>>> set the inode as having shared extents before we finish
>>>> xfs_reflink_remap_prep() will cause pnfs mappings to kick out before
>>>> we even take the IOLOCK.
>>>> 
>>>> But, regardless of that, both new PNFS mappings and DAX writes use
>>>> IOLOCK_EXCL, and xfs_ilock2_io_mmap() breaks both PNFS and DAX
>>>> layouts which will force them to finish what they are doing and sync
>>>> data before the clone operation grabs the IOLOCK_EXCL. They'll block
>>>> on the clone holding the IOLOCK from that point onwards, so I think
>>>> we're good here.
>>>> 
>>>> hmmmmm.
>>>> 
>>>> <notes that xfs_ilock2_io_mmap() calls filemap_invalidate_lock_two()>
>>>> 
>>>> Sigh.
>>>> 
>>>> That will block buffered reads trying to instantiate new pages
>>>> in the page cache. However, this isn't why the invalidate lock is
>>>> held - that's being held to lock out lock page faults (i.e. mmap()
>>>> access) whilst the clone is running.
>>>> 
>>>> 
>>>> We really only need to lock out mmap writes, and the only way to do
>>>> that is to prevent write faults from making progress whilst the
>>>> clone is running.
>>>> 
>>>> __xfs_filemap_fault() currently takes XFS_MMAPLOCK_SHARED for write
>>>> faults - I think we need it to look at the "clone in progress" flag
>>>> for write faults, too, and use XFS_MMAPLOCK_EXCL in that case.
>>>> 
>>>> That would then allow us to demote the invalidate lock on the source
>>>> file the same way we do the IOLOCK, allowing buffered reads to
>>>> populate the page caceh but have write faults block until the clone
>>>> completes (as they do now, same as writes).
>>>> 
>>>> Is there anything else I missed?
>>> 
>>> I think that's it.  I'd wondered how much we really care about reflink
>>> stalling read faults, but yeah, let's fix both.
>> 
>> Well, it's not so much about mmap as the fact that holding
>> invalidate lock exclusive prevents adding or removing folios to the
>> page cache from any path. Hence the change as I originally proposed
>> would block the buffered read path trying to add pages to the page
>> cache the same as it will block the read fault path....
> 
> Ah, ok.
> 
> Catherine: Do you have enough information to get started on a proof of
> concept?

Yup, this solution makes sense to me.

> 
> --D
> 
>> Cheers,
>> 
>> Dave.
>> 
>> -- 
>> Dave Chinner
>> david@fromorbit.com


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-09-27  0:02 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-19  2:43 proposal: enhance 'cp --reflink' to expose ioctl_ficlonerange Catherine Hoang
2023-09-19  3:31 ` Bagas Sanjaya
2023-09-19  5:51 ` Dave Chinner
2023-09-20  0:00   ` Darrick J. Wong
2023-09-20  1:07     ` Dave Chinner
2023-09-21 22:26       ` Darrick J. Wong
2023-09-21 23:18         ` Dave Chinner
2023-09-25 21:28           ` Darrick J. Wong
2023-09-26 21:18             ` Catherine Hoang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox