public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: Catherine Hoang <catherine.hoang@oracle.com>,
	Dave Chinner <david@fromorbit.com>
Cc: "linux-xfs@vger.kernel.org" <linux-xfs@vger.kernel.org>
Subject: Re: proposal: enhance 'cp --reflink' to expose ioctl_ficlonerange
Date: Mon, 25 Sep 2023 14:28:35 -0700	[thread overview]
Message-ID: <20230925212835.GB11439@frogsfrogsfrogs> (raw)
In-Reply-To: <ZQzPWJ/iojT0Vumi@dread.disaster.area>

On Fri, Sep 22, 2023 at 09:18:48AM +1000, Dave Chinner wrote:
> On Thu, Sep 21, 2023 at 03:26:28PM -0700, Darrick J. Wong wrote:
> > On Wed, Sep 20, 2023 at 11:07:37AM +1000, Dave Chinner wrote:
> > > On Tue, Sep 19, 2023 at 05:00:58PM -0700, Darrick J. Wong wrote:
> > > > On Tue, Sep 19, 2023 at 03:51:24PM +1000, Dave Chinner wrote:
> > > > > On Tue, Sep 19, 2023 at 02:43:32AM +0000, Catherine Hoang wrote:
> > > > > The XFS clone implementation takes the IOLOCK_EXCL high up, and
> > > > > then lower down it iterates one extent doing the sharing operation.
> > > > > It holds the ILOCK_EXCL while it is modifying the extent in both the
> > > > > source and destination files, then commits the transaction and drops
> > > > > the ILOCKs.
> > > > > 
> > > > > OK, so we have fine-grained ILOCK serialisation during the clone for
> > > > > access/modification to the extent list. Excellent, I think we can
> > > > > make this work.
> > > > > 
> > > > > So:
> > > > > 
> > > > > 1. take IOLOCK_EXCL like we already do on the source and destination
> > > > > files.
> > > > > 
> > > > > 2. Once all the pre work is done, set a "clone in progress" flag on
> > > > > the in-memory source inode.
> > > > > 
> > > > > 3. atomically demote the source inode IOLOCK_EXCL to IOLOCK_SHARED.
> > > > > 
> > > > > 4. read IO and the clone serialise access to the extent list via the
> > > > > ILOCK. We know this works fine, because that's how the extent list
> > > > > access serialisation for concurrent read and write direct IO works.
> > > > > 
> > > > > 5. buffered writes take the IOLOCK_EXCL, so they block until the
> > > > > clone completes. Same behaviour as right now, all good.
> > > > 
> > > > I think pnfs layouts and DAX writes also take IOLOCK_EXCL, right?  So
> > > > once reflink breaks the layouts, we're good there too?
> > > 
> > > I think so.
> > > 
> > > <looks to confirm>
> > > 
> > > The pnfs code in xfs_fs_map_blocks() will reject mappings on any
> > > inode marked with shared extents, so I think the fact that we
> > > set the inode as having shared extents before we finish
> > > xfs_reflink_remap_prep() will cause pnfs mappings to kick out before
> > > we even take the IOLOCK.
> > > 
> > > But, regardless of that, both new PNFS mappings and DAX writes use
> > > IOLOCK_EXCL, and xfs_ilock2_io_mmap() breaks both PNFS and DAX
> > > layouts which will force them to finish what they are doing and sync
> > > data before the clone operation grabs the IOLOCK_EXCL. They'll block
> > > on the clone holding the IOLOCK from that point onwards, so I think
> > > we're good here.
> > > 
> > > hmmmmm.
> > > 
> > > <notes that xfs_ilock2_io_mmap() calls filemap_invalidate_lock_two()>
> > > 
> > > Sigh.
> > > 
> > > That will block buffered reads trying to instantiate new pages
> > > in the page cache. However, this isn't why the invalidate lock is
> > > held - that's being held to lock out lock page faults (i.e. mmap()
> > > access) whilst the clone is running.
> > > 
> > > 
> > > We really only need to lock out mmap writes, and the only way to do
> > > that is to prevent write faults from making progress whilst the
> > > clone is running.
> > > 
> > > __xfs_filemap_fault() currently takes XFS_MMAPLOCK_SHARED for write
> > > faults - I think we need it to look at the "clone in progress" flag
> > > for write faults, too, and use XFS_MMAPLOCK_EXCL in that case.
> > > 
> > > That would then allow us to demote the invalidate lock on the source
> > > file the same way we do the IOLOCK, allowing buffered reads to
> > > populate the page caceh but have write faults block until the clone
> > > completes (as they do now, same as writes).
> > > 
> > > Is there anything else I missed?
> > 
> > I think that's it.  I'd wondered how much we really care about reflink
> > stalling read faults, but yeah, let's fix both.
> 
> Well, it's not so much about mmap as the fact that holding
> invalidate lock exclusive prevents adding or removing folios to the
> page cache from any path. Hence the change as I originally proposed
> would block the buffered read path trying to add pages to the page
> cache the same as it will block the read fault path....

Ah, ok.

Catherine: Do you have enough information to get started on a proof of
concept?

--D

> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com

  reply	other threads:[~2023-09-25 21:28 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-19  2:43 proposal: enhance 'cp --reflink' to expose ioctl_ficlonerange Catherine Hoang
2023-09-19  3:31 ` Bagas Sanjaya
2023-09-19  5:51 ` Dave Chinner
2023-09-20  0:00   ` Darrick J. Wong
2023-09-20  1:07     ` Dave Chinner
2023-09-21 22:26       ` Darrick J. Wong
2023-09-21 23:18         ` Dave Chinner
2023-09-25 21:28           ` Darrick J. Wong [this message]
2023-09-26 21:18             ` Catherine Hoang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230925212835.GB11439@frogsfrogsfrogs \
    --to=djwong@kernel.org \
    --cc=catherine.hoang@oracle.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox