From: Dave Chinner <david@fromorbit.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>, Carlos Maiolino <cem@kernel.org>,
Zorro Lang <zlang@kernel.org>,
linux-xfs@vger.kernel.org
Subject: Re: [PATCH 1/2] xfs: flush inodegc before swapon
Date: Thu, 6 Feb 2025 08:53:59 +1100 [thread overview]
Message-ID: <Z6Pd9wG2sqVZSKjQ@dread.disaster.area> (raw)
In-Reply-To: <20250205211659.GC21808@frogsfrogsfrogs>
On Wed, Feb 05, 2025 at 01:16:59PM -0800, Darrick J. Wong wrote:
> On Thu, Feb 06, 2025 at 08:08:14AM +1100, Dave Chinner wrote:
> > On Wed, Feb 05, 2025 at 05:28:00PM +0100, Christoph Hellwig wrote:
> > > Fix the brand new xfstest that tries to swapon on a recently unshared
> > > file and use the chance to document the other bit of magic in this
> > > function.
> >
> > You haven't documented the magic at all - I have no clue what the
> > bug being fixed is nor how adding an inodegc flush fixes anything
> > to do with swap file activation....
> >
> > > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > > ---
> > > fs/xfs/xfs_aops.c | 18 +++++++++++++++++-
> > > 1 file changed, 17 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > index 69b8c2d1937d..c792297aa0a3 100644
> > > --- a/fs/xfs/xfs_aops.c
> > > +++ b/fs/xfs/xfs_aops.c
> > > @@ -21,6 +21,7 @@
> > > #include "xfs_error.h"
> > > #include "xfs_zone_alloc.h"
> > > #include "xfs_rtgroup.h"
> > > +#include "xfs_icache.h"
> > >
> > > struct xfs_writepage_ctx {
> > > struct iomap_writepage_ctx ctx;
> > > @@ -685,7 +686,22 @@ xfs_iomap_swapfile_activate(
> > > struct file *swap_file,
> > > sector_t *span)
> > > {
> > > - sis->bdev = xfs_inode_buftarg(XFS_I(file_inode(swap_file)))->bt_bdev;
> > > + struct xfs_inode *ip = XFS_I(file_inode(swap_file));
> > > +
> > > + /*
> > > + * Ensure inode GC has finished to remove unmapped extents, as the
> > > + * reflink bit is only cleared once all previously shared extents
> > > + * are unmapped. Otherwise swapon could incorrectly fail on a
> > > + * very recently unshare file.
> > > + */
> > > + xfs_inodegc_flush(ip->i_mount);
> >
> > The comment doesn't explains what this actually fixes. Inodes that
> > are processed by inodegc *must* be unreferenced by the VFS, so it
> > is not clear exactly what this is actually doing.
> >
> > I'm guessing that the test in question is doing something like this:
> >
> > file2 = clone(file1)
> > unlink(file1)
> > swapon(file2)
> >
> > and so the swap file activation is racing with the background
> > inactivation and extent removal of file1?
>
> Yes, I think hch is referring to this:
> https://lore.kernel.org/fstests/2c9ff99c2bcaec4412b0903e03949d5a3ad0d817.1736783467.git.fdmanana@suse.com/
>
> > But in that case, the extents are being removed from file1, and at
> > no time does that remove the reflink bit on file2. i.e. even if the
> > inactivation of file1 results in all the extents in file2 no longer
> > being shared, that only results in refcountbt updates and it does
> > not get propagated back to file2's inode. i.e. file2 will still be
> > marked as a reflink file containing shared extents.
>
> Right, but the (iomap) swapfile activation code only errors out if the
> filesystem gives it a mapping that is marked as shared. So the reflink
> flag isn't relevant here.
>
> How about this for a better comment:
>
> "Ensure inode GC has finished so that unlinked clones of this file have
> been truncated and inactivated fully. This is to ensure that walking
> the swap file does not find any shared extents."
Even talking about it in terms on "inodegc" seems like
misdirection to me. Now I understand what this flush is working
around, it is clear to me that swapon could race the same way with
any other operation that removes extents from cloned files (e.g.
hole punch, truncate, etc).
however, from a user perspective, the only one that matters -right
now- is unlink because of the deferred processing of extent removal.
But even that isn't a guarantee - if something else has that cloned
file open, then the unlinked inode won't be queued for inodegc
and so swapon will still fail regardless of the inodegc flush.
Hence I think this needs to explain the race with extent removal and
cloned files, then explain that the inodegc flush is a workaround
that applies only to a specific corner case w.r.t. unlinking clones
before swapon is run.
Something like:
/*
* Swap file activation is can race against concurrent shared extent
* removal in files that have been cloned. If this happens,
* iomap_swapfile_iter() can fail because it encountered a shared
* extent even though an operation is in progress to remove those
* shared extents.
*
* This race becomes problematic when we defer extent removal
* operations beyond the end of a syscall (i.e. use async background
* processing algorithms). Users think the extents are no longer
* shared, but iomap_swapfile_iter() still sees them as shared
* because the refcountbt entries for the extents being removed have
* not yet been updated. Hence the swapon call fails unexpectedly.
*
* The race condition is currently most obvious from the unlink()
* operation as extent removal is deferred until after the last
* reference to the inode goes away. We then process the extent
* removal asynchronously, hence triggers the "syscall completed but
* work not done" condition mentioned above. To close this race
* window, we need to flush any pending inodegc operations to ensure
* they have updated the refcountbt records before we try to map the
* swapfile.
*/
This explains the race condition we are working around, and it gives
enough information to document that any other refcountbt updates we
defer to background processing (either removals or inserts!) are
going to need to be synchronised here.
-Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2025-02-05 21:54 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-05 16:27 fix swapon for recently unshared blocks Christoph Hellwig
2025-02-05 16:28 ` [PATCH 1/2] xfs: flush inodegc before swapon Christoph Hellwig
2025-02-05 18:23 ` Darrick J. Wong
2025-02-05 21:08 ` Dave Chinner
2025-02-05 21:16 ` Darrick J. Wong
2025-02-05 21:53 ` Dave Chinner [this message]
2025-02-05 22:05 ` Darrick J. Wong
2025-02-05 16:28 ` [PATCH 2/2] xfs: rename xfs_iomap_swapfile_activate to xfs_vm_swap_activate Christoph Hellwig
2025-02-05 18:24 ` Darrick J. Wong
-- strict thread matches above, loose matches on Subject: below --
2025-02-06 6:14 fix swapon for recently unshared blocks v2 Christoph Hellwig
2025-02-06 6:15 ` [PATCH 1/2] xfs: flush inodegc before swapon Christoph Hellwig
2025-02-11 8:44 ` Carlos Maiolino
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z6Pd9wG2sqVZSKjQ@dread.disaster.area \
--to=david@fromorbit.com \
--cc=cem@kernel.org \
--cc=djwong@kernel.org \
--cc=hch@lst.de \
--cc=linux-xfs@vger.kernel.org \
--cc=zlang@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox