From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>, Carlos Maiolino <cem@kernel.org>,
Zorro Lang <zlang@kernel.org>,
linux-xfs@vger.kernel.org
Subject: Re: [PATCH 1/2] xfs: flush inodegc before swapon
Date: Wed, 5 Feb 2025 14:05:41 -0800 [thread overview]
Message-ID: <20250205220541.GF21808@frogsfrogsfrogs> (raw)
In-Reply-To: <Z6Pd9wG2sqVZSKjQ@dread.disaster.area>
On Thu, Feb 06, 2025 at 08:53:59AM +1100, Dave Chinner wrote:
> On Wed, Feb 05, 2025 at 01:16:59PM -0800, Darrick J. Wong wrote:
> > On Thu, Feb 06, 2025 at 08:08:14AM +1100, Dave Chinner wrote:
> > > On Wed, Feb 05, 2025 at 05:28:00PM +0100, Christoph Hellwig wrote:
> > > > Fix the brand new xfstest that tries to swapon on a recently unshared
> > > > file and use the chance to document the other bit of magic in this
> > > > function.
> > >
> > > You haven't documented the magic at all - I have no clue what the
> > > bug being fixed is nor how adding an inodegc flush fixes anything
> > > to do with swap file activation....
> > >
> > > > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > > > ---
> > > > fs/xfs/xfs_aops.c | 18 +++++++++++++++++-
> > > > 1 file changed, 17 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > > index 69b8c2d1937d..c792297aa0a3 100644
> > > > --- a/fs/xfs/xfs_aops.c
> > > > +++ b/fs/xfs/xfs_aops.c
> > > > @@ -21,6 +21,7 @@
> > > > #include "xfs_error.h"
> > > > #include "xfs_zone_alloc.h"
> > > > #include "xfs_rtgroup.h"
> > > > +#include "xfs_icache.h"
> > > >
> > > > struct xfs_writepage_ctx {
> > > > struct iomap_writepage_ctx ctx;
> > > > @@ -685,7 +686,22 @@ xfs_iomap_swapfile_activate(
> > > > struct file *swap_file,
> > > > sector_t *span)
> > > > {
> > > > - sis->bdev = xfs_inode_buftarg(XFS_I(file_inode(swap_file)))->bt_bdev;
> > > > + struct xfs_inode *ip = XFS_I(file_inode(swap_file));
> > > > +
> > > > + /*
> > > > + * Ensure inode GC has finished to remove unmapped extents, as the
> > > > + * reflink bit is only cleared once all previously shared extents
> > > > + * are unmapped. Otherwise swapon could incorrectly fail on a
> > > > + * very recently unshare file.
> > > > + */
> > > > + xfs_inodegc_flush(ip->i_mount);
> > >
> > > The comment doesn't explains what this actually fixes. Inodes that
> > > are processed by inodegc *must* be unreferenced by the VFS, so it
> > > is not clear exactly what this is actually doing.
> > >
> > > I'm guessing that the test in question is doing something like this:
> > >
> > > file2 = clone(file1)
> > > unlink(file1)
> > > swapon(file2)
> > >
> > > and so the swap file activation is racing with the background
> > > inactivation and extent removal of file1?
> >
> > Yes, I think hch is referring to this:
> > https://lore.kernel.org/fstests/2c9ff99c2bcaec4412b0903e03949d5a3ad0d817.1736783467.git.fdmanana@suse.com/
> >
> > > But in that case, the extents are being removed from file1, and at
> > > no time does that remove the reflink bit on file2. i.e. even if the
> > > inactivation of file1 results in all the extents in file2 no longer
> > > being shared, that only results in refcountbt updates and it does
> > > not get propagated back to file2's inode. i.e. file2 will still be
> > > marked as a reflink file containing shared extents.
> >
> > Right, but the (iomap) swapfile activation code only errors out if the
> > filesystem gives it a mapping that is marked as shared. So the reflink
> > flag isn't relevant here.
> >
> > How about this for a better comment:
> >
> > "Ensure inode GC has finished so that unlinked clones of this file have
> > been truncated and inactivated fully. This is to ensure that walking
> > the swap file does not find any shared extents."
>
> Even talking about it in terms on "inodegc" seems like
> misdirection to me. Now I understand what this flush is working
> around, it is clear to me that swapon could race the same way with
> any other operation that removes extents from cloned files (e.g.
> hole punch, truncate, etc).
>
> however, from a user perspective, the only one that matters -right
> now- is unlink because of the deferred processing of extent removal.
>
> But even that isn't a guarantee - if something else has that cloned
> file open, then the unlinked inode won't be queued for inodegc
> and so swapon will still fail regardless of the inodegc flush.
>
> Hence I think this needs to explain the race with extent removal and
> cloned files, then explain that the inodegc flush is a workaround
> that applies only to a specific corner case w.r.t. unlinking clones
> before swapon is run.
>
> Something like:
>
> /*
> * Swap file activation is can race against concurrent shared extent
"..can race..."
> * removal in files that have been cloned. If this happens,
> * iomap_swapfile_iter() can fail because it encountered a shared
> * extent even though an operation is in progress to remove those
> * shared extents.
> *
> * This race becomes problematic when we defer extent removal
> * operations beyond the end of a syscall (i.e. use async background
> * processing algorithms). Users think the extents are no longer
> * shared, but iomap_swapfile_iter() still sees them as shared
> * because the refcountbt entries for the extents being removed have
> * not yet been updated. Hence the swapon call fails unexpectedly.
> *
> * The race condition is currently most obvious from the unlink()
> * operation as extent removal is deferred until after the last
> * reference to the inode goes away. We then process the extent
> * removal asynchronously, hence triggers the "syscall completed but
> * work not done" condition mentioned above. To close this race
> * window, we need to flush any pending inodegc operations to ensure
> * they have updated the refcountbt records before we try to map the
> * swapfile.
Yes, this is a good explanation.
> */
>
> This explains the race condition we are working around, and it gives
> enough information to document that any other refcountbt updates we
> defer to background processing (either removals or inserts!) are
> going to need to be synchronised here.
There shouldn't be any refcount increments involving the swapfile
because the mm already took IOLOCK_EXCL for us. But yes, there could
someday be more asynchronous decrements elsewhere in the filesystem.
--D
>
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
next prev parent reply other threads:[~2025-02-05 22:05 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-05 16:27 fix swapon for recently unshared blocks Christoph Hellwig
2025-02-05 16:28 ` [PATCH 1/2] xfs: flush inodegc before swapon Christoph Hellwig
2025-02-05 18:23 ` Darrick J. Wong
2025-02-05 21:08 ` Dave Chinner
2025-02-05 21:16 ` Darrick J. Wong
2025-02-05 21:53 ` Dave Chinner
2025-02-05 22:05 ` Darrick J. Wong [this message]
2025-02-05 16:28 ` [PATCH 2/2] xfs: rename xfs_iomap_swapfile_activate to xfs_vm_swap_activate Christoph Hellwig
2025-02-05 18:24 ` Darrick J. Wong
-- strict thread matches above, loose matches on Subject: below --
2025-02-06 6:14 fix swapon for recently unshared blocks v2 Christoph Hellwig
2025-02-06 6:15 ` [PATCH 1/2] xfs: flush inodegc before swapon Christoph Hellwig
2025-02-11 8:44 ` Carlos Maiolino
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250205220541.GF21808@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=cem@kernel.org \
--cc=david@fromorbit.com \
--cc=hch@lst.de \
--cc=linux-xfs@vger.kernel.org \
--cc=zlang@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox