All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <dgc@kernel.org>
To: Hans Holmberg <Hans.Holmberg@wdc.com>
Cc: Carlos Maiolino <cem@kernel.org>,
	"Darrick J . Wong" <djwong@kernel.org>,
	Dave Chinner <david@fromorbit.com>,
	Christoph Hellwig <hch@lst.de>,
	Damien Le Moal <dlemoal@kernel.org>,
	"linux-xfs@vger.kernel.org" <linux-xfs@vger.kernel.org>
Subject: Re: [PATCH v2] xfs: handle racing deletions in xfs_zone_gc_iter_irec
Date: Mon, 18 May 2026 19:44:36 +1000	[thread overview]
Message-ID: <agrfhOT9ZremS605@dread> (raw)
In-Reply-To: <42149407-5702-4f8f-8973-28f3e705bdc5@wdc.com>

On Mon, May 18, 2026 at 08:55:58AM +0000, Hans Holmberg wrote:
> On 18/05/2026 09:48, Dave Chinner wrote:
> > On Mon, May 18, 2026 at 08:52:24AM +0200, Hans Holmberg wrote:
> >> Under heavy garbage collection pressure from RocksDB workloads,
> >> filesystem shutdowns can occur in xfs_zone_gc_iter_irec when
> >> xfs_iget() returns -EINVAL for deleted files.
> >>
> >> Fix this by handling -EINVAL just like we handle -ENOENT, allowing
> >> zone GC to safely ignore stale mappings.
> >>
> >> Fixes: 080d01c41d44 ("xfs: implement zoned garbage collection")
> >> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
> >> ---
> >>
> >> v2:
> >> - handle -EINVAL in the the caller in stead of switching error code
> >>   in xfs_imap_lookup
> >>
> >> v1:
> >> https://lore.kernel.org/linux-xfs/20260513063745.8067-1-hans.holmberg@wdc.com/
> >>
> >>
> >>  fs/xfs/xfs_zone_gc.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/fs/xfs/xfs_zone_gc.c b/fs/xfs/xfs_zone_gc.c
> >> index fedcc47048af..96f14d811db2 100644
> >> --- a/fs/xfs/xfs_zone_gc.c
> >> +++ b/fs/xfs/xfs_zone_gc.c
> >> @@ -400,7 +400,7 @@ xfs_zone_gc_iter_irec(
> >>  		/*
> >>  		 * If the inode was already deleted, skip over it.
> >>  		 */
> >> -		if (error == -ENOENT) {
> >> +		if (error == -ENOENT || error == -EINVAL) {
> >>  			iter->rec_idx++;
> >>  			goto retry;
> >>  		}
> > 
> > Why did you choose to do this? I was expecting the updated fix to be
> > dropping XFS_IGET_UNTRUSTED from the xfs_iget() call because the
> > inode numbers are coming from a trusted source (i.e. the rmapbt) and
> > this would then return the expected -ENOENT on unlink races instead
> > of -EINVAL....
> 
> I kept the XFS_IGET_UNTRUSTED because the xfs_rmap_query_range lookup
> that we do in xfs_zone_gc_query may not be valid by the time we call
> xfs_iget in xfs_zone_gc_iter_irec.

I'll repeat this once more: XFS_IGET_UNTRUSTED tells xfs_iget where
the inode number came from, not whether it *might* be valid or not.
xfs_iget() will detect races with unlink and do the right thing
regardless of whether IGET_UNTRUSTED is set.

IGET_UNTRUSTED is intended to be used when the inode number comes
from an an external source. That may be userspace or the network
with file handle decoding, bulkstat (user supplied inode number),
corrupt metadata (e.g during online repair when the inode clusters
cannot be fully trusted), etc.

In these situations, we should not be doing expensive validity
checks before attempting to read the inode from disk, and we should
not be giving any information back to the caller as
to whether the inode cluster is allocated or not on failure (e.g.
potentially giving away other valid inode numbers nearby). IOWs, we
don't want to be giving any indication of why the inode number is
considered invalid - it is just a bad inode number that was
provided, and because we don't trust the provider of the inode
number, that's all we will tell it.

However, inode numbers that come from internal metadata (i.e.
trusted sources) are -known to be valid-. They don't get placed into
the internal metadata if they are invalid inode numbers. Hence the
fact that is was in the rmapbt means it is, or very recently was, a
valid inode number, and we can trust it.

If xfs_iget then races with unlink on lookup, that is -just fine-
and it will return -ENOENT. In this case, xfs_iget() is telling the
caller that the inode is no longer allocated, because it is expected
that the caller can use and will do the right thing with this
information.

i.e. XFS_IGET_UNTRUSTED defines whether the inode number source
could be trusted to provide a valid inode number, not whether the
inode number can be guaranteed to be allocated at this specific
lookup instance.

-Dave.
-- 
Dave Chinner
dgc@kernel.org

  parent reply	other threads:[~2026-05-18  9:44 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-18  6:52 [PATCH v2] xfs: handle racing deletions in xfs_zone_gc_iter_irec Hans Holmberg
2026-05-18  7:48 ` Dave Chinner
2026-05-18  8:55   ` Hans Holmberg
2026-05-18  9:17     ` Damien Le Moal
2026-05-18  9:44     ` Dave Chinner [this message]
2026-05-18 12:45       ` Christoph Hellwig
2026-05-18 12:43   ` Christoph Hellwig
2026-05-19  7:15 ` Christoph Hellwig
2026-05-30  6:35 ` Carlos Maiolino

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=agrfhOT9ZremS605@dread \
    --to=dgc@kernel.org \
    --cc=Hans.Holmberg@wdc.com \
    --cc=cem@kernel.org \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=dlemoal@kernel.org \
    --cc=hch@lst.de \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.