Re: [PATCH 0/3] xfs: sparse inodes overlap end of filesystem

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 0/3] xfs: sparse inodes overlap end of filesystem
Date: Thu, 24 Oct 2024 09:20:39 -0400	[thread overview]
Message-ID: <ZxpJp48vi4NpFVqJ@bfoster> (raw)
In-Reply-To: <20241024025142.4082218-1-david@fromorbit.com>

On Thu, Oct 24, 2024 at 01:51:02PM +1100, Dave Chinner wrote:
> We have had a large number of recent reports about cloud filesystems
> with "corrupt" inode records recently. They are all the same, and
> feature a filesystem that has been grown from a small size to a
> larger size (to 30G or 50G). In all cases, they have a very small
> runt AG at the end of the filesystem.  In the case of the 30GB
> filesystems, this is 1031 blocks long.
> 
> These filesystems start issuing corruption warnings when trying to
> allocate an in a sparse cluster at block 1024 of the runt AG. At
> this point, there shouldn't be a sparse inode cluster because there
> isn't space to fit an entire inode chunk (8 blocks) at block 1024.
> i.e. it is only 7 blocks from the end of the AG.
> 
> Hence the first bug is that we allowed allocation of a sparse inode
> cluster in this location when it should not have occurred. The first
> patch in the series addresses this.
> 
> However, there is actually nothing corrupt in the on-disk sparse
> inode record or inode cluster at agbno 1024. It is a 32 inode
> cluster, which means it is 4 blocks in length, so sits entirely
> within the AG and every inode in the record is addressable and
> accessible. The only thing we can't do is make the sparse inode
> record whole - the inode allocation code cannot allocate another 4
> blocks that span beyond the end of the AG. Hence this inode record
> and cluster remain sparse until all the inodes in it are freed and
> the cluster removed from disk.
> 
> The second bug is that we don't consider inodes beyond inode cluster
> alignment at the end of an AG as being valid. When sparse inode
> alignment is in use, we set the in-memory inode cluster alignment to
> match the inode chunk alignment, and so the maximum valid inode
> number is inode chunk aligned, not inode cluster aligned. Hence when
> we have an inode cluster at the end of the AG - so the max inode
> number is cluster aligned - we reject that entire cluster as being
> invalid. 
> 
> As stated above, there is nothing corrupt about the sparse inode
> cluster at the end of the AG, it just doesn't match an arbitrary
> alignment validation restriction for inodes at the end of the AG.
> Given we have production filesystems out there with sparse inode
> clusters allocated with cluster alignment at the end of the AG, we
> need to consider these inodes as valid and not error out with a
> corruption report.  The second patch addresses this.
> 
> The third issue I found is that we never validate the
> sb->sb_spino_align valid when we mount the filesystem. It could have
> any value and we just blindly use it when calculating inode
> allocation geometry. The third patch adds sb->sb_spino_align range
> validation to the superblock verifier.
> 
> There is one question that needs to be resolved in this patchset: if
> we take patch 2 to allow sparse inodes at the end of the AG, why
> would we need the change in patch 1? Indeed, at this point I have to
> ask why we even need the min/max agbno guidelines to the inode chunk
> allocation as we end up allowing any aligned location in the AG to
> be used by sparse inodes. i.e. if we take patch 2, then patch 1 is
> unnecessary and now we can remove a bunch of code (min/max_agbno
> constraints) from the allocator paths...
> 
> I'd prefer that we take the latter path: ignore the first patch.
> This results in more flexible behaviour, allows existing filesystems
> with this issue to work without needing xfs_repair to fix them, and
> we get to remove complexity from the code.
> 
> Thoughts?
> 

This all sounds reasonable on its own if the corruption is essentially
artifical and there is a path for code simplification, etc. That said, I
think there's a potential counter argument for skipping patch 1 though.
A couple things come to mind:

1. When this corrupted inode chunk allocation does occur, is it possible
to actually allocate an inode out of it, or does the error checking
logic prevent that? My sense was the latter, but I could be wrong. This
generally indicates whether user data is impacted or not if repair
resolves by tossing the chunk.

2. Would we recommend a user upgrade to a new kernel with a corruption
present that causes inode allocation failure?

My .02: under no circumstances would I run a distro/package upgrade on a
filesystem in that state before running repair, nor would I recommend
that to anyone else. The caveat to this is that even after a repair,
there's no guarantee an upgrade wouldn't go and realloc the same bad
chunk and end up right back in the same state, and thus fail just the
same.

For that reason, I'm not sure we can achieve a reliable workaround via a
kernel change on its own. I'm wondering if this requires something on
the repair side that either recommends growing the fs by a few blocks,
or perhaps if it finds this "unaligned runt" situation, actively does
something to prevent it.

For example, I assume allocating the last handful of blocks out of the
runt AG would prevent the problem. Of course that technically creates
another corruption by leaking blocks, but as long repair knows to keep
it in place so long as the fs geometry is susceptible, perhaps that
would work..? Hmm.. if those blocks are free then maybe a better option
would be to just truncate the last few blocks off the runt AG (i.e.
effectively reduce the fs size by the size of the sparse chunk
allocation), then the fs could be made consistent and functional. Hm?

Brian

next prev parent reply	other threads:[~2024-10-24 13:19 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-24  2:51 [PATCH 0/3] xfs: sparse inodes overlap end of filesystem Dave Chinner
2024-10-24  2:51 ` [PATCH 1/3] xfs: fix sparse inode limits on runt AG Dave Chinner
2024-10-24  2:51 ` [PATCH 2/3] xfs: allow sparse inode records at the end of runt AGs Dave Chinner
2024-10-24 17:00   ` Darrick J. Wong
2024-10-25  6:43     ` Dave Chinner
2024-10-25 22:19       ` Darrick J. Wong
2024-10-26 21:47         ` Dave Chinner
2024-10-24  2:51 ` [PATCH 3/3] xfs: sb_spino_align is not verified Dave Chinner
2024-10-24 16:55   ` Darrick J. Wong
2024-10-25  6:33     ` Dave Chinner
2024-12-07  0:25   ` Luis Chamberlain
2024-12-07  0:32     ` Darrick J. Wong
2024-12-07  0:36       ` Luis Chamberlain
2024-12-07 11:34         ` Carlos Maiolino
2024-10-24 13:20 ` Brian Foster [this message]
2024-10-25  0:48   ` [PATCH 0/3] xfs: sparse inodes overlap end of filesystem Dave Chinner
2024-10-25 19:33     ` Brian Foster
2024-10-24 16:38 ` Darrick J. Wong
2024-10-25  6:29   ` Dave Chinner
2024-10-29 16:14 ` Eric Sandeen
2024-10-31 11:44   ` Carlos Maiolino
2024-10-31 20:45     ` Eric Sandeen
2024-10-31 22:13     ` Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZxpJp48vi4NpFVqJ@bfoster \
    --to=bfoster@redhat.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox