[PATCH 0/3] xfs: sparse inodes overlap end of filesystem

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] xfs: sparse inodes overlap end of filesystem
@ 2024-10-24  2:51 Dave Chinner
  2024-10-24  2:51 ` [PATCH 1/3] xfs: fix sparse inode limits on runt AG Dave Chinner
                   ` (5 more replies)
  0 siblings, 6 replies; 23+ messages in thread
From: Dave Chinner @ 2024-10-24  2:51 UTC (permalink / raw)
  To: linux-xfs

We have had a large number of recent reports about cloud filesystems
with "corrupt" inode records recently. They are all the same, and
feature a filesystem that has been grown from a small size to a
larger size (to 30G or 50G). In all cases, they have a very small
runt AG at the end of the filesystem.  In the case of the 30GB
filesystems, this is 1031 blocks long.

These filesystems start issuing corruption warnings when trying to
allocate an in a sparse cluster at block 1024 of the runt AG. At
this point, there shouldn't be a sparse inode cluster because there
isn't space to fit an entire inode chunk (8 blocks) at block 1024.
i.e. it is only 7 blocks from the end of the AG.

Hence the first bug is that we allowed allocation of a sparse inode
cluster in this location when it should not have occurred. The first
patch in the series addresses this.

However, there is actually nothing corrupt in the on-disk sparse
inode record or inode cluster at agbno 1024. It is a 32 inode
cluster, which means it is 4 blocks in length, so sits entirely
within the AG and every inode in the record is addressable and
accessible. The only thing we can't do is make the sparse inode
record whole - the inode allocation code cannot allocate another 4
blocks that span beyond the end of the AG. Hence this inode record
and cluster remain sparse until all the inodes in it are freed and
the cluster removed from disk.

The second bug is that we don't consider inodes beyond inode cluster
alignment at the end of an AG as being valid. When sparse inode
alignment is in use, we set the in-memory inode cluster alignment to
match the inode chunk alignment, and so the maximum valid inode
number is inode chunk aligned, not inode cluster aligned. Hence when
we have an inode cluster at the end of the AG - so the max inode
number is cluster aligned - we reject that entire cluster as being
invalid. 

As stated above, there is nothing corrupt about the sparse inode
cluster at the end of the AG, it just doesn't match an arbitrary
alignment validation restriction for inodes at the end of the AG.
Given we have production filesystems out there with sparse inode
clusters allocated with cluster alignment at the end of the AG, we
need to consider these inodes as valid and not error out with a
corruption report.  The second patch addresses this.

The third issue I found is that we never validate the
sb->sb_spino_align valid when we mount the filesystem. It could have
any value and we just blindly use it when calculating inode
allocation geometry. The third patch adds sb->sb_spino_align range
validation to the superblock verifier.

There is one question that needs to be resolved in this patchset: if
we take patch 2 to allow sparse inodes at the end of the AG, why
would we need the change in patch 1? Indeed, at this point I have to
ask why we even need the min/max agbno guidelines to the inode chunk
allocation as we end up allowing any aligned location in the AG to
be used by sparse inodes. i.e. if we take patch 2, then patch 1 is
unnecessary and now we can remove a bunch of code (min/max_agbno
constraints) from the allocator paths...

I'd prefer that we take the latter path: ignore the first patch.
This results in more flexible behaviour, allows existing filesystems
with this issue to work without needing xfs_repair to fix them, and
we get to remove complexity from the code.

Thoughts?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/3] xfs: fix sparse inode limits on runt AG
  2024-10-24  2:51 [PATCH 0/3] xfs: sparse inodes overlap end of filesystem Dave Chinner
@ 2024-10-24  2:51 ` Dave Chinner
  2024-10-24  2:51 ` [PATCH 2/3] xfs: allow sparse inode records at the end of runt AGs Dave Chinner
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Dave Chinner @ 2024-10-24  2:51 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The runt AG at the end of a filesystem is almost always smaller than
the mp->m_sb.sb_agblocks. Unfortunately, when setting the max_agbno
limit for the inode chunk allocation, we do not take this into
account. This means we can allocate a sparse inode chunk that
overlaps beyond the end of an AG. When we go to allocate an inode
from that sparse chunk, the irec fails validation because the
agbno of the start of the irec is beyond valid limits for the runt
AG.

Prevent this from happening by taking into account the size of the
runt AG when allocating inode chunks. Also convert the various
checks for valid inode chunk agbnos to use xfs_ag_block_count()
so that they will also catch such issues in the future.

Fixes: 56d1115c9bc7 ("xfs: allocate sparse inode chunks on full chunk allocation failure")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 271855227514..6258527315f2 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -855,7 +855,8 @@ xfs_ialloc_ag_alloc(
 		 * the end of the AG.
 		 */
 		args.min_agbno = args.mp->m_sb.sb_inoalignmt;
-		args.max_agbno = round_down(args.mp->m_sb.sb_agblocks,
+		args.max_agbno = round_down(xfs_ag_block_count(args.mp,
+							pag->pag_agno),
 					    args.mp->m_sb.sb_inoalignmt) -
 				 igeo->ialloc_blks;
 
@@ -2332,9 +2333,9 @@ xfs_difree(
 		return -EINVAL;
 	}
 	agbno = XFS_AGINO_TO_AGBNO(mp, agino);
-	if (agbno >= mp->m_sb.sb_agblocks)  {
-		xfs_warn(mp, "%s: agbno >= mp->m_sb.sb_agblocks (%d >= %d).",
-			__func__, agbno, mp->m_sb.sb_agblocks);
+	if (agbno >= xfs_ag_block_count(mp, pag->pag_agno)) {
+		xfs_warn(mp, "%s: agbno >= xfs_ag_block_count (%d >= %d).",
+			__func__, agbno, xfs_ag_block_count(mp, pag->pag_agno));
 		ASSERT(0);
 		return -EINVAL;
 	}
@@ -2457,7 +2458,7 @@ xfs_imap(
 	 */
 	agino = XFS_INO_TO_AGINO(mp, ino);
 	agbno = XFS_AGINO_TO_AGBNO(mp, agino);
-	if (agbno >= mp->m_sb.sb_agblocks ||
+	if (agbno >= xfs_ag_block_count(mp, pag->pag_agno) ||
 	    ino != XFS_AGINO_TO_INO(mp, pag->pag_agno, agino)) {
 		error = -EINVAL;
 #ifdef DEBUG
@@ -2467,11 +2468,12 @@ xfs_imap(
 		 */
 		if (flags & XFS_IGET_UNTRUSTED)
 			return error;
-		if (agbno >= mp->m_sb.sb_agblocks) {
+		if (agbno >= xfs_ag_block_count(mp, pag->pag_agno)) {
 			xfs_alert(mp,
 		"%s: agbno (0x%llx) >= mp->m_sb.sb_agblocks (0x%lx)",
 				__func__, (unsigned long long)agbno,
-				(unsigned long)mp->m_sb.sb_agblocks);
+				(unsigned long)xfs_ag_block_count(mp,
+							pag->pag_agno));
 		}
 		if (ino != XFS_AGINO_TO_INO(mp, pag->pag_agno, agino)) {
 			xfs_alert(mp,
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/3] xfs: allow sparse inode records at the end of runt AGs
  2024-10-24  2:51 [PATCH 0/3] xfs: sparse inodes overlap end of filesystem Dave Chinner
  2024-10-24  2:51 ` [PATCH 1/3] xfs: fix sparse inode limits on runt AG Dave Chinner
@ 2024-10-24  2:51 ` Dave Chinner
  2024-10-24 17:00   ` Darrick J. Wong
  2024-10-24  2:51 ` [PATCH 3/3] xfs: sb_spino_align is not verified Dave Chinner
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 23+ messages in thread
From: Dave Chinner @ 2024-10-24  2:51 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Due to the failure to correctly limit sparse inode chunk allocation
in runt AGs, we now have many production filesystems with sparse
inode chunks allocated across the end of the runt AG. xfs_repair
or a growfs is needed to fix this situation, neither of which are
particularly appealing.

The on disk layout from the metadump shows AG 12 as a runt that is
1031 blocks in length and the last inode chunk allocated on disk at
agino 8192.

$ xfs_db -c "agi 12" -c "p length" -c "p newino"a \
> -c "convert agno 12 agino 8192 agbno" \
> -c "a free_root" -c p /mnt/scratch/t.img
length = 1031
newino = 8192
0x400 (1024)
magic = 0x46494233
level = 0
numrecs = 3
leftsib = null
rightsib = null
bno = 62902208
lsn = 0xb5500001849
uuid = e941c927-8697-4c16-a828-bc98e3878f7d
owner = 12
crc = 0xfe0a5c41 (correct)
recs[1-3] = [startino,holemask,count,freecount,free]
1:[128,0,64,11,0xc1ff00]
2:[256,0,64,3,0xb]
3:[8192,0xff00,32,32,0xffffffffffffffff]

The agbno of the inode chunk is 0x400 (1024), but there are only 7
blocks from there to the end of the AG. No inode cluster should have
been allocated there, but the bug fixed in the previous commit
allowed that. We can see from the finobt record #3 that there is a
sparse inode record at agbno 1024 that is for 32 inodes - 4 blocks
worth of inodes. Hence we have a valid inode cluster from agbno
1024-1027 on disk, and we are trying to allocation inodes from it.

This is due to the sparse inode feature requiring sb->sb_spino_align
being set to the inode cluster size, whilst the sb->sb_inoalignmt is
set to the full chunk size.  The args.max_agbno bug in sparse inode
alignment allows an inode cluster at the start of the irec which is
sb_spino_align aligned and sized, but the remainder of the irec to
be beyond EOAG.

There is actually nothing wrong with having a sparse inode cluster
that ends up overlapping the end of the runt AG - it just means that
attempts to make it non-sparse will fail because there's no
contiguous space available to fill out the chunk. However, we can't
even get that far because xfs_inobt_get_rec() will validate the
irec->ir_startino and xfs_verify_agino() will fail on an irec that
spans beyond the end of the AG:

XFS (loop0): finobt record corruption in AG 12 detected at xfs_inobt_check_irec+0x44/0xb0!
XFS (loop0): start inode 0x2000, count 0x20, free 0x20 freemask 0xffffffffffffffff, holemask 0xff00

Hence the actual maximum agino we could allocate is the size of the
AG rounded down by the size of of an inode cluster, not the size of
a full inode chunk. Modify __xfs_agino_range() code to take this
sparse inode case into account and hence allow us of the already
allocated sparse inode chunk at the end of a runt AG.

That change, alone, however, is not sufficient, as
xfs_inobt_get_rec() hard codes the maximum inode number in the chunk
and attempts to verify the last inode number in the chunk.  This
fails because the of the sparse inode record is beyond the end of
the AG. Hence we have to look at the hole mask in the sparse inode
record to determine where the highest allocated inode is. We then
use the calculated high inode number to determine if the allocated
sparse inode cluster fits within the AG.

With this, inode allocation on a sparse inode cluster at the end
of a runt AG now succeeds. Hence any filesystem that has allocated a
cluster in this location will no longer fail allocation and issue
corruption warnings.

Fixes: 56d1115c9bc7 ("xfs: allocate sparse inode chunks on full chunk allocation failure")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_ag.c     | 47 ++++++++++++++++++++++++++++++--------
 fs/xfs/libxfs/xfs_ialloc.c | 20 +++++++++++++---
 2 files changed, 54 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
index 5ca8d0106827..33290af6ab01 100644
--- a/fs/xfs/libxfs/xfs_ag.c
+++ b/fs/xfs/libxfs/xfs_ag.c
@@ -238,15 +238,36 @@ xfs_ag_block_count(
 			mp->m_sb.sb_dblocks);
 }
 
-/* Calculate the first and last possible inode number in an AG. */
+/*
+ * Calculate the first and last possible inode number in an AG.
+ *
+ * Due to a bug in sparse inode allocation for the runt AG at the end of the
+ * filesystem, we can have a valid sparse inode chunk on disk that spans beyond
+ * the end of the AG. Sparse inode chunks have special alignment - the full
+ * chunk must always be naturally aligned, and the regions that are allocated
+ * sparsely are cluster sized and aligned.
+ *
+ * The result of this is that for sparse inode setups, sb->sb_inoalignmt is
+ * always the size of the chunk, and that means M_IGEO(mp)->cluster_align isn't
+ * actually cluster alignment, it is chunk alignment. That means a sparse inode
+ * cluster that overlaps the end of the AG can never be valid based on "cluster
+ * alignment" even though all the inodes allocated within the sparse chunk at
+ * within the valid bounds of the AG and so can be used.
+ *
+ * Hence for the runt AG, the valid maximum inode number is based on sparse
+ * inode cluster alignment (sb->sb_spino_align) and not the "cluster alignment"
+ * value.
+ */
 static void
 __xfs_agino_range(
 	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
 	xfs_agblock_t		eoag,
 	xfs_agino_t		*first,
 	xfs_agino_t		*last)
 {
 	xfs_agblock_t		bno;
+	xfs_agblock_t		end_align;
 
 	/*
 	 * Calculate the first inode, which will be in the first
@@ -259,7 +280,12 @@ __xfs_agino_range(
 	 * Calculate the last inode, which will be at the end of the
 	 * last (aligned) cluster that can be allocated in the AG.
 	 */
-	bno = round_down(eoag, M_IGEO(mp)->cluster_align);
+	if (xfs_has_sparseinodes(mp) && agno == mp->m_sb.sb_agcount - 1)
+		end_align = mp->m_sb.sb_spino_align;
+	else
+		end_align = M_IGEO(mp)->cluster_align;
+
+	bno = round_down(eoag, end_align);
 	*last = XFS_AGB_TO_AGINO(mp, bno) - 1;
 }
 
@@ -270,7 +296,8 @@ xfs_agino_range(
 	xfs_agino_t		*first,
 	xfs_agino_t		*last)
 {
-	return __xfs_agino_range(mp, xfs_ag_block_count(mp, agno), first, last);
+	return __xfs_agino_range(mp, agno, xfs_ag_block_count(mp, agno),
+			first, last);
 }
 
 int
@@ -284,7 +311,7 @@ xfs_update_last_ag_size(
 		return -EFSCORRUPTED;
 	pag->block_count = __xfs_ag_block_count(mp, prev_agcount - 1,
 			mp->m_sb.sb_agcount, mp->m_sb.sb_dblocks);
-	__xfs_agino_range(mp, pag->block_count, &pag->agino_min,
+	__xfs_agino_range(mp, pag->pag_agno, pag->block_count, &pag->agino_min,
 			&pag->agino_max);
 	xfs_perag_rele(pag);
 	return 0;
@@ -345,8 +372,8 @@ xfs_initialize_perag(
 		pag->block_count = __xfs_ag_block_count(mp, index, new_agcount,
 				dblocks);
 		pag->min_block = XFS_AGFL_BLOCK(mp);
-		__xfs_agino_range(mp, pag->block_count, &pag->agino_min,
-				&pag->agino_max);
+		__xfs_agino_range(mp, pag->pag_agno, pag->block_count,
+				&pag->agino_min, &pag->agino_max);
 	}
 
 	index = xfs_set_inode_alloc(mp, new_agcount);
@@ -932,8 +959,8 @@ xfs_ag_shrink_space(
 
 	/* Update perag geometry */
 	pag->block_count -= delta;
-	__xfs_agino_range(pag->pag_mount, pag->block_count, &pag->agino_min,
-				&pag->agino_max);
+	__xfs_agino_range(mp, pag->pag_agno, pag->block_count,
+				&pag->agino_min, &pag->agino_max);
 
 	xfs_ialloc_log_agi(*tpp, agibp, XFS_AGI_LENGTH);
 	xfs_alloc_log_agf(*tpp, agfbp, XFS_AGF_LENGTH);
@@ -1003,8 +1030,8 @@ xfs_ag_extend_space(
 
 	/* Update perag geometry */
 	pag->block_count = be32_to_cpu(agf->agf_length);
-	__xfs_agino_range(pag->pag_mount, pag->block_count, &pag->agino_min,
-				&pag->agino_max);
+	__xfs_agino_range(pag->pag_mount, pag->pag_agno, pag->block_count,
+				&pag->agino_min, &pag->agino_max);
 	return 0;
 }
 
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 6258527315f2..d68b53334990 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -108,22 +108,36 @@ xfs_inobt_rec_freecount(
 	return hweight64(realfree);
 }
 
+/* Compute the highest allocated inode in an incore inode record. */
+static xfs_agino_t
+xfs_inobt_rec_highino(
+	const struct xfs_inobt_rec_incore	*irec)
+{
+	if (xfs_inobt_issparse(irec->ir_holemask))
+		return xfs_highbit64(xfs_inobt_irec_to_allocmask(irec));
+	return XFS_INODES_PER_CHUNK;
+}
+
 /* Simple checks for inode records. */
 xfs_failaddr_t
 xfs_inobt_check_irec(
 	struct xfs_perag			*pag,
 	const struct xfs_inobt_rec_incore	*irec)
 {
+	xfs_agino_t	high_ino = xfs_inobt_rec_highino(irec);
+
 	/* Record has to be properly aligned within the AG. */
 	if (!xfs_verify_agino(pag, irec->ir_startino))
 		return __this_address;
-	if (!xfs_verify_agino(pag,
-				irec->ir_startino + XFS_INODES_PER_CHUNK - 1))
+
+	if (!xfs_verify_agino(pag, irec->ir_startino + high_ino - 1))
 		return __this_address;
+
 	if (irec->ir_count < XFS_INODES_PER_HOLEMASK_BIT ||
 	    irec->ir_count > XFS_INODES_PER_CHUNK)
 		return __this_address;
-	if (irec->ir_freecount > XFS_INODES_PER_CHUNK)
+
+	if (irec->ir_freecount > irec->ir_count)
 		return __this_address;
 
 	if (xfs_inobt_rec_freecount(irec) != irec->ir_freecount)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/3] xfs: sb_spino_align is not verified
  2024-10-24  2:51 [PATCH 0/3] xfs: sparse inodes overlap end of filesystem Dave Chinner
  2024-10-24  2:51 ` [PATCH 1/3] xfs: fix sparse inode limits on runt AG Dave Chinner
  2024-10-24  2:51 ` [PATCH 2/3] xfs: allow sparse inode records at the end of runt AGs Dave Chinner
@ 2024-10-24  2:51 ` Dave Chinner
  2024-10-24 16:55   ` Darrick J. Wong
  2024-12-07  0:25   ` Luis Chamberlain
  2024-10-24 13:20 ` [PATCH 0/3] xfs: sparse inodes overlap end of filesystem Brian Foster
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 23+ messages in thread
From: Dave Chinner @ 2024-10-24  2:51 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

It's just read in from the superblock and used without doing any
validity checks at all on the value.

Fixes: fb4f2b4e5a82 ("xfs: add sparse inode chunk alignment superblock field")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_sb.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index d95409f3cba6..0d181bc140f0 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -398,6 +398,20 @@ xfs_validate_sb_common(
 					 sbp->sb_inoalignmt, align);
 				return -EINVAL;
 			}
+
+			if (!sbp->sb_spino_align ||
+			    sbp->sb_spino_align > sbp->sb_inoalignmt ||
+			    (sbp->sb_inoalignmt % sbp->sb_spino_align) != 0) {
+				xfs_warn(mp,
+				"Sparse inode alignment (%u) is invalid.",
+					sbp->sb_spino_align);
+				return -EINVAL;
+			}
+		} else if (sbp->sb_spino_align) {
+			xfs_warn(mp,
+				"Sparse inode alignment (%u) should be zero.",
+				sbp->sb_spino_align);
+			return -EINVAL;
 		}
 	} else if (sbp->sb_qflags & (XFS_PQUOTA_ENFD | XFS_GQUOTA_ENFD |
 				XFS_PQUOTA_CHKD | XFS_GQUOTA_CHKD)) {
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] xfs: sparse inodes overlap end of filesystem
  2024-10-24  2:51 [PATCH 0/3] xfs: sparse inodes overlap end of filesystem Dave Chinner
                   ` (2 preceding siblings ...)
  2024-10-24  2:51 ` [PATCH 3/3] xfs: sb_spino_align is not verified Dave Chinner
@ 2024-10-24 13:20 ` Brian Foster
  2024-10-25  0:48   ` Dave Chinner
  2024-10-24 16:38 ` Darrick J. Wong
  2024-10-29 16:14 ` Eric Sandeen
  5 siblings, 1 reply; 23+ messages in thread
From: Brian Foster @ 2024-10-24 13:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Oct 24, 2024 at 01:51:02PM +1100, Dave Chinner wrote:
> We have had a large number of recent reports about cloud filesystems
> with "corrupt" inode records recently. They are all the same, and
> feature a filesystem that has been grown from a small size to a
> larger size (to 30G or 50G). In all cases, they have a very small
> runt AG at the end of the filesystem.  In the case of the 30GB
> filesystems, this is 1031 blocks long.
> 
> These filesystems start issuing corruption warnings when trying to
> allocate an in a sparse cluster at block 1024 of the runt AG. At
> this point, there shouldn't be a sparse inode cluster because there
> isn't space to fit an entire inode chunk (8 blocks) at block 1024.
> i.e. it is only 7 blocks from the end of the AG.
> 
> Hence the first bug is that we allowed allocation of a sparse inode
> cluster in this location when it should not have occurred. The first
> patch in the series addresses this.
> 
> However, there is actually nothing corrupt in the on-disk sparse
> inode record or inode cluster at agbno 1024. It is a 32 inode
> cluster, which means it is 4 blocks in length, so sits entirely
> within the AG and every inode in the record is addressable and
> accessible. The only thing we can't do is make the sparse inode
> record whole - the inode allocation code cannot allocate another 4
> blocks that span beyond the end of the AG. Hence this inode record
> and cluster remain sparse until all the inodes in it are freed and
> the cluster removed from disk.
> 
> The second bug is that we don't consider inodes beyond inode cluster
> alignment at the end of an AG as being valid. When sparse inode
> alignment is in use, we set the in-memory inode cluster alignment to
> match the inode chunk alignment, and so the maximum valid inode
> number is inode chunk aligned, not inode cluster aligned. Hence when
> we have an inode cluster at the end of the AG - so the max inode
> number is cluster aligned - we reject that entire cluster as being
> invalid. 
> 
> As stated above, there is nothing corrupt about the sparse inode
> cluster at the end of the AG, it just doesn't match an arbitrary
> alignment validation restriction for inodes at the end of the AG.
> Given we have production filesystems out there with sparse inode
> clusters allocated with cluster alignment at the end of the AG, we
> need to consider these inodes as valid and not error out with a
> corruption report.  The second patch addresses this.
> 
> The third issue I found is that we never validate the
> sb->sb_spino_align valid when we mount the filesystem. It could have
> any value and we just blindly use it when calculating inode
> allocation geometry. The third patch adds sb->sb_spino_align range
> validation to the superblock verifier.
> 
> There is one question that needs to be resolved in this patchset: if
> we take patch 2 to allow sparse inodes at the end of the AG, why
> would we need the change in patch 1? Indeed, at this point I have to
> ask why we even need the min/max agbno guidelines to the inode chunk
> allocation as we end up allowing any aligned location in the AG to
> be used by sparse inodes. i.e. if we take patch 2, then patch 1 is
> unnecessary and now we can remove a bunch of code (min/max_agbno
> constraints) from the allocator paths...
> 
> I'd prefer that we take the latter path: ignore the first patch.
> This results in more flexible behaviour, allows existing filesystems
> with this issue to work without needing xfs_repair to fix them, and
> we get to remove complexity from the code.
> 
> Thoughts?
> 

This all sounds reasonable on its own if the corruption is essentially
artifical and there is a path for code simplification, etc. That said, I
think there's a potential counter argument for skipping patch 1 though.
A couple things come to mind:

1. When this corrupted inode chunk allocation does occur, is it possible
to actually allocate an inode out of it, or does the error checking
logic prevent that? My sense was the latter, but I could be wrong. This
generally indicates whether user data is impacted or not if repair
resolves by tossing the chunk.

2. Would we recommend a user upgrade to a new kernel with a corruption
present that causes inode allocation failure?

My .02: under no circumstances would I run a distro/package upgrade on a
filesystem in that state before running repair, nor would I recommend
that to anyone else. The caveat to this is that even after a repair,
there's no guarantee an upgrade wouldn't go and realloc the same bad
chunk and end up right back in the same state, and thus fail just the
same.

For that reason, I'm not sure we can achieve a reliable workaround via a
kernel change on its own. I'm wondering if this requires something on
the repair side that either recommends growing the fs by a few blocks,
or perhaps if it finds this "unaligned runt" situation, actively does
something to prevent it.

For example, I assume allocating the last handful of blocks out of the
runt AG would prevent the problem. Of course that technically creates
another corruption by leaking blocks, but as long repair knows to keep
it in place so long as the fs geometry is susceptible, perhaps that
would work..? Hmm.. if those blocks are free then maybe a better option
would be to just truncate the last few blocks off the runt AG (i.e.
effectively reduce the fs size by the size of the sparse chunk
allocation), then the fs could be made consistent and functional. Hm?

Brian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] xfs: sparse inodes overlap end of filesystem
  2024-10-24  2:51 [PATCH 0/3] xfs: sparse inodes overlap end of filesystem Dave Chinner
                   ` (3 preceding siblings ...)
  2024-10-24 13:20 ` [PATCH 0/3] xfs: sparse inodes overlap end of filesystem Brian Foster
@ 2024-10-24 16:38 ` Darrick J. Wong
  2024-10-25  6:29   ` Dave Chinner
  2024-10-29 16:14 ` Eric Sandeen
  5 siblings, 1 reply; 23+ messages in thread
From: Darrick J. Wong @ 2024-10-24 16:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Oct 24, 2024 at 01:51:02PM +1100, Dave Chinner wrote:
> We have had a large number of recent reports about cloud filesystems
> with "corrupt" inode records recently. They are all the same, and
> feature a filesystem that has been grown from a small size to a
> larger size (to 30G or 50G). In all cases, they have a very small
> runt AG at the end of the filesystem.  In the case of the 30GB
> filesystems, this is 1031 blocks long.
> 
> These filesystems start issuing corruption warnings when trying to
> allocate an in a sparse cluster at block 1024 of the runt AG. At
> this point, there shouldn't be a sparse inode cluster because there
> isn't space to fit an entire inode chunk (8 blocks) at block 1024.
> i.e. it is only 7 blocks from the end of the AG.
> 
> Hence the first bug is that we allowed allocation of a sparse inode
> cluster in this location when it should not have occurred. The first
> patch in the series addresses this.
> 
> However, there is actually nothing corrupt in the on-disk sparse
> inode record or inode cluster at agbno 1024. It is a 32 inode
> cluster, which means it is 4 blocks in length, so sits entirely
> within the AG and every inode in the record is addressable and
> accessible. The only thing we can't do is make the sparse inode
> record whole - the inode allocation code cannot allocate another 4
> blocks that span beyond the end of the AG. Hence this inode record
> and cluster remain sparse until all the inodes in it are freed and
> the cluster removed from disk.
> 
> The second bug is that we don't consider inodes beyond inode cluster
> alignment at the end of an AG as being valid. When sparse inode
> alignment is in use, we set the in-memory inode cluster alignment to
> match the inode chunk alignment, and so the maximum valid inode
> number is inode chunk aligned, not inode cluster aligned. Hence when
> we have an inode cluster at the end of the AG - so the max inode
> number is cluster aligned - we reject that entire cluster as being
> invalid. 
> 
> As stated above, there is nothing corrupt about the sparse inode
> cluster at the end of the AG, it just doesn't match an arbitrary
> alignment validation restriction for inodes at the end of the AG.
> Given we have production filesystems out there with sparse inode
> clusters allocated with cluster alignment at the end of the AG, we
> need to consider these inodes as valid and not error out with a
> corruption report.  The second patch addresses this.
> 
> The third issue I found is that we never validate the
> sb->sb_spino_align valid when we mount the filesystem. It could have
> any value and we just blindly use it when calculating inode
> allocation geometry. The third patch adds sb->sb_spino_align range
> validation to the superblock verifier.
> 
> There is one question that needs to be resolved in this patchset: if
> we take patch 2 to allow sparse inodes at the end of the AG, why
> would we need the change in patch 1? Indeed, at this point I have to
> ask why we even need the min/max agbno guidelines to the inode chunk
> allocation as we end up allowing any aligned location in the AG to
> be used by sparse inodes. i.e. if we take patch 2, then patch 1 is
> unnecessary and now we can remove a bunch of code (min/max_agbno
> constraints) from the allocator paths...

Yeah, I think we can only adjust the codebase to allow chunks that cross
EOAG as long as the clusters don't.

> I'd prefer that we take the latter path: ignore the first patch.
> This results in more flexible behaviour, allows existing filesystems
> with this issue to work without needing xfs_repair to fix them, and
> we get to remove complexity from the code.

Do xfs_repair/scrub trip over these sparse chunks that cross EOAG,
or are they ok?

--D

> Thoughts?
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] xfs: sb_spino_align is not verified
  2024-10-24  2:51 ` [PATCH 3/3] xfs: sb_spino_align is not verified Dave Chinner
@ 2024-10-24 16:55   ` Darrick J. Wong
  2024-10-25  6:33     ` Dave Chinner
  2024-12-07  0:25   ` Luis Chamberlain
  1 sibling, 1 reply; 23+ messages in thread
From: Darrick J. Wong @ 2024-10-24 16:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Oct 24, 2024 at 01:51:05PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> It's just read in from the superblock and used without doing any
> validity checks at all on the value.
> 
> Fixes: fb4f2b4e5a82 ("xfs: add sparse inode chunk alignment superblock field")
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Cc: <stable@vger.kernel.org> # v4.2

Oof yeah that's quite a gap!
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/libxfs/xfs_sb.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> index d95409f3cba6..0d181bc140f0 100644
> --- a/fs/xfs/libxfs/xfs_sb.c
> +++ b/fs/xfs/libxfs/xfs_sb.c
> @@ -398,6 +398,20 @@ xfs_validate_sb_common(
>  					 sbp->sb_inoalignmt, align);
>  				return -EINVAL;
>  			}
> +
> +			if (!sbp->sb_spino_align ||
> +			    sbp->sb_spino_align > sbp->sb_inoalignmt ||
> +			    (sbp->sb_inoalignmt % sbp->sb_spino_align) != 0) {
> +				xfs_warn(mp,
> +				"Sparse inode alignment (%u) is invalid.",
> +					sbp->sb_spino_align);
> +				return -EINVAL;
> +			}
> +		} else if (sbp->sb_spino_align) {
> +			xfs_warn(mp,
> +				"Sparse inode alignment (%u) should be zero.",
> +				sbp->sb_spino_align);
> +			return -EINVAL;
>  		}
>  	} else if (sbp->sb_qflags & (XFS_PQUOTA_ENFD | XFS_GQUOTA_ENFD |
>  				XFS_PQUOTA_CHKD | XFS_GQUOTA_CHKD)) {
> -- 
> 2.45.2
> 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] xfs: allow sparse inode records at the end of runt AGs
  2024-10-24  2:51 ` [PATCH 2/3] xfs: allow sparse inode records at the end of runt AGs Dave Chinner
@ 2024-10-24 17:00   ` Darrick J. Wong
  2024-10-25  6:43     ` Dave Chinner
  0 siblings, 1 reply; 23+ messages in thread
From: Darrick J. Wong @ 2024-10-24 17:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Oct 24, 2024 at 01:51:04PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Due to the failure to correctly limit sparse inode chunk allocation
> in runt AGs, we now have many production filesystems with sparse
> inode chunks allocated across the end of the runt AG. xfs_repair
> or a growfs is needed to fix this situation, neither of which are
> particularly appealing.
> 
> The on disk layout from the metadump shows AG 12 as a runt that is
> 1031 blocks in length and the last inode chunk allocated on disk at
> agino 8192.

Does this problem also happen on non-runt AGs?  If the only free space
that could be turned into a sparse cluster is unaligned space at the
end of AG 0, would you still get the same corruption error?

--D

> $ xfs_db -c "agi 12" -c "p length" -c "p newino"a \
> > -c "convert agno 12 agino 8192 agbno" \
> > -c "a free_root" -c p /mnt/scratch/t.img
> length = 1031
> newino = 8192
> 0x400 (1024)
> magic = 0x46494233
> level = 0
> numrecs = 3
> leftsib = null
> rightsib = null
> bno = 62902208
> lsn = 0xb5500001849
> uuid = e941c927-8697-4c16-a828-bc98e3878f7d
> owner = 12
> crc = 0xfe0a5c41 (correct)
> recs[1-3] = [startino,holemask,count,freecount,free]
> 1:[128,0,64,11,0xc1ff00]
> 2:[256,0,64,3,0xb]
> 3:[8192,0xff00,32,32,0xffffffffffffffff]
> 
> The agbno of the inode chunk is 0x400 (1024), but there are only 7
> blocks from there to the end of the AG. No inode cluster should have
> been allocated there, but the bug fixed in the previous commit
> allowed that. We can see from the finobt record #3 that there is a
> sparse inode record at agbno 1024 that is for 32 inodes - 4 blocks
> worth of inodes. Hence we have a valid inode cluster from agbno
> 1024-1027 on disk, and we are trying to allocation inodes from it.
> 
> This is due to the sparse inode feature requiring sb->sb_spino_align
> being set to the inode cluster size, whilst the sb->sb_inoalignmt is
> set to the full chunk size.  The args.max_agbno bug in sparse inode
> alignment allows an inode cluster at the start of the irec which is
> sb_spino_align aligned and sized, but the remainder of the irec to
> be beyond EOAG.
> 
> There is actually nothing wrong with having a sparse inode cluster
> that ends up overlapping the end of the runt AG - it just means that
> attempts to make it non-sparse will fail because there's no
> contiguous space available to fill out the chunk. However, we can't
> even get that far because xfs_inobt_get_rec() will validate the
> irec->ir_startino and xfs_verify_agino() will fail on an irec that
> spans beyond the end of the AG:
> 
> XFS (loop0): finobt record corruption in AG 12 detected at xfs_inobt_check_irec+0x44/0xb0!
> XFS (loop0): start inode 0x2000, count 0x20, free 0x20 freemask 0xffffffffffffffff, holemask 0xff00
> 
> Hence the actual maximum agino we could allocate is the size of the
> AG rounded down by the size of of an inode cluster, not the size of
> a full inode chunk. Modify __xfs_agino_range() code to take this
> sparse inode case into account and hence allow us of the already
> allocated sparse inode chunk at the end of a runt AG.
> 
> That change, alone, however, is not sufficient, as
> xfs_inobt_get_rec() hard codes the maximum inode number in the chunk
> and attempts to verify the last inode number in the chunk.  This
> fails because the of the sparse inode record is beyond the end of
> the AG. Hence we have to look at the hole mask in the sparse inode
> record to determine where the highest allocated inode is. We then
> use the calculated high inode number to determine if the allocated
> sparse inode cluster fits within the AG.
> 
> With this, inode allocation on a sparse inode cluster at the end
> of a runt AG now succeeds. Hence any filesystem that has allocated a
> cluster in this location will no longer fail allocation and issue
> corruption warnings.
> 
> Fixes: 56d1115c9bc7 ("xfs: allocate sparse inode chunks on full chunk allocation failure")
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_ag.c     | 47 ++++++++++++++++++++++++++++++--------
>  fs/xfs/libxfs/xfs_ialloc.c | 20 +++++++++++++---
>  2 files changed, 54 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
> index 5ca8d0106827..33290af6ab01 100644
> --- a/fs/xfs/libxfs/xfs_ag.c
> +++ b/fs/xfs/libxfs/xfs_ag.c
> @@ -238,15 +238,36 @@ xfs_ag_block_count(
>  			mp->m_sb.sb_dblocks);
>  }
>  
> -/* Calculate the first and last possible inode number in an AG. */
> +/*
> + * Calculate the first and last possible inode number in an AG.
> + *
> + * Due to a bug in sparse inode allocation for the runt AG at the end of the
> + * filesystem, we can have a valid sparse inode chunk on disk that spans beyond
> + * the end of the AG. Sparse inode chunks have special alignment - the full
> + * chunk must always be naturally aligned, and the regions that are allocated
> + * sparsely are cluster sized and aligned.
> + *
> + * The result of this is that for sparse inode setups, sb->sb_inoalignmt is
> + * always the size of the chunk, and that means M_IGEO(mp)->cluster_align isn't
> + * actually cluster alignment, it is chunk alignment. That means a sparse inode
> + * cluster that overlaps the end of the AG can never be valid based on "cluster
> + * alignment" even though all the inodes allocated within the sparse chunk at
> + * within the valid bounds of the AG and so can be used.
> + *
> + * Hence for the runt AG, the valid maximum inode number is based on sparse
> + * inode cluster alignment (sb->sb_spino_align) and not the "cluster alignment"
> + * value.
> + */
>  static void
>  __xfs_agino_range(
>  	struct xfs_mount	*mp,
> +	xfs_agnumber_t		agno,
>  	xfs_agblock_t		eoag,
>  	xfs_agino_t		*first,
>  	xfs_agino_t		*last)
>  {
>  	xfs_agblock_t		bno;
> +	xfs_agblock_t		end_align;
>  
>  	/*
>  	 * Calculate the first inode, which will be in the first
> @@ -259,7 +280,12 @@ __xfs_agino_range(
>  	 * Calculate the last inode, which will be at the end of the
>  	 * last (aligned) cluster that can be allocated in the AG.
>  	 */
> -	bno = round_down(eoag, M_IGEO(mp)->cluster_align);
> +	if (xfs_has_sparseinodes(mp) && agno == mp->m_sb.sb_agcount - 1)
> +		end_align = mp->m_sb.sb_spino_align;
> +	else
> +		end_align = M_IGEO(mp)->cluster_align;
> +
> +	bno = round_down(eoag, end_align);
>  	*last = XFS_AGB_TO_AGINO(mp, bno) - 1;
>  }
>  
> @@ -270,7 +296,8 @@ xfs_agino_range(
>  	xfs_agino_t		*first,
>  	xfs_agino_t		*last)
>  {
> -	return __xfs_agino_range(mp, xfs_ag_block_count(mp, agno), first, last);
> +	return __xfs_agino_range(mp, agno, xfs_ag_block_count(mp, agno),
> +			first, last);
>  }
>  
>  int
> @@ -284,7 +311,7 @@ xfs_update_last_ag_size(
>  		return -EFSCORRUPTED;
>  	pag->block_count = __xfs_ag_block_count(mp, prev_agcount - 1,
>  			mp->m_sb.sb_agcount, mp->m_sb.sb_dblocks);
> -	__xfs_agino_range(mp, pag->block_count, &pag->agino_min,
> +	__xfs_agino_range(mp, pag->pag_agno, pag->block_count, &pag->agino_min,
>  			&pag->agino_max);
>  	xfs_perag_rele(pag);
>  	return 0;
> @@ -345,8 +372,8 @@ xfs_initialize_perag(
>  		pag->block_count = __xfs_ag_block_count(mp, index, new_agcount,
>  				dblocks);
>  		pag->min_block = XFS_AGFL_BLOCK(mp);
> -		__xfs_agino_range(mp, pag->block_count, &pag->agino_min,
> -				&pag->agino_max);
> +		__xfs_agino_range(mp, pag->pag_agno, pag->block_count,
> +				&pag->agino_min, &pag->agino_max);
>  	}
>  
>  	index = xfs_set_inode_alloc(mp, new_agcount);
> @@ -932,8 +959,8 @@ xfs_ag_shrink_space(
>  
>  	/* Update perag geometry */
>  	pag->block_count -= delta;
> -	__xfs_agino_range(pag->pag_mount, pag->block_count, &pag->agino_min,
> -				&pag->agino_max);
> +	__xfs_agino_range(mp, pag->pag_agno, pag->block_count,
> +				&pag->agino_min, &pag->agino_max);
>  
>  	xfs_ialloc_log_agi(*tpp, agibp, XFS_AGI_LENGTH);
>  	xfs_alloc_log_agf(*tpp, agfbp, XFS_AGF_LENGTH);
> @@ -1003,8 +1030,8 @@ xfs_ag_extend_space(
>  
>  	/* Update perag geometry */
>  	pag->block_count = be32_to_cpu(agf->agf_length);
> -	__xfs_agino_range(pag->pag_mount, pag->block_count, &pag->agino_min,
> -				&pag->agino_max);
> +	__xfs_agino_range(pag->pag_mount, pag->pag_agno, pag->block_count,
> +				&pag->agino_min, &pag->agino_max);
>  	return 0;
>  }
>  
> diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> index 6258527315f2..d68b53334990 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -108,22 +108,36 @@ xfs_inobt_rec_freecount(
>  	return hweight64(realfree);
>  }
>  
> +/* Compute the highest allocated inode in an incore inode record. */
> +static xfs_agino_t
> +xfs_inobt_rec_highino(
> +	const struct xfs_inobt_rec_incore	*irec)
> +{
> +	if (xfs_inobt_issparse(irec->ir_holemask))
> +		return xfs_highbit64(xfs_inobt_irec_to_allocmask(irec));
> +	return XFS_INODES_PER_CHUNK;
> +}
> +
>  /* Simple checks for inode records. */
>  xfs_failaddr_t
>  xfs_inobt_check_irec(
>  	struct xfs_perag			*pag,
>  	const struct xfs_inobt_rec_incore	*irec)
>  {
> +	xfs_agino_t	high_ino = xfs_inobt_rec_highino(irec);
> +
>  	/* Record has to be properly aligned within the AG. */
>  	if (!xfs_verify_agino(pag, irec->ir_startino))
>  		return __this_address;
> -	if (!xfs_verify_agino(pag,
> -				irec->ir_startino + XFS_INODES_PER_CHUNK - 1))
> +
> +	if (!xfs_verify_agino(pag, irec->ir_startino + high_ino - 1))
>  		return __this_address;
> +
>  	if (irec->ir_count < XFS_INODES_PER_HOLEMASK_BIT ||
>  	    irec->ir_count > XFS_INODES_PER_CHUNK)
>  		return __this_address;
> -	if (irec->ir_freecount > XFS_INODES_PER_CHUNK)
> +
> +	if (irec->ir_freecount > irec->ir_count)
>  		return __this_address;
>  
>  	if (xfs_inobt_rec_freecount(irec) != irec->ir_freecount)
> -- 
> 2.45.2
> 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] xfs: sparse inodes overlap end of filesystem
  2024-10-24 13:20 ` [PATCH 0/3] xfs: sparse inodes overlap end of filesystem Brian Foster
@ 2024-10-25  0:48   ` Dave Chinner
  2024-10-25 19:33     ` Brian Foster
  0 siblings, 1 reply; 23+ messages in thread
From: Dave Chinner @ 2024-10-25  0:48 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Oct 24, 2024 at 09:20:39AM -0400, Brian Foster wrote:
> On Thu, Oct 24, 2024 at 01:51:02PM +1100, Dave Chinner wrote:
> > We have had a large number of recent reports about cloud filesystems
> > with "corrupt" inode records recently. They are all the same, and
> > feature a filesystem that has been grown from a small size to a
> > larger size (to 30G or 50G). In all cases, they have a very small
> > runt AG at the end of the filesystem.  In the case of the 30GB
> > filesystems, this is 1031 blocks long.
> > 
> > These filesystems start issuing corruption warnings when trying to
> > allocate an in a sparse cluster at block 1024 of the runt AG. At
> > this point, there shouldn't be a sparse inode cluster because there
> > isn't space to fit an entire inode chunk (8 blocks) at block 1024.
> > i.e. it is only 7 blocks from the end of the AG.
> > 
> > Hence the first bug is that we allowed allocation of a sparse inode
> > cluster in this location when it should not have occurred. The first
> > patch in the series addresses this.
> > 
> > However, there is actually nothing corrupt in the on-disk sparse
> > inode record or inode cluster at agbno 1024. It is a 32 inode
> > cluster, which means it is 4 blocks in length, so sits entirely
> > within the AG and every inode in the record is addressable and
> > accessible. The only thing we can't do is make the sparse inode
> > record whole - the inode allocation code cannot allocate another 4
> > blocks that span beyond the end of the AG. Hence this inode record
> > and cluster remain sparse until all the inodes in it are freed and
> > the cluster removed from disk.
> > 
> > The second bug is that we don't consider inodes beyond inode cluster
> > alignment at the end of an AG as being valid. When sparse inode
> > alignment is in use, we set the in-memory inode cluster alignment to
> > match the inode chunk alignment, and so the maximum valid inode
> > number is inode chunk aligned, not inode cluster aligned. Hence when
> > we have an inode cluster at the end of the AG - so the max inode
> > number is cluster aligned - we reject that entire cluster as being
> > invalid. 
> > 
> > As stated above, there is nothing corrupt about the sparse inode
> > cluster at the end of the AG, it just doesn't match an arbitrary
> > alignment validation restriction for inodes at the end of the AG.
> > Given we have production filesystems out there with sparse inode
> > clusters allocated with cluster alignment at the end of the AG, we
> > need to consider these inodes as valid and not error out with a
> > corruption report.  The second patch addresses this.
> > 
> > The third issue I found is that we never validate the
> > sb->sb_spino_align valid when we mount the filesystem. It could have
> > any value and we just blindly use it when calculating inode
> > allocation geometry. The third patch adds sb->sb_spino_align range
> > validation to the superblock verifier.
> > 
> > There is one question that needs to be resolved in this patchset: if
> > we take patch 2 to allow sparse inodes at the end of the AG, why
> > would we need the change in patch 1? Indeed, at this point I have to
> > ask why we even need the min/max agbno guidelines to the inode chunk
> > allocation as we end up allowing any aligned location in the AG to
> > be used by sparse inodes. i.e. if we take patch 2, then patch 1 is
> > unnecessary and now we can remove a bunch of code (min/max_agbno
> > constraints) from the allocator paths...
> > 
> > I'd prefer that we take the latter path: ignore the first patch.
> > This results in more flexible behaviour, allows existing filesystems
> > with this issue to work without needing xfs_repair to fix them, and
> > we get to remove complexity from the code.
> > 
> > Thoughts?
> > 
> 
> This all sounds reasonable on its own if the corruption is essentially
> artifical and there is a path for code simplification, etc. That said, I
> think there's a potential counter argument for skipping patch 1 though.
> A couple things come to mind:
> 
> 1. When this corrupted inode chunk allocation does occur, is it possible
> to actually allocate an inode out of it, or does the error checking
> logic prevent that?

The error checking during finobt lookup prevents inode allocation.
i.e. it fails after finobt lookup via the path:

xfs_dialloc_ag_finobt_newino()
  xfs_inobt_get_rec()
    xfs_inobt_check_irec()
      if (!xfs_verify_agino(pag, irec->ir_startino))
         return __this_address;

Before any modifications are made. hence the transaction is clean
when cancelled, and the error is propagated cleanly back out to
userspace.

> 2. Would we recommend a user upgrade to a new kernel with a corruption
> present that causes inode allocation failure?

I'm not making any sort of recommendations on how downstream distros
handle system recovery from this issue.  System recovery once the
problem has manifested is a separate problem - what we are concerned
about here is modifying the kernel code to:

a) prevent the issue from happening again; and
b) ensure that existing filesytsems with this latent issue on disk
   don't throw corruption errors in future.

How we get that kernel onto downstream distro systems is largely a
distro support issue. Downstream distros can:

- run the gaunlet and upgrade in place, relying on the the
  transactional upgrade behaviour of the package manager to handle
  rollbacks when file create failures during installation; or
- grow the cloud block device and filesystem to put the inode
  cluster wholly within the bounds of the AG and so pass the
  alignment checks without any kernel modifications, then do the
  kernel upgrade; or
- live patch the running kernel to allow the sparse inode cluster to
  be considered valid (as per the second patch in this series)
  so it won't ever fail whilst installing the kernel upgrade; or
- do some xfs_db magic to remove the finobt record that is
  problematic and leak the sparse inode cluster so we never try to
  allocate inode from it, then do the kernel upgrade and run
  xfs_repair; or
- do some xfs_db magic to remove the finobt record and shrink the
  filesystem down a few blocks to inode chunk align it.
- something else...

IOWs, there are lots of ways that the downstream distros can
mitigate the problem sufficiently to install an updated kernel that
won't have the problem ever again.

> My .02: under no circumstances would I run a distro/package upgrade on a
> filesystem in that state before running repair, nor would I recommend
> that to anyone else.

Keep in mind that disallowing distro/package managers to run in this
situation also rules out shipping a new xfsprogs package to address
the issue. i.e. if the distro won't allow kernel package upgrades,
then they won't allow xfsprogs package upgrades, either.

There aren't any filesystem level problems with allowing operation
to continue on the filesystem with a sparse inode chunk like this in
the runt AG. Yes, file create will fail with an error every so
often, but there aren't any issues beyond that. The "corruption"
can't propagate, it wont' shut down the filesystem, and errors are
returned to userspace when it is hit. There is no danger to
other filesystem metadata or user data from this issue.

Hence I don't see any issues with simply installing new packages and
rebooting to make the problem go away...

> The caveat to this is that even after a repair,
> there's no guarantee an upgrade wouldn't go and realloc the same bad
> chunk and end up right back in the same state, and thus fail just the
> same.

Sure, but this can be said about every single sparse inode enabled
filesystem that has an unaligned end of the runt AG whether the
problem has manifested or not. There are going to *lots* of
filesystems out there with this potential problem just waiting to be
tripped over.

i.e. the scope of this latent issue has the potential to affect a
very large number of filesystems.  Hence saying that we can't
upgrade -anything- on <some large subset> of sparse inode enable
filesystems because they *might* fail with this problem doesn't make
a whole lot of sense to me....

> For example, I assume allocating the last handful of blocks out of the
> runt AG would prevent the problem. Of course that technically creates
> another corruption by leaking blocks, but as long repair knows to keep
> it in place so long as the fs geometry is susceptible, perhaps that
> would work..?

I think that would become an implicit change of on-disk format.
scrub would need to know about it, as would online repair. growfs
will need to handle the case explicitly (is the trailing runt space
on this filesystem leaked or not?), and so on. I don't see this as a
viable solution.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] xfs: sparse inodes overlap end of filesystem
  2024-10-24 16:38 ` Darrick J. Wong
@ 2024-10-25  6:29   ` Dave Chinner
  0 siblings, 0 replies; 23+ messages in thread
From: Dave Chinner @ 2024-10-25  6:29 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Oct 24, 2024 at 09:38:04AM -0700, Darrick J. Wong wrote:
> On Thu, Oct 24, 2024 at 01:51:02PM +1100, Dave Chinner wrote:
> > I'd prefer that we take the latter path: ignore the first patch.
> > This results in more flexible behaviour, allows existing filesystems
> > with this issue to work without needing xfs_repair to fix them, and
> > we get to remove complexity from the code.
> 
> Do xfs_repair/scrub trip over these sparse chunks that cross EOAG,
> or are they ok?

Repair trips over them and removes them (because they
fail alignment checks). I haven't looked to see if it needs anything
more than to consider the alignment of these sparse chunks as being
ok. I suspect the same for scrub - as long as the alignment checks
out, it shouldn't need to fail the cluster...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] xfs: sb_spino_align is not verified
  2024-10-24 16:55   ` Darrick J. Wong
@ 2024-10-25  6:33     ` Dave Chinner
  0 siblings, 0 replies; 23+ messages in thread
From: Dave Chinner @ 2024-10-25  6:33 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Oct 24, 2024 at 09:55:44AM -0700, Darrick J. Wong wrote:
> On Thu, Oct 24, 2024 at 01:51:05PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > It's just read in from the superblock and used without doing any
> > validity checks at all on the value.
> > 
> > Fixes: fb4f2b4e5a82 ("xfs: add sparse inode chunk alignment superblock field")
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> Cc: <stable@vger.kernel.org> # v4.2

Yeah. And probably what ever fix we decide on, too.

> Oof yeah that's quite a gap!

*nod*

What surprises me is that syzbot hasn't found this - it's exactly
the sort of thing that randomised structure fuzzing is supposed to
find..... 

> Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Thanks!

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] xfs: allow sparse inode records at the end of runt AGs
  2024-10-24 17:00   ` Darrick J. Wong
@ 2024-10-25  6:43     ` Dave Chinner
  2024-10-25 22:19       ` Darrick J. Wong
  0 siblings, 1 reply; 23+ messages in thread
From: Dave Chinner @ 2024-10-25  6:43 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Oct 24, 2024 at 10:00:38AM -0700, Darrick J. Wong wrote:
> On Thu, Oct 24, 2024 at 01:51:04PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Due to the failure to correctly limit sparse inode chunk allocation
> > in runt AGs, we now have many production filesystems with sparse
> > inode chunks allocated across the end of the runt AG. xfs_repair
> > or a growfs is needed to fix this situation, neither of which are
> > particularly appealing.
> > 
> > The on disk layout from the metadump shows AG 12 as a runt that is
> > 1031 blocks in length and the last inode chunk allocated on disk at
> > agino 8192.
> 
> Does this problem also happen on non-runt AGs?

No. The highest agbno an inode chunk can be allocated at in a full
size AG is aligned by rounding down from sb_agblocks.  Hence
sb_agblocks can be unaligned and nothing will go wrong. The problem
is purely that the runt AG being shorter than sb_agblocks and so
this highest agbno allocation guard is set beyond the end of the
AG...

> If the only free space
> that could be turned into a sparse cluster is unaligned space at the
> end of AG 0, would you still get the same corruption error?

It will only happen if AG 0 is a runt AG, and then the same error
would occur. We don't currently allow single AG filesystems, nor
when they are set up  do we create them as a runt - the are always
full size. So current single AG filesystems made by mkfs won't have
this problem.

That said, the proposed single AG cloud image filesystems that set
AG 0 up as a runt (i.e. dblocks smaller than sb_agblocks) to allow
the AG 0 size to grow along with the size of the filesystem could
definitely have this problem. i.e. sb_dblocks needs to be inode
chunk aligned in this sort of setup, or these filesystems need to
be restricted to fixed kernels....

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] xfs: sparse inodes overlap end of filesystem
  2024-10-25  0:48   ` Dave Chinner
@ 2024-10-25 19:33     ` Brian Foster
  0 siblings, 0 replies; 23+ messages in thread
From: Brian Foster @ 2024-10-25 19:33 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Oct 25, 2024 at 11:48:15AM +1100, Dave Chinner wrote:
> On Thu, Oct 24, 2024 at 09:20:39AM -0400, Brian Foster wrote:
> > On Thu, Oct 24, 2024 at 01:51:02PM +1100, Dave Chinner wrote:
> > > We have had a large number of recent reports about cloud filesystems
> > > with "corrupt" inode records recently. They are all the same, and
> > > feature a filesystem that has been grown from a small size to a
> > > larger size (to 30G or 50G). In all cases, they have a very small
> > > runt AG at the end of the filesystem.  In the case of the 30GB
> > > filesystems, this is 1031 blocks long.
> > > 
> > > These filesystems start issuing corruption warnings when trying to
> > > allocate an in a sparse cluster at block 1024 of the runt AG. At
> > > this point, there shouldn't be a sparse inode cluster because there
> > > isn't space to fit an entire inode chunk (8 blocks) at block 1024.
> > > i.e. it is only 7 blocks from the end of the AG.
> > > 
> > > Hence the first bug is that we allowed allocation of a sparse inode
> > > cluster in this location when it should not have occurred. The first
> > > patch in the series addresses this.
> > > 
> > > However, there is actually nothing corrupt in the on-disk sparse
> > > inode record or inode cluster at agbno 1024. It is a 32 inode
> > > cluster, which means it is 4 blocks in length, so sits entirely
> > > within the AG and every inode in the record is addressable and
> > > accessible. The only thing we can't do is make the sparse inode
> > > record whole - the inode allocation code cannot allocate another 4
> > > blocks that span beyond the end of the AG. Hence this inode record
> > > and cluster remain sparse until all the inodes in it are freed and
> > > the cluster removed from disk.
> > > 
> > > The second bug is that we don't consider inodes beyond inode cluster
> > > alignment at the end of an AG as being valid. When sparse inode
> > > alignment is in use, we set the in-memory inode cluster alignment to
> > > match the inode chunk alignment, and so the maximum valid inode
> > > number is inode chunk aligned, not inode cluster aligned. Hence when
> > > we have an inode cluster at the end of the AG - so the max inode
> > > number is cluster aligned - we reject that entire cluster as being
> > > invalid. 
> > > 
> > > As stated above, there is nothing corrupt about the sparse inode
> > > cluster at the end of the AG, it just doesn't match an arbitrary
> > > alignment validation restriction for inodes at the end of the AG.
> > > Given we have production filesystems out there with sparse inode
> > > clusters allocated with cluster alignment at the end of the AG, we
> > > need to consider these inodes as valid and not error out with a
> > > corruption report.  The second patch addresses this.
> > > 
> > > The third issue I found is that we never validate the
> > > sb->sb_spino_align valid when we mount the filesystem. It could have
> > > any value and we just blindly use it when calculating inode
> > > allocation geometry. The third patch adds sb->sb_spino_align range
> > > validation to the superblock verifier.
> > > 
> > > There is one question that needs to be resolved in this patchset: if
> > > we take patch 2 to allow sparse inodes at the end of the AG, why
> > > would we need the change in patch 1? Indeed, at this point I have to
> > > ask why we even need the min/max agbno guidelines to the inode chunk
> > > allocation as we end up allowing any aligned location in the AG to
> > > be used by sparse inodes. i.e. if we take patch 2, then patch 1 is
> > > unnecessary and now we can remove a bunch of code (min/max_agbno
> > > constraints) from the allocator paths...
> > > 
> > > I'd prefer that we take the latter path: ignore the first patch.
> > > This results in more flexible behaviour, allows existing filesystems
> > > with this issue to work without needing xfs_repair to fix them, and
> > > we get to remove complexity from the code.
> > > 
> > > Thoughts?
> > > 
> > 
> > This all sounds reasonable on its own if the corruption is essentially
> > artifical and there is a path for code simplification, etc. That said, I
> > think there's a potential counter argument for skipping patch 1 though.
> > A couple things come to mind:
> > 
> > 1. When this corrupted inode chunk allocation does occur, is it possible
> > to actually allocate an inode out of it, or does the error checking
> > logic prevent that?
> 
> The error checking during finobt lookup prevents inode allocation.
> i.e. it fails after finobt lookup via the path:
> 
> xfs_dialloc_ag_finobt_newino()
>   xfs_inobt_get_rec()
>     xfs_inobt_check_irec()
>       if (!xfs_verify_agino(pag, irec->ir_startino))
>          return __this_address;
> 
> Before any modifications are made. hence the transaction is clean
> when cancelled, and the error is propagated cleanly back out to
> userspace.
> 

Ok, so data is not at risk. That is good to know.

> > 2. Would we recommend a user upgrade to a new kernel with a corruption
> > present that causes inode allocation failure?
> 
> I'm not making any sort of recommendations on how downstream distros
> handle system recovery from this issue.  System recovery once the
> problem has manifested is a separate problem - what we are concerned
> about here is modifying the kernel code to:
> 
> a) prevent the issue from happening again; and
> b) ensure that existing filesytsems with this latent issue on disk
>    don't throw corruption errors in future.
> 
> How we get that kernel onto downstream distro systems is largely a
> distro support issue. Downstream distros can:
> 
> - run the gaunlet and upgrade in place, relying on the the
>   transactional upgrade behaviour of the package manager to handle
>   rollbacks when file create failures during installation; or
> - grow the cloud block device and filesystem to put the inode
>   cluster wholly within the bounds of the AG and so pass the
>   alignment checks without any kernel modifications, then do the
>   kernel upgrade; or
> - live patch the running kernel to allow the sparse inode cluster to
>   be considered valid (as per the second patch in this series)
>   so it won't ever fail whilst installing the kernel upgrade; or
> - do some xfs_db magic to remove the finobt record that is
>   problematic and leak the sparse inode cluster so we never try to
>   allocate inode from it, then do the kernel upgrade and run
>   xfs_repair; or
> - do some xfs_db magic to remove the finobt record and shrink the
>   filesystem down a few blocks to inode chunk align it.
> - something else...
> 
> IOWs, there are lots of ways that the downstream distros can
> mitigate the problem sufficiently to install an updated kernel that
> won't have the problem ever again.
> 

IOOW, we agree that something might be necessary on the userspace side
to fully support users/distros out of this problem.

Given the last couple items on the list don't share some of the
constraints or external dependencies of the others, and pretty much
restate the couple of ideas proposed in my previous mail, they seem like
the most broadly useful options to me.

> > My .02: under no circumstances would I run a distro/package upgrade on a
> > filesystem in that state before running repair, nor would I recommend
> > that to anyone else.
> 
> Keep in mind that disallowing distro/package managers to run in this
> situation also rules out shipping a new xfsprogs package to address
> the issue. i.e. if the distro won't allow kernel package upgrades,
> then they won't allow xfsprogs package upgrades, either.
> 

I don't know why you would expect otherwise. I assume this would require
a boot/recovery image or emergency boot mode with binary external to the
filesystem that needs repair.

> There aren't any filesystem level problems with allowing operation
> to continue on the filesystem with a sparse inode chunk like this in
> the runt AG. Yes, file create will fail with an error every so
> often, but there aren't any issues beyond that. The "corruption"
> can't propagate, it wont' shut down the filesystem, and errors are
> returned to userspace when it is hit. There is no danger to
> other filesystem metadata or user data from this issue.
> 
> Hence I don't see any issues with simply installing new packages and
> rebooting to make the problem go away...
> 

Nor do I so long as upgrade doesn't fail and cancel out on inode
allocation failures.

> > The caveat to this is that even after a repair,
> > there's no guarantee an upgrade wouldn't go and realloc the same bad
> > chunk and end up right back in the same state, and thus fail just the
> > same.
> 
> Sure, but this can be said about every single sparse inode enabled
> filesystem that has an unaligned end of the runt AG whether the
> problem has manifested or not. There are going to *lots* of
> filesystems out there with this potential problem just waiting to be
> tripped over.
> 
> i.e. the scope of this latent issue has the potential to affect a
> very large number of filesystems.  Hence saying that we can't
> upgrade -anything- on <some large subset> of sparse inode enable
> filesystems because they *might* fail with this problem doesn't make
> a whole lot of sense to me....
> 

That doesn't make sense to me either. I'd just upgrade in that case. If
the problem hasn't manifested already, it seems unlikely to occur before
the fixed kernel is installed.

> > For example, I assume allocating the last handful of blocks out of the
> > runt AG would prevent the problem. Of course that technically creates
> > another corruption by leaking blocks, but as long repair knows to keep
> > it in place so long as the fs geometry is susceptible, perhaps that
> > would work..?
> 
> I think that would become an implicit change of on-disk format.
> scrub would need to know about it, as would online repair. growfs
> will need to handle the case explicitly (is the trailing runt space
> on this filesystem leaked or not?), and so on. I don't see this as a
> viable solution.
> 

Then just make it stateless and leak the blocks as a transient means to
facilitate upgrade and recovery, as suggested above.

Brian

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] xfs: allow sparse inode records at the end of runt AGs
  2024-10-25  6:43     ` Dave Chinner
@ 2024-10-25 22:19       ` Darrick J. Wong
  2024-10-26 21:47         ` Dave Chinner
  0 siblings, 1 reply; 23+ messages in thread
From: Darrick J. Wong @ 2024-10-25 22:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Oct 25, 2024 at 05:43:41PM +1100, Dave Chinner wrote:
> On Thu, Oct 24, 2024 at 10:00:38AM -0700, Darrick J. Wong wrote:
> > On Thu, Oct 24, 2024 at 01:51:04PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Due to the failure to correctly limit sparse inode chunk allocation
> > > in runt AGs, we now have many production filesystems with sparse
> > > inode chunks allocated across the end of the runt AG. xfs_repair
> > > or a growfs is needed to fix this situation, neither of which are
> > > particularly appealing.
> > > 
> > > The on disk layout from the metadump shows AG 12 as a runt that is
> > > 1031 blocks in length and the last inode chunk allocated on disk at
> > > agino 8192.
> > 
> > Does this problem also happen on non-runt AGs?
> 
> No. The highest agbno an inode chunk can be allocated at in a full
> size AG is aligned by rounding down from sb_agblocks.  Hence
> sb_agblocks can be unaligned and nothing will go wrong. The problem
> is purely that the runt AG being shorter than sb_agblocks and so
> this highest agbno allocation guard is set beyond the end of the
> AG...

Ah, right, and we don't want sparse inode chunks to cross EOAG because
then you'd have a chunk whose clusters would cross into the next AG, at
least in the linear LBA space.  That's why (for sparse inode fses) it
makes sense that we want to round last_agino down by the chunk for
non-last AGs, and round it down by only the cluster for the last AG.

Waitaminute, what if the last AG is less than a chunk but more than a
cluster's worth of blocks short of sb_agblocks?  Or what if sb_agblocks
doesn't align with a chunk boundary?  I think the new code:

	if (xfs_has_sparseinodes(mp) && agno == mp->m_sb.sb_agcount - 1)
		end_align = mp->m_sb.sb_spino_align;
	else
		end_align = M_IGEO(mp)->cluster_align;
	bno = round_down(eoag, end_align);
	*last = XFS_AGB_TO_AGINO(mp, bno) - 1;

will allow a sparse chunk that (erroneously) crosses sb_agblocks, right?
Let's say sb_spino_align == 4, sb_inoalignmt == 8, sb_agcount == 2,
sb_agblocks == 100,007, and sb_dblocks == 200,014.

For AG 0, eoag is 100007, end_align == cluster_align == 8, so bno is
rounded down to 100000.  *last is thus set to the inode at the end of
block 99999.

For AG 1, eoag is also 100007, but now end_align == 4.  bno is rounded
down to 100,004.  *last is set to the inode at the end of block 100003,
not 99999.

But now let's say we growfs another 100007 blocks onto the filesystem.
Now we have 3x AGs, each with 100007 blocks.  But now *last for AG 1
becomes 99999 even though we might've allocated an inode in block
100000 before the growfs.  That will cause a corruption error too,
right?

IOWs, don't we want something more like this?

	/*
	 * The preferred inode cluster allocation size cannot ever cross
	 * sb_agblocks.  cluster_align is one of the following:
	 *
	 * - For sparse inodes, this is an inode chunk.
	 * - For aligned non-sparse inodes, this is an inode cluster.
	 */
	bno = round_down(sb_agblocks, cluster_align);
	if (xfs_has_sparseinodes(mp) &&
	    agno == mp->m_sb.sb_agcount - 1) {
		/*
		 * For a filesystem with sparse inodes, an inode chunk
		 * still cannot cross sb_agblocks, but it can cross eoag
		 * if eoag < agblocks.  Inode clusters cannot cross eoag.
		 */
		last_clus_bno = round_down(eoag, sb_spino_align);
		bno = min(bno, last_clus_bno);
	}
	*last = XFS_AGB_TO_AGINO(mp, bno) - 1;

This preserves the invariant that inode chunks cannot span sb_agblocks,
while permitting sparse clusters going right up to EOAG so long as the
chunk doesn't cross sb_agblocks.

> > If the only free space
> > that could be turned into a sparse cluster is unaligned space at the
> > end of AG 0, would you still get the same corruption error?
> 
> It will only happen if AG 0 is a runt AG, and then the same error
> would occur. We don't currently allow single AG filesystems, nor
> when they are set up  do we create them as a runt - the are always
> full size. So current single AG filesystems made by mkfs won't have
> this problem.

Hmm, do you have a quick means to simulate this last-AG unaligned
icluster situation?

> That said, the proposed single AG cloud image filesystems that set
> AG 0 up as a runt (i.e. dblocks smaller than sb_agblocks) to allow
> the AG 0 size to grow along with the size of the filesystem could
> definitely have this problem. i.e. sb_dblocks needs to be inode
> chunk aligned in this sort of setup, or these filesystems need to
> be restricted to fixed kernels....

I wonder if we *should* have a compat flag for these cloud filesystems
just as a warning sign to us all. :)

--D

> -Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] xfs: allow sparse inode records at the end of runt AGs
  2024-10-25 22:19       ` Darrick J. Wong
@ 2024-10-26 21:47         ` Dave Chinner
  0 siblings, 0 replies; 23+ messages in thread
From: Dave Chinner @ 2024-10-26 21:47 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Fri, Oct 25, 2024 at 03:19:19PM -0700, Darrick J. Wong wrote:
> On Fri, Oct 25, 2024 at 05:43:41PM +1100, Dave Chinner wrote:
> > On Thu, Oct 24, 2024 at 10:00:38AM -0700, Darrick J. Wong wrote:
> > > On Thu, Oct 24, 2024 at 01:51:04PM +1100, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > Due to the failure to correctly limit sparse inode chunk allocation
> > > > in runt AGs, we now have many production filesystems with sparse
> > > > inode chunks allocated across the end of the runt AG. xfs_repair
> > > > or a growfs is needed to fix this situation, neither of which are
> > > > particularly appealing.
> > > > 
> > > > The on disk layout from the metadump shows AG 12 as a runt that is
> > > > 1031 blocks in length and the last inode chunk allocated on disk at
> > > > agino 8192.
> > > 
> > > Does this problem also happen on non-runt AGs?
> > 
> > No. The highest agbno an inode chunk can be allocated at in a full
> > size AG is aligned by rounding down from sb_agblocks.  Hence
> > sb_agblocks can be unaligned and nothing will go wrong. The problem
> > is purely that the runt AG being shorter than sb_agblocks and so
> > this highest agbno allocation guard is set beyond the end of the
> > AG...
> 
> Ah, right, and we don't want sparse inode chunks to cross EOAG because
> then you'd have a chunk whose clusters would cross into the next AG, at
> least in the linear LBA space.  That's why (for sparse inode fses) it
> makes sense that we want to round last_agino down by the chunk for
> non-last AGs, and round it down by only the cluster for the last AG.
> 
> Waitaminute, what if the last AG is less than a chunk but more than a
> cluster's worth of blocks short of sb_agblocks?  Or what if sb_agblocks
> doesn't align with a chunk boundary?  I think the new code:
> 
> 	if (xfs_has_sparseinodes(mp) && agno == mp->m_sb.sb_agcount - 1)
> 		end_align = mp->m_sb.sb_spino_align;
> 	else
> 		end_align = M_IGEO(mp)->cluster_align;
> 	bno = round_down(eoag, end_align);
> 	*last = XFS_AGB_TO_AGINO(mp, bno) - 1;
> 
> will allow a sparse chunk that (erroneously) crosses sb_agblocks, right?
> Let's say sb_spino_align == 4, sb_inoalignmt == 8, sb_agcount == 2,
> sb_agblocks == 100,007, and sb_dblocks == 200,014.
> 
> For AG 0, eoag is 100007, end_align == cluster_align == 8, so bno is
> rounded down to 100000.  *last is thus set to the inode at the end of
> block 99999.
> 
> For AG 1, eoag is also 100007, but now end_align == 4.  bno is rounded
> down to 100,004.  *last is set to the inode at the end of block 100003,
> not 99999.
> 
> But now let's say we growfs another 100007 blocks onto the filesystem.
> Now we have 3x AGs, each with 100007 blocks.  But now *last for AG 1
> becomes 99999 even though we might've allocated an inode in block
> 100000 before the growfs.  That will cause a corruption error too,
> right?

Yes, I overlooked that case. Good catch.

> IOWs, don't we want something more like this?
> 
> 	/*
> 	 * The preferred inode cluster allocation size cannot ever cross
> 	 * sb_agblocks.  cluster_align is one of the following:
> 	 *
> 	 * - For sparse inodes, this is an inode chunk.
> 	 * - For aligned non-sparse inodes, this is an inode cluster.
> 	 */
> 	bno = round_down(sb_agblocks, cluster_align);
> 	if (xfs_has_sparseinodes(mp) &&
> 	    agno == mp->m_sb.sb_agcount - 1) {
> 		/*
> 		 * For a filesystem with sparse inodes, an inode chunk
> 		 * still cannot cross sb_agblocks, but it can cross eoag
> 		 * if eoag < agblocks.  Inode clusters cannot cross eoag.
> 		 */
> 		last_clus_bno = round_down(eoag, sb_spino_align);
> 		bno = min(bno, last_clus_bno);
> 	}
> 	*last = XFS_AGB_TO_AGINO(mp, bno) - 1;

Yes, something like that is needed.

> > > If the only free space
> > > that could be turned into a sparse cluster is unaligned space at the
> > > end of AG 0, would you still get the same corruption error?
> > 
> > It will only happen if AG 0 is a runt AG, and then the same error
> > would occur. We don't currently allow single AG filesystems, nor
> > when they are set up  do we create them as a runt - the are always
> > full size. So current single AG filesystems made by mkfs won't have
> > this problem.
> 
> Hmm, do you have a quick means to simulate this last-AG unaligned
> icluster situation?

No, I haven't been able to reproduce it on demand - nothing I've
tried has specifically landed a sparse inode cluster in exactly the
right position to trigger this. I typically get ENOSPC when I think
it should trigger and it's not immediately obvious what I'm missing
in way of pre-conditions to trigger it. I've been able to test the
fixes on a metadump that has the sparse chunk already on disk
(which came from one of the production systems hitting this).

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] xfs: sparse inodes overlap end of filesystem
  2024-10-24  2:51 [PATCH 0/3] xfs: sparse inodes overlap end of filesystem Dave Chinner
                   ` (4 preceding siblings ...)
  2024-10-24 16:38 ` Darrick J. Wong
@ 2024-10-29 16:14 ` Eric Sandeen
  2024-10-31 11:44   ` Carlos Maiolino
  5 siblings, 1 reply; 23+ messages in thread
From: Eric Sandeen @ 2024-10-29 16:14 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs

On 10/23/24 9:51 PM, Dave Chinner wrote:
> There is one question that needs to be resolved in this patchset: if
> we take patch 2 to allow sparse inodes at the end of the AG, why
> would we need the change in patch 1? Indeed, at this point I have to
> ask why we even need the min/max agbno guidelines to the inode chunk
> allocation as we end up allowing any aligned location in the AG to
> be used by sparse inodes. i.e. if we take patch 2, then patch 1 is
> unnecessary and now we can remove a bunch of code (min/max_agbno
> constraints) from the allocator paths...
> 
> I'd prefer that we take the latter path: ignore the first patch.
> This results in more flexible behaviour, allows existing filesystems
> with this issue to work without needing xfs_repair to fix them, and
> we get to remove complexity from the code.
> 
> Thoughts?

For some reason I'm struggling to grasp some of the details here, so
maybe I can just throw out a "what I think should happen" type response.

A concern is that older xfs_repair binaries will continue to see
inodes in this region as corrupt, and throw them away, IIUC - even
if the kernel is updated to handle them properly.

Older xfs_repair could be encountered on rescue CDs/images, maybe
even in initramfs environments, by virt hosts managing guest filesystems,
etc.

So it seems to me that it would be worth it to prevent any new inode
allocations in this region going forward, even if we *can* make it work,
so that we won't continue to generate what looks like corruption to older
userspace.

That might not be the most "pure" upstream approach, but as a practical
matter I think it might be a better outcome for users and support
orgs... even if distros update kernels & userspace together, that does
not necessarily prevent older userspace from encountering a filesystem
with inodes in this range and trashing them.

Thanks,
-Eric

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] xfs: sparse inodes overlap end of filesystem
  2024-10-29 16:14 ` Eric Sandeen
@ 2024-10-31 11:44   ` Carlos Maiolino
  2024-10-31 20:45     ` Eric Sandeen
  2024-10-31 22:13     ` Darrick J. Wong
  0 siblings, 2 replies; 23+ messages in thread
From: Carlos Maiolino @ 2024-10-31 11:44 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Dave Chinner, linux-xfs

On Tue, Oct 29, 2024 at 11:14:18AM -0500, Eric Sandeen wrote:
> On 10/23/24 9:51 PM, Dave Chinner wrote:
> > There is one question that needs to be resolved in this patchset: if
> > we take patch 2 to allow sparse inodes at the end of the AG, why
> > would we need the change in patch 1? Indeed, at this point I have to
> > ask why we even need the min/max agbno guidelines to the inode chunk
> > allocation as we end up allowing any aligned location in the AG to
> > be used by sparse inodes. i.e. if we take patch 2, then patch 1 is
> > unnecessary and now we can remove a bunch of code (min/max_agbno
> > constraints) from the allocator paths...
> > 
> > I'd prefer that we take the latter path: ignore the first patch.
> > This results in more flexible behaviour, allows existing filesystems
> > with this issue to work without needing xfs_repair to fix them, and
> > we get to remove complexity from the code.
> > 
> > Thoughts?
> 
> For some reason I'm struggling to grasp some of the details here, so
> maybe I can just throw out a "what I think should happen" type response.
> 
> A concern is that older xfs_repair binaries will continue to see
> inodes in this region as corrupt, and throw them away, IIUC - even
> if the kernel is updated to handle them properly.
> 
> Older xfs_repair could be encountered on rescue CDs/images, maybe
> even in initramfs environments, by virt hosts managing guest filesystems,
> etc.
> 
> So it seems to me that it would be worth it to prevent any new inode
> allocations in this region going forward, even if we *can* make it work,
> so that we won't continue to generate what looks like corruption to older
> userspace.
> 
> That might not be the most "pure" upstream approach, but as a practical
> matter I think it might be a better outcome for users and support
> orgs... even if distros update kernels & userspace together, that does
> not necessarily prevent older userspace from encountering a filesystem
> with inodes in this range and trashing them.
>

I'm inclined to agree with Eric here as preventing the sparse inodes to be
allocated at the edge of the runt AG sounds the most reasonable approach to me.

It just seems to me yet another corner case to deal with for very little benefit,
i.e to enable a few extra inodes, on a FS that seems to be in life support
regarding space for new inodes, whether it's a distro kernel or upstream kernel.

It kind of seem risky to me, to allow users to run a new kernel, allocate inodes
there, fill those inodes with data, just to run a not yet ready xfs_repair, and
discard everything there. Just seems like a possible data loss vector.

Unless - and I'm not sure how reasonable it is -, we first release a new
xfsprogs, preventing xfs_repair to rip off those inodes, and later update the
kernel. But this will end up on users hitting a -EFSCORRUPTED every attempt to
allocate inodes from the FS edge.

How feasible would be to first prevent inodes to be allocated at the runt AG's
edge, let it sink for a while, and once we have a fixed xfs_repair for some
time, we then enable inode allocation on the edge, giving enough time for users
to have a newer xfs_repair?

Again, I'm not sure it it does make sense at all, hopefully it does.

Carlos

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] xfs: sparse inodes overlap end of filesystem
  2024-10-31 11:44   ` Carlos Maiolino
@ 2024-10-31 20:45     ` Eric Sandeen
  2024-10-31 22:13     ` Darrick J. Wong
  1 sibling, 0 replies; 23+ messages in thread
From: Eric Sandeen @ 2024-10-31 20:45 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: Dave Chinner, linux-xfs

On 10/31/24 6:44 AM, Carlos Maiolino wrote:
> On Tue, Oct 29, 2024 at 11:14:18AM -0500, Eric Sandeen wrote:
>> On 10/23/24 9:51 PM, Dave Chinner wrote:
>>> There is one question that needs to be resolved in this patchset: if
>>> we take patch 2 to allow sparse inodes at the end of the AG, why
>>> would we need the change in patch 1? Indeed, at this point I have to
>>> ask why we even need the min/max agbno guidelines to the inode chunk
>>> allocation as we end up allowing any aligned location in the AG to
>>> be used by sparse inodes. i.e. if we take patch 2, then patch 1 is
>>> unnecessary and now we can remove a bunch of code (min/max_agbno
>>> constraints) from the allocator paths...
>>>
>>> I'd prefer that we take the latter path: ignore the first patch.
>>> This results in more flexible behaviour, allows existing filesystems
>>> with this issue to work without needing xfs_repair to fix them, and
>>> we get to remove complexity from the code.
>>>
>>> Thoughts?
>>
>> For some reason I'm struggling to grasp some of the details here, so
>> maybe I can just throw out a "what I think should happen" type response.
>>
>> A concern is that older xfs_repair binaries will continue to see
>> inodes in this region as corrupt, and throw them away, IIUC - even
>> if the kernel is updated to handle them properly.
>>
>> Older xfs_repair could be encountered on rescue CDs/images, maybe
>> even in initramfs environments, by virt hosts managing guest filesystems,
>> etc.
>>
>> So it seems to me that it would be worth it to prevent any new inode
>> allocations in this region going forward, even if we *can* make it work,
>> so that we won't continue to generate what looks like corruption to older
>> userspace.
>>
>> That might not be the most "pure" upstream approach, but as a practical
>> matter I think it might be a better outcome for users and support
>> orgs... even if distros update kernels & userspace together, that does
>> not necessarily prevent older userspace from encountering a filesystem
>> with inodes in this range and trashing them.
>>
> 
> I'm inclined to agree with Eric here as preventing the sparse inodes to be
> allocated at the edge of the runt AG sounds the most reasonable approach to me.
> 
> It just seems to me yet another corner case to deal with for very little benefit,
> i.e to enable a few extra inodes, on a FS that seems to be in life support
> regarding space for new inodes, whether it's a distro kernel or upstream kernel.
> 
> It kind of seem risky to me, to allow users to run a new kernel, allocate inodes
> there, fill those inodes with data, just to run a not yet ready xfs_repair, and
> discard everything there. Just seems like a possible data loss vector.
> 
> Unless - and I'm not sure how reasonable it is -, we first release a new
> xfsprogs, preventing xfs_repair to rip off those inodes, and later update the
> kernel. But this will end up on users hitting a -EFSCORRUPTED every attempt to
> allocate inodes from the FS edge.
> 
> How feasible would be to first prevent inodes to be allocated at the runt AG's
> edge, let it sink for a while, and once we have a fixed xfs_repair for some
> time, we then enable inode allocation on the edge, giving enough time for users
> to have a newer xfs_repair?
> 
> Again, I'm not sure it it does make sense at all, hopefully it does.

I think Dave agrees with all this too, and I may have simply misunderstood
the proposal.

paraphrasing a side convo w/ Dave, it seems to make sense to have 3 steps for
the fix:

1) Stop allowing inode allocations in these blocks (kernel)
2) Treat already-allocated inodes in these blocks as valid (kernel+userspace)
3) Re-enable inode allocations to these blocks (kernel)

Distros can pick up 1) and 2), and skip 3) if desired to avoid problems with
older userspace if that seems prudent.

I guess I still worry a little about upstream/non-distro use after applying 3)
but it's the technically correct path forward.

-Eric

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] xfs: sparse inodes overlap end of filesystem
  2024-10-31 11:44   ` Carlos Maiolino
  2024-10-31 20:45     ` Eric Sandeen
@ 2024-10-31 22:13     ` Darrick J. Wong
  1 sibling, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2024-10-31 22:13 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: Eric Sandeen, Dave Chinner, linux-xfs

On Thu, Oct 31, 2024 at 12:44:12PM +0100, Carlos Maiolino wrote:
> On Tue, Oct 29, 2024 at 11:14:18AM -0500, Eric Sandeen wrote:
> > On 10/23/24 9:51 PM, Dave Chinner wrote:
> > > There is one question that needs to be resolved in this patchset: if
> > > we take patch 2 to allow sparse inodes at the end of the AG, why
> > > would we need the change in patch 1? Indeed, at this point I have to
> > > ask why we even need the min/max agbno guidelines to the inode chunk
> > > allocation as we end up allowing any aligned location in the AG to
> > > be used by sparse inodes. i.e. if we take patch 2, then patch 1 is
> > > unnecessary and now we can remove a bunch of code (min/max_agbno
> > > constraints) from the allocator paths...
> > > 
> > > I'd prefer that we take the latter path: ignore the first patch.
> > > This results in more flexible behaviour, allows existing filesystems
> > > with this issue to work without needing xfs_repair to fix them, and
> > > we get to remove complexity from the code.
> > > 
> > > Thoughts?
> > 
> > For some reason I'm struggling to grasp some of the details here, so
> > maybe I can just throw out a "what I think should happen" type response.
> > 
> > A concern is that older xfs_repair binaries will continue to see
> > inodes in this region as corrupt, and throw them away, IIUC - even
> > if the kernel is updated to handle them properly.
> > 
> > Older xfs_repair could be encountered on rescue CDs/images, maybe
> > even in initramfs environments, by virt hosts managing guest filesystems,
> > etc.
> > 
> > So it seems to me that it would be worth it to prevent any new inode
> > allocations in this region going forward, even if we *can* make it work,
> > so that we won't continue to generate what looks like corruption to older
> > userspace.
> > 
> > That might not be the most "pure" upstream approach, but as a practical
> > matter I think it might be a better outcome for users and support
> > orgs... even if distros update kernels & userspace together, that does
> > not necessarily prevent older userspace from encountering a filesystem
> > with inodes in this range and trashing them.
> >
> 
> I'm inclined to agree with Eric here as preventing the sparse inodes to be
> allocated at the edge of the runt AG sounds the most reasonable approach to me.
> It just seems to me yet another corner case to deal with for very little benefit,
> i.e to enable a few extra inodes, on a FS that seems to be in life support
> regarding space for new inodes, whether it's a distro kernel or upstream kernel.
> 
> It kind of seem risky to me, to allow users to run a new kernel, allocate inodes
> there, fill those inodes with data, just to run a not yet ready xfs_repair, and
> discard everything there. Just seems like a possible data loss vector.

I agree.  I think we have to fix the fsck/validation code to allow
inode clusters that go right up to EOAG on a runt AG because current
kernels write out filesystems that way.  I also think we have to take
the other patch that prevents the inode allocator from creating a chunk
that crosses EOAG so that unpatched xfs_repairs won't trip over newer
filesystems.

> Unless - and I'm not sure how reasonable it is -, we first release a new
> xfsprogs, preventing xfs_repair to rip off those inodes, and later update the
> kernel. But this will end up on users hitting a -EFSCORRUPTED every attempt to
> allocate inodes from the FS edge.
> 
> How feasible would be to first prevent inodes to be allocated at the runt AG's
> edge, let it sink for a while, and once we have a fixed xfs_repair for some
> time, we then enable inode allocation on the edge, giving enough time for users
> to have a newer xfs_repair?
> 
> Again, I'm not sure it it does make sense at all, hopefully it does.

I don't think we can ever re-enable inode allocation on the edge no
matter how much time goes by.

--D

> 
> Carlos
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] xfs: sb_spino_align is not verified
  2024-10-24  2:51 ` [PATCH 3/3] xfs: sb_spino_align is not verified Dave Chinner
  2024-10-24 16:55   ` Darrick J. Wong
@ 2024-12-07  0:25   ` Luis Chamberlain
  2024-12-07  0:32     ` Darrick J. Wong
  1 sibling, 1 reply; 23+ messages in thread
From: Luis Chamberlain @ 2024-12-07  0:25 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-xfs, Darrick J. Wong, Eric Sandeen, Ritesh Harjani,
	Carlos Maiolino, Brian Foster, Pankaj Raghav, Daniel Gomez,
	Kundan Kumar, gost.dev

On Thu, Oct 24, 2024 at 01:51:05PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> It's just read in from the superblock and used without doing any
> validity checks at all on the value.
> 
> Fixes: fb4f2b4e5a82 ("xfs: add sparse inode chunk alignment superblock field")
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

This is 59e43f5479cce106d71c0b91a297c7ad1913176c on v6.13-r1 now.

This commit broke mounting 32k and 64k bs filesystems on 4k page size systems.
Oddly, it does not break 16k or 8k bs. I took a quick glance and I can't
easily identify a fix.

I haven't had a chance yet to find a large page size system to see if
32k page size and 64k page size systems are affected as well.

CIs in place did not pick up on this given fstests check script just
bails out right away, we don't annotate this as a failure on fstests and
the tests don't even get listed as failures on xunit. You'd have to have
a trained curious eye to just monitor CIs and verify that all hosts
actually were chugging along. I suppose we can enhance this by just
assuming hosts which don't have results are assumed to be a failure.

However if we want to enahnce this on fstests so that in the future we
pick up on these failures more easily it would be good time to evaluate
that now too.

  Luis

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] xfs: sb_spino_align is not verified
  2024-12-07  0:25   ` Luis Chamberlain
@ 2024-12-07  0:32     ` Darrick J. Wong
  2024-12-07  0:36       ` Luis Chamberlain
  0 siblings, 1 reply; 23+ messages in thread
From: Darrick J. Wong @ 2024-12-07  0:32 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Dave Chinner, linux-xfs, Eric Sandeen, Ritesh Harjani,
	Carlos Maiolino, Brian Foster, Pankaj Raghav, Daniel Gomez,
	Kundan Kumar, gost.dev

On Fri, Dec 06, 2024 at 04:25:22PM -0800, Luis Chamberlain wrote:
> On Thu, Oct 24, 2024 at 01:51:05PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > It's just read in from the superblock and used without doing any
> > validity checks at all on the value.
> > 
> > Fixes: fb4f2b4e5a82 ("xfs: add sparse inode chunk alignment superblock field")
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> This is 59e43f5479cce106d71c0b91a297c7ad1913176c on v6.13-r1 now.
> 
> This commit broke mounting 32k and 64k bs filesystems on 4k page size systems.
> Oddly, it does not break 16k or 8k bs. I took a quick glance and I can't
> easily identify a fix.
> 
> I haven't had a chance yet to find a large page size system to see if
> 32k page size and 64k page size systems are affected as well.
> 
> CIs in place did not pick up on this given fstests check script just
> bails out right away, we don't annotate this as a failure on fstests and
> the tests don't even get listed as failures on xunit. You'd have to have
> a trained curious eye to just monitor CIs and verify that all hosts
> actually were chugging along. I suppose we can enhance this by just
> assuming hosts which don't have results are assumed to be a failure.
> 
> However if we want to enahnce this on fstests so that in the future we
> pick up on these failures more easily it would be good time to evaluate
> that now too.

Known bug, already patched here:
https://lore.kernel.org/linux-xfs/20241126202619.GO9438@frogsfrogsfrogs/

and PR to the release manager here:
https://lore.kernel.org/linux-xfs/173328206660.1159971.4540485910402305562.stg-ugh@frogsfrogsfrogs/

--D

>   Luis
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] xfs: sb_spino_align is not verified
  2024-12-07  0:32     ` Darrick J. Wong
@ 2024-12-07  0:36       ` Luis Chamberlain
  2024-12-07 11:34         ` Carlos Maiolino
  0 siblings, 1 reply; 23+ messages in thread
From: Luis Chamberlain @ 2024-12-07  0:36 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dave Chinner, linux-xfs, Eric Sandeen, Ritesh Harjani,
	Carlos Maiolino, Brian Foster, Pankaj Raghav, Daniel Gomez,
	Kundan Kumar, gost.dev

On Fri, Dec 06, 2024 at 04:32:04PM -0800, Darrick J. Wong wrote:
> On Fri, Dec 06, 2024 at 04:25:22PM -0800, Luis Chamberlain wrote:
> > On Thu, Oct 24, 2024 at 01:51:05PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > It's just read in from the superblock and used without doing any
> > > validity checks at all on the value.
> > > 
> > > Fixes: fb4f2b4e5a82 ("xfs: add sparse inode chunk alignment superblock field")
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > 
> > This is 59e43f5479cce106d71c0b91a297c7ad1913176c on v6.13-r1 now.
> > 
> > This commit broke mounting 32k and 64k bs filesystems on 4k page size systems.
> > Oddly, it does not break 16k or 8k bs. I took a quick glance and I can't
> > easily identify a fix.
> > 
> > I haven't had a chance yet to find a large page size system to see if
> > 32k page size and 64k page size systems are affected as well.
> > 
> > CIs in place did not pick up on this given fstests check script just
> > bails out right away, we don't annotate this as a failure on fstests and
> > the tests don't even get listed as failures on xunit. You'd have to have
> > a trained curious eye to just monitor CIs and verify that all hosts
> > actually were chugging along. I suppose we can enhance this by just
> > assuming hosts which don't have results are assumed to be a failure.
> > 
> > However if we want to enahnce this on fstests so that in the future we
> > pick up on these failures more easily it would be good time to evaluate
> > that now too.
> 
> Known bug, already patched here:
> https://lore.kernel.org/linux-xfs/20241126202619.GO9438@frogsfrogsfrogs/
> 
> and PR to the release manager here:
> https://lore.kernel.org/linux-xfs/173328206660.1159971.4540485910402305562.stg-ugh@frogsfrogsfrogs/

Woot, thanks!

  Luis

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] xfs: sb_spino_align is not verified
  2024-12-07  0:36       ` Luis Chamberlain
@ 2024-12-07 11:34         ` Carlos Maiolino
  0 siblings, 0 replies; 23+ messages in thread
From: Carlos Maiolino @ 2024-12-07 11:34 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Darrick J. Wong, Dave Chinner, linux-xfs, Eric Sandeen,
	Ritesh Harjani, Brian Foster, Pankaj Raghav, Daniel Gomez,
	Kundan Kumar, gost.dev

On Fri, Dec 06, 2024 at 04:36:45PM -0800, Luis Chamberlain wrote:
> On Fri, Dec 06, 2024 at 04:32:04PM -0800, Darrick J. Wong wrote:
> > On Fri, Dec 06, 2024 at 04:25:22PM -0800, Luis Chamberlain wrote:
> > > On Thu, Oct 24, 2024 at 01:51:05PM +1100, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > It's just read in from the superblock and used without doing any
> > > > validity checks at all on the value.
> > > > 
> > > > Fixes: fb4f2b4e5a82 ("xfs: add sparse inode chunk alignment superblock field")
> > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > 
> > > This is 59e43f5479cce106d71c0b91a297c7ad1913176c on v6.13-r1 now.
> > > 
> > > This commit broke mounting 32k and 64k bs filesystems on 4k page size systems.
> > > Oddly, it does not break 16k or 8k bs. I took a quick glance and I can't
> > > easily identify a fix.
> > > 
> > > I haven't had a chance yet to find a large page size system to see if
> > > 32k page size and 64k page size systems are affected as well.
> > > 
> > > CIs in place did not pick up on this given fstests check script just
> > > bails out right away, we don't annotate this as a failure on fstests and
> > > the tests don't even get listed as failures on xunit. You'd have to have
> > > a trained curious eye to just monitor CIs and verify that all hosts
> > > actually were chugging along. I suppose we can enhance this by just
> > > assuming hosts which don't have results are assumed to be a failure.
> > > 
> > > However if we want to enahnce this on fstests so that in the future we
> > > pick up on these failures more easily it would be good time to evaluate
> > > that now too.
> > 
> > Known bug, already patched here:
> > https://lore.kernel.org/linux-xfs/20241126202619.GO9438@frogsfrogsfrogs/
> > 
> > and PR to the release manager here:
> > https://lore.kernel.org/linux-xfs/173328206660.1159971.4540485910402305562.stg-ugh@frogsfrogsfrogs/
> 
> Woot, thanks!
> 
>   Luis
> 

FWIW, I didn't pull these to -rc2, as I wouldn't have enough time to test them
before sending them to Linus.

I'm picking them up first thing for -rc3.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2024-12-07 11:34 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-24  2:51 [PATCH 0/3] xfs: sparse inodes overlap end of filesystem Dave Chinner
2024-10-24  2:51 ` [PATCH 1/3] xfs: fix sparse inode limits on runt AG Dave Chinner
2024-10-24  2:51 ` [PATCH 2/3] xfs: allow sparse inode records at the end of runt AGs Dave Chinner
2024-10-24 17:00   ` Darrick J. Wong
2024-10-25  6:43     ` Dave Chinner
2024-10-25 22:19       ` Darrick J. Wong
2024-10-26 21:47         ` Dave Chinner
2024-10-24  2:51 ` [PATCH 3/3] xfs: sb_spino_align is not verified Dave Chinner
2024-10-24 16:55   ` Darrick J. Wong
2024-10-25  6:33     ` Dave Chinner
2024-12-07  0:25   ` Luis Chamberlain
2024-12-07  0:32     ` Darrick J. Wong
2024-12-07  0:36       ` Luis Chamberlain
2024-12-07 11:34         ` Carlos Maiolino
2024-10-24 13:20 ` [PATCH 0/3] xfs: sparse inodes overlap end of filesystem Brian Foster
2024-10-25  0:48   ` Dave Chinner
2024-10-25 19:33     ` Brian Foster
2024-10-24 16:38 ` Darrick J. Wong
2024-10-25  6:29   ` Dave Chinner
2024-10-29 16:14 ` Eric Sandeen
2024-10-31 11:44   ` Carlos Maiolino
2024-10-31 20:45     ` Eric Sandeen
2024-10-31 22:13     ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox