[PATCH 0/2] xfs: fix recovery corruption on s390 w/ nrext64

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] xfs: fix recovery corruption on s390 w/ nrext64
@ 2023-11-10  4:33 Dave Chinner
  2023-11-10  4:33 ` [PATCH 1/2] xfs: inode recovery does not validate the recovered inode Dave Chinner
  2023-11-10  4:33 ` [PATCH 2/2] xfs: recovery should not clear di_flushiter unconditionally Dave Chinner
  0 siblings, 2 replies; 8+ messages in thread
From: Dave Chinner @ 2023-11-10  4:33 UTC (permalink / raw)
  To: linux-xfs; +Cc: zlang

Hi folks,

The following are two patches to fix the corruption bug that
Zorro recently recently reported on big endian platforms when
nrext64 is enabled.

It is caused by recovery zeroing a v2 inode field even on v3 inodes.
Before nrext64, this wasn't an issue because the field in v3 inodes
was just padding (i.e. zeroes) and so writing zero to part of it was
never noticed. However, nrext64 uses this field for the 64 bit data
fork extent count, so zeroing part of it becomes a problem.

The zeroing is done to the xfs_log_dinode, which is held in host
format. Because of the layout of this field, on little endian
machines the di_flush_iter field overlaps the high 16 bits of
di_big_nextents, and so we didn't notice that recovery was
overwriting the the high bits of the extent count because it's
always going to be zero bytes (needs to go over 2^48 extents before
we'd notice) in testing contexts.

However, on big endian machines, the di_flushiter field overlays the
lower 16 bits of the di_big_extents field, and so zeroing
di_flushiter essentially rounds it down to the nearest 64k. For
typical testing contexts, that's effectively zeroing the extent
count.

The fix is two parts. The first patch adds detection of this
specific corruption to the xfs_dinode verifier, and adds a call to
the verifier after recovery has recovered the inode. This catches
corruptions at recovery time, rather than silently writing them back
and not noticing them until the inode is re-read from disk at
runtime.

The second patch fixes the recovery bug. It simply moves the
resetting of the di_flushiter field into the v2 inode specific
flushiter handling branch. That's always been wrong (i.e. since v3
inodes were introduced) but it hasn't mattered until now, so it
marked as fixing the introduction of the nrext64 feature....

-Dave.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/2] xfs: inode recovery does not validate the recovered inode
  2023-11-10  4:33 [PATCH 0/2] xfs: fix recovery corruption on s390 w/ nrext64 Dave Chinner
@ 2023-11-10  4:33 ` Dave Chinner
  2023-11-10 19:27   ` Darrick J. Wong
  2023-11-10  4:33 ` [PATCH 2/2] xfs: recovery should not clear di_flushiter unconditionally Dave Chinner
  1 sibling, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2023-11-10  4:33 UTC (permalink / raw)
  To: linux-xfs; +Cc: zlang

From: Dave Chinner <dchinner@redhat.com>

Discovered when trying to track down a weird recovery corruption
issue that wasn't detected at recovery time.

The specific corruption was a zero extent count field when big
extent counts are in use, and it turns out the dinode verifier
doesn't detect that specific corruption case, either. So fix it too.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_inode_buf.c   |  3 +++
 fs/xfs/xfs_inode_item_recover.c | 14 +++++++++++++-
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index a35781577cad..0f970a0b3382 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -508,6 +508,9 @@ xfs_dinode_verify(
 	if (mode && nextents + naextents > nblocks)
 		return __this_address;
 
+	if (nextents + naextents == 0 && nblocks != 0)
+		return __this_address;
+
 	if (S_ISDIR(mode) && nextents > mp->m_dir_geo->max_extents)
 		return __this_address;
 
diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
index 6b09e2bf2d74..f4c31c2b60d5 100644
--- a/fs/xfs/xfs_inode_item_recover.c
+++ b/fs/xfs/xfs_inode_item_recover.c
@@ -286,6 +286,7 @@ xlog_recover_inode_commit_pass2(
 	struct xfs_log_dinode		*ldip;
 	uint				isize;
 	int				need_free = 0;
+	xfs_failaddr_t			fa;
 
 	if (item->ri_buf[0].i_len == sizeof(struct xfs_inode_log_format)) {
 		in_f = item->ri_buf[0].i_addr;
@@ -529,8 +530,19 @@ xlog_recover_inode_commit_pass2(
 	    (dip->di_mode != 0))
 		error = xfs_recover_inode_owner_change(mp, dip, in_f,
 						       buffer_list);
-	/* re-generate the checksum. */
+	/* re-generate the checksum and validate the recovered inode. */
 	xfs_dinode_calc_crc(log->l_mp, dip);
+	fa = xfs_dinode_verify(log->l_mp, in_f->ilf_ino, dip);
+	if (fa) {
+		XFS_CORRUPTION_ERROR(
+			"Bad dinode after recovery",
+				XFS_ERRLEVEL_LOW, mp, dip, sizeof(*dip));
+		xfs_alert(mp,
+			"Metadata corruption detected at %pS, inode 0x%llx",
+			fa, in_f->ilf_ino);
+		error = -EFSCORRUPTED;
+		goto out_release;
+	}
 
 	ASSERT(bp->b_mount == mp);
 	bp->b_flags |= _XBF_LOGRECOVERY;
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] xfs: inode recovery does not validate the recovered inode
  2023-11-10  4:33 ` [PATCH 1/2] xfs: inode recovery does not validate the recovered inode Dave Chinner
@ 2023-11-10 19:27   ` Darrick J. Wong
  2023-11-10 20:40     ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Darrick J. Wong @ 2023-11-10 19:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, zlang

On Fri, Nov 10, 2023 at 03:33:13PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Discovered when trying to track down a weird recovery corruption
> issue that wasn't detected at recovery time.
> 
> The specific corruption was a zero extent count field when big
> extent counts are in use, and it turns out the dinode verifier
> doesn't detect that specific corruption case, either. So fix it too.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_inode_buf.c   |  3 +++
>  fs/xfs/xfs_inode_item_recover.c | 14 +++++++++++++-
>  2 files changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> index a35781577cad..0f970a0b3382 100644
> --- a/fs/xfs/libxfs/xfs_inode_buf.c
> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> @@ -508,6 +508,9 @@ xfs_dinode_verify(
>  	if (mode && nextents + naextents > nblocks)
>  		return __this_address;
>  
> +	if (nextents + naextents == 0 && nblocks != 0)
> +		return __this_address;
> +
>  	if (S_ISDIR(mode) && nextents > mp->m_dir_geo->max_extents)
>  		return __this_address;
>  
> diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
> index 6b09e2bf2d74..f4c31c2b60d5 100644
> --- a/fs/xfs/xfs_inode_item_recover.c
> +++ b/fs/xfs/xfs_inode_item_recover.c
> @@ -286,6 +286,7 @@ xlog_recover_inode_commit_pass2(
>  	struct xfs_log_dinode		*ldip;
>  	uint				isize;
>  	int				need_free = 0;
> +	xfs_failaddr_t			fa;
>  
>  	if (item->ri_buf[0].i_len == sizeof(struct xfs_inode_log_format)) {
>  		in_f = item->ri_buf[0].i_addr;
> @@ -529,8 +530,19 @@ xlog_recover_inode_commit_pass2(
>  	    (dip->di_mode != 0))
>  		error = xfs_recover_inode_owner_change(mp, dip, in_f,
>  						       buffer_list);
> -	/* re-generate the checksum. */
> +	/* re-generate the checksum and validate the recovered inode. */
>  	xfs_dinode_calc_crc(log->l_mp, dip);
> +	fa = xfs_dinode_verify(log->l_mp, in_f->ilf_ino, dip);
> +	if (fa) {

Does xlog_recover_dquot_commit_pass2 need to call xfs_dquot_verify as
well?

This patch looks good though,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> +		XFS_CORRUPTION_ERROR(
> +			"Bad dinode after recovery",
> +				XFS_ERRLEVEL_LOW, mp, dip, sizeof(*dip));
> +		xfs_alert(mp,
> +			"Metadata corruption detected at %pS, inode 0x%llx",
> +			fa, in_f->ilf_ino);
> +		error = -EFSCORRUPTED;
> +		goto out_release;
> +	}
>  
>  	ASSERT(bp->b_mount == mp);
>  	bp->b_flags |= _XBF_LOGRECOVERY;
> -- 
> 2.42.0
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] xfs: inode recovery does not validate the recovered inode
  2023-11-10 19:27   ` Darrick J. Wong
@ 2023-11-10 20:40     ` Dave Chinner
  2023-11-17 22:19       ` Darrick J. Wong
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2023-11-10 20:40 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, zlang

On Fri, Nov 10, 2023 at 11:27:52AM -0800, Darrick J. Wong wrote:
> On Fri, Nov 10, 2023 at 03:33:13PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Discovered when trying to track down a weird recovery corruption
> > issue that wasn't detected at recovery time.
> > 
> > The specific corruption was a zero extent count field when big
> > extent counts are in use, and it turns out the dinode verifier
> > doesn't detect that specific corruption case, either. So fix it too.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_inode_buf.c   |  3 +++
> >  fs/xfs/xfs_inode_item_recover.c | 14 +++++++++++++-
> >  2 files changed, 16 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > index a35781577cad..0f970a0b3382 100644
> > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > @@ -508,6 +508,9 @@ xfs_dinode_verify(
> >  	if (mode && nextents + naextents > nblocks)
> >  		return __this_address;
> >  
> > +	if (nextents + naextents == 0 && nblocks != 0)
> > +		return __this_address;
> > +
> >  	if (S_ISDIR(mode) && nextents > mp->m_dir_geo->max_extents)
> >  		return __this_address;
> >  
> > diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
> > index 6b09e2bf2d74..f4c31c2b60d5 100644
> > --- a/fs/xfs/xfs_inode_item_recover.c
> > +++ b/fs/xfs/xfs_inode_item_recover.c
> > @@ -286,6 +286,7 @@ xlog_recover_inode_commit_pass2(
> >  	struct xfs_log_dinode		*ldip;
> >  	uint				isize;
> >  	int				need_free = 0;
> > +	xfs_failaddr_t			fa;
> >  
> >  	if (item->ri_buf[0].i_len == sizeof(struct xfs_inode_log_format)) {
> >  		in_f = item->ri_buf[0].i_addr;
> > @@ -529,8 +530,19 @@ xlog_recover_inode_commit_pass2(
> >  	    (dip->di_mode != 0))
> >  		error = xfs_recover_inode_owner_change(mp, dip, in_f,
> >  						       buffer_list);
> > -	/* re-generate the checksum. */
> > +	/* re-generate the checksum and validate the recovered inode. */
> >  	xfs_dinode_calc_crc(log->l_mp, dip);
> > +	fa = xfs_dinode_verify(log->l_mp, in_f->ilf_ino, dip);
> > +	if (fa) {
> 
> Does xlog_recover_dquot_commit_pass2 need to call xfs_dquot_verify as
> well?

Maybe - I haven't looked closely at that, and it depends what the
dquot buffer verifier does. If it's similar to the inode cluster
buffer verifier (i.e. only checks for dquots, doesn't verify the
dquots) then it should do the same thing. I don't have time to do
this right now because I'm OOO for the next week, so maybe you could
check this and send a patch for it?

> This patch looks good though,
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Thanks!

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] xfs: inode recovery does not validate the recovered inode
  2023-11-10 20:40     ` Dave Chinner
@ 2023-11-17 22:19       ` Darrick J. Wong
  0 siblings, 0 replies; 8+ messages in thread
From: Darrick J. Wong @ 2023-11-17 22:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, zlang

On Sat, Nov 11, 2023 at 07:40:43AM +1100, Dave Chinner wrote:
> On Fri, Nov 10, 2023 at 11:27:52AM -0800, Darrick J. Wong wrote:
> > On Fri, Nov 10, 2023 at 03:33:13PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Discovered when trying to track down a weird recovery corruption
> > > issue that wasn't detected at recovery time.
> > > 
> > > The specific corruption was a zero extent count field when big
> > > extent counts are in use, and it turns out the dinode verifier
> > > doesn't detect that specific corruption case, either. So fix it too.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_inode_buf.c   |  3 +++
> > >  fs/xfs/xfs_inode_item_recover.c | 14 +++++++++++++-
> > >  2 files changed, 16 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > > index a35781577cad..0f970a0b3382 100644
> > > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > > @@ -508,6 +508,9 @@ xfs_dinode_verify(
> > >  	if (mode && nextents + naextents > nblocks)
> > >  		return __this_address;
> > >  
> > > +	if (nextents + naextents == 0 && nblocks != 0)
> > > +		return __this_address;
> > > +
> > >  	if (S_ISDIR(mode) && nextents > mp->m_dir_geo->max_extents)
> > >  		return __this_address;
> > >  
> > > diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
> > > index 6b09e2bf2d74..f4c31c2b60d5 100644
> > > --- a/fs/xfs/xfs_inode_item_recover.c
> > > +++ b/fs/xfs/xfs_inode_item_recover.c
> > > @@ -286,6 +286,7 @@ xlog_recover_inode_commit_pass2(
> > >  	struct xfs_log_dinode		*ldip;
> > >  	uint				isize;
> > >  	int				need_free = 0;
> > > +	xfs_failaddr_t			fa;
> > >  
> > >  	if (item->ri_buf[0].i_len == sizeof(struct xfs_inode_log_format)) {
> > >  		in_f = item->ri_buf[0].i_addr;
> > > @@ -529,8 +530,19 @@ xlog_recover_inode_commit_pass2(
> > >  	    (dip->di_mode != 0))
> > >  		error = xfs_recover_inode_owner_change(mp, dip, in_f,
> > >  						       buffer_list);
> > > -	/* re-generate the checksum. */
> > > +	/* re-generate the checksum and validate the recovered inode. */
> > >  	xfs_dinode_calc_crc(log->l_mp, dip);
> > > +	fa = xfs_dinode_verify(log->l_mp, in_f->ilf_ino, dip);
> > > +	if (fa) {
> > 
> > Does xlog_recover_dquot_commit_pass2 need to call xfs_dquot_verify as
> > well?
> 
> Maybe - I haven't looked closely at that, and it depends what the
> dquot buffer verifier does. If it's similar to the inode cluster
> buffer verifier (i.e. only checks for dquots, doesn't verify the
> dquots) then it should do the same thing. I don't have time to do
> this right now because I'm OOO for the next week, so maybe you could
> check this and send a patch for it?

I tossed on a patch to do this and after a couple of days of generic/388
and generic/475 spinning I haven't noticed any failures.

--D

> > This patch looks good though,
> > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> 
> Thanks!
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 2/2] xfs: recovery should not clear di_flushiter unconditionally
  2023-11-10  4:33 [PATCH 0/2] xfs: fix recovery corruption on s390 w/ nrext64 Dave Chinner
  2023-11-10  4:33 ` [PATCH 1/2] xfs: inode recovery does not validate the recovered inode Dave Chinner
@ 2023-11-10  4:33 ` Dave Chinner
  2023-11-10 19:32   ` Darrick J. Wong
  1 sibling, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2023-11-10  4:33 UTC (permalink / raw)
  To: linux-xfs; +Cc: zlang

From: Dave Chinner <dchinner@redhat.com>

Because on v3 inodes, di_flushiter doesn't exist. It overlaps with
zero padding in the inode, except when NREXT64=1 configurations are
in use and the zero padding is no longer padding but holds the 64
bit extent counter.

This manifests obviously on big endian platforms (e.g. s390) because
the log dinode is in host order and the overlap is the LSBs of the
extent count field. It is not noticed on little endian machines
because the overlap is at the MSB end of the extent count field and
we need to get more than 2^^48 extents in the inode before it
manifests. i.e. the heat death of the universe will occur before we
see the problem in little endian machines.

This is a zero-day issue for NREXT64=1 configuraitons on big endian
machines. Fix it by only clearing di_flushiter on v2 inodes during
recovery.

Fixes: 9b7d16e34bbe ("xfs: Introduce XFS_DIFLAG2_NREXT64 and associated helpers")
cc: stable@kernel.org # 5.19+
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode_item_recover.c | 32 +++++++++++++++++---------------
 1 file changed, 17 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
index f4c31c2b60d5..dbdab4ce7c44 100644
--- a/fs/xfs/xfs_inode_item_recover.c
+++ b/fs/xfs/xfs_inode_item_recover.c
@@ -371,24 +371,26 @@ xlog_recover_inode_commit_pass2(
 	 * superblock flag to determine whether we need to look at di_flushiter
 	 * to skip replay when the on disk inode is newer than the log one
 	 */
-	if (!xfs_has_v3inodes(mp) &&
-	    ldip->di_flushiter < be16_to_cpu(dip->di_flushiter)) {
-		/*
-		 * Deal with the wrap case, DI_MAX_FLUSH is less
-		 * than smaller numbers
-		 */
-		if (be16_to_cpu(dip->di_flushiter) == DI_MAX_FLUSH &&
-		    ldip->di_flushiter < (DI_MAX_FLUSH >> 1)) {
-			/* do nothing */
-		} else {
-			trace_xfs_log_recover_inode_skip(log, in_f);
-			error = 0;
-			goto out_release;
+	if (!xfs_has_v3inodes(mp)) {
+		if (ldip->di_flushiter < be16_to_cpu(dip->di_flushiter)) {
+			/*
+			 * Deal with the wrap case, DI_MAX_FLUSH is less
+			 * than smaller numbers
+			 */
+			if (be16_to_cpu(dip->di_flushiter) == DI_MAX_FLUSH &&
+			    ldip->di_flushiter < (DI_MAX_FLUSH >> 1)) {
+				/* do nothing */
+			} else {
+				trace_xfs_log_recover_inode_skip(log, in_f);
+				error = 0;
+				goto out_release;
+			}
 		}
+
+		/* Take the opportunity to reset the flush iteration count */
+		ldip->di_flushiter = 0;
 	}
 
-	/* Take the opportunity to reset the flush iteration count */
-	ldip->di_flushiter = 0;
 
 	if (unlikely(S_ISREG(ldip->di_mode))) {
 		if ((ldip->di_format != XFS_DINODE_FMT_EXTENTS) &&
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] xfs: recovery should not clear di_flushiter unconditionally
  2023-11-10  4:33 ` [PATCH 2/2] xfs: recovery should not clear di_flushiter unconditionally Dave Chinner
@ 2023-11-10 19:32   ` Darrick J. Wong
  2023-11-10 20:36     ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Darrick J. Wong @ 2023-11-10 19:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, zlang

On Fri, Nov 10, 2023 at 03:33:14PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Because on v3 inodes, di_flushiter doesn't exist. It overlaps with
> zero padding in the inode, except when NREXT64=1 configurations are
> in use and the zero padding is no longer padding but holds the 64
> bit extent counter.
> 
> This manifests obviously on big endian platforms (e.g. s390) because
> the log dinode is in host order and the overlap is the LSBs of the
> extent count field. It is not noticed on little endian machines
> because the overlap is at the MSB end of the extent count field and
> we need to get more than 2^^48 extents in the inode before it
> manifests. i.e. the heat death of the universe will occur before we
> see the problem in little endian machines.
> 
> This is a zero-day issue for NREXT64=1 configuraitons on big endian
> machines. Fix it by only clearing di_flushiter on v2 inodes during
> recovery.
> 
> Fixes: 9b7d16e34bbe ("xfs: Introduce XFS_DIFLAG2_NREXT64 and associated helpers")
> cc: stable@kernel.org # 5.19+
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_inode_item_recover.c | 32 +++++++++++++++++---------------
>  1 file changed, 17 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
> index f4c31c2b60d5..dbdab4ce7c44 100644
> --- a/fs/xfs/xfs_inode_item_recover.c
> +++ b/fs/xfs/xfs_inode_item_recover.c
> @@ -371,24 +371,26 @@ xlog_recover_inode_commit_pass2(
>  	 * superblock flag to determine whether we need to look at di_flushiter
>  	 * to skip replay when the on disk inode is newer than the log one
>  	 */
> -	if (!xfs_has_v3inodes(mp) &&
> -	    ldip->di_flushiter < be16_to_cpu(dip->di_flushiter)) {
> -		/*
> -		 * Deal with the wrap case, DI_MAX_FLUSH is less
> -		 * than smaller numbers
> -		 */
> -		if (be16_to_cpu(dip->di_flushiter) == DI_MAX_FLUSH &&
> -		    ldip->di_flushiter < (DI_MAX_FLUSH >> 1)) {
> -			/* do nothing */
> -		} else {
> -			trace_xfs_log_recover_inode_skip(log, in_f);
> -			error = 0;
> -			goto out_release;
> +	if (!xfs_has_v3inodes(mp)) {
> +		if (ldip->di_flushiter < be16_to_cpu(dip->di_flushiter)) {
> +			/*
> +			 * Deal with the wrap case, DI_MAX_FLUSH is less
> +			 * than smaller numbers
> +			 */
> +			if (be16_to_cpu(dip->di_flushiter) == DI_MAX_FLUSH &&
> +			    ldip->di_flushiter < (DI_MAX_FLUSH >> 1)) {
> +				/* do nothing */
> +			} else {
> +				trace_xfs_log_recover_inode_skip(log, in_f);
> +				error = 0;
> +				goto out_release;
> +			}
>  		}
> +
> +		/* Take the opportunity to reset the flush iteration count */
> +		ldip->di_flushiter = 0;

Hmm.  Well this fixes the zeroday problem, so thank you for getting the
root of this!

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Though hch did suggest reducing the amount of indenting here by
compressing the if tests together.  I can't decide if it's worth
rearranging that old V4 code since none of it's scheduled for removal
until 2030, but it /is/ legacy code that maybe we just don't care to
touch?

<shrug>

--D

>  	}
>  
> -	/* Take the opportunity to reset the flush iteration count */
> -	ldip->di_flushiter = 0;
>  
>  	if (unlikely(S_ISREG(ldip->di_mode))) {
>  		if ((ldip->di_format != XFS_DINODE_FMT_EXTENTS) &&
> -- 
> 2.42.0
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] xfs: recovery should not clear di_flushiter unconditionally
  2023-11-10 19:32   ` Darrick J. Wong
@ 2023-11-10 20:36     ` Dave Chinner
  0 siblings, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2023-11-10 20:36 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, zlang

On Fri, Nov 10, 2023 at 11:32:55AM -0800, Darrick J. Wong wrote:
> On Fri, Nov 10, 2023 at 03:33:14PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Because on v3 inodes, di_flushiter doesn't exist. It overlaps with
> > zero padding in the inode, except when NREXT64=1 configurations are
> > in use and the zero padding is no longer padding but holds the 64
> > bit extent counter.
> > 
> > This manifests obviously on big endian platforms (e.g. s390) because
> > the log dinode is in host order and the overlap is the LSBs of the
> > extent count field. It is not noticed on little endian machines
> > because the overlap is at the MSB end of the extent count field and
> > we need to get more than 2^^48 extents in the inode before it
> > manifests. i.e. the heat death of the universe will occur before we
> > see the problem in little endian machines.
> > 
> > This is a zero-day issue for NREXT64=1 configuraitons on big endian
> > machines. Fix it by only clearing di_flushiter on v2 inodes during
> > recovery.
> > 
> > Fixes: 9b7d16e34bbe ("xfs: Introduce XFS_DIFLAG2_NREXT64 and associated helpers")
> > cc: stable@kernel.org # 5.19+
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_inode_item_recover.c | 32 +++++++++++++++++---------------
> >  1 file changed, 17 insertions(+), 15 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
> > index f4c31c2b60d5..dbdab4ce7c44 100644
> > --- a/fs/xfs/xfs_inode_item_recover.c
> > +++ b/fs/xfs/xfs_inode_item_recover.c
> > @@ -371,24 +371,26 @@ xlog_recover_inode_commit_pass2(
> >  	 * superblock flag to determine whether we need to look at di_flushiter
> >  	 * to skip replay when the on disk inode is newer than the log one
> >  	 */
> > -	if (!xfs_has_v3inodes(mp) &&
> > -	    ldip->di_flushiter < be16_to_cpu(dip->di_flushiter)) {
> > -		/*
> > -		 * Deal with the wrap case, DI_MAX_FLUSH is less
> > -		 * than smaller numbers
> > -		 */
> > -		if (be16_to_cpu(dip->di_flushiter) == DI_MAX_FLUSH &&
> > -		    ldip->di_flushiter < (DI_MAX_FLUSH >> 1)) {
> > -			/* do nothing */
> > -		} else {
> > -			trace_xfs_log_recover_inode_skip(log, in_f);
> > -			error = 0;
> > -			goto out_release;
> > +	if (!xfs_has_v3inodes(mp)) {
> > +		if (ldip->di_flushiter < be16_to_cpu(dip->di_flushiter)) {
> > +			/*
> > +			 * Deal with the wrap case, DI_MAX_FLUSH is less
> > +			 * than smaller numbers
> > +			 */
> > +			if (be16_to_cpu(dip->di_flushiter) == DI_MAX_FLUSH &&
> > +			    ldip->di_flushiter < (DI_MAX_FLUSH >> 1)) {
> > +				/* do nothing */
> > +			} else {
> > +				trace_xfs_log_recover_inode_skip(log, in_f);
> > +				error = 0;
> > +				goto out_release;
> > +			}
> >  		}
> > +
> > +		/* Take the opportunity to reset the flush iteration count */
> > +		ldip->di_flushiter = 0;
> 
> Hmm.  Well this fixes the zeroday problem, so thank you for getting the
> root of this!
> 
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> 
> Though hch did suggest reducing the amount of indenting here by
> compressing the if tests together.  I can't decide if it's worth
> rearranging that old V4 code since none of it's scheduled for removal
> until 2030, but it /is/ legacy code that maybe we just don't care to
> touch?

I'd prefer not to touch it. It's now all isolated in a
"!xfs_has_v3inodes()" branch, so I think we can largely ignore the
grot for now. If we have to do a larger rework of this code in
future, then we can look at reworking it.

But right now, I really don't feel like risking breaking something
else by doing unnecessary cleanups on code we haven't needed to
touch in over a decade.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-11-17 22:19 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-10  4:33 [PATCH 0/2] xfs: fix recovery corruption on s390 w/ nrext64 Dave Chinner
2023-11-10  4:33 ` [PATCH 1/2] xfs: inode recovery does not validate the recovered inode Dave Chinner
2023-11-10 19:27   ` Darrick J. Wong
2023-11-10 20:40     ` Dave Chinner
2023-11-17 22:19       ` Darrick J. Wong
2023-11-10  4:33 ` [PATCH 2/2] xfs: recovery should not clear di_flushiter unconditionally Dave Chinner
2023-11-10 19:32   ` Darrick J. Wong
2023-11-10 20:36     ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox