XFS disaster recovery

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* XFS disaster recovery
@ 2022-02-01 23:07 Sean Caron
  2022-02-01 23:33 ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Sean Caron @ 2022-02-01 23:07 UTC (permalink / raw)
  To: linux-xfs, Sean Caron

Hi all,

Me again with another not-backed-up XFS filesystem that's in a little
trouble. Last time I stopped by to discuss my woes, I was told that I
could check in here and get some help reading the tea leaves before I
do anything drastic so I'm doing that :)

Brief backstory: This is a RAID 60 composed of three 18-drive RAID 6
strings of 8 TB disk drives, around 460 TB total capacity. Last week
we had a disk fail out of the array. We replaced the disk and the
recovery hung at around 70%.

We power cycled the machine and enclosure and got the recovery to run
to completion. Just as it finished up, the same string dropped another
drive.

We replaced that drive and started recovery again. It got a fair bit
into the recovery, then hung just as did the first drive recovery, at
around +/- 70%. We power cycled everything again, then started the
recovery. As the recovery was running again, a third disk started to
throw read errors.

At this point, I decided to just stop trying to recover this array so
it's up with two disks down but otherwise assembled. I figured I would
just try to mount ro,norecovery and try to salvage as much as possible
at this point before going any further.

Trying to mount ro,norecovery, I am getting an error:

metadata I/O error in "xfs_trans_read_buf_map at daddr ... len 8 error 74
Metadata CRC error detected at xfs_agf_read_verify+0xd0/0xf0 [xfs],
xfs_agf block ...

I ran an xfs_repair -L -n just to see what it would spit out. It
completes within 15-20 minutes (which I feel might be a good sign,
from my experience, outcomes are inversely proportional to run time),
but the output is implying that it would unlink over 100,000 files
(I'm not sure how many total files are on the filesystem, in terms of
what proportion of loss this would equate to) and it also says:

"Inode allocation btrees are too corrupted, skipping phases 6 and 7"

which sounds a little ominous.

It would be a huge help if someone could help me get a little insight
into this and determine the best way forward at this point to try and
salvage as much as possible.

Happy to provide any data, dmesg output, etc as needed. Please just
let me know what would be helpful for diagnosis.

Thanks so much,

Sean

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS disaster recovery
  2022-02-01 23:07 XFS disaster recovery Sean Caron
@ 2022-02-01 23:33 ` Dave Chinner
  2022-02-02  1:20   ` Sean Caron
  2022-02-07 22:03   ` XFS disaster recovery Sean Caron
  0 siblings, 2 replies; 20+ messages in thread
From: Dave Chinner @ 2022-02-01 23:33 UTC (permalink / raw)
  To: Sean Caron; +Cc: linux-xfs

On Tue, Feb 01, 2022 at 06:07:18PM -0500, Sean Caron wrote:
> Hi all,
> 
> Me again with another not-backed-up XFS filesystem that's in a little
> trouble. Last time I stopped by to discuss my woes, I was told that I
> could check in here and get some help reading the tea leaves before I
> do anything drastic so I'm doing that :)
> 
> Brief backstory: This is a RAID 60 composed of three 18-drive RAID 6
> strings of 8 TB disk drives, around 460 TB total capacity. Last week
> we had a disk fail out of the array. We replaced the disk and the
> recovery hung at around 70%.
> 
> We power cycled the machine and enclosure and got the recovery to run
> to completion. Just as it finished up, the same string dropped another
> drive.
> 
> We replaced that drive and started recovery again. It got a fair bit
> into the recovery, then hung just as did the first drive recovery, at
> around +/- 70%. We power cycled everything again, then started the
> recovery. As the recovery was running again, a third disk started to
> throw read errors.
> 
> At this point, I decided to just stop trying to recover this array so
> it's up with two disks down but otherwise assembled. I figured I would
> just try to mount ro,norecovery and try to salvage as much as possible
> at this point before going any further.
> 
> Trying to mount ro,norecovery, I am getting an error:

Seeing as you've only lost redundancy at this point in time, this
will simply result in trying to mount the filesystem in an
inconsistent state and so you'll see metadata corruptions because
the log has no be replayed.

> metadata I/O error in "xfs_trans_read_buf_map at daddr ... len 8 error 74
> Metadata CRC error detected at xfs_agf_read_verify+0xd0/0xf0 [xfs],
> xfs_agf block ...
> 
> I ran an xfs_repair -L -n just to see what it would spit out. It
> completes within 15-20 minutes (which I feel might be a good sign,
> from my experience, outcomes are inversely proportional to run time),
> but the output is implying that it would unlink over 100,000 files
> (I'm not sure how many total files are on the filesystem, in terms of
> what proportion of loss this would equate to) and it also says:
> 
> "Inode allocation btrees are too corrupted, skipping phases 6 and 7"

This is expected because 'xfs_repair -n' does not recover the log.
Hence you're running checks on an inconsistent fs and repair is
detecting that the inobts are inconsistent so it can't check the
directory structure connectivity and link counts sanely.

What you want to do here is take a metadump of the filesystem (it's
an offline operation) and restore it to a an image file on a
different system (creates a sparse file so just needs to run on a fs
that supports file sizes > 16TB). You can then mount the image file
via "mount -o loop <fs.img> <mntpt>", and it run log recovery on the
image. Then you can unmount it again and see if the resultant
filesystem image contains any corruption via 'xfs_repair -n'.

If there's no problems found, then the original filesysetm is all
good an all you need to do is mount it and everythign should be
there ready for the migration process to non-failing storage.

If there are warnings/repairs needed then you're probably best to
post the output of 'xfs_reapir -n' so we can review it and determine
the best course of action from there.

IOWs, do all the diagnosis/triage of the filesytem state on the
restored metadump images so that we don't risk further damaging the
real storage. If we screw up a restored filesystem image, no big
deal, we can just return it to the original state by restoring it
from the metadump again to try something different.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS disaster recovery
  2022-02-01 23:33 ` Dave Chinner
@ 2022-02-02  1:20   ` Sean Caron
  2022-02-02  2:44     ` Dave Chinner
  2022-02-07 22:03   ` XFS disaster recovery Sean Caron
  1 sibling, 1 reply; 20+ messages in thread
From: Sean Caron @ 2022-02-02  1:20 UTC (permalink / raw)
  To: Dave Chinner, Sean Caron; +Cc: linux-xfs

Thank you for the detailed response, Dave! I downloaded and built the
latest xfsprogs (5.14.2) and tried to run a metadump with the
parameters:

xfs_metadump -g -o -w /dev/md4 /exports/home/work/md4.metadump

It says:

Metadata CRC error detected at 0x56384b41796e, xfs_agf block 0x4d7fffd948/0x1000
xfs_metadump: cannot init perag data (74). Continuing anyway.

It starts counting up inodes and gets to "Copied 418624 of 83032768
inodes (1 of 350 AGs)"

The it stops with an error:

xfs_metadump: inode 2216156864 has unexpected extents

I don't see any disk read errors or SAS errors logged in dmesg. MD
array is still online and running as far as I can tell.

Where should I go from here?

Thanks,

Sean

On Tue, Feb 1, 2022 at 6:33 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Feb 01, 2022 at 06:07:18PM -0500, Sean Caron wrote:
> > Hi all,
> >
> > Me again with another not-backed-up XFS filesystem that's in a little
> > trouble. Last time I stopped by to discuss my woes, I was told that I
> > could check in here and get some help reading the tea leaves before I
> > do anything drastic so I'm doing that :)
> >
> > Brief backstory: This is a RAID 60 composed of three 18-drive RAID 6
> > strings of 8 TB disk drives, around 460 TB total capacity. Last week
> > we had a disk fail out of the array. We replaced the disk and the
> > recovery hung at around 70%.
> >
> > We power cycled the machine and enclosure and got the recovery to run
> > to completion. Just as it finished up, the same string dropped another
> > drive.
> >
> > We replaced that drive and started recovery again. It got a fair bit
> > into the recovery, then hung just as did the first drive recovery, at
> > around +/- 70%. We power cycled everything again, then started the
> > recovery. As the recovery was running again, a third disk started to
> > throw read errors.
> >
> > At this point, I decided to just stop trying to recover this array so
> > it's up with two disks down but otherwise assembled. I figured I would
> > just try to mount ro,norecovery and try to salvage as much as possible
> > at this point before going any further.
> >
> > Trying to mount ro,norecovery, I am getting an error:
>
> Seeing as you've only lost redundancy at this point in time, this
> will simply result in trying to mount the filesystem in an
> inconsistent state and so you'll see metadata corruptions because
> the log has no be replayed.
>
> > metadata I/O error in "xfs_trans_read_buf_map at daddr ... len 8 error 74
> > Metadata CRC error detected at xfs_agf_read_verify+0xd0/0xf0 [xfs],
> > xfs_agf block ...
> >
> > I ran an xfs_repair -L -n just to see what it would spit out. It
> > completes within 15-20 minutes (which I feel might be a good sign,
> > from my experience, outcomes are inversely proportional to run time),
> > but the output is implying that it would unlink over 100,000 files
> > (I'm not sure how many total files are on the filesystem, in terms of
> > what proportion of loss this would equate to) and it also says:
> >
> > "Inode allocation btrees are too corrupted, skipping phases 6 and 7"
>
> This is expected because 'xfs_repair -n' does not recover the log.
> Hence you're running checks on an inconsistent fs and repair is
> detecting that the inobts are inconsistent so it can't check the
> directory structure connectivity and link counts sanely.
>
> What you want to do here is take a metadump of the filesystem (it's
> an offline operation) and restore it to a an image file on a
> different system (creates a sparse file so just needs to run on a fs
> that supports file sizes > 16TB). You can then mount the image file
> via "mount -o loop <fs.img> <mntpt>", and it run log recovery on the
> image. Then you can unmount it again and see if the resultant
> filesystem image contains any corruption via 'xfs_repair -n'.
>
> If there's no problems found, then the original filesysetm is all
> good an all you need to do is mount it and everythign should be
> there ready for the migration process to non-failing storage.
>
> If there are warnings/repairs needed then you're probably best to
> post the output of 'xfs_reapir -n' so we can review it and determine
> the best course of action from there.
>
> IOWs, do all the diagnosis/triage of the filesytem state on the
> restored metadump images so that we don't risk further damaging the
> real storage. If we screw up a restored filesystem image, no big
> deal, we can just return it to the original state by restoring it
> from the metadump again to try something different.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS disaster recovery
  2022-02-02  1:20   ` Sean Caron
@ 2022-02-02  2:44     ` Dave Chinner
  2022-02-02  7:42       ` [PATCH] metadump: handle corruption errors without aborting Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2022-02-02  2:44 UTC (permalink / raw)
  To: Sean Caron; +Cc: linux-xfs

On Tue, Feb 01, 2022 at 08:20:45PM -0500, Sean Caron wrote:
> Thank you for the detailed response, Dave! I downloaded and built the
> latest xfsprogs (5.14.2) and tried to run a metadump with the
> parameters:
> 
> xfs_metadump -g -o -w /dev/md4 /exports/home/work/md4.metadump
> 
> It says:
> 
> Metadata CRC error detected at 0x56384b41796e, xfs_agf block 0x4d7fffd948/0x1000
> xfs_metadump: cannot init perag data (74). Continuing anyway.
> 
> It starts counting up inodes and gets to "Copied 418624 of 83032768
> inodes (1 of 350 AGs)"
> 
> The it stops with an error:
> 
> xfs_metadump: inode 2216156864 has unexpected extents

Not promising - that's a device inode (blk, chr, fifo or sock) that
appears to have extents in the data fork. That's indicative of the
inode cluster containing garbage, but unfortunately the error
propagation from a bad inode appears to abort the rest of the
metadump.

That looks like a bug in metadump to me.

Let me confirm this and work out how it should behave and I'll
send you a patch to avoid this issue.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] metadump: handle corruption errors without aborting
  2022-02-02  2:44     ` Dave Chinner
@ 2022-02-02  7:42       ` Dave Chinner
  2022-02-02 18:49         ` Sean Caron
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2022-02-02  7:42 UTC (permalink / raw)
  To: Sean Caron; +Cc: linux-xfs


From: Dave Chinner <dchinner@redhat.com>

Sean Caron reported that a metadump terminated after givin gthis
warning:

xfs_metadump: inode 2216156864 has unexpected extents

Metadump is supposed to ignore corruptions and continue dumping the
filesystem as best it can. Whilst it warns about many situations
where it can't fully dump structures, it should stop processing that
structure and continue with the next one until the entire filesystem
has been processed.

Unfortunately, some warning conditions also return an "abort" error
status, causing metadump to abort if that condition is hit. Most of
these abort conditions should really be "continue on next object"
conditions so that the we attempt to dump the rest of the
filesystem.

Fix the returns for warnings that incorrectly cause aborts
such that the only abort conditions are read errors when
"stop-on-read-error" semantics are specified. Also make the return
values consistently mean abort/continue rather than returning -errno
to mean "stop because read error" and then trying to infer what
the error means in callers without the context it occurred in.

Reported-by: Sean Caron <scaron@umich.edu>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---

Sean,

Can you please apply this patch to your xfsprogs source tree and
rebuild it? This should let metadump continue past the corrupt
inodes it aborted on and run through to completion.

-Dave.

 db/metadump.c | 94 +++++++++++++++++++++++++++++------------------------------
 1 file changed, 47 insertions(+), 47 deletions(-)

diff --git a/db/metadump.c b/db/metadump.c
index 96b098b0eaca..9b32b88a3c50 100644
--- a/db/metadump.c
+++ b/db/metadump.c
@@ -1645,7 +1645,7 @@ process_symlink_block(
 {
 	struct bbmap	map;
 	char		*link;
-	int		ret = 0;
+	int		rval = 1;
 
 	push_cur();
 	map.nmaps = 1;
@@ -1658,8 +1658,7 @@ process_symlink_block(
 
 		print_warning("cannot read %s block %u/%u (%llu)",
 				typtab[btype].name, agno, agbno, s);
-		if (stop_on_read_error)
-			ret = -1;
+		rval = !stop_on_read_error;
 		goto out_pop;
 	}
 	link = iocur_top->data;
@@ -1682,10 +1681,11 @@ process_symlink_block(
 	}
 
 	iocur_top->need_crc = 1;
-	ret = write_buf(iocur_top);
+	if (write_buf(iocur_top))
+		rval = 0;
 out_pop:
 	pop_cur();
-	return ret;
+	return rval;
 }
 
 #define MAX_REMOTE_VALS		4095
@@ -1843,8 +1843,8 @@ process_single_fsb_objects(
 	typnm_t		btype,
 	xfs_fileoff_t	last)
 {
+	int		rval = 1;
 	char		*dp;
-	int		ret = 0;
 	int		i;
 
 	for (i = 0; i < c; i++) {
@@ -1858,8 +1858,7 @@ process_single_fsb_objects(
 
 			print_warning("cannot read %s block %u/%u (%llu)",
 					typtab[btype].name, agno, agbno, s);
-			if (stop_on_read_error)
-				ret = -EIO;
+			rval = !stop_on_read_error;
 			goto out_pop;
 
 		}
@@ -1925,16 +1924,17 @@ process_single_fsb_objects(
 		}
 
 write:
-		ret = write_buf(iocur_top);
+		if (write_buf(iocur_top))
+			rval = 0;
 out_pop:
 		pop_cur();
-		if (ret)
+		if (!rval)
 			break;
 		o++;
 		s++;
 	}
 
-	return ret;
+	return rval;
 }
 
 /*
@@ -1952,7 +1952,7 @@ process_multi_fsb_dir(
 	xfs_fileoff_t	last)
 {
 	char		*dp;
-	int		ret = 0;
+	int		rval = 1;
 
 	while (c > 0) {
 		unsigned int	bm_len;
@@ -1978,8 +1978,7 @@ process_multi_fsb_dir(
 
 				print_warning("cannot read %s block %u/%u (%llu)",
 						typtab[btype].name, agno, agbno, s);
-				if (stop_on_read_error)
-					ret = -1;
+				rval = !stop_on_read_error;
 				goto out_pop;
 
 			}
@@ -1998,18 +1997,19 @@ process_multi_fsb_dir(
 			}
 			iocur_top->need_crc = 1;
 write:
-			ret = write_buf(iocur_top);
+			if (write_buf(iocur_top))
+				rval = 0;
 out_pop:
 			pop_cur();
 			mfsb_map.nmaps = 0;
-			if (ret)
+			if (!rval)
 				break;
 		}
 		c -= bm_len;
 		s += bm_len;
 	}
 
-	return ret;
+	return rval;
 }
 
 static bool
@@ -2039,15 +2039,15 @@ process_multi_fsb_objects(
 		return process_symlink_block(o, s, c, btype, last);
 	default:
 		print_warning("bad type for multi-fsb object %d", btype);
-		return -EINVAL;
+		return 1;
 	}
 }
 
 /* inode copy routines */
 static int
 process_bmbt_reclist(
-	xfs_bmbt_rec_t 		*rp,
-	int 			numrecs,
+	xfs_bmbt_rec_t		*rp,
+	int			numrecs,
 	typnm_t			btype)
 {
 	int			i;
@@ -2059,7 +2059,7 @@ process_bmbt_reclist(
 	xfs_agnumber_t		agno;
 	xfs_agblock_t		agbno;
 	bool			is_multi_fsb = is_multi_fsb_object(mp, btype);
-	int			error;
+	int			rval = 1;
 
 	if (btype == TYP_DATA)
 		return 1;
@@ -2123,16 +2123,16 @@ process_bmbt_reclist(
 
 		/* multi-extent blocks require special handling */
 		if (is_multi_fsb)
-			error = process_multi_fsb_objects(o, s, c, btype,
+			rval = process_multi_fsb_objects(o, s, c, btype,
 					last);
 		else
-			error = process_single_fsb_objects(o, s, c, btype,
+			rval = process_single_fsb_objects(o, s, c, btype,
 					last);
-		if (error)
-			return 0;
+		if (!rval)
+			break;
 	}
 
-	return 1;
+	return rval;
 }
 
 static int
@@ -2331,7 +2331,7 @@ process_inode_data(
 	return 1;
 }
 
-static int
+static void
 process_dev_inode(
 	xfs_dinode_t		*dip)
 {
@@ -2339,15 +2339,13 @@ process_dev_inode(
 		if (show_warnings)
 			print_warning("inode %llu has unexpected extents",
 				      (unsigned long long)cur_ino);
-		return 0;
-	} else {
-		if (zero_stale_data) {
-			unsigned int	size = sizeof(xfs_dev_t);
+		return;
+	}
+	if (zero_stale_data) {
+		unsigned int	size = sizeof(xfs_dev_t);
 
-			memset(XFS_DFORK_DPTR(dip) + size, 0,
-					XFS_DFORK_DSIZE(dip, mp) - size);
-		}
-		return 1;
+		memset(XFS_DFORK_DPTR(dip) + size, 0,
+				XFS_DFORK_DSIZE(dip, mp) - size);
 	}
 }
 
@@ -2365,11 +2363,10 @@ process_inode(
 	xfs_dinode_t 		*dip,
 	bool			free_inode)
 {
-	int			success;
+	int			rval = 1;
 	bool			crc_was_ok = false; /* no recalc by default */
 	bool			need_new_crc = false;
 
-	success = 1;
 	cur_ino = XFS_AGINO_TO_INO(mp, agno, agino);
 
 	/* we only care about crc recalculation if we will modify the inode. */
@@ -2390,32 +2387,34 @@ process_inode(
 	/* copy appropriate data fork metadata */
 	switch (be16_to_cpu(dip->di_mode) & S_IFMT) {
 		case S_IFDIR:
-			success = process_inode_data(dip, TYP_DIR2);
+			rval = process_inode_data(dip, TYP_DIR2);
 			if (dip->di_format == XFS_DINODE_FMT_LOCAL)
 				need_new_crc = 1;
 			break;
 		case S_IFLNK:
-			success = process_inode_data(dip, TYP_SYMLINK);
+			rval = process_inode_data(dip, TYP_SYMLINK);
 			if (dip->di_format == XFS_DINODE_FMT_LOCAL)
 				need_new_crc = 1;
 			break;
 		case S_IFREG:
-			success = process_inode_data(dip, TYP_DATA);
+			rval = process_inode_data(dip, TYP_DATA);
 			break;
 		case S_IFIFO:
 		case S_IFCHR:
 		case S_IFBLK:
 		case S_IFSOCK:
-			success = process_dev_inode(dip);
+			process_dev_inode(dip);
 			need_new_crc = 1;
 			break;
 		default:
 			break;
 	}
 	nametable_clear();
+	if (!rval)
+		goto done;
 
 	/* copy extended attributes if they exist and forkoff is valid */
-	if (success && XFS_DFORK_DSIZE(dip, mp) < XFS_LITINO(mp)) {
+	if (XFS_DFORK_DSIZE(dip, mp) < XFS_LITINO(mp)) {
 		attr_data.remote_val_count = 0;
 		switch (dip->di_aformat) {
 			case XFS_DINODE_FMT_LOCAL:
@@ -2425,11 +2424,11 @@ process_inode(
 				break;
 
 			case XFS_DINODE_FMT_EXTENTS:
-				success = process_exinode(dip, TYP_ATTR);
+				rval = process_exinode(dip, TYP_ATTR);
 				break;
 
 			case XFS_DINODE_FMT_BTREE:
-				success = process_btinode(dip, TYP_ATTR);
+				rval = process_btinode(dip, TYP_ATTR);
 				break;
 		}
 		nametable_clear();
@@ -2442,7 +2441,8 @@ done:
 
 	if (crc_was_ok && need_new_crc)
 		libxfs_dinode_calc_crc(mp, dip);
-	return success;
+
+	return rval;
 }
 
 static uint32_t	inodes_copied;
@@ -2541,7 +2541,7 @@ copy_inode_chunk(
 
 			/* process_inode handles free inodes, too */
 			if (!process_inode(agno, agino + ioff + i, dip,
-			    XFS_INOBT_IS_FREE_DISK(rp, ioff + i)))
+					XFS_INOBT_IS_FREE_DISK(rp, ioff + i)))
 				goto pop_out;
 
 			inodes_copied++;
@@ -2800,7 +2800,7 @@ copy_ino(
 	xfs_agblock_t		agbno;
 	xfs_agino_t		agino;
 	int			offset;
-	int			rval = 0;
+	int			rval = 1;
 
 	if (ino == 0 || ino == NULLFSINO)
 		return 1;

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] metadump: handle corruption errors without aborting
  2022-02-02  7:42       ` [PATCH] metadump: handle corruption errors without aborting Dave Chinner
@ 2022-02-02 18:49         ` Sean Caron
  2022-02-02 19:43           ` Sean Caron
  0 siblings, 1 reply; 20+ messages in thread
From: Sean Caron @ 2022-02-02 18:49 UTC (permalink / raw)
  To: Dave Chinner, Sean Caron; +Cc: linux-xfs

Hi Dave,

Thank you! I tried copying and pasting this into a file and applying it with:

patch < thispatchfile

against both the 5.14.2 release and xfsprogs-dev Git pull and I'm
getting errors from patch.

I also tried using "git patch" and it told me the patch does not apply.

I tried applying the patch by hand to the 5.14.2 metadump.c but I get
a compilation error in process_dev_inode but I'm not sure if that's
because I made a mistake or because the patch is expecting other
content to be there, that isn't.

I'm sorry for the ignorance but how would I make use of this?

Thanks,

Sean

On Wed, Feb 2, 2022 at 2:42 AM Dave Chinner <david@fromorbit.com> wrote:
>
>
> From: Dave Chinner <dchinner@redhat.com>
>
> Sean Caron reported that a metadump terminated after givin gthis
> warning:
>
> xfs_metadump: inode 2216156864 has unexpected extents
>
> Metadump is supposed to ignore corruptions and continue dumping the
> filesystem as best it can. Whilst it warns about many situations
> where it can't fully dump structures, it should stop processing that
> structure and continue with the next one until the entire filesystem
> has been processed.
>
> Unfortunately, some warning conditions also return an "abort" error
> status, causing metadump to abort if that condition is hit. Most of
> these abort conditions should really be "continue on next object"
> conditions so that the we attempt to dump the rest of the
> filesystem.
>
> Fix the returns for warnings that incorrectly cause aborts
> such that the only abort conditions are read errors when
> "stop-on-read-error" semantics are specified. Also make the return
> values consistently mean abort/continue rather than returning -errno
> to mean "stop because read error" and then trying to infer what
> the error means in callers without the context it occurred in.
>
> Reported-by: Sean Caron <scaron@umich.edu>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>
> Sean,
>
> Can you please apply this patch to your xfsprogs source tree and
> rebuild it? This should let metadump continue past the corrupt
> inodes it aborted on and run through to completion.
>
> -Dave.
>
>  db/metadump.c | 94 +++++++++++++++++++++++++++++------------------------------
>  1 file changed, 47 insertions(+), 47 deletions(-)
>
> diff --git a/db/metadump.c b/db/metadump.c
> index 96b098b0eaca..9b32b88a3c50 100644
> --- a/db/metadump.c
> +++ b/db/metadump.c
> @@ -1645,7 +1645,7 @@ process_symlink_block(
>  {
>         struct bbmap    map;
>         char            *link;
> -       int             ret = 0;
> +       int             rval = 1;
>
>         push_cur();
>         map.nmaps = 1;
> @@ -1658,8 +1658,7 @@ process_symlink_block(
>
>                 print_warning("cannot read %s block %u/%u (%llu)",
>                                 typtab[btype].name, agno, agbno, s);
> -               if (stop_on_read_error)
> -                       ret = -1;
> +               rval = !stop_on_read_error;
>                 goto out_pop;
>         }
>         link = iocur_top->data;
> @@ -1682,10 +1681,11 @@ process_symlink_block(
>         }
>
>         iocur_top->need_crc = 1;
> -       ret = write_buf(iocur_top);
> +       if (write_buf(iocur_top))
> +               rval = 0;
>  out_pop:
>         pop_cur();
> -       return ret;
> +       return rval;
>  }
>
>  #define MAX_REMOTE_VALS                4095
> @@ -1843,8 +1843,8 @@ process_single_fsb_objects(
>         typnm_t         btype,
>         xfs_fileoff_t   last)
>  {
> +       int             rval = 1;
>         char            *dp;
> -       int             ret = 0;
>         int             i;
>
>         for (i = 0; i < c; i++) {
> @@ -1858,8 +1858,7 @@ process_single_fsb_objects(
>
>                         print_warning("cannot read %s block %u/%u (%llu)",
>                                         typtab[btype].name, agno, agbno, s);
> -                       if (stop_on_read_error)
> -                               ret = -EIO;
> +                       rval = !stop_on_read_error;
>                         goto out_pop;
>
>                 }
> @@ -1925,16 +1924,17 @@ process_single_fsb_objects(
>                 }
>
>  write:
> -               ret = write_buf(iocur_top);
> +               if (write_buf(iocur_top))
> +                       rval = 0;
>  out_pop:
>                 pop_cur();
> -               if (ret)
> +               if (!rval)
>                         break;
>                 o++;
>                 s++;
>         }
>
> -       return ret;
> +       return rval;
>  }
>
>  /*
> @@ -1952,7 +1952,7 @@ process_multi_fsb_dir(
>         xfs_fileoff_t   last)
>  {
>         char            *dp;
> -       int             ret = 0;
> +       int             rval = 1;
>
>         while (c > 0) {
>                 unsigned int    bm_len;
> @@ -1978,8 +1978,7 @@ process_multi_fsb_dir(
>
>                                 print_warning("cannot read %s block %u/%u (%llu)",
>                                                 typtab[btype].name, agno, agbno, s);
> -                               if (stop_on_read_error)
> -                                       ret = -1;
> +                               rval = !stop_on_read_error;
>                                 goto out_pop;
>
>                         }
> @@ -1998,18 +1997,19 @@ process_multi_fsb_dir(
>                         }
>                         iocur_top->need_crc = 1;
>  write:
> -                       ret = write_buf(iocur_top);
> +                       if (write_buf(iocur_top))
> +                               rval = 0;
>  out_pop:
>                         pop_cur();
>                         mfsb_map.nmaps = 0;
> -                       if (ret)
> +                       if (!rval)
>                                 break;
>                 }
>                 c -= bm_len;
>                 s += bm_len;
>         }
>
> -       return ret;
> +       return rval;
>  }
>
>  static bool
> @@ -2039,15 +2039,15 @@ process_multi_fsb_objects(
>                 return process_symlink_block(o, s, c, btype, last);
>         default:
>                 print_warning("bad type for multi-fsb object %d", btype);
> -               return -EINVAL;
> +               return 1;
>         }
>  }
>
>  /* inode copy routines */
>  static int
>  process_bmbt_reclist(
> -       xfs_bmbt_rec_t          *rp,
> -       int                     numrecs,
> +       xfs_bmbt_rec_t          *rp,
> +       int                     numrecs,
>         typnm_t                 btype)
>  {
>         int                     i;
> @@ -2059,7 +2059,7 @@ process_bmbt_reclist(
>         xfs_agnumber_t          agno;
>         xfs_agblock_t           agbno;
>         bool                    is_multi_fsb = is_multi_fsb_object(mp, btype);
> -       int                     error;
> +       int                     rval = 1;
>
>         if (btype == TYP_DATA)
>                 return 1;
> @@ -2123,16 +2123,16 @@ process_bmbt_reclist(
>
>                 /* multi-extent blocks require special handling */
>                 if (is_multi_fsb)
> -                       error = process_multi_fsb_objects(o, s, c, btype,
> +                       rval = process_multi_fsb_objects(o, s, c, btype,
>                                         last);
>                 else
> -                       error = process_single_fsb_objects(o, s, c, btype,
> +                       rval = process_single_fsb_objects(o, s, c, btype,
>                                         last);
> -               if (error)
> -                       return 0;
> +               if (!rval)
> +                       break;
>         }
>
> -       return 1;
> +       return rval;
>  }
>
>  static int
> @@ -2331,7 +2331,7 @@ process_inode_data(
>         return 1;
>  }
>
> -static int
> +static void
>  process_dev_inode(
>         xfs_dinode_t            *dip)
>  {
> @@ -2339,15 +2339,13 @@ process_dev_inode(
>                 if (show_warnings)
>                         print_warning("inode %llu has unexpected extents",
>                                       (unsigned long long)cur_ino);
> -               return 0;
> -       } else {
> -               if (zero_stale_data) {
> -                       unsigned int    size = sizeof(xfs_dev_t);
> +               return;
> +       }
> +       if (zero_stale_data) {
> +               unsigned int    size = sizeof(xfs_dev_t);
>
> -                       memset(XFS_DFORK_DPTR(dip) + size, 0,
> -                                       XFS_DFORK_DSIZE(dip, mp) - size);
> -               }
> -               return 1;
> +               memset(XFS_DFORK_DPTR(dip) + size, 0,
> +                               XFS_DFORK_DSIZE(dip, mp) - size);
>         }
>  }
>
> @@ -2365,11 +2363,10 @@ process_inode(
>         xfs_dinode_t            *dip,
>         bool                    free_inode)
>  {
> -       int                     success;
> +       int                     rval = 1;
>         bool                    crc_was_ok = false; /* no recalc by default */
>         bool                    need_new_crc = false;
>
> -       success = 1;
>         cur_ino = XFS_AGINO_TO_INO(mp, agno, agino);
>
>         /* we only care about crc recalculation if we will modify the inode. */
> @@ -2390,32 +2387,34 @@ process_inode(
>         /* copy appropriate data fork metadata */
>         switch (be16_to_cpu(dip->di_mode) & S_IFMT) {
>                 case S_IFDIR:
> -                       success = process_inode_data(dip, TYP_DIR2);
> +                       rval = process_inode_data(dip, TYP_DIR2);
>                         if (dip->di_format == XFS_DINODE_FMT_LOCAL)
>                                 need_new_crc = 1;
>                         break;
>                 case S_IFLNK:
> -                       success = process_inode_data(dip, TYP_SYMLINK);
> +                       rval = process_inode_data(dip, TYP_SYMLINK);
>                         if (dip->di_format == XFS_DINODE_FMT_LOCAL)
>                                 need_new_crc = 1;
>                         break;
>                 case S_IFREG:
> -                       success = process_inode_data(dip, TYP_DATA);
> +                       rval = process_inode_data(dip, TYP_DATA);
>                         break;
>                 case S_IFIFO:
>                 case S_IFCHR:
>                 case S_IFBLK:
>                 case S_IFSOCK:
> -                       success = process_dev_inode(dip);
> +                       process_dev_inode(dip);
>                         need_new_crc = 1;
>                         break;
>                 default:
>                         break;
>         }
>         nametable_clear();
> +       if (!rval)
> +               goto done;
>
>         /* copy extended attributes if they exist and forkoff is valid */
> -       if (success && XFS_DFORK_DSIZE(dip, mp) < XFS_LITINO(mp)) {
> +       if (XFS_DFORK_DSIZE(dip, mp) < XFS_LITINO(mp)) {
>                 attr_data.remote_val_count = 0;
>                 switch (dip->di_aformat) {
>                         case XFS_DINODE_FMT_LOCAL:
> @@ -2425,11 +2424,11 @@ process_inode(
>                                 break;
>
>                         case XFS_DINODE_FMT_EXTENTS:
> -                               success = process_exinode(dip, TYP_ATTR);
> +                               rval = process_exinode(dip, TYP_ATTR);
>                                 break;
>
>                         case XFS_DINODE_FMT_BTREE:
> -                               success = process_btinode(dip, TYP_ATTR);
> +                               rval = process_btinode(dip, TYP_ATTR);
>                                 break;
>                 }
>                 nametable_clear();
> @@ -2442,7 +2441,8 @@ done:
>
>         if (crc_was_ok && need_new_crc)
>                 libxfs_dinode_calc_crc(mp, dip);
> -       return success;
> +
> +       return rval;
>  }
>
>  static uint32_t        inodes_copied;
> @@ -2541,7 +2541,7 @@ copy_inode_chunk(
>
>                         /* process_inode handles free inodes, too */
>                         if (!process_inode(agno, agino + ioff + i, dip,
> -                           XFS_INOBT_IS_FREE_DISK(rp, ioff + i)))
> +                                       XFS_INOBT_IS_FREE_DISK(rp, ioff + i)))
>                                 goto pop_out;
>
>                         inodes_copied++;
> @@ -2800,7 +2800,7 @@ copy_ino(
>         xfs_agblock_t           agbno;
>         xfs_agino_t             agino;
>         int                     offset;
> -       int                     rval = 0;
> +       int                     rval = 1;
>
>         if (ino == 0 || ino == NULLFSINO)
>                 return 1;

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] metadump: handle corruption errors without aborting
  2022-02-02 18:49         ` Sean Caron
@ 2022-02-02 19:43           ` Sean Caron
  2022-02-02 20:18             ` Sean Caron
  0 siblings, 1 reply; 20+ messages in thread
From: Sean Caron @ 2022-02-02 19:43 UTC (permalink / raw)
  To: Dave Chinner, Sean Caron; +Cc: linux-xfs

OK, I got it applied with:

git apply --ignore-space-change --ignore-whitespace

I've got it running and so far it looks good. It's gotten past the
inode number where exited before and now just prints as an
informational message:

inode xx has unexpected extents

I'll let this run and follow up on this thread if I have any more
issues with xfs_metadump in particular.

Thanks again,

Sean

On Wed, Feb 2, 2022 at 1:49 PM Sean Caron <scaron@umich.edu> wrote:
>
> Hi Dave,
>
> Thank you! I tried copying and pasting this into a file and applying it with:
>
> patch < thispatchfile
>
> against both the 5.14.2 release and xfsprogs-dev Git pull and I'm
> getting errors from patch.
>
> I also tried using "git patch" and it told me the patch does not apply.
>
> I tried applying the patch by hand to the 5.14.2 metadump.c but I get
> a compilation error in process_dev_inode but I'm not sure if that's
> because I made a mistake or because the patch is expecting other
> content to be there, that isn't.
>
> I'm sorry for the ignorance but how would I make use of this?
>
> Thanks,
>
> Sean
>
> On Wed, Feb 2, 2022 at 2:42 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> >
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > Sean Caron reported that a metadump terminated after givin gthis
> > warning:
> >
> > xfs_metadump: inode 2216156864 has unexpected extents
> >
> > Metadump is supposed to ignore corruptions and continue dumping the
> > filesystem as best it can. Whilst it warns about many situations
> > where it can't fully dump structures, it should stop processing that
> > structure and continue with the next one until the entire filesystem
> > has been processed.
> >
> > Unfortunately, some warning conditions also return an "abort" error
> > status, causing metadump to abort if that condition is hit. Most of
> > these abort conditions should really be "continue on next object"
> > conditions so that the we attempt to dump the rest of the
> > filesystem.
> >
> > Fix the returns for warnings that incorrectly cause aborts
> > such that the only abort conditions are read errors when
> > "stop-on-read-error" semantics are specified. Also make the return
> > values consistently mean abort/continue rather than returning -errno
> > to mean "stop because read error" and then trying to infer what
> > the error means in callers without the context it occurred in.
> >
> > Reported-by: Sean Caron <scaron@umich.edu>
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >
> > Sean,
> >
> > Can you please apply this patch to your xfsprogs source tree and
> > rebuild it? This should let metadump continue past the corrupt
> > inodes it aborted on and run through to completion.
> >
> > -Dave.
> >
> >  db/metadump.c | 94 +++++++++++++++++++++++++++++------------------------------
> >  1 file changed, 47 insertions(+), 47 deletions(-)
> >
> > diff --git a/db/metadump.c b/db/metadump.c
> > index 96b098b0eaca..9b32b88a3c50 100644
> > --- a/db/metadump.c
> > +++ b/db/metadump.c
> > @@ -1645,7 +1645,7 @@ process_symlink_block(
> >  {
> >         struct bbmap    map;
> >         char            *link;
> > -       int             ret = 0;
> > +       int             rval = 1;
> >
> >         push_cur();
> >         map.nmaps = 1;
> > @@ -1658,8 +1658,7 @@ process_symlink_block(
> >
> >                 print_warning("cannot read %s block %u/%u (%llu)",
> >                                 typtab[btype].name, agno, agbno, s);
> > -               if (stop_on_read_error)
> > -                       ret = -1;
> > +               rval = !stop_on_read_error;
> >                 goto out_pop;
> >         }
> >         link = iocur_top->data;
> > @@ -1682,10 +1681,11 @@ process_symlink_block(
> >         }
> >
> >         iocur_top->need_crc = 1;
> > -       ret = write_buf(iocur_top);
> > +       if (write_buf(iocur_top))
> > +               rval = 0;
> >  out_pop:
> >         pop_cur();
> > -       return ret;
> > +       return rval;
> >  }
> >
> >  #define MAX_REMOTE_VALS                4095
> > @@ -1843,8 +1843,8 @@ process_single_fsb_objects(
> >         typnm_t         btype,
> >         xfs_fileoff_t   last)
> >  {
> > +       int             rval = 1;
> >         char            *dp;
> > -       int             ret = 0;
> >         int             i;
> >
> >         for (i = 0; i < c; i++) {
> > @@ -1858,8 +1858,7 @@ process_single_fsb_objects(
> >
> >                         print_warning("cannot read %s block %u/%u (%llu)",
> >                                         typtab[btype].name, agno, agbno, s);
> > -                       if (stop_on_read_error)
> > -                               ret = -EIO;
> > +                       rval = !stop_on_read_error;
> >                         goto out_pop;
> >
> >                 }
> > @@ -1925,16 +1924,17 @@ process_single_fsb_objects(
> >                 }
> >
> >  write:
> > -               ret = write_buf(iocur_top);
> > +               if (write_buf(iocur_top))
> > +                       rval = 0;
> >  out_pop:
> >                 pop_cur();
> > -               if (ret)
> > +               if (!rval)
> >                         break;
> >                 o++;
> >                 s++;
> >         }
> >
> > -       return ret;
> > +       return rval;
> >  }
> >
> >  /*
> > @@ -1952,7 +1952,7 @@ process_multi_fsb_dir(
> >         xfs_fileoff_t   last)
> >  {
> >         char            *dp;
> > -       int             ret = 0;
> > +       int             rval = 1;
> >
> >         while (c > 0) {
> >                 unsigned int    bm_len;
> > @@ -1978,8 +1978,7 @@ process_multi_fsb_dir(
> >
> >                                 print_warning("cannot read %s block %u/%u (%llu)",
> >                                                 typtab[btype].name, agno, agbno, s);
> > -                               if (stop_on_read_error)
> > -                                       ret = -1;
> > +                               rval = !stop_on_read_error;
> >                                 goto out_pop;
> >
> >                         }
> > @@ -1998,18 +1997,19 @@ process_multi_fsb_dir(
> >                         }
> >                         iocur_top->need_crc = 1;
> >  write:
> > -                       ret = write_buf(iocur_top);
> > +                       if (write_buf(iocur_top))
> > +                               rval = 0;
> >  out_pop:
> >                         pop_cur();
> >                         mfsb_map.nmaps = 0;
> > -                       if (ret)
> > +                       if (!rval)
> >                                 break;
> >                 }
> >                 c -= bm_len;
> >                 s += bm_len;
> >         }
> >
> > -       return ret;
> > +       return rval;
> >  }
> >
> >  static bool
> > @@ -2039,15 +2039,15 @@ process_multi_fsb_objects(
> >                 return process_symlink_block(o, s, c, btype, last);
> >         default:
> >                 print_warning("bad type for multi-fsb object %d", btype);
> > -               return -EINVAL;
> > +               return 1;
> >         }
> >  }
> >
> >  /* inode copy routines */
> >  static int
> >  process_bmbt_reclist(
> > -       xfs_bmbt_rec_t          *rp,
> > -       int                     numrecs,
> > +       xfs_bmbt_rec_t          *rp,
> > +       int                     numrecs,
> >         typnm_t                 btype)
> >  {
> >         int                     i;
> > @@ -2059,7 +2059,7 @@ process_bmbt_reclist(
> >         xfs_agnumber_t          agno;
> >         xfs_agblock_t           agbno;
> >         bool                    is_multi_fsb = is_multi_fsb_object(mp, btype);
> > -       int                     error;
> > +       int                     rval = 1;
> >
> >         if (btype == TYP_DATA)
> >                 return 1;
> > @@ -2123,16 +2123,16 @@ process_bmbt_reclist(
> >
> >                 /* multi-extent blocks require special handling */
> >                 if (is_multi_fsb)
> > -                       error = process_multi_fsb_objects(o, s, c, btype,
> > +                       rval = process_multi_fsb_objects(o, s, c, btype,
> >                                         last);
> >                 else
> > -                       error = process_single_fsb_objects(o, s, c, btype,
> > +                       rval = process_single_fsb_objects(o, s, c, btype,
> >                                         last);
> > -               if (error)
> > -                       return 0;
> > +               if (!rval)
> > +                       break;
> >         }
> >
> > -       return 1;
> > +       return rval;
> >  }
> >
> >  static int
> > @@ -2331,7 +2331,7 @@ process_inode_data(
> >         return 1;
> >  }
> >
> > -static int
> > +static void
> >  process_dev_inode(
> >         xfs_dinode_t            *dip)
> >  {
> > @@ -2339,15 +2339,13 @@ process_dev_inode(
> >                 if (show_warnings)
> >                         print_warning("inode %llu has unexpected extents",
> >                                       (unsigned long long)cur_ino);
> > -               return 0;
> > -       } else {
> > -               if (zero_stale_data) {
> > -                       unsigned int    size = sizeof(xfs_dev_t);
> > +               return;
> > +       }
> > +       if (zero_stale_data) {
> > +               unsigned int    size = sizeof(xfs_dev_t);
> >
> > -                       memset(XFS_DFORK_DPTR(dip) + size, 0,
> > -                                       XFS_DFORK_DSIZE(dip, mp) - size);
> > -               }
> > -               return 1;
> > +               memset(XFS_DFORK_DPTR(dip) + size, 0,
> > +                               XFS_DFORK_DSIZE(dip, mp) - size);
> >         }
> >  }
> >
> > @@ -2365,11 +2363,10 @@ process_inode(
> >         xfs_dinode_t            *dip,
> >         bool                    free_inode)
> >  {
> > -       int                     success;
> > +       int                     rval = 1;
> >         bool                    crc_was_ok = false; /* no recalc by default */
> >         bool                    need_new_crc = false;
> >
> > -       success = 1;
> >         cur_ino = XFS_AGINO_TO_INO(mp, agno, agino);
> >
> >         /* we only care about crc recalculation if we will modify the inode. */
> > @@ -2390,32 +2387,34 @@ process_inode(
> >         /* copy appropriate data fork metadata */
> >         switch (be16_to_cpu(dip->di_mode) & S_IFMT) {
> >                 case S_IFDIR:
> > -                       success = process_inode_data(dip, TYP_DIR2);
> > +                       rval = process_inode_data(dip, TYP_DIR2);
> >                         if (dip->di_format == XFS_DINODE_FMT_LOCAL)
> >                                 need_new_crc = 1;
> >                         break;
> >                 case S_IFLNK:
> > -                       success = process_inode_data(dip, TYP_SYMLINK);
> > +                       rval = process_inode_data(dip, TYP_SYMLINK);
> >                         if (dip->di_format == XFS_DINODE_FMT_LOCAL)
> >                                 need_new_crc = 1;
> >                         break;
> >                 case S_IFREG:
> > -                       success = process_inode_data(dip, TYP_DATA);
> > +                       rval = process_inode_data(dip, TYP_DATA);
> >                         break;
> >                 case S_IFIFO:
> >                 case S_IFCHR:
> >                 case S_IFBLK:
> >                 case S_IFSOCK:
> > -                       success = process_dev_inode(dip);
> > +                       process_dev_inode(dip);
> >                         need_new_crc = 1;
> >                         break;
> >                 default:
> >                         break;
> >         }
> >         nametable_clear();
> > +       if (!rval)
> > +               goto done;
> >
> >         /* copy extended attributes if they exist and forkoff is valid */
> > -       if (success && XFS_DFORK_DSIZE(dip, mp) < XFS_LITINO(mp)) {
> > +       if (XFS_DFORK_DSIZE(dip, mp) < XFS_LITINO(mp)) {
> >                 attr_data.remote_val_count = 0;
> >                 switch (dip->di_aformat) {
> >                         case XFS_DINODE_FMT_LOCAL:
> > @@ -2425,11 +2424,11 @@ process_inode(
> >                                 break;
> >
> >                         case XFS_DINODE_FMT_EXTENTS:
> > -                               success = process_exinode(dip, TYP_ATTR);
> > +                               rval = process_exinode(dip, TYP_ATTR);
> >                                 break;
> >
> >                         case XFS_DINODE_FMT_BTREE:
> > -                               success = process_btinode(dip, TYP_ATTR);
> > +                               rval = process_btinode(dip, TYP_ATTR);
> >                                 break;
> >                 }
> >                 nametable_clear();
> > @@ -2442,7 +2441,8 @@ done:
> >
> >         if (crc_was_ok && need_new_crc)
> >                 libxfs_dinode_calc_crc(mp, dip);
> > -       return success;
> > +
> > +       return rval;
> >  }
> >
> >  static uint32_t        inodes_copied;
> > @@ -2541,7 +2541,7 @@ copy_inode_chunk(
> >
> >                         /* process_inode handles free inodes, too */
> >                         if (!process_inode(agno, agino + ioff + i, dip,
> > -                           XFS_INOBT_IS_FREE_DISK(rp, ioff + i)))
> > +                                       XFS_INOBT_IS_FREE_DISK(rp, ioff + i)))
> >                                 goto pop_out;
> >
> >                         inodes_copied++;
> > @@ -2800,7 +2800,7 @@ copy_ino(
> >         xfs_agblock_t           agbno;
> >         xfs_agino_t             agino;
> >         int                     offset;
> > -       int                     rval = 0;
> > +       int                     rval = 1;
> >
> >         if (ino == 0 || ino == NULLFSINO)
> >                 return 1;

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] metadump: handle corruption errors without aborting
  2022-02-02 19:43           ` Sean Caron
@ 2022-02-02 20:18             ` Sean Caron
  2022-02-02 22:05               ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Sean Caron @ 2022-02-02 20:18 UTC (permalink / raw)
  To: Dave Chinner, Sean Caron; +Cc: linux-xfs

Hi Dave,

It counted up to inode 13555712 and then crashed with the error:

malloc_consolidate(): invalid chunk size

Immediately before that, it printed:

xfs_metadump: invalid block number 4358190/50414336 (1169892770398976)
in bmap extent 0 in symlink ino 98799839421

Best,

Sean

On Wed, Feb 2, 2022 at 2:43 PM Sean Caron <scaron@umich.edu> wrote:
>
> OK, I got it applied with:
>
> git apply --ignore-space-change --ignore-whitespace
>
> I've got it running and so far it looks good. It's gotten past the
> inode number where exited before and now just prints as an
> informational message:
>
> inode xx has unexpected extents
>
> I'll let this run and follow up on this thread if I have any more
> issues with xfs_metadump in particular.
>
> Thanks again,
>
> Sean
>
> On Wed, Feb 2, 2022 at 1:49 PM Sean Caron <scaron@umich.edu> wrote:
> >
> > Hi Dave,
> >
> > Thank you! I tried copying and pasting this into a file and applying it with:
> >
> > patch < thispatchfile
> >
> > against both the 5.14.2 release and xfsprogs-dev Git pull and I'm
> > getting errors from patch.
> >
> > I also tried using "git patch" and it told me the patch does not apply.
> >
> > I tried applying the patch by hand to the 5.14.2 metadump.c but I get
> > a compilation error in process_dev_inode but I'm not sure if that's
> > because I made a mistake or because the patch is expecting other
> > content to be there, that isn't.
> >
> > I'm sorry for the ignorance but how would I make use of this?
> >
> > Thanks,
> >
> > Sean
> >
> > On Wed, Feb 2, 2022 at 2:42 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > >
> > > From: Dave Chinner <dchinner@redhat.com>
> > >
> > > Sean Caron reported that a metadump terminated after givin gthis
> > > warning:
> > >
> > > xfs_metadump: inode 2216156864 has unexpected extents
> > >
> > > Metadump is supposed to ignore corruptions and continue dumping the
> > > filesystem as best it can. Whilst it warns about many situations
> > > where it can't fully dump structures, it should stop processing that
> > > structure and continue with the next one until the entire filesystem
> > > has been processed.
> > >
> > > Unfortunately, some warning conditions also return an "abort" error
> > > status, causing metadump to abort if that condition is hit. Most of
> > > these abort conditions should really be "continue on next object"
> > > conditions so that the we attempt to dump the rest of the
> > > filesystem.
> > >
> > > Fix the returns for warnings that incorrectly cause aborts
> > > such that the only abort conditions are read errors when
> > > "stop-on-read-error" semantics are specified. Also make the return
> > > values consistently mean abort/continue rather than returning -errno
> > > to mean "stop because read error" and then trying to infer what
> > > the error means in callers without the context it occurred in.
> > >
> > > Reported-by: Sean Caron <scaron@umich.edu>
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >
> > > Sean,
> > >
> > > Can you please apply this patch to your xfsprogs source tree and
> > > rebuild it? This should let metadump continue past the corrupt
> > > inodes it aborted on and run through to completion.
> > >
> > > -Dave.
> > >
> > >  db/metadump.c | 94 +++++++++++++++++++++++++++++------------------------------
> > >  1 file changed, 47 insertions(+), 47 deletions(-)
> > >
> > > diff --git a/db/metadump.c b/db/metadump.c
> > > index 96b098b0eaca..9b32b88a3c50 100644
> > > --- a/db/metadump.c
> > > +++ b/db/metadump.c
> > > @@ -1645,7 +1645,7 @@ process_symlink_block(
> > >  {
> > >         struct bbmap    map;
> > >         char            *link;
> > > -       int             ret = 0;
> > > +       int             rval = 1;
> > >
> > >         push_cur();
> > >         map.nmaps = 1;
> > > @@ -1658,8 +1658,7 @@ process_symlink_block(
> > >
> > >                 print_warning("cannot read %s block %u/%u (%llu)",
> > >                                 typtab[btype].name, agno, agbno, s);
> > > -               if (stop_on_read_error)
> > > -                       ret = -1;
> > > +               rval = !stop_on_read_error;
> > >                 goto out_pop;
> > >         }
> > >         link = iocur_top->data;
> > > @@ -1682,10 +1681,11 @@ process_symlink_block(
> > >         }
> > >
> > >         iocur_top->need_crc = 1;
> > > -       ret = write_buf(iocur_top);
> > > +       if (write_buf(iocur_top))
> > > +               rval = 0;
> > >  out_pop:
> > >         pop_cur();
> > > -       return ret;
> > > +       return rval;
> > >  }
> > >
> > >  #define MAX_REMOTE_VALS                4095
> > > @@ -1843,8 +1843,8 @@ process_single_fsb_objects(
> > >         typnm_t         btype,
> > >         xfs_fileoff_t   last)
> > >  {
> > > +       int             rval = 1;
> > >         char            *dp;
> > > -       int             ret = 0;
> > >         int             i;
> > >
> > >         for (i = 0; i < c; i++) {
> > > @@ -1858,8 +1858,7 @@ process_single_fsb_objects(
> > >
> > >                         print_warning("cannot read %s block %u/%u (%llu)",
> > >                                         typtab[btype].name, agno, agbno, s);
> > > -                       if (stop_on_read_error)
> > > -                               ret = -EIO;
> > > +                       rval = !stop_on_read_error;
> > >                         goto out_pop;
> > >
> > >                 }
> > > @@ -1925,16 +1924,17 @@ process_single_fsb_objects(
> > >                 }
> > >
> > >  write:
> > > -               ret = write_buf(iocur_top);
> > > +               if (write_buf(iocur_top))
> > > +                       rval = 0;
> > >  out_pop:
> > >                 pop_cur();
> > > -               if (ret)
> > > +               if (!rval)
> > >                         break;
> > >                 o++;
> > >                 s++;
> > >         }
> > >
> > > -       return ret;
> > > +       return rval;
> > >  }
> > >
> > >  /*
> > > @@ -1952,7 +1952,7 @@ process_multi_fsb_dir(
> > >         xfs_fileoff_t   last)
> > >  {
> > >         char            *dp;
> > > -       int             ret = 0;
> > > +       int             rval = 1;
> > >
> > >         while (c > 0) {
> > >                 unsigned int    bm_len;
> > > @@ -1978,8 +1978,7 @@ process_multi_fsb_dir(
> > >
> > >                                 print_warning("cannot read %s block %u/%u (%llu)",
> > >                                                 typtab[btype].name, agno, agbno, s);
> > > -                               if (stop_on_read_error)
> > > -                                       ret = -1;
> > > +                               rval = !stop_on_read_error;
> > >                                 goto out_pop;
> > >
> > >                         }
> > > @@ -1998,18 +1997,19 @@ process_multi_fsb_dir(
> > >                         }
> > >                         iocur_top->need_crc = 1;
> > >  write:
> > > -                       ret = write_buf(iocur_top);
> > > +                       if (write_buf(iocur_top))
> > > +                               rval = 0;
> > >  out_pop:
> > >                         pop_cur();
> > >                         mfsb_map.nmaps = 0;
> > > -                       if (ret)
> > > +                       if (!rval)
> > >                                 break;
> > >                 }
> > >                 c -= bm_len;
> > >                 s += bm_len;
> > >         }
> > >
> > > -       return ret;
> > > +       return rval;
> > >  }
> > >
> > >  static bool
> > > @@ -2039,15 +2039,15 @@ process_multi_fsb_objects(
> > >                 return process_symlink_block(o, s, c, btype, last);
> > >         default:
> > >                 print_warning("bad type for multi-fsb object %d", btype);
> > > -               return -EINVAL;
> > > +               return 1;
> > >         }
> > >  }
> > >
> > >  /* inode copy routines */
> > >  static int
> > >  process_bmbt_reclist(
> > > -       xfs_bmbt_rec_t          *rp,
> > > -       int                     numrecs,
> > > +       xfs_bmbt_rec_t          *rp,
> > > +       int                     numrecs,
> > >         typnm_t                 btype)
> > >  {
> > >         int                     i;
> > > @@ -2059,7 +2059,7 @@ process_bmbt_reclist(
> > >         xfs_agnumber_t          agno;
> > >         xfs_agblock_t           agbno;
> > >         bool                    is_multi_fsb = is_multi_fsb_object(mp, btype);
> > > -       int                     error;
> > > +       int                     rval = 1;
> > >
> > >         if (btype == TYP_DATA)
> > >                 return 1;
> > > @@ -2123,16 +2123,16 @@ process_bmbt_reclist(
> > >
> > >                 /* multi-extent blocks require special handling */
> > >                 if (is_multi_fsb)
> > > -                       error = process_multi_fsb_objects(o, s, c, btype,
> > > +                       rval = process_multi_fsb_objects(o, s, c, btype,
> > >                                         last);
> > >                 else
> > > -                       error = process_single_fsb_objects(o, s, c, btype,
> > > +                       rval = process_single_fsb_objects(o, s, c, btype,
> > >                                         last);
> > > -               if (error)
> > > -                       return 0;
> > > +               if (!rval)
> > > +                       break;
> > >         }
> > >
> > > -       return 1;
> > > +       return rval;
> > >  }
> > >
> > >  static int
> > > @@ -2331,7 +2331,7 @@ process_inode_data(
> > >         return 1;
> > >  }
> > >
> > > -static int
> > > +static void
> > >  process_dev_inode(
> > >         xfs_dinode_t            *dip)
> > >  {
> > > @@ -2339,15 +2339,13 @@ process_dev_inode(
> > >                 if (show_warnings)
> > >                         print_warning("inode %llu has unexpected extents",
> > >                                       (unsigned long long)cur_ino);
> > > -               return 0;
> > > -       } else {
> > > -               if (zero_stale_data) {
> > > -                       unsigned int    size = sizeof(xfs_dev_t);
> > > +               return;
> > > +       }
> > > +       if (zero_stale_data) {
> > > +               unsigned int    size = sizeof(xfs_dev_t);
> > >
> > > -                       memset(XFS_DFORK_DPTR(dip) + size, 0,
> > > -                                       XFS_DFORK_DSIZE(dip, mp) - size);
> > > -               }
> > > -               return 1;
> > > +               memset(XFS_DFORK_DPTR(dip) + size, 0,
> > > +                               XFS_DFORK_DSIZE(dip, mp) - size);
> > >         }
> > >  }
> > >
> > > @@ -2365,11 +2363,10 @@ process_inode(
> > >         xfs_dinode_t            *dip,
> > >         bool                    free_inode)
> > >  {
> > > -       int                     success;
> > > +       int                     rval = 1;
> > >         bool                    crc_was_ok = false; /* no recalc by default */
> > >         bool                    need_new_crc = false;
> > >
> > > -       success = 1;
> > >         cur_ino = XFS_AGINO_TO_INO(mp, agno, agino);
> > >
> > >         /* we only care about crc recalculation if we will modify the inode. */
> > > @@ -2390,32 +2387,34 @@ process_inode(
> > >         /* copy appropriate data fork metadata */
> > >         switch (be16_to_cpu(dip->di_mode) & S_IFMT) {
> > >                 case S_IFDIR:
> > > -                       success = process_inode_data(dip, TYP_DIR2);
> > > +                       rval = process_inode_data(dip, TYP_DIR2);
> > >                         if (dip->di_format == XFS_DINODE_FMT_LOCAL)
> > >                                 need_new_crc = 1;
> > >                         break;
> > >                 case S_IFLNK:
> > > -                       success = process_inode_data(dip, TYP_SYMLINK);
> > > +                       rval = process_inode_data(dip, TYP_SYMLINK);
> > >                         if (dip->di_format == XFS_DINODE_FMT_LOCAL)
> > >                                 need_new_crc = 1;
> > >                         break;
> > >                 case S_IFREG:
> > > -                       success = process_inode_data(dip, TYP_DATA);
> > > +                       rval = process_inode_data(dip, TYP_DATA);
> > >                         break;
> > >                 case S_IFIFO:
> > >                 case S_IFCHR:
> > >                 case S_IFBLK:
> > >                 case S_IFSOCK:
> > > -                       success = process_dev_inode(dip);
> > > +                       process_dev_inode(dip);
> > >                         need_new_crc = 1;
> > >                         break;
> > >                 default:
> > >                         break;
> > >         }
> > >         nametable_clear();
> > > +       if (!rval)
> > > +               goto done;
> > >
> > >         /* copy extended attributes if they exist and forkoff is valid */
> > > -       if (success && XFS_DFORK_DSIZE(dip, mp) < XFS_LITINO(mp)) {
> > > +       if (XFS_DFORK_DSIZE(dip, mp) < XFS_LITINO(mp)) {
> > >                 attr_data.remote_val_count = 0;
> > >                 switch (dip->di_aformat) {
> > >                         case XFS_DINODE_FMT_LOCAL:
> > > @@ -2425,11 +2424,11 @@ process_inode(
> > >                                 break;
> > >
> > >                         case XFS_DINODE_FMT_EXTENTS:
> > > -                               success = process_exinode(dip, TYP_ATTR);
> > > +                               rval = process_exinode(dip, TYP_ATTR);
> > >                                 break;
> > >
> > >                         case XFS_DINODE_FMT_BTREE:
> > > -                               success = process_btinode(dip, TYP_ATTR);
> > > +                               rval = process_btinode(dip, TYP_ATTR);
> > >                                 break;
> > >                 }
> > >                 nametable_clear();
> > > @@ -2442,7 +2441,8 @@ done:
> > >
> > >         if (crc_was_ok && need_new_crc)
> > >                 libxfs_dinode_calc_crc(mp, dip);
> > > -       return success;
> > > +
> > > +       return rval;
> > >  }
> > >
> > >  static uint32_t        inodes_copied;
> > > @@ -2541,7 +2541,7 @@ copy_inode_chunk(
> > >
> > >                         /* process_inode handles free inodes, too */
> > >                         if (!process_inode(agno, agino + ioff + i, dip,
> > > -                           XFS_INOBT_IS_FREE_DISK(rp, ioff + i)))
> > > +                                       XFS_INOBT_IS_FREE_DISK(rp, ioff + i)))
> > >                                 goto pop_out;
> > >
> > >                         inodes_copied++;
> > > @@ -2800,7 +2800,7 @@ copy_ino(
> > >         xfs_agblock_t           agbno;
> > >         xfs_agino_t             agino;
> > >         int                     offset;
> > > -       int                     rval = 0;
> > > +       int                     rval = 1;
> > >
> > >         if (ino == 0 || ino == NULLFSINO)
> > >                 return 1;

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] metadump: handle corruption errors without aborting
  2022-02-02 20:18             ` Sean Caron
@ 2022-02-02 22:05               ` Dave Chinner
  2022-02-02 23:45                 ` Sean Caron
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2022-02-02 22:05 UTC (permalink / raw)
  To: Sean Caron; +Cc: linux-xfs

On Wed, Feb 02, 2022 at 03:18:34PM -0500, Sean Caron wrote:
> Hi Dave,
> 
> It counted up to inode 13555712 and then crashed with the error:
> 
> malloc_consolidate(): invalid chunk size

That sounds like heap corruption or something similar - that's a
much more difficult problem to track down.

Can you either run gdb on the core file it left and grab a stack
trace of where it crashed, or run metadump again from gdb so that it
can capture the crash and get a stack trace that way?

> Immediately before that, it printed:
> 
> xfs_metadump: invalid block number 4358190/50414336 (1169892770398976)
> in bmap extent 0 in symlink ino 98799839421

I don't think that would cause any problems - it just aborts
processing the extent records in that block and moves on to the next
valid one that is found.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] metadump: handle corruption errors without aborting
  2022-02-02 22:05               ` Dave Chinner
@ 2022-02-02 23:45                 ` Sean Caron
  2022-02-06 22:34                   ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Sean Caron @ 2022-02-02 23:45 UTC (permalink / raw)
  To: Dave Chinner, Sean Caron; +Cc: linux-xfs

Sure! Please see gdb backtrace output below.

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f289d5c7921 in __GI_abort () at abort.c:79
#2  0x00007f289d610967 in __libc_message (action=action@entry=do_abort,
    fmt=fmt@entry=0x7f289d73db0d "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007f289d6179da in malloc_printerr (
    str=str@entry=0x7f289d73f368 "malloc_consolidate(): invalid chunk
size") at malloc.c:5342
#4  0x00007f289d617c7e in malloc_consolidate
(av=av@entry=0x7f289d972c40 <main_arena>)
    at malloc.c:4471
#5  0x00007f289d61b968 in _int_malloc (av=av@entry=0x7f289d972c40 <main_arena>,
    bytes=bytes@entry=33328) at malloc.c:3713
#6  0x00007f289d620275 in _int_memalign (bytes=32768, alignment=512,
av=0x7f289d972c40 <main_arena>)
    at malloc.c:4683
#7  _mid_memalign (address=<optimized out>, bytes=32768,
alignment=<optimized out>) at malloc.c:3315
#8  __GI___libc_memalign (alignment=<optimized out>,
bytes=bytes@entry=32768) at malloc.c:3266
#9  0x000055fe39e2d7ec in __initbuf (bp=bp@entry=0x55fe3c781010,
btp=btp@entry=0x55fe3b88a080,
    bno=bno@entry=98799836480, bytes=bytes@entry=32768) at rdwr.c:239
#10 0x000055fe39e2d8a4 in libxfs_initbuf (bytes=32768,
bno=98799836480, btp=0x55fe3b88a080,
    bp=0x55fe3c781010) at rdwr.c:266
#11 libxfs_getbufr (btp=btp@entry=0x55fe3b88a080, blkno=blkno@entry=98799836480,
    bblen=<optimized out>) at rdwr.c:345
#12 0x000055fe39e2d9ab in libxfs_balloc (key=<optimized out>) at rdwr.c:554
#13 0x000055fe39e77bf8 in cache_node_allocate (key=0x7ffef716dcc0,
cache=0x55fe3b879a70)
    at cache.c:305
#14 cache_node_get (cache=0x55fe3b879a70, key=key@entry=0x7ffef716dcc0,
    nodep=nodep@entry=0x7ffef716dc60) at cache.c:451
#15 0x000055fe39e2d496 in __cache_lookup
(key=key@entry=0x7ffef716dcc0, flags=flags@entry=0,
    bpp=bpp@entry=0x7ffef716dcb8) at rdwr.c:388
#16 0x000055fe39e2e91f in libxfs_getbuf_flags (bpp=0x7ffef716dcb8,
flags=0, len=<optimized out>,
    blkno=98799836480, btp=<optimized out>) at rdwr.c:440
#17 libxfs_buf_read_map (btp=0x55fe3b88a080,
map=map@entry=0x7ffef716dd60, nmaps=nmaps@entry=1,
    flags=flags@entry=2, bpp=bpp@entry=0x7ffef716dd58,
    ops=ops@entry=0x55fe3a0adae0 <xfs_inode_buf_ops>) at rdwr.c:655
#18 0x000055fe39e1bc64 in libxfs_buf_read (ops=0x55fe3a0adae0
<xfs_inode_buf_ops>,
    bpp=0x7ffef716dd58, flags=2, numblks=64, blkno=98799836480,
target=<optimized out>)
    at ../libxfs/libxfs_io.h:173
#19 set_cur (type=0x55fe3a0b11a8 <__typtab_crc+840>, blknum=98799836480, len=64,
    ring_flag=ring_flag@entry=0, bbmap=bbmap@entry=0x0) at io.c:550
#20 0x000055fe39e2155f in copy_inode_chunk (rp=0x55fe420f42a8,
agno=<optimized out>)
    at metadump.c:2527
#21 scanfunc_ino (block=<optimized out>, agno=<optimized out>,
agbno=<optimized out>,
    level=<optimized out>, btype=<optimized out>, arg=<optimized out>)
at metadump.c:2604
#22 0x000055fe39e1d7df in scan_btree (agno=46, agbno=1553279,
level=level@entry=1,
    btype=btype@entry=TYP_INOBT, arg=arg@entry=0x7ffef716e030,
    func=func@entry=0x55fe39e210b0 <scanfunc_ino>) at metadump.c:403
#23 0x000055fe39e2133d in scanfunc_ino (block=<optimized out>,
agno=46, agbno=1197680, level=1,
    btype=TYP_INOBT, arg=0x7ffef716e030) at metadump.c:2627
#24 0x000055fe39e1d7df in scan_btree (agno=agno@entry=46,
agbno=1197680, level=2,
    btype=btype@entry=TYP_INOBT, arg=arg@entry=0x7ffef716e030,
    func=func@entry=0x55fe39e210b0 <scanfunc_ino>) at metadump.c:403
#25 0x000055fe39e20eca in copy_inodes (agi=0x55fe41ca9400, agno=46) at
metadump.c:2660
#26 scan_ag (agno=46) at metadump.c:2784
#27 metadump_f (argc=<optimized out>, argv=<optimized out>) at metadump.c:3086
#28 0x000055fe39e030d1 in main (argc=<optimized out>, argv=<optimized
out>) at init.c:190
(gdb)

Best,

Sean


On Wed, Feb 2, 2022 at 5:06 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Wed, Feb 02, 2022 at 03:18:34PM -0500, Sean Caron wrote:
> > Hi Dave,
> >
> > It counted up to inode 13555712 and then crashed with the error:
> >
> > malloc_consolidate(): invalid chunk size
>
> That sounds like heap corruption or something similar - that's a
> much more difficult problem to track down.
>
> Can you either run gdb on the core file it left and grab a stack
> trace of where it crashed, or run metadump again from gdb so that it
> can capture the crash and get a stack trace that way?
>
> > Immediately before that, it printed:
> >
> > xfs_metadump: invalid block number 4358190/50414336 (1169892770398976)
> > in bmap extent 0 in symlink ino 98799839421
>
> I don't think that would cause any problems - it just aborts
> processing the extent records in that block and moves on to the next
> valid one that is found.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] metadump: handle corruption errors without aborting
  2022-02-02 23:45                 ` Sean Caron
@ 2022-02-06 22:34                   ` Dave Chinner
  2022-02-07 21:42                     ` Sean Caron
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2022-02-06 22:34 UTC (permalink / raw)
  To: Sean Caron; +Cc: linux-xfs

On Wed, Feb 02, 2022 at 06:45:34PM -0500, Sean Caron wrote:
> Sure! Please see gdb backtrace output below.
> 
> (gdb) bt
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
> #1  0x00007f289d5c7921 in __GI_abort () at abort.c:79
> #2  0x00007f289d610967 in __libc_message (action=action@entry=do_abort,
>     fmt=fmt@entry=0x7f289d73db0d "%s\n") at ../sysdeps/posix/libc_fatal.c:181
> #3  0x00007f289d6179da in malloc_printerr (
>     str=str@entry=0x7f289d73f368 "malloc_consolidate(): invalid chunk
> size") at malloc.c:5342
> #4  0x00007f289d617c7e in malloc_consolidate
> (av=av@entry=0x7f289d972c40 <main_arena>)
>     at malloc.c:4471
> #5  0x00007f289d61b968 in _int_malloc (av=av@entry=0x7f289d972c40 <main_arena>,
>     bytes=bytes@entry=33328) at malloc.c:3713
> #6  0x00007f289d620275 in _int_memalign (bytes=32768, alignment=512,
> av=0x7f289d972c40 <main_arena>)
>     at malloc.c:4683

Ok, so there's nothing wrong with the memalign() parameters being
passed to glibc, so something has previously caused heap corruption
that is only now being tripped over trying to allocate memory for a
new inode cluster buffer.

I wonder if it was the zeroing of the "unused" part of the inode
data fork area that did this (perhaps a corrupt inode fork offset?)
so maybe it is worth turning off the stale data zeroing function
(-a to copy all metadata blocks) so that it doesn't try to interpret
corrupt metadata to determine what is unused areas or not...

If that does get past this inode, then we'll need to make the
stale region zeroing a lot more careful and avoid zeroing in the
case of badly broken metadata.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] metadump: handle corruption errors without aborting
  2022-02-06 22:34                   ` Dave Chinner
@ 2022-02-07 21:42                     ` Sean Caron
  2022-02-07 22:34                       ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Sean Caron @ 2022-02-07 21:42 UTC (permalink / raw)
  To: Dave Chinner, Sean Caron; +Cc: linux-xfs

Hi Dave,

Your suggestion was right on. I ran xfs_metadump with the "-a"
parameter and it was able to finish without any more showstoppers.

Thanks!

Sean


On Sun, Feb 6, 2022 at 5:34 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Wed, Feb 02, 2022 at 06:45:34PM -0500, Sean Caron wrote:
> > Sure! Please see gdb backtrace output below.
> >
> > (gdb) bt
> > #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
> > #1  0x00007f289d5c7921 in __GI_abort () at abort.c:79
> > #2  0x00007f289d610967 in __libc_message (action=action@entry=do_abort,
> >     fmt=fmt@entry=0x7f289d73db0d "%s\n") at ../sysdeps/posix/libc_fatal.c:181
> > #3  0x00007f289d6179da in malloc_printerr (
> >     str=str@entry=0x7f289d73f368 "malloc_consolidate(): invalid chunk
> > size") at malloc.c:5342
> > #4  0x00007f289d617c7e in malloc_consolidate
> > (av=av@entry=0x7f289d972c40 <main_arena>)
> >     at malloc.c:4471
> > #5  0x00007f289d61b968 in _int_malloc (av=av@entry=0x7f289d972c40 <main_arena>,
> >     bytes=bytes@entry=33328) at malloc.c:3713
> > #6  0x00007f289d620275 in _int_memalign (bytes=32768, alignment=512,
> > av=0x7f289d972c40 <main_arena>)
> >     at malloc.c:4683
>
> Ok, so there's nothing wrong with the memalign() parameters being
> passed to glibc, so something has previously caused heap corruption
> that is only now being tripped over trying to allocate memory for a
> new inode cluster buffer.
>
> I wonder if it was the zeroing of the "unused" part of the inode
> data fork area that did this (perhaps a corrupt inode fork offset?)
> so maybe it is worth turning off the stale data zeroing function
> (-a to copy all metadata blocks) so that it doesn't try to interpret
> corrupt metadata to determine what is unused areas or not...
>
> If that does get past this inode, then we'll need to make the
> stale region zeroing a lot more careful and avoid zeroing in the
> case of badly broken metadata.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] metadump: handle corruption errors without aborting
  2022-02-07 21:42                     ` Sean Caron
@ 2022-02-07 22:34                       ` Dave Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2022-02-07 22:34 UTC (permalink / raw)
  To: Sean Caron; +Cc: linux-xfs

On Mon, Feb 07, 2022 at 04:42:28PM -0500, Sean Caron wrote:
> Hi Dave,
> 
> Your suggestion was right on. I ran xfs_metadump with the "-a"
> parameter and it was able to finish without any more showstoppers.

Thanks for the feedback - I'll have a look at tightening up the
constraints on the zeroing code so that it doesn't do stupid stuff
like this...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS disaster recovery
  2022-02-01 23:33 ` Dave Chinner
  2022-02-02  1:20   ` Sean Caron
@ 2022-02-07 22:03   ` Sean Caron
  2022-02-07 22:33     ` Dave Chinner
  1 sibling, 1 reply; 20+ messages in thread
From: Sean Caron @ 2022-02-07 22:03 UTC (permalink / raw)
  To: Dave Chinner, Sean Caron; +Cc: linux-xfs

Hi Dave,

OK! With your patch and help on that other thread pertaining to
xfs_metadump I was able to get a full metadata dump of this
filesystem.

I used xfs_mdrestore to set up a sparse image for this volume using my
dumped metadata:

xfs_mdrestore /exports/home/work/md4.metadump /exports/home/work/md4.img

Then set up a loopback device for it and tried to mount.

losetup --show --find /exports/home/work/md4.img
mount /dev/loop0 /mnt

When I do this, I get a "Structure needs cleaning" error and the
following in dmesg:

[523615.874581] XFS (loop0): Corruption warning: Metadata has LSN
(7095:2330880) ahead of current LSN (7095:2328512). Please unmount and
run xfs_repair (>= v4.3) to resolve.
[523615.874637] XFS (loop0): Metadata corruption detected at
xfs_agi_verify+0xef/0x180 [xfs], xfs_agi block 0x10
[523615.874666] XFS (loop0): Unmount and run xfs_repair
[523615.874679] XFS (loop0): First 128 bytes of corrupted metadata buffer:
[523615.874695] 00000000: 58 41 47 49 00 00 00 01 00 00 00 00 0f ff ff
f8  XAGI............
[523615.874713] 00000010: 00 03 ba 40 00 04 ef 7e 00 00 00 02 00 00 00
34  ...@...~.......4
[523615.874732] 00000020: 00 30 09 40 ff ff ff ff ff ff ff ff ff ff ff
ff  .0.@............
[523615.874750] 00000030: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff  ................
[523615.874768] 00000040: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff  ................
[523615.874787] 00000050: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff  ................
[523615.874806] 00000060: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff  ................
[523615.874824] 00000070: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff  ................
[523615.874914] XFS (loop0): metadata I/O error in
"xfs_trans_read_buf_map" at daddr 0x10 len 8 error 117
[523615.874998] XFS (loop0): xfs_imap_lookup: xfs_ialloc_read_agi()
returned error -117, agno 0
[523615.876866] XFS (loop0): Failed to read root inode 0x80, error 117

Seems like the next step is to just run xfs_repair (with or without
log zeroing?) on this image and see what shakes out?

Thanks,

Sean

On Tue, Feb 1, 2022 at 6:33 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Feb 01, 2022 at 06:07:18PM -0500, Sean Caron wrote:
> > Hi all,
> >
> > Me again with another not-backed-up XFS filesystem that's in a little
> > trouble. Last time I stopped by to discuss my woes, I was told that I
> > could check in here and get some help reading the tea leaves before I
> > do anything drastic so I'm doing that :)
> >
> > Brief backstory: This is a RAID 60 composed of three 18-drive RAID 6
> > strings of 8 TB disk drives, around 460 TB total capacity. Last week
> > we had a disk fail out of the array. We replaced the disk and the
> > recovery hung at around 70%.
> >
> > We power cycled the machine and enclosure and got the recovery to run
> > to completion. Just as it finished up, the same string dropped another
> > drive.
> >
> > We replaced that drive and started recovery again. It got a fair bit
> > into the recovery, then hung just as did the first drive recovery, at
> > around +/- 70%. We power cycled everything again, then started the
> > recovery. As the recovery was running again, a third disk started to
> > throw read errors.
> >
> > At this point, I decided to just stop trying to recover this array so
> > it's up with two disks down but otherwise assembled. I figured I would
> > just try to mount ro,norecovery and try to salvage as much as possible
> > at this point before going any further.
> >
> > Trying to mount ro,norecovery, I am getting an error:
>
> Seeing as you've only lost redundancy at this point in time, this
> will simply result in trying to mount the filesystem in an
> inconsistent state and so you'll see metadata corruptions because
> the log has no be replayed.
>
> > metadata I/O error in "xfs_trans_read_buf_map at daddr ... len 8 error 74
> > Metadata CRC error detected at xfs_agf_read_verify+0xd0/0xf0 [xfs],
> > xfs_agf block ...
> >
> > I ran an xfs_repair -L -n just to see what it would spit out. It
> > completes within 15-20 minutes (which I feel might be a good sign,
> > from my experience, outcomes are inversely proportional to run time),
> > but the output is implying that it would unlink over 100,000 files
> > (I'm not sure how many total files are on the filesystem, in terms of
> > what proportion of loss this would equate to) and it also says:
> >
> > "Inode allocation btrees are too corrupted, skipping phases 6 and 7"
>
> This is expected because 'xfs_repair -n' does not recover the log.
> Hence you're running checks on an inconsistent fs and repair is
> detecting that the inobts are inconsistent so it can't check the
> directory structure connectivity and link counts sanely.
>
> What you want to do here is take a metadump of the filesystem (it's
> an offline operation) and restore it to a an image file on a
> different system (creates a sparse file so just needs to run on a fs
> that supports file sizes > 16TB). You can then mount the image file
> via "mount -o loop <fs.img> <mntpt>", and it run log recovery on the
> image. Then you can unmount it again and see if the resultant
> filesystem image contains any corruption via 'xfs_repair -n'.
>
> If there's no problems found, then the original filesysetm is all
> good an all you need to do is mount it and everythign should be
> there ready for the migration process to non-failing storage.
>
> If there are warnings/repairs needed then you're probably best to
> post the output of 'xfs_reapir -n' so we can review it and determine
> the best course of action from there.
>
> IOWs, do all the diagnosis/triage of the filesytem state on the
> restored metadump images so that we don't risk further damaging the
> real storage. If we screw up a restored filesystem image, no big
> deal, we can just return it to the original state by restoring it
> from the metadump again to try something different.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS disaster recovery
  2022-02-07 22:03   ` XFS disaster recovery Sean Caron
@ 2022-02-07 22:33     ` Dave Chinner
  2022-02-07 22:56       ` Sean Caron
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2022-02-07 22:33 UTC (permalink / raw)
  To: Sean Caron; +Cc: linux-xfs

On Mon, Feb 07, 2022 at 05:03:03PM -0500, Sean Caron wrote:
> Hi Dave,
> 
> OK! With your patch and help on that other thread pertaining to
> xfs_metadump I was able to get a full metadata dump of this
> filesystem.
> 
> I used xfs_mdrestore to set up a sparse image for this volume using my
> dumped metadata:
> 
> xfs_mdrestore /exports/home/work/md4.metadump /exports/home/work/md4.img
> 
> Then set up a loopback device for it and tried to mount.
> 
> losetup --show --find /exports/home/work/md4.img
> mount /dev/loop0 /mnt
> 
> When I do this, I get a "Structure needs cleaning" error and the
> following in dmesg:
> 
> [523615.874581] XFS (loop0): Corruption warning: Metadata has LSN
> (7095:2330880) ahead of current LSN (7095:2328512). Please unmount and
> run xfs_repair (>= v4.3) to resolve.
> [523615.874637] XFS (loop0): Metadata corruption detected at
> xfs_agi_verify+0xef/0x180 [xfs], xfs_agi block 0x10
> [523615.874666] XFS (loop0): Unmount and run xfs_repair
> [523615.874679] XFS (loop0): First 128 bytes of corrupted metadata buffer:
> [523615.874695] 00000000: 58 41 47 49 00 00 00 01 00 00 00 00 0f ff ff
> f8  XAGI............
> [523615.874713] 00000010: 00 03 ba 40 00 04 ef 7e 00 00 00 02 00 00 00
> 34  ...@...~.......4
> [523615.874732] 00000020: 00 30 09 40 ff ff ff ff ff ff ff ff ff ff ff
> ff  .0.@............
> [523615.874750] 00000030: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> ff  ................
> [523615.874768] 00000040: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> ff  ................
> [523615.874787] 00000050: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> ff  ................
> [523615.874806] 00000060: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> ff  ................
> [523615.874824] 00000070: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> ff  ................
> [523615.874914] XFS (loop0): metadata I/O error in
> "xfs_trans_read_buf_map" at daddr 0x10 len 8 error 117
> [523615.874998] XFS (loop0): xfs_imap_lookup: xfs_ialloc_read_agi()
> returned error -117, agno 0
> [523615.876866] XFS (loop0): Failed to read root inode 0x80, error 117

Hmmm - I think this is after log recovery. The nature of the error
(metadata LSN a few blocks larger than the current recovered LSN)
implies that part of the log was lost during device failure/recovery
and hence not recovered when mounting the filesystem.

> Seems like the next step is to just run xfs_repair (with or without
> log zeroing?) on this image and see what shakes out?

Yup.

You may be able to run it on the image file without log zeroing
after the failed mount if there were no pending intents that needed
replay.  But it doesn't matter if you do zero the log at this point,
as it's already replayed everything it can replay back into the
filesystem and it will be as consistent as it's going to get.

Regardless, you are still likely to get a bunch of "unlinked but not
freed" inode warnings and inconsistent free space because the mount
failed between the initial recovery phase and the final recovery
phase that runs intent replay and processes unlinked inodes.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS disaster recovery
  2022-02-07 22:33     ` Dave Chinner
@ 2022-02-07 22:56       ` Sean Caron
  2022-02-08  1:51         ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Sean Caron @ 2022-02-07 22:56 UTC (permalink / raw)
  To: Dave Chinner, Sean Caron; +Cc: linux-xfs

Got it. I ran an xfs_repair on the simulated metadata filesystem and
it seems like it almost finished but errored out with the message:

fatal error -- name create failed in lost+found (28), filesystem may
be out of space

However there is plenty of space on the underlying volume where the
metadata dump and sparse image are kept. Even if the sparse image was
actually 384 TB as it shows up in "ls", there's 425 TB free on the
volume where it's kept.

I wonder since this was a fairly large filesystem (~500 TB) it's
hitting some kind of limit somewhere with the loopback device?

Any thoughts on how I might be able to move past this? I guess I will
need to xfs_repair this filesystem one way or the other anyway to get
anything off of it, but it would be nice to run the simulation first
just to see what to expect.

Thanks,

Sean

On Mon, Feb 7, 2022 at 5:34 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Feb 07, 2022 at 05:03:03PM -0500, Sean Caron wrote:
> > Hi Dave,
> >
> > OK! With your patch and help on that other thread pertaining to
> > xfs_metadump I was able to get a full metadata dump of this
> > filesystem.
> >
> > I used xfs_mdrestore to set up a sparse image for this volume using my
> > dumped metadata:
> >
> > xfs_mdrestore /exports/home/work/md4.metadump /exports/home/work/md4.img
> >
> > Then set up a loopback device for it and tried to mount.
> >
> > losetup --show --find /exports/home/work/md4.img
> > mount /dev/loop0 /mnt
> >
> > When I do this, I get a "Structure needs cleaning" error and the
> > following in dmesg:
> >
> > [523615.874581] XFS (loop0): Corruption warning: Metadata has LSN
> > (7095:2330880) ahead of current LSN (7095:2328512). Please unmount and
> > run xfs_repair (>= v4.3) to resolve.
> > [523615.874637] XFS (loop0): Metadata corruption detected at
> > xfs_agi_verify+0xef/0x180 [xfs], xfs_agi block 0x10
> > [523615.874666] XFS (loop0): Unmount and run xfs_repair
> > [523615.874679] XFS (loop0): First 128 bytes of corrupted metadata buffer:
> > [523615.874695] 00000000: 58 41 47 49 00 00 00 01 00 00 00 00 0f ff ff
> > f8  XAGI............
> > [523615.874713] 00000010: 00 03 ba 40 00 04 ef 7e 00 00 00 02 00 00 00
> > 34  ...@...~.......4
> > [523615.874732] 00000020: 00 30 09 40 ff ff ff ff ff ff ff ff ff ff ff
> > ff  .0.@............
> > [523615.874750] 00000030: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> > ff  ................
> > [523615.874768] 00000040: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> > ff  ................
> > [523615.874787] 00000050: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> > ff  ................
> > [523615.874806] 00000060: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> > ff  ................
> > [523615.874824] 00000070: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> > ff  ................
> > [523615.874914] XFS (loop0): metadata I/O error in
> > "xfs_trans_read_buf_map" at daddr 0x10 len 8 error 117
> > [523615.874998] XFS (loop0): xfs_imap_lookup: xfs_ialloc_read_agi()
> > returned error -117, agno 0
> > [523615.876866] XFS (loop0): Failed to read root inode 0x80, error 117
>
> Hmmm - I think this is after log recovery. The nature of the error
> (metadata LSN a few blocks larger than the current recovered LSN)
> implies that part of the log was lost during device failure/recovery
> and hence not recovered when mounting the filesystem.
>
> > Seems like the next step is to just run xfs_repair (with or without
> > log zeroing?) on this image and see what shakes out?
>
> Yup.
>
> You may be able to run it on the image file without log zeroing
> after the failed mount if there were no pending intents that needed
> replay.  But it doesn't matter if you do zero the log at this point,
> as it's already replayed everything it can replay back into the
> filesystem and it will be as consistent as it's going to get.
>
> Regardless, you are still likely to get a bunch of "unlinked but not
> freed" inode warnings and inconsistent free space because the mount
> failed between the initial recovery phase and the final recovery
> phase that runs intent replay and processes unlinked inodes.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS disaster recovery
  2022-02-07 22:56       ` Sean Caron
@ 2022-02-08  1:51         ` Dave Chinner
  2022-02-08 15:46           ` Sean Caron
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2022-02-08  1:51 UTC (permalink / raw)
  To: Sean Caron; +Cc: linux-xfs

On Mon, Feb 07, 2022 at 05:56:21PM -0500, Sean Caron wrote:
> Got it. I ran an xfs_repair on the simulated metadata filesystem and
> it seems like it almost finished but errored out with the message:
> 
> fatal error -- name create failed in lost+found (28), filesystem may
> be out of space

Not a lot to go on there - can you send me the entire reapir output?

> However there is plenty of space on the underlying volume where the
> metadata dump and sparse image are kept. Even if the sparse image was
> actually 384 TB as it shows up in "ls", there's 425 TB free on the
> volume where it's kept.

Hmmm - the sparse image should be the same size as the filesystem
itself. If it's only 384TB and not 500TB, then either the metadump
or the restore may not have completed fully.

> I wonder since this was a fairly large filesystem (~500 TB) it's
> hitting some kind of limit somewhere with the loopback device?

Shouldn't - I've used larger loopback files hostsed on XFS
filesystems in the past.

> Any thoughts on how I might be able to move past this? I guess I will
> need to xfs_repair this filesystem one way or the other anyway to get
> anything off of it, but it would be nice to run the simulation first
> just to see what to expect.

I think that first we need to make sure that the metadump and
restore process was completed successfully (did you check the exit
value was zero?). xfs_db can be used to do that:

# xfs_db -r <image-file>
xfs_db> sb 0
xfs_db> p agcount
<val>
xfs_db> agf <val - 1>
xfs_db> p
.....
(should dump the last AGF in the filesystem)

If that works, then the metadump/restore should have been complete,
and the size of the image file should match the size of the
filesystem that was dumped...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS disaster recovery
  2022-02-08  1:51         ` Dave Chinner
@ 2022-02-08 15:46           ` Sean Caron
  2022-02-08 20:56             ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Sean Caron @ 2022-02-08 15:46 UTC (permalink / raw)
  To: Dave Chinner, Sean Caron; +Cc: linux-xfs

Hi Dave,

I'm sorry for some imprecise language. The array is around 450 TB raw
and I will refer to it as roughly half a petabyte but factoring out
RAID parity disks and spare disks it should indeed be around 384 TB
formatted.

I checked over the metadump with xfs_db as you suggest and it looks
like it dumped all AGs that the filesystem had.

# ./xfs_db -r /exports/home/work/md4.img
Metadata CRC error detected at 0x555a5c9dc8fe, xfs_agf block 0x4d7fffd948/0x1000
xfs_db: cannot init perag data (74). Continuing anyway.
xfs_db> sb 0
xfs_db> p agcount
agcount = 350
xfs_db> agf 349
xfs_db> p
magicnum = 0x58414746
versionnum = 1
seqno = 349
length = 82676200
bnoroot = 11
cntroot = 9
rmaproot =
refcntroot =
bnolevel = 2
cntlevel = 2
rmaplevel = 0
refcntlevel = 0
rmapblocks = 0
refcntblocks = 0
flfirst = 576
fllast = 581
flcount = 6
freeblks = 12229503
longest = 55037
btreeblks = 545
uuid = 4f39a900-91fa-4c5d-ba34-b56e77720db3
lsn = 0x1bb700239100
crc = 0xd329ccfc (correct)
xfs_db>

And looking at the image the size is roughly what it should be with
the actual size of the filesystem and the size of the sparse file
image looks sane.

# ls -l /exports/home/work/
total 157159048
-rw-r--r-- 1 root root 384068188372992 Feb  7 22:02 md4.img
-rw-r--r-- 1 root root     53912722432 Feb  7 21:59 md4.metadump
-rw-r--r-- 1 root root     53912722432 Feb  7 16:50 md4.metadump.save
# du -sh /exports/home/work/md4.img
50G     /exports/home/work/md4.img
#

I also apologize, in my last email I accidentally ran the copy of
xfs_repair that was installed from the Ubuntu package manager (old -
4.9.0) instead of the copy that I built from the dev tree.

I took advantage of this test environment to just run a bunch of
experiments and see what happened.

I found that if I ran the dev tree xfs_repair with the -P option, I
could get xfs_repair to complete a run. It exits with return code 130
but the resulting loopback image filesystem is mountable and I see
around 27 TB in lost+found which would represent around 9% loss in
terms of what was actually on the filesystem.

Given where we started I think this is acceptable (more than
acceptable, IMO, I was getting to the point of expecting to have to
write off the majority of the filesystem) and it seems like a way
forward to get the majority of the data off this old filesystem.

Is there anything further I should check or any caveats that I should
bear in mind applying this xfs_repair to the real filesystem? Or does
it seem reasonable to go ahead, repair this and start copying off?

Thanks so much for all your help so far,

Sean

On Mon, Feb 7, 2022 at 8:51 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Feb 07, 2022 at 05:56:21PM -0500, Sean Caron wrote:
> > Got it. I ran an xfs_repair on the simulated metadata filesystem and
> > it seems like it almost finished but errored out with the message:
> >
> > fatal error -- name create failed in lost+found (28), filesystem may
> > be out of space
>
> Not a lot to go on there - can you send me the entire reapir output?
>
> > However there is plenty of space on the underlying volume where the
> > metadata dump and sparse image are kept. Even if the sparse image was
> > actually 384 TB as it shows up in "ls", there's 425 TB free on the
> > volume where it's kept.
>
> Hmmm - the sparse image should be the same size as the filesystem
> itself. If it's only 384TB and not 500TB, then either the metadump
> or the restore may not have completed fully.
>
> > I wonder since this was a fairly large filesystem (~500 TB) it's
> > hitting some kind of limit somewhere with the loopback device?
>
> Shouldn't - I've used larger loopback files hostsed on XFS
> filesystems in the past.
>
> > Any thoughts on how I might be able to move past this? I guess I will
> > need to xfs_repair this filesystem one way or the other anyway to get
> > anything off of it, but it would be nice to run the simulation first
> > just to see what to expect.
>
> I think that first we need to make sure that the metadump and
> restore process was completed successfully (did you check the exit
> value was zero?). xfs_db can be used to do that:
>
> # xfs_db -r <image-file>
> xfs_db> sb 0
> xfs_db> p agcount
> <val>
> xfs_db> agf <val - 1>
> xfs_db> p
> .....
> (should dump the last AGF in the filesystem)
>
> If that works, then the metadump/restore should have been complete,
> and the size of the image file should match the size of the
> filesystem that was dumped...
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS disaster recovery
  2022-02-08 15:46           ` Sean Caron
@ 2022-02-08 20:56             ` Dave Chinner
  2022-02-08 21:24               ` Sean Caron
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2022-02-08 20:56 UTC (permalink / raw)
  To: Sean Caron; +Cc: linux-xfs

On Tue, Feb 08, 2022 at 10:46:45AM -0500, Sean Caron wrote:
> Hi Dave,
> 
> I'm sorry for some imprecise language. The array is around 450 TB raw
> and I will refer to it as roughly half a petabyte but factoring out
> RAID parity disks and spare disks it should indeed be around 384 TB
> formatted.

Ah, OK, looks like it was a complete dump, then.

> I found that if I ran the dev tree xfs_repair with the -P option, I
> could get xfs_repair to complete a run. It exits with return code 130
> but the resulting loopback image filesystem is mountable and I see
> around 27 TB in lost+found which would represent around 9% loss in
> terms of what was actually on the filesystem.

I'm sure that if that much ended up in lost+found, xfs_repair also
threw away a whole load of metadata which means data will have been
lost. And with this much metadata corruption occurring, it tends to
imply that there will be widespread data corruption, too.  Hence I
think it's worth pointing out (maybe unnecessarily!) that xfs_repair
doesn't tell you about (or fix) data corruption - it just rebuilds
the metadata back into a consistent state.

> Given where we started I think this is acceptable (more than
> acceptable, IMO, I was getting to the point of expecting to have to
> write off the majority of the filesystem) and it seems like a way
> forward to get the majority of the data off this old filesystem.

Yes, but you are still going to have to verify the data you can
still access is not corrupted - random offsets within files could
now contain garbage regardless of whether the file was moved to
lost+found or not.

> Is there anything further I should check or any caveats that I should
> bear in mind applying this xfs_repair to the real filesystem? Or does
> it seem reasonable to go ahead, repair this and start copying off?

Seems reasonable to repeat the process on the real filesystem, but
given the caveat about data corruption above, I suspect that the
entire dataset on the filesystem might still end up being a complete
write-off.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS disaster recovery
  2022-02-08 20:56             ` Dave Chinner
@ 2022-02-08 21:24               ` Sean Caron
  0 siblings, 0 replies; 20+ messages in thread
From: Sean Caron @ 2022-02-08 21:24 UTC (permalink / raw)
  To: Dave Chinner, Sean Caron; +Cc: linux-xfs

Thank you so much for your expert consultation on this, Dave. I'm
definitely cognizant of the fact that there may still be inter-file
corruption while metadata is OK as well. It sounds like we've moved
the situation along as much as we can by finding a set of parameters
where xfs_repair will finish without hanging or crashing via
nondestructive testing with the metadata dump and sparse image and we
end up with a product that can be mounted.

We'll move ahead with repairing the filesystem on-disk and copy off
what we can, with the caveat that users will want to go back and check
file integrity once the copies are finished and there may be
additional data loss that isn't captured in what's in lost+found.

Best,

Sean


On Tue, Feb 8, 2022 at 3:56 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Feb 08, 2022 at 10:46:45AM -0500, Sean Caron wrote:
> > Hi Dave,
> >
> > I'm sorry for some imprecise language. The array is around 450 TB raw
> > and I will refer to it as roughly half a petabyte but factoring out
> > RAID parity disks and spare disks it should indeed be around 384 TB
> > formatted.
>
> Ah, OK, looks like it was a complete dump, then.
>
> > I found that if I ran the dev tree xfs_repair with the -P option, I
> > could get xfs_repair to complete a run. It exits with return code 130
> > but the resulting loopback image filesystem is mountable and I see
> > around 27 TB in lost+found which would represent around 9% loss in
> > terms of what was actually on the filesystem.
>
> I'm sure that if that much ended up in lost+found, xfs_repair also
> threw away a whole load of metadata which means data will have been
> lost. And with this much metadata corruption occurring, it tends to
> imply that there will be widespread data corruption, too.  Hence I
> think it's worth pointing out (maybe unnecessarily!) that xfs_repair
> doesn't tell you about (or fix) data corruption - it just rebuilds
> the metadata back into a consistent state.
>
> > Given where we started I think this is acceptable (more than
> > acceptable, IMO, I was getting to the point of expecting to have to
> > write off the majority of the filesystem) and it seems like a way
> > forward to get the majority of the data off this old filesystem.
>
> Yes, but you are still going to have to verify the data you can
> still access is not corrupted - random offsets within files could
> now contain garbage regardless of whether the file was moved to
> lost+found or not.
>
> > Is there anything further I should check or any caveats that I should
> > bear in mind applying this xfs_repair to the real filesystem? Or does
> > it seem reasonable to go ahead, repair this and start copying off?
>
> Seems reasonable to repeat the process on the real filesystem, but
> given the caveat about data corruption above, I suspect that the
> entire dataset on the filesystem might still end up being a complete
> write-off.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-02-08 22:24 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-02-01 23:07 XFS disaster recovery Sean Caron
2022-02-01 23:33 ` Dave Chinner
2022-02-02  1:20   ` Sean Caron
2022-02-02  2:44     ` Dave Chinner
2022-02-02  7:42       ` [PATCH] metadump: handle corruption errors without aborting Dave Chinner
2022-02-02 18:49         ` Sean Caron
2022-02-02 19:43           ` Sean Caron
2022-02-02 20:18             ` Sean Caron
2022-02-02 22:05               ` Dave Chinner
2022-02-02 23:45                 ` Sean Caron
2022-02-06 22:34                   ` Dave Chinner
2022-02-07 21:42                     ` Sean Caron
2022-02-07 22:34                       ` Dave Chinner
2022-02-07 22:03   ` XFS disaster recovery Sean Caron
2022-02-07 22:33     ` Dave Chinner
2022-02-07 22:56       ` Sean Caron
2022-02-08  1:51         ` Dave Chinner
2022-02-08 15:46           ` Sean Caron
2022-02-08 20:56             ` Dave Chinner
2022-02-08 21:24               ` Sean Caron

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox