From: Brian Foster <bfoster@redhat.com>
To: Lucas Stach <l.stach@pengutronix.de>
Cc: linux-xfs@vger.kernel.org
Subject: Re: Regular FS shutdown while rsync is running
Date: Mon, 21 Jan 2019 13:11:51 -0500 [thread overview]
Message-ID: <20190121181151.GD14281@bfoster> (raw)
In-Reply-To: <1548087824.2465.9.camel@pengutronix.de>
On Mon, Jan 21, 2019 at 05:23:44PM +0100, Lucas Stach wrote:
> Am Montag, den 21.01.2019, 08:01 -0500 schrieb Brian Foster:
> [...]
> > > root@XXX:/mnt/metadump# xfs_repair /dev/XXX
> > > Phase 1 - find and verify superblock...
> > > - reporting progress in intervals of 15 minutes
> > > Phase 2 - using internal log
> > > - zero log...
> > > - scan filesystem freespace and inode maps...
> > > bad magic # 0x49414233 in inobt block 5/7831662
> >
> > Hmm, so this looks like a very isolated corruption. It's complaining
> > about a magic number (internal filesystem value stamped to metadata
> > blocks for verification/sanity purposes) being wrong on a finobt block
> > and nothing else seems to be wrong in the fs. (I guess repair should
> > probably print out 'finobt' here instead of 'inobt,' but that's a
> > separate issue..).
> >
> > The finobt uses a magic value of XFS_FIBT_CRC_MAGIC (0x46494233, 'FIB3')
> > whereas this block has a magic value of 0x49414233. The latter is
> > 'IAB3' (XFS_IBT_CRC_MAGIC), which is the magic value for regular inode
> > btree blocks.
> >
> > > - 23:06:50: scanning filesystem freespace - 33 of 33 allocation groups done
> > > - found root inode chunk
> > > Phase 3 - for each AG...
> >
> > ...
> > > Phase 7 - verify and correct link counts...
> > > - 22:29:19: verify and correct link counts - 33 of 33 allocation groups done
> > > done
> > >
> > > > Would you be able to provide an xfs_metadump image
> > > > of this filesystem for closer inspection?
> > >
> > > This filesystem is really metadata heavy, so an xfs_metadump ended up
> > > being around 400GB of data. Not sure if this is something you would be
> > > willing to look into?
> > >
> >
> > Ok, it might be difficult to get ahold of that. Does the image happen to
> > compress well?
>
> I'll see how well it compresses, but this might take a while...
>
> > In the meantime, given that the corruption appears to be so isolated you
> > might be able to provide enough information from the metadump without
> > having to transfer it. The first thing is probably to take a look at the
> > block in question..
> >
> > First, restore the metadump somewhere:
> >
> > xfs_mdrestore -g ./md.img <destination>
> >
> > You'll need somewhere with enough space for that 400G or so. Note that
> > you can restore to a file and mount/inspect that file as if it were the
> > original fs. I'd also mount/unmount the restored metadump and run an
> > 'xfs_repair -n' on it just to double check that the corruption was
> > captured properly and there are no other issues with the metadump. -n is
> > important here as otherwise repair will fix the metadump and remove the
> > corruption.
> >
> > Next, use xfs_db to dump the contents of the suspect block. Run 'xfs_db
> > <metadump image>' to open the fs and try the following sequence of
> > commands.
> >
> > - Convert to a global fsb: 'convert agno 5 agbno 7831662 fsb'
> > - Jump to the fsb: 'fsb <output of prev cmd>'
> > - Set the block type: 'type finobt'
> > - Print the block: 'print'
> >
> > ... and copy/paste the output.
>
> So for the moment, here's the output of the above sequence.
>
> xfs_db> convert agno 5 agbno 7831662 fsb
> 0x5077806e (1350008942)
> xfs_db> fsb 0x5077806e
> xfs_db> type finobt
> xfs_db> print
> magic = 0x49414233
> level = 1
> numrecs = 335
> leftsib = 7810856
> rightsib = null
> bno = 7387612016
> lsn = 0x6671003d9700
> uuid = 026711cc-25c7-44b9-89aa-0aac496edfec
> owner = 5
> crc = 0xe12b19b2 (correct)
As expected, we have the inobt magic. Interesting that this is a fairly
full intermediate (level > 0) node. There is no right sibling, which
means we're at the far right end of the tree. I wouldn't mind poking
around a bit more at the tree, but that might be easier with access to
the metadump. I also think that xfs_repair would have complained were
something more significant wrong with the tree.
Hmm, I wonder if the (lightly tested) diff below would help us catch
anything. It basically just splits up the currently combined inobt and
finobt I/O verifiers to expect the appropriate magic number (rather than
accepting either magic for both trees). Could you give that a try?
Unless we're doing something like using the wrong type of cursor for a
particular tree, I'd think this would catch wherever we happen to put a
bad magic on disk. Note that this assumes the underlying filesystem has
been repaired so as to try and detect the next time an on-disk
corruption is introduced.
You'll also need to turn up the XFS error level to make sure this prints
out a stack trace if/when a verifier failure triggers:
echo 5 > /proc/sys/fs/xfs/error_level
I guess we also shouldn't rule out hardware issues or whatnot. I did
notice you have a strange kernel version: 4.19.4-holodeck10. Is that a
distro kernel? Has it been modified from upstream in any way? If so, I'd
strongly suggest to try and confirm whether this is reproducible with an
upstream kernel.
Brian
--- 8< ---
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index 9b25e7a0df47..c493a37730cb 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -272,13 +272,11 @@ xfs_inobt_verify(
*/
switch (block->bb_magic) {
case cpu_to_be32(XFS_IBT_CRC_MAGIC):
- case cpu_to_be32(XFS_FIBT_CRC_MAGIC):
fa = xfs_btree_sblock_v5hdr_verify(bp);
if (fa)
return fa;
/* fall through */
case cpu_to_be32(XFS_IBT_MAGIC):
- case cpu_to_be32(XFS_FIBT_MAGIC):
break;
default:
return __this_address;
@@ -333,6 +331,86 @@ const struct xfs_buf_ops xfs_inobt_buf_ops = {
.verify_struct = xfs_inobt_verify,
};
+static xfs_failaddr_t
+xfs_finobt_verify(
+ struct xfs_buf *bp)
+{
+ struct xfs_mount *mp = bp->b_target->bt_mount;
+ struct xfs_btree_block *block = XFS_BUF_TO_BLOCK(bp);
+ xfs_failaddr_t fa;
+ unsigned int level;
+
+ /*
+ * During growfs operations, we can't verify the exact owner as the
+ * perag is not fully initialised and hence not attached to the buffer.
+ *
+ * Similarly, during log recovery we will have a perag structure
+ * attached, but the agi information will not yet have been initialised
+ * from the on disk AGI. We don't currently use any of this information,
+ * but beware of the landmine (i.e. need to check pag->pagi_init) if we
+ * ever do.
+ */
+ switch (block->bb_magic) {
+ case cpu_to_be32(XFS_FIBT_CRC_MAGIC):
+ fa = xfs_btree_sblock_v5hdr_verify(bp);
+ if (fa)
+ return fa;
+ /* fall through */
+ case cpu_to_be32(XFS_FIBT_MAGIC):
+ break;
+ default:
+ return __this_address;
+ }
+
+ /* level verification */
+ level = be16_to_cpu(block->bb_level);
+ if (level >= mp->m_in_maxlevels)
+ return __this_address;
+
+ return xfs_btree_sblock_verify(bp, mp->m_inobt_mxr[level != 0]);
+}
+
+static void
+xfs_finobt_read_verify(
+ struct xfs_buf *bp)
+{
+ xfs_failaddr_t fa;
+
+ if (!xfs_btree_sblock_verify_crc(bp))
+ xfs_verifier_error(bp, -EFSBADCRC, __this_address);
+ else {
+ fa = xfs_finobt_verify(bp);
+ if (fa)
+ xfs_verifier_error(bp, -EFSCORRUPTED, fa);
+ }
+
+ if (bp->b_error)
+ trace_xfs_btree_corrupt(bp, _RET_IP_);
+}
+
+static void
+xfs_finobt_write_verify(
+ struct xfs_buf *bp)
+{
+ xfs_failaddr_t fa;
+
+ fa = xfs_finobt_verify(bp);
+ if (fa) {
+ trace_xfs_btree_corrupt(bp, _RET_IP_);
+ xfs_verifier_error(bp, -EFSCORRUPTED, fa);
+ return;
+ }
+ xfs_btree_sblock_calc_crc(bp);
+
+}
+
+const struct xfs_buf_ops xfs_finobt_buf_ops = {
+ .name = "xfs_inobt",
+ .verify_read = xfs_finobt_read_verify,
+ .verify_write = xfs_finobt_write_verify,
+ .verify_struct = xfs_finobt_verify,
+};
+
STATIC int
xfs_inobt_keys_inorder(
struct xfs_btree_cur *cur,
@@ -389,7 +467,7 @@ static const struct xfs_btree_ops xfs_finobt_ops = {
.init_rec_from_cur = xfs_inobt_init_rec_from_cur,
.init_ptr_from_cur = xfs_finobt_init_ptr_from_cur,
.key_diff = xfs_inobt_key_diff,
- .buf_ops = &xfs_inobt_buf_ops,
+ .buf_ops = &xfs_finobt_buf_ops,
.diff_two_keys = xfs_inobt_diff_two_keys,
.keys_inorder = xfs_inobt_keys_inorder,
.recs_inorder = xfs_inobt_recs_inorder,
next prev parent reply other threads:[~2019-01-21 18:11 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-11-26 13:29 Regular FS shutdown while rsync is running Lucas Stach
2018-11-26 15:32 ` Brian Foster
2019-01-21 10:41 ` Lucas Stach
2019-01-21 13:01 ` Brian Foster
2019-01-21 16:23 ` Lucas Stach
2019-01-21 18:11 ` Brian Foster [this message]
2019-01-21 18:21 ` Lucas Stach
2019-01-22 10:39 ` Lucas Stach
2019-01-22 13:02 ` Brian Foster
2019-01-23 11:14 ` Lucas Stach
2019-01-23 12:11 ` Brian Foster
2019-01-23 13:03 ` Brian Foster
2019-01-23 18:58 ` Brian Foster
2019-01-24 8:53 ` Lucas Stach
2019-01-23 20:45 ` Dave Chinner
2019-01-24 13:31 ` Brian Foster
2019-01-24 19:11 ` Brian Foster
2019-01-28 22:34 ` Dave Chinner
2019-01-29 13:46 ` Brian Foster
2019-01-21 21:18 ` Dave Chinner
2019-01-22 9:15 ` Lucas Stach
2019-01-22 21:41 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190121181151.GD14281@bfoster \
--to=bfoster@redhat.com \
--cc=l.stach@pengutronix.de \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).