public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Olaf Hering <olaf@aepfle.de>
Cc: xfs@oss.sgi.com
Subject: Re: BUG in xfs_trans_binval
Date: Wed, 30 Mar 2016 10:54:15 +1100	[thread overview]
Message-ID: <20160329235415.GF30721@dastard> (raw)
In-Reply-To: <20160329171553.GA17885@aepfle.de>

On Tue, Mar 29, 2016 at 07:15:53PM +0200, Olaf Hering wrote:
> During receiving a backup stream (netcal -l 12345 | tar xf -) the host
> crashed and rebooted, no idea why.

That's the likely cause of your problems, because....

> After reboot I tried to remove the received directory (rm -rf dir) and
> got this BUG:
> 
> "_xfs_buf_find: Block out of range: block 0x81ffff3f8, EOFS 0x7fffd000"

This will be caused by a corrupt block....

> Kernel is 4.5.0 from openSUSE Tumbleweed.
> dmesg is attached, I just realized it has the backtrace.
> 
> [    1.883626] sd 4:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

So, write cache is enabled on the drive.

> [   20.083397] XFS (sdb1): Mounting V5 Filesystem
> [   20.291900] XFS (sdb1): Starting recovery (logdev: internal)
> [   25.285846] XFS (sdb1): Bad dir block magic!
> [   26.448027] XFS (sdb1): Ending recovery (logdev: internal)

And that's a big clue that something went badly wrong at the storage
level. Basically, after recovering a buffer from the log, it had an
invalid magic number for the type of buffer information being
recovered. In this case, the journal entry being recovered was for a
directory block in "single block" format. The magic number foundi in
the block after recovery of the transaction was not that of a
directory block in single block format.

The only way this can happen is if there is an underlying corruption
in the block prior to recovery starting. Given that the system
crashed and rebooted, it's entirely possible that initialisation of
the block never made it to persistent storage, or it was corrupted
on the way to disk by whatever caused the crash and reboot.

> [  130.489414] XFS (sdb1): _xfs_buf_find: Block out of range: block 0x81ffff3f8, EOFS 0x7fffd000 
> [  130.494271] XFS (sdb1): _xfs_buf_find: Block out of range: block 0x81ffff3f8, EOFS 0x7fffd000 

These occur because a bad sector address is being detected.

> [  130.489707]  [<ffffffff81395921>] dump_stack+0x63/0x82
> [  130.489715]  [<ffffffff8107d912>] warn_slowpath_common+0x82/0xc0
> [  130.489722]  [<ffffffff8107da0a>] warn_slowpath_null+0x1a/0x20
> [  130.489766]  [<ffffffffa0941b80>] _xfs_buf_find+0x350/0x3b0 [xfs]
> [  130.489824]  [<ffffffffa0941c0a>] xfs_buf_get_map+0x2a/0x2c0 [xfs]
> [  130.489876]  [<ffffffffa097026a>] xfs_trans_get_buf_map+0x11a/0x1c0 [xfs]
> [  130.489923]  [<ffffffffa0919040>] xfs_btree_get_bufs+0x50/0x60 [xfs]
> [  130.489961]  [<ffffffffa090283f>] xfs_alloc_fix_freelist+0x20f/0x3c0 [xfs]

And this location generating the out-of-range disk address indicates
that there may be a bad block number on the AGFL.

Given that none of these have triggered verifier failures on read
from disk, it makes me think that whatever has gone wrong in this
filesystem occurred before the crash and reboot, and smells somewhat
of memory corruption and/or misdirected writes.

Given that xfs_repair didn't warn about blocks on the AGFL being out
of range (which is checked), nor any other metadata linkage in the
filesystem pointing to a block out of range, nor did it warn about
directory blocks cwbeing corrupted or having invalid formats, this
is starting to look like an in-memory problem. Perhaps there is
still memory corruption occurring - the block out of range has a
single high bit set that puts it out of range. i.e.  when we mask of
the single bit that is out of range, 0x1ffff3f8 is a valid sector
address.

Can you run a memory tester on the machine?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  parent reply	other threads:[~2016-03-29 23:54 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-29 17:15 BUG in xfs_trans_binval Olaf Hering
2016-03-29 17:34 ` Olaf Hering
2016-03-29 23:54 ` Dave Chinner [this message]
2016-03-30 19:33   ` Olaf Hering

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160329235415.GF30721@dastard \
    --to=david@fromorbit.com \
    --cc=olaf@aepfle.de \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox