From: Dave Chinner <david@fromorbit.com>
To: Emmanuel Florac <eflorac@intellique.com>
Cc: xfs@oss.sgi.com
Subject: Re: easily reproducible filesystem crash on rebuilding array [XFS bug in my book]
Date: Wed, 17 Dec 2014 07:04:10 +1100 [thread overview]
Message-ID: <20141216200410.GC15665@dastard> (raw)
In-Reply-To: <20141216120821.587cf104@harpe.intellique.com>
On Tue, Dec 16, 2014 at 12:08:21PM +0100, Emmanuel Florac wrote:
> Le Mon, 15 Dec 2014 13:25:00 +0100
> Emmanuel Florac <eflorac@intellique.com> écrivait:
>
> > Reading the source I see that the error occured in xfs_buf_read_map, I
> > suppose it's when xfsbufd tries to scan dirty metadata? This is a read
> > error, so it could very well be a simple IO starvation at the
> > controller level (as the controller probably gives priority to
> > whatever writes are pending over reads).
> >
> > Maybe setting xfsbufd_centisecs to the max could help here? Trying
> > right away... Any advice welcome.
> >
>
> Alas, same thing;
>
> dmesg output:
>
>
> ffff8800df1f5020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> ffff8800df1f5030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> XFS (dm-0): Metadata corruption detected at xfs_inode_buf_verify+0x6c/0xb0, block 0xeffffff40
> XFS (dm-0): Unmount and run xfs_repair
> XFS (dm-0): First 64 bytes of corrupted metadata buffer:
> ffff8800df1f5000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> ffff8800df1f5010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> ffff8800df1f5020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> ffff8800df1f5030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> XFS (dm-0): Metadata corruption detected at xfs_inode_buf_verify+0x6c/0xb0, block 0xeffffff40
> XFS (dm-0): Unmount and run xfs_repair
So the underlying storage stack is returning zeros without any IO
errors here. It's probably a lookup operation, so it simply fails
and returns the error to userspace. Every one of these messages is a
separate read IO, but they are all returning zeros.
....
> XFS (dm-0): First 64 bytes of corrupted metadata buffer:
> ffff8800df1f5000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> ffff8800df1f5010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> ffff8800df1f5020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> ffff8800df1f5030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> XFS (dm-0): metadata I/O error: block 0xeffffff40 ("xfs_trans_read_buf_map") error 117 numblks 16
> XFS (dm-0): xfs_do_force_shutdown(0x1) called from line 383 of file fs/xfs/xfs_trans_buf.c. Return address = 0xffffffff8125cc90
> XFS (dm-0): I/O Error Detected. Shutting down filesystem
> XFS (dm-0): Please umount the filesystem and rectify the problem(s)
> XFS (dm-0): xfs_imap_to_bp: xfs_trans_read_buf() returned error 117.
> XFS (dm-0): xfs_log_force: error 5 returned.
> XFS (dm-0): xfs_log_force: error 5 returned.
And here the same read error has occurred in a dirty transaction,
and so the filesystem shut down.
> There is no IO error at the RAID controller level, at all. The buffer
> hasn't been overwritten with zeros, I'm pretty sure it actually timed
> out and just read nothing. This is not a case for an IO error IMO, a
> retry would almost certainly succeed; after all the problem occurred
> after more than 8 hours of continuous heavy read/write activity.
What you see above is a persistent corruption that has been
reported several times as XFS has errored out and then re-read
the data from disk multiple times. A retry would most certainly
return zeros again.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2014-12-16 20:04 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-12-11 11:39 easily reproducible filesystem crash on rebuilding array Emmanuel Florac
2014-12-11 15:52 ` Eric Sandeen
2014-12-15 12:07 ` Emmanuel Florac
2014-12-15 12:25 ` Emmanuel Florac
2014-12-15 20:10 ` Dave Chinner
2014-12-16 7:56 ` Christoph Hellwig
2014-12-16 11:38 ` Emmanuel Florac
2014-12-16 17:21 ` Emmanuel Florac
2014-12-16 11:34 ` Emmanuel Florac
2014-12-16 19:58 ` Dave Chinner
2014-12-17 11:21 ` Emmanuel Florac
2014-12-18 15:40 ` Emmanuel Florac
2014-12-18 22:58 ` Dave Chinner
2014-12-19 11:57 ` Emmanuel Florac
2014-12-19 23:06 ` Dave Chinner
2014-12-16 11:08 ` easily reproducible filesystem crash on rebuilding array [XFS bug in my book] Emmanuel Florac
2014-12-16 20:04 ` Dave Chinner [this message]
2015-01-13 11:21 ` easily reproducible filesystem crash on rebuilding array Emmanuel Florac
2015-01-13 13:59 ` Emmanuel Florac
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20141216200410.GC15665@dastard \
--to=david@fromorbit.com \
--cc=eflorac@intellique.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.