From: Eric Sandeen <sandeen@sandeen.net>
To: John Quigley <jquigley@jquigley.com>
Cc: XFS Development <xfs@oss.sgi.com>
Subject: Re: File system corruption
Date: Thu, 16 Jul 2009 14:20:57 -0500 [thread overview]
Message-ID: <4A5F7D99.4010503@sandeen.net> (raw)
In-Reply-To: <4A5F6C8C.609@jquigley.com>
John Quigley wrote:
> Hey Folks:
>
> I'm periodically encountering an issue with XFS that you might perhaps be interested in. The environment in which this manifests itself is on a CentOS Linux machine (custom 2.6.28.7 kernel), which is serving the XFS mount point in question with the standard Linux nfsd. The XFS file system lives on an LVM device in a striping configuration (2 wide stripe), with two iSCSI volumes acting as the constituent physical volumes. This configuration is somewhat baroque, I know.
>
> I'm experiencing periodic file system corruption, which manifests in the XFS file system going offline, and refusing subsequent mounts. The only way to recover from this has been to perform a xfs_repair -L, which has resulted in data loss on each occasion, as expected.
The log corruption might be related to data reordering somewhere along
your IO path, though I wouldn't swear to it. But often when write
caches are on, barriers are off, and power is lost, this sort of thing
shows up.
> Now, here's what I witness in the system logs:
>
> <snip>
> kernel: XFS: bad magic number
> kernel: XFS: SB validate failed
That's the first error?
> kernel: 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> kernel: Filesystem "dm-0": XFS internal error xfs_ialloc_read_agi at line 1408 of file fs/xfs/xfs_ialloc.c. Caller 0xffffffff8118711a
This means that after you read an agi, it failed a sanity test:
1403 be32_to_cpu(agi->agi_magicnum) == XFS_AGI_MAGIC &&
1404 XFS_AGI_GOOD_VERSION(be32_to_cpu(agi->agi_versionnum));
bad magic number, etc. The "00 00 00 00 ..." is the contents of the
buffer that it thought was the agi, containing all that wonderful magic
- but it's all 0s.
...
> The resultant stack trace coming from "XFS internal error xfs_ialloc_read_agi" repeats itself numerous times, at which point, the following is seen:
>
> <snip>
>
> kernel: 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> kernel: Filesystem "dm-0": XFS internal error xfs_alloc_read_agf at line 2194 of file fs/xfs/xfs_alloc.c. Caller 0xffffffff8115cf09
Similar, but bad info on the AGF:
2184 agf_ok =
2185 be32_to_cpu(agf->agf_magicnum) == XFS_AGF_MAGIC &&
2186
XFS_AGF_GOOD_VERSION(be32_to_cpu(agf->agf_versionnum)) &&
2187 be32_to_cpu(agf->agf_freeblks) <=
be32_to_cpu(agf->agf_length) &&
2188 be32_to_cpu(agf->agf_flfirst) < XFS_AGFL_SIZE(mp) &&
2189 be32_to_cpu(agf->agf_fllast) < XFS_AGFL_SIZE(mp) &&
2190 be32_to_cpu(agf->agf_flcount) <= XFS_AGFL_SIZE(mp);
Again w/ the zeros ...
>
> kernel: Filesystem "dm-0": XFS internal error xfs_trans_cancel at line 1164 of file fs/xfs/xfs_trans.c. Caller 0xffffffff811a9411
...
and then the fs tried to back out of a dirty transaction, which it can't
do, but that's secondary.
> kernel: xfs_force_shutdown(dm-0,0x8) called from line 1165 of file fs/xfs/xfs_trans.c. Return address = 0xffffffff811a348e
> kernel: Filesystem "dm-0": Corruption of in-memory data detected. Shutting down filesystem: dm-0
> kernel: Please umount the filesystem, and rectify the problem(s)
> kernel: nfsd: non-standard errno: -117
117 EFSCORRUPTED IIRC?
> kernel: Filesystem "dm-0": xfs_log_force: error 5 returned.
EIO
> </snip>
>
> I'm somewhat at a loss with this one - it's been experienced on a customer's installation, so I don't have ready access to the machine. All internal tests to attempt reproduction with identical hardware/software configurations has been unfruitful. I'm concerned about the custom kernel, and may attempt to downgrade to the stock CentOS 5.3 kernel (2.6.18, if I remember correctly).
>
> Any insight would be hugely appreciated, and of course tell me how I can help further. Thanks so much.
I'm happy to blame the storage here, given the buffers full of 0s ...
you could modify the messages to print the block nrs in question and go
back directly to the storage, read it, and see what's there.
Were there no iscsi or other assorted messages before all this?
-Eric
> John Quigley
> jquigley.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2009-07-16 19:20 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-07-16 18:08 File system corruption John Quigley
2009-07-16 19:20 ` Eric Sandeen [this message]
-- strict thread matches above, loose matches on Subject: below --
2012-10-11 17:52 Wayne Walker
2012-10-11 18:03 ` Wayne Walker
2012-10-11 21:07 ` Dave Chinner
[not found] ` <50789076.7040402@crossroads.com>
2012-10-13 0:14 ` Dave Chinner
2012-10-24 21:19 ` Wayne Walker
2012-10-24 22:51 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4A5F7D99.4010503@sandeen.net \
--to=sandeen@sandeen.net \
--cc=jquigley@jquigley.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox