Re: Corrupt XFS -Filesystems on new Hardware and Kernel

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* Re: Corrupt XFS -Filesystems on new Hardware and Kernel
       [not found] <46094344.4090007@j-o-a.de>
@ 2007-03-28 11:31 ` David Chinner
  2007-03-28 12:42   ` Oliver Joa
  2007-04-11  7:36   ` Oliver Joa
  0 siblings, 2 replies; 11+ messages in thread
From: David Chinner @ 2007-03-28 11:31 UTC (permalink / raw)
  To: Oliver Joa; +Cc: linux-kernel, xfs-oss

On Tue, Mar 27, 2007 at 06:16:04PM +0200, Oliver Joa wrote:
> Hi,
> 
> since some weeks i try to get my new hardware running:
> 
> Intel(R) Core(TM)2 CPU          6300  @ 1.86GHz
> Intel DP965LT Mainboard
> Seagate SATA-Harddisk in AHCI-Mode
> 
> After some hours of running or after some heavy file-i/o
> (find / | cpio -padm /test) I always get a corrupted
> XFS-filesystem.

What is the corruption message in the log from XFS?
Can you please post that? Without it we really can't help you.

Also, please check to see if there are any I/O errors
in the log around the time the corruption message appears.

> I used already the following Kernels:
> 2.6.19.2
> 2.6.19.7
> 2.6.20.2
> 2.6.20.4
> 
> After xfs_repair I get damaged files in lost+found.
> 
> I read in newsgroups that the write-cache of the harddisk
> should be turned of, but the messages are all very old.

That's really only an issue for crashes, not runtime failures.

> I also often get a sata-bus-reset with the kernels 2.6.19.2
> and 2.6.20.2.

I/O errors. That's what we need to isolate first. The reports in
your logs are the first thing we need to seeee.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Corrupt XFS -Filesystems on new Hardware and Kernel
  2007-03-28 11:31 ` Corrupt XFS -Filesystems on new Hardware and Kernel David Chinner
@ 2007-03-28 12:42   ` Oliver Joa
  2007-03-28 14:56     ` Eric Sandeen
  2007-03-28 23:46     ` David Chinner
  2007-04-11  7:36   ` Oliver Joa
  1 sibling, 2 replies; 11+ messages in thread
From: Oliver Joa @ 2007-03-28 12:42 UTC (permalink / raw)
  To: David Chinner; +Cc: linux-kernel, xfs-oss

Hi,

David Chinner wrote:

[...]

> What is the corruption message in the log from XFS?
> Can you please post that? Without it we really can't help you.
> 
> Also, please check to see if there are any I/O errors
> in the log around the time the corruption message appears.

Ok, here is a test:

test:/# find / -xdev | cpio -padm /test/
cpio: /usr/src/linux-2.6.20.2/Documentation/networking/NAPI_HOWTO.txt: 
Structure needs cleaning
3648371 blocks
test:/#

test:/home/olli# uname -a
Linux test 2.6.20.4-majestix-1 #1 SMP PREEMPT Tue Mar 27 12:15:41 CEST 
2007 i686 GNU/Linux

dmesg gives the following:
[15442.935941] Filesystem "sda3": XFS internal error xfs_iformat(6) at 
line 492 of file fs/xfs/xfs_inode.c.  Caller 0xc0211f94
[15442.936003]  [<c0216dba>] xfs_iread+0x4ee/0x6e8
[15442.936039]  [<c0211f94>] xfs_iget+0x2e4/0x714
[15442.936071]  [<c0211f94>] xfs_iget+0x2e4/0x714
[15442.936101]  [<c02293be>] xfs_dir_lookup_int+0x7d/0xd4
[15442.936135]  [<c022cc6b>] xfs_lookup+0x52/0x78
[15442.936167]  [<c0238a22>] xfs_vn_lookup+0x3b/0x70
[15442.936201]  [<c0153e6d>] do_lookup+0xa3/0x140
[15442.936234]  [<c015578e>] __link_path_walk+0x73d/0xb5e
[15442.936278]  [<c0211655>] xfs_iunlock+0x51/0x6d
[15442.936309]  [<c0155bf3>] link_path_walk+0x44/0xb3
[15442.936342]  [<c0155efe>] do_path_lookup+0x176/0x191
[15442.936373]  [<c0154ef8>] getname+0x59/0x8f
[15442.936402]  [<c01566b8>] __user_walk_fd+0x2f/0x45
[15442.936431]  [<c0150a09>] vfs_lstat_fd+0x16/0x3d
[15442.936461]  [<c0150a75>] sys_lstat64+0xf/0x23
[15442.936490]  [<c0102bd8>] syscall_call+0x7/0xb
[15442.936519]  =======================

And after this command:

test:/# rm /usr/src/linux-2.6.20.2/Documentation/networking/NAPI_HOWTO.txt
rm: cannot remove 
`/usr/src/linux-2.6.20.2/Documentation/networking/NAPI_HOWTO.txt': 
Structure needs cleaning
test:/#

I got:

[18359.750604] Filesystem "sda3": XFS internal error xfs_iformat(6) at 
line 492 of file fs/xfs/xfs_inode.c.  Caller 0xc0211f94
[18359.750701]  [<c0216dba>] xfs_iread+0x4ee/0x6e8
[18359.750755]  [<c0211f94>] xfs_iget+0x2e4/0x714
[18359.750802]  [<c0211f94>] xfs_iget+0x2e4/0x714
[18359.750849]  [<c02293be>] xfs_dir_lookup_int+0x7d/0xd4
[18359.750897]  [<c022cc6b>] xfs_lookup+0x52/0x78
[18359.750943]  [<c0238a22>] xfs_vn_lookup+0x3b/0x70
[18359.750990]  [<c0153e6d>] do_lookup+0xa3/0x140
[18359.751036]  [<c015578e>] __link_path_walk+0x73d/0xb5e
[18359.751086]  [<c0155bf3>] link_path_walk+0x44/0xb3
[18359.751133]  [<c0252afc>] rb_insert_color+0x4c/0xad
[18359.751180]  [<c0142044>] vma_link+0x54/0xcd
[18359.751226]  [<c0155efe>] do_path_lookup+0x176/0x191
[18359.751273]  [<c0154ef8>] getname+0x59/0x8f
[18359.751318]  [<c01566b8>] __user_walk_fd+0x2f/0x45
[18359.751364]  [<c0150a09>] vfs_lstat_fd+0x16/0x3d
[18359.751410]  [<c0252afc>] rb_insert_color+0x4c/0xad
[18359.751457]  [<c0142044>] vma_link+0x54/0xcd
[18359.751501]  [<c0150a75>] sys_lstat64+0xf/0x23
[18359.751546]  [<c0110545>] do_page_fault+0x277/0x526
[18359.751595]  [<c01102ce>] do_page_fault+0x0/0x526
[18359.751640]  [<c0102bd8>] syscall_call+0x7/0xb
[18359.751686]  [<c0360033>] rsc_parse+0x6f/0x37f
[18359.751732]  =======================
[18359.751784] Filesystem "sda3": XFS internal error xfs_iformat(6) at 
line 492 of file fs/xfs/xfs_inode.c.  Caller 0xc0211f94
[18359.751859]  [<c0216dba>] xfs_iread+0x4ee/0x6e8
[18359.751906]  [<c0211f94>] xfs_iget+0x2e4/0x714
[18359.751952]  [<c0211f94>] xfs_iget+0x2e4/0x714
[18359.751998]  [<c02293be>] xfs_dir_lookup_int+0x7d/0xd4
[18359.752047]  [<c022cc6b>] xfs_lookup+0x52/0x78
[18359.752094]  [<c0238a22>] xfs_vn_lookup+0x3b/0x70
[18359.752140]  [<c0154bcf>] __lookup_hash+0xb1/0xe1
[18359.752191]  [<c0156241>] do_unlinkat+0x5f/0x126
[18359.752237]  [<c0110545>] do_page_fault+0x277/0x526
[18359.752285]  [<c0102bd8>] syscall_call+0x7/0xb
[18359.752331]  [<c0360033>] rsc_parse+0x6f/0x37f
[18359.752376]  =======================



Thanks a Lot

Oliver

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Corrupt XFS -Filesystems on new Hardware and Kernel
  2007-03-28 12:42   ` Oliver Joa
@ 2007-03-28 14:56     ` Eric Sandeen
  2007-03-28 19:56       ` Oliver Joa
  2007-03-28 23:46     ` David Chinner
  1 sibling, 1 reply; 11+ messages in thread
From: Eric Sandeen @ 2007-03-28 14:56 UTC (permalink / raw)
  To: Oliver Joa; +Cc: David Chinner, linux-kernel, xfs-oss

Oliver Joa wrote:
> Ok, here is a test:
> 
> test:/# find / -xdev | cpio -padm /test/
> cpio: /usr/src/linux-2.6.20.2/Documentation/networking/NAPI_HOWTO.txt: 
> Structure needs cleaning
> 3648371 blocks
> test:/#

That, cryptically enough, means that the filesystem has detected a 
problem and has shut down.

> test:/home/olli# uname -a
> Linux test 2.6.20.4-majestix-1 #1 SMP PREEMPT Tue Mar 27 12:15:41 CEST 
> 2007 i686 GNU/Linux
> 
> dmesg gives the following:
> [15442.935941] Filesystem "sda3": XFS internal error xfs_iformat(6) at 
> line 492 of file fs/xfs/xfs_inode.c.  Caller 0xc0211f94
> [15442.936003]  [<c0216dba>] xfs_iread+0x4ee/0x6e8
> [15442.936039]  [<c0211f94>] xfs_iget+0x2e4/0x714
> [15442.936071]  [<c0211f94>] xfs_iget+0x2e4/0x714
> [15442.936101]  [<c02293be>] xfs_dir_lookup_int+0x7d/0xd4
> [15442.936135]  [<c022cc6b>] xfs_lookup+0x52/0x78
> [15442.936167]  [<c0238a22>] xfs_vn_lookup+0x3b/0x70
> [15442.936201]  [<c0153e6d>] do_lookup+0xa3/0x140
> [15442.936234]  [<c015578e>] __link_path_walk+0x73d/0xb5e
> [15442.936278]  [<c0211655>] xfs_iunlock+0x51/0x6d
> [15442.936309]  [<c0155bf3>] link_path_walk+0x44/0xb3
> [15442.936342]  [<c0155efe>] do_path_lookup+0x176/0x191
> [15442.936373]  [<c0154ef8>] getname+0x59/0x8f
> [15442.936402]  [<c01566b8>] __user_walk_fd+0x2f/0x45
> [15442.936431]  [<c0150a09>] vfs_lstat_fd+0x16/0x3d
> [15442.936461]  [<c0150a75>] sys_lstat64+0xf/0x23
> [15442.936490]  [<c0102bd8>] syscall_call+0x7/0xb
> [15442.936519]  =======================

For one reason or another, xfs has detected a corrupted on-disk inode 
format which it cannot recognize, and shuts down.  It is likely the 
result of something which has gone wrong previously.  xfs_repair should 
fix it.  Are there other non-xfs messages in your logs indicating other 
problems prior to this?

-Eric

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Corrupt XFS -Filesystems on new Hardware and Kernel
  2007-03-28 14:56     ` Eric Sandeen
@ 2007-03-28 19:56       ` Oliver Joa
  2007-03-29  0:21         ` Linda Walsh
  0 siblings, 1 reply; 11+ messages in thread
From: Oliver Joa @ 2007-03-28 19:56 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: David Chinner, linux-kernel, xfs-oss

Hi,

Eric Sandeen wrote:

[...]

> For one reason or another, xfs has detected a corrupted on-disk inode 
> format which it cannot recognize, and shuts down.  It is likely the 
> result of something which has gone wrong previously.  xfs_repair should 
> fix it.  Are there other non-xfs messages in your logs indicating other 
> problems prior to this?

i sent already the dmesg output to the list. there is nothing else.

I made a xfs_repair. Now I have some Files in lost+found.

So I tried it again with a new cable:

test:/# find / -xdev | cpio -padm /test/
3648526 blocks
test:/# rm -rf test
test:/# find / -xdev | cpio -padm /test/
find: /usr/src/linux-2.6.19.2/arch/sh/kernel/cpufreq.c: Structure needs 
cleaning
find: /usr/src/linux-2.6.19.2/arch/sh/kernel/head.S: Structure needs 
cleaning
find: /usr/src/linux-2.6.19.2/arch/sh/kernel/irq.c: Structure needs cleaning
3653268 blocks
test:/#

Since the reboot I did not get any bus-reset, but the following:

[ 1878.777203] Filesystem "sda3": XFS internal error xfs_iformat(6) at 
line 492 of file fs/xfs/xfs_inode.c.  Caller 0xc0211f94
[ 1878.777264]  [<c0216dba>] xfs_iread+0x4ee/0x6e8
[ 1878.777298]  [<c0211f94>] xfs_iget+0x2e4/0x714
[ 1878.777451]  [<c0211f94>] xfs_iget+0x2e4/0x714
[ 1878.777513]  [<c02293be>] xfs_dir_lookup_int+0x7d/0xd4
[ 1878.777576]  [<c022cc6b>] xfs_lookup+0x52/0x78
[ 1878.777636]  [<c0238a22>] xfs_vn_lookup+0x3b/0x70
[ 1878.777696]  [<c0153e6d>] do_lookup+0xa3/0x140
[ 1878.777757]  [<c015578e>] __link_path_walk+0x73d/0xb5e
[ 1878.777819]  [<c015ff0e>] mntput_no_expire+0x11/0x63
[ 1878.777879]  [<c0155c58>] link_path_walk+0xa9/0xb3
[ 1878.777941]  [<c0155bf3>] link_path_walk+0x44/0xb3
[ 1878.778001]  [<c014ca81>] nameidata_to_filp+0x24/0x33
[ 1878.778074]  [<c014cac2>] do_filp_open+0x32/0x39
[ 1878.778145]  [<c0155efe>] do_path_lookup+0x176/0x191
[ 1878.778209]  [<c0154ef8>] getname+0x59/0x8f
[ 1878.778270]  [<c01566b8>] __user_walk_fd+0x2f/0x45
[ 1878.778334]  [<c0150a09>] vfs_lstat_fd+0x16/0x3d
[ 1878.778397]  [<c014ca81>] nameidata_to_filp+0x24/0x33
[ 1878.778461]  [<c014cac2>] do_filp_open+0x32/0x39
[ 1878.778524]  [<c0150a75>] sys_lstat64+0xf/0x23
[ 1878.778585]  [<c014ecb7>] __fput+0x112/0x13c
[ 1878.778647]  [<c015ff0e>] mntput_no_expire+0x11/0x63
[ 1878.778709]  [<c014c793>] filp_close+0x51/0x58
[ 1878.778771]  [<c014d772>] sys_close+0x67/0x9e
[ 1878.778832]  [<c0102bd8>] syscall_call+0x7/0xb
[ 1878.778895]  =======================
[ 1878.974434] Filesystem "sda3": XFS internal error xfs_iformat(6) at 
line 492 of file fs/xfs/xfs_inode.c.  Caller 0xc0211f94
[ 1878.974493]  [<c0216dba>] xfs_iread+0x4ee/0x6e8
[ 1878.974599]  [<c0211f94>] xfs_iget+0x2e4/0x714
[ 1878.974692]  [<c0211f94>] xfs_iget+0x2e4/0x714
[ 1878.974759]  [<c02293be>] xfs_dir_lookup_int+0x7d/0xd4
[ 1878.974799]  [<c022cc6b>] xfs_lookup+0x52/0x78
[ 1878.974888]  [<c0238a22>] xfs_vn_lookup+0x3b/0x70
[ 1878.974950]  [<c0153e6d>] do_lookup+0xa3/0x140
[ 1878.975015]  [<c015578e>] __link_path_walk+0x73d/0xb5e
[ 1878.975080]  [<c03645e7>] _spin_unlock_irqrestore+0xf/0x23
[ 1878.975145]  [<c0285ad3>] n_tty_receive_buf+0xc77/0xd1a
[ 1878.975210]  [<c0155bf3>] link_path_walk+0x44/0xb3
[ 1878.975275]  [<c0155efe>] do_path_lookup+0x176/0x191
[ 1878.975338]  [<c0154ef8>] getname+0x59/0x8f
[ 1878.975399]  [<c01566b8>] __user_walk_fd+0x2f/0x45
[ 1878.975461]  [<c0150a09>] vfs_lstat_fd+0x16/0x3d
[ 1878.975525]  [<c0150a75>] sys_lstat64+0xf/0x23
[ 1878.975588]  [<c0102bd8>] syscall_call+0x7/0xb
[ 1878.975650]  [<c0360033>] rsc_parse+0x6f/0x37f
[ 1878.975712]  =======================
[ 1878.975956] Filesystem "sda3": XFS internal error xfs_iformat(6) at 
line 492 of file fs/xfs/xfs_inode.c.  Caller 0xc0211f94
[ 1878.976012]  [<c0216dba>] xfs_iread+0x4ee/0x6e8
[ 1878.976111]  [<c0211f94>] xfs_iget+0x2e4/0x714
[ 1878.976184]  [<c0211f94>] xfs_iget+0x2e4/0x714
[ 1878.976249]  [<c02293be>] xfs_dir_lookup_int+0x7d/0xd4
[ 1878.976314]  [<c022cc6b>] xfs_lookup+0x52/0x78
[ 1878.976376]  [<c0238a22>] xfs_vn_lookup+0x3b/0x70
[ 1878.976438]  [<c0153e6d>] do_lookup+0xa3/0x140
[ 1878.976500]  [<c015578e>] __link_path_walk+0x73d/0xb5e
[ 1878.976564]  [<c03645e7>] _spin_unlock_irqrestore+0xf/0x23
[ 1878.976629]  [<c0285ad3>] n_tty_receive_buf+0xc77/0xd1a
[ 1878.976701]  [<c0155bf3>] link_path_walk+0x44/0xb3
[ 1878.976766]  [<c0155efe>] do_path_lookup+0x176/0x191
[ 1878.976835]  [<c0154ef8>] getname+0x59/0x8f
[ 1878.976898]  [<c01566b8>] __user_walk_fd+0x2f/0x45
[ 1878.976961]  [<c0150a09>] vfs_lstat_fd+0x16/0x3d
[ 1878.977024]  [<c0150a75>] sys_lstat64+0xf/0x23
[ 1878.977088]  [<c0102bd8>] syscall_call+0x7/0xb
[ 1878.977150]  [<c0360033>] rsc_parse+0x6f/0x37f
[ 1878.977212]  =======================

Thanks

Oliver

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Corrupt XFS -Filesystems on new Hardware and Kernel
  2007-03-28 19:56       ` Oliver Joa
@ 2007-03-29  0:21         ` Linda Walsh
  2007-03-29  2:34           ` Linda Walsh
  0 siblings, 1 reply; 11+ messages in thread
From: Linda Walsh @ 2007-03-29  0:21 UTC (permalink / raw)
  To: Oliver Joa; +Cc: Eric Sandeen, David Chinner, linux-kernel, xfs-oss

Oliver Joa wrote:
>> eason or another, xfs has detected a corrupted on-disk inode format 
>> which it cannot recognize, and shuts down.  It is likely the result 
>> of something which has gone wrong previously.  xfs_repair should fix 
>> it.  Are there other non-xfs messages in your logs indicating other 
>> problems prior to this?
> i sent already the dmesg output to the list. there is nothing else.
> I made a xfs_repair. Now I have some Files in lost+found.
> So I tried it again with a new cable:
---
    I doubt it has changed significantly, but xfs was designed for
stable hardware.  That doesn't mean you can't pull the plug, but if
you are getting SATA resets, you may be getting some writes aborted,
with subsequent writes going through (speculation).  I know when
I had a flakey SCSI disk problem (was cable or connector in my
case), I'd get a rare XFS corruption (out of ~10 years of XFS use,
maybe 2-3 corruptions, all caused by loose connections, cables, etc).

    I'd strongly suggest you get to the bottom of the SATA reset
problem.  After that is fixed, then try to clean up your XFS disks (or
restore from backups).  Sometimes, after some intermittent hardware
problems, my xfs file system was too corrupt for me to repair (at
least with default xfs_repair options).  Doesn't mean it was irreparable,
just, I didn't know how to proceed and it was easier to restore from
a daily backup than attempt to manually repair the damage.

    The above is based solely on my own experience.  I use xfs
with max(8?) logbuffs, and noatime/nodiratime, and find it to have among
the best performance characteristics of any file system (overall;
lowest performance aspect was file delete).
    XFS has a low fragmentation rate, due to how it allocates
space and can delay writes.  Even so, it is also one of the few
file systems (only?) that comes with a "defragmenter"
(xfs_fsr (file system reorganizer)).

Sgi used to ship systems with xfs_fsr configured to run
weekly to "watch out for" rare, degenerate cases (important for some
real-time video apps).  My cron runs it nightly,  but often it
will pass through all file systems making no changes.

Fix the flakey hw -- then see if your xfs probs don't "magically"
go away...however, YMMV...

Linda

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Corrupt XFS -Filesystems on new Hardware and Kernel
  2007-03-29  0:21         ` Linda Walsh
@ 2007-03-29  2:34           ` Linda Walsh
  2007-03-29  9:34             ` Jan Kara
  0 siblings, 1 reply; 11+ messages in thread
From: Linda Walsh @ 2007-03-29  2:34 UTC (permalink / raw)
  To: Linda Walsh
  Cc: Oliver Joa, Eric Sandeen, David Chinner, linux-kernel, xfs-oss

Oliver Joa wrote:
>> eason or another, xfs has detected a corrupted on-disk inode format 
>> which it cannot recognize, and shuts down.
----
Oh, one other thing that may not apply in your case, but may.
Does your SATA disk support write caching?  Does it support
something called a barrier function?  (not real clear on all
the ways this can go wrong, but I believe barriers are supposed
to guarantee previous data has been fixed on disk (not in write
cache).  If the SATA controller issues a reset, it may very well
purge the write cache.  Theoretically, I can think of a _possibility_,
that the reset disk would purge the write cache and the barrier
indicator would tell xfs to resume writing.  From a recent thread
on the xfs list, it would appear this could be a "bad" thing (like
crossing the streams ala "ghostbusters", but in a data-integrity
context).

Just a "shot in the dark" -- absent knowing anything specific
about your hardware or situation...

If that's the case, you might want to turn off write
caching, since when xfs thinks "barriers" work, it turns
off some "protection", that can enable some significant
speedup in some situations. As an aside, some disks, I gather,
may "claim" to support barriers, but really don't.  Xfs tries
to verify the barrier claim, but I don't know that a reset
issued to the disk will have deterministic behavior across
all manufacturer's disks.  A bunch of "coulds" and "maybe's",
but just thinking off top of head...

Linda

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Corrupt XFS -Filesystems on new Hardware and Kernel
  2007-03-29  2:34           ` Linda Walsh
@ 2007-03-29  9:34             ` Jan Kara
  2007-03-29 11:14               ` Jens Axboe
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Kara @ 2007-03-29  9:34 UTC (permalink / raw)
  To: Linda Walsh
  Cc: Oliver Joa, Eric Sandeen, David Chinner, linux-kernel, xfs-oss

> Oliver Joa wrote:
> >>eason or another, xfs has detected a corrupted on-disk inode format 
> >>which it cannot recognize, and shuts down.
> ----
> Oh, one other thing that may not apply in your case, but may.
> Does your SATA disk support write caching?  Does it support
> something called a barrier function?  (not real clear on all
> the ways this can go wrong, but I believe barriers are supposed
> to guarantee previous data has been fixed on disk (not in write
> cache).  If the SATA controller issues a reset, it may very well
> purge the write cache.  Theoretically, I can think of a _possibility_,
> that the reset disk would purge the write cache and the barrier
> indicator would tell xfs to resume writing.  From a recent thread
> on the xfs list, it would appear this could be a "bad" thing (like
> crossing the streams ala "ghostbusters", but in a data-integrity
> context).
  As far as I can remember, barrier does not mean that data is fixed on
disk. It is only a command that forces all the writes before the barrier
to be performed before all the writes after the barrier. So this is more
an ordering restriction than a data integrity thing...

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Corrupt XFS -Filesystems on new Hardware and Kernel
  2007-03-29  9:34             ` Jan Kara
@ 2007-03-29 11:14               ` Jens Axboe
  0 siblings, 0 replies; 11+ messages in thread
From: Jens Axboe @ 2007-03-29 11:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: Linda Walsh, Oliver Joa, Eric Sandeen, David Chinner,
	linux-kernel, xfs-oss

On Thu, Mar 29 2007, Jan Kara wrote:
> > Oliver Joa wrote:
> > >>eason or another, xfs has detected a corrupted on-disk inode format 
> > >>which it cannot recognize, and shuts down.
> > ----
> > Oh, one other thing that may not apply in your case, but may.
> > Does your SATA disk support write caching?  Does it support
> > something called a barrier function?  (not real clear on all
> > the ways this can go wrong, but I believe barriers are supposed
> > to guarantee previous data has been fixed on disk (not in write
> > cache).  If the SATA controller issues a reset, it may very well
> > purge the write cache.  Theoretically, I can think of a _possibility_,
> > that the reset disk would purge the write cache and the barrier
> > indicator would tell xfs to resume writing.  From a recent thread
> > on the xfs list, it would appear this could be a "bad" thing (like
> > crossing the streams ala "ghostbusters", but in a data-integrity
> > context).
>   As far as I can remember, barrier does not mean that data is fixed on
> disk. It is only a command that forces all the writes before the barrier
> to be performed before all the writes after the barrier. So this is more
> an ordering restriction than a data integrity thing...

A barrier write guarentees both data before barrier is on disk, as well
as the barrier itself when completion is signalled.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Corrupt XFS -Filesystems on new Hardware and Kernel
  2007-03-28 12:42   ` Oliver Joa
  2007-03-28 14:56     ` Eric Sandeen
@ 2007-03-28 23:46     ` David Chinner
  2007-03-30 13:45       ` Oliver Joa
  1 sibling, 1 reply; 11+ messages in thread
From: David Chinner @ 2007-03-28 23:46 UTC (permalink / raw)
  To: Oliver Joa; +Cc: David Chinner, linux-kernel, xfs-oss

On Wed, Mar 28, 2007 at 02:42:00PM +0200, Oliver Joa wrote:
> Hi,
> 
> David Chinner wrote:
> 
> [...]
> 
> >What is the corruption message in the log from XFS?
> >Can you please post that? Without it we really can't help you.
> >
> >Also, please check to see if there are any I/O errors
> >in the log around the time the corruption message appears.
> 
> Ok, here is a test:
> 
> test:/# find / -xdev | cpio -padm /test/
> cpio: /usr/src/linux-2.6.20.2/Documentation/networking/NAPI_HOWTO.txt: 
> Structure needs cleaning
> 3648371 blocks
> test:/#
> 
> test:/home/olli# uname -a
> Linux test 2.6.20.4-majestix-1 #1 SMP PREEMPT Tue Mar 27 12:15:41 CEST 
> 2007 i686 GNU/Linux
> 
> dmesg gives the following:
> [15442.935941] Filesystem "sda3": XFS internal error xfs_iformat(6) at 
> line 492 of file fs/xfs/xfs_inode.c.  Caller 0xc0211f94
> [15442.936003]  [<c0216dba>] xfs_iread+0x4ee/0x6e8
> [15442.936039]  [<c0211f94>] xfs_iget+0x2e4/0x714
> [15442.936071]  [<c0211f94>] xfs_iget+0x2e4/0x714
> [15442.936101]  [<c02293be>] xfs_dir_lookup_int+0x7d/0xd4

So we have a corrupt inode. The error tells me that the
corrupted inode is either a regular file, directory or link.
Unfortunately it doesn't tell us the inode number that is
corrupted.

> test:/# rm /usr/src/linux-2.6.20.2/Documentation/networking/NAPI_HOWTO.txt
> rm: cannot remove 
> `/usr/src/linux-2.6.20.2/Documentation/networking/NAPI_HOWTO.txt': 
> Structure needs cleaning
> test:/#

Once the filesystem shuts down this will happen to every operation.

Next time you get a shutdown, can you unmount the filesystems and
run xfs_check and then "xfs_repair -n" on the filesystem. These will
tell you the inode numbers that are bad. Can you post the errors
reported by these tools?

Once you have the bad inode numbers, can you run the following
on the bad inodes:

# xfs_db -r -c "inode <inum>" -c "p" <device>

E.g.:

# xfs_db -r -c "inode 128" -c p /dev/sdb8
core.magic = 0x494e
core.mode = 040755
core.version = 2
core.format = 2 (extents)
......

and post the output for us? That will enable us to see exactly what
the corruption is on the inode.

Cheers,

Dave.


> 
> I got:
> 
> [18359.750604] Filesystem "sda3": XFS internal error xfs_iformat(6) at 
> line 492 of file fs/xfs/xfs_inode.c.  Caller 0xc0211f94
> [18359.750701]  [<c0216dba>] xfs_iread+0x4ee/0x6e8
> [18359.750755]  [<c0211f94>] xfs_iget+0x2e4/0x714
> [18359.750802]  [<c0211f94>] xfs_iget+0x2e4/0x714
> [18359.750849]  [<c02293be>] xfs_dir_lookup_int+0x7d/0xd4
> [18359.750897]  [<c022cc6b>] xfs_lookup+0x52/0x78
> [18359.750943]  [<c0238a22>] xfs_vn_lookup+0x3b/0x70
> [18359.750990]  [<c0153e6d>] do_lookup+0xa3/0x140
> [18359.751036]  [<c015578e>] __link_path_walk+0x73d/0xb5e
> [18359.751086]  [<c0155bf3>] link_path_walk+0x44/0xb3
> [18359.751133]  [<c0252afc>] rb_insert_color+0x4c/0xad
> [18359.751180]  [<c0142044>] vma_link+0x54/0xcd
> [18359.751226]  [<c0155efe>] do_path_lookup+0x176/0x191
> [18359.751273]  [<c0154ef8>] getname+0x59/0x8f
> [18359.751318]  [<c01566b8>] __user_walk_fd+0x2f/0x45
> [18359.751364]  [<c0150a09>] vfs_lstat_fd+0x16/0x3d
> [18359.751410]  [<c0252afc>] rb_insert_color+0x4c/0xad
> [18359.751457]  [<c0142044>] vma_link+0x54/0xcd
> [18359.751501]  [<c0150a75>] sys_lstat64+0xf/0x23
> [18359.751546]  [<c0110545>] do_page_fault+0x277/0x526
> [18359.751595]  [<c01102ce>] do_page_fault+0x0/0x526
> [18359.751640]  [<c0102bd8>] syscall_call+0x7/0xb
> [18359.751686]  [<c0360033>] rsc_parse+0x6f/0x37f
> [18359.751732]  =======================
> [18359.751784] Filesystem "sda3": XFS internal error xfs_iformat(6) at 
> line 492 of file fs/xfs/xfs_inode.c.  Caller 0xc0211f94
> [18359.751859]  [<c0216dba>] xfs_iread+0x4ee/0x6e8
> [18359.751906]  [<c0211f94>] xfs_iget+0x2e4/0x714
> [18359.751952]  [<c0211f94>] xfs_iget+0x2e4/0x714
> [18359.751998]  [<c02293be>] xfs_dir_lookup_int+0x7d/0xd4
> [18359.752047]  [<c022cc6b>] xfs_lookup+0x52/0x78
> [18359.752094]  [<c0238a22>] xfs_vn_lookup+0x3b/0x70
> [18359.752140]  [<c0154bcf>] __lookup_hash+0xb1/0xe1
> [18359.752191]  [<c0156241>] do_unlinkat+0x5f/0x126
> [18359.752237]  [<c0110545>] do_page_fault+0x277/0x526
> [18359.752285]  [<c0102bd8>] syscall_call+0x7/0xb
> [18359.752331]  [<c0360033>] rsc_parse+0x6f/0x37f
> [18359.752376]  =======================
> 
> 
> 
> Thanks a Lot
> 
> Oliver

-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Corrupt XFS -Filesystems on new Hardware and Kernel
  2007-03-28 23:46     ` David Chinner
@ 2007-03-30 13:45       ` Oliver Joa
  0 siblings, 0 replies; 11+ messages in thread
From: Oliver Joa @ 2007-03-30 13:45 UTC (permalink / raw)
  To: David Chinner; +Cc: linux-kernel, xfs-oss

Hi,

David Chinner wrote:

[...]

> Next time you get a shutdown, can you unmount the filesystems and
> run xfs_check and then "xfs_repair -n" on the filesystem. These will
> tell you the inode numbers that are bad. Can you post the errors
> reported by these tools?


xfs_check gives this:

bad format 0 for inode 8458341 type 0100000
bad format 0 for inode 8458344 type 0100000
bad format 0 for inode 8458348 type 0100000
block 1/4962 type unknown not expected
block 1/4963 type unknown not expected
block 1/4970 type unknown not expected
block 1/4975 type unknown not expected
block 1/4976 type unknown not expected
link count mismatch for inode 8458341 (name ?), nlink 0, counted 1
link count mismatch for inode 8458344 (name ?), nlink 0, counted 1
link count mismatch for inode 8458348 (name ?), nlink 0, counted 1



xfs_repair -n gives this:

Phase 1 - find and verify superblock...
Phase 2 - using internal log
         - scan filesystem freespace and inode maps...
         - found root inode chunk
Phase 3 - for each AG...
         - scan (but don't clear) agi unlinked lists...
         - process known inodes and perform inode discovery...
         - agno = 0
         - agno = 1
bad inode format in inode 8458341
bad inode format in inode 8458344
bad inode format in inode 8458348
bad inode format in inode 8458341
would have cleared inode 8458341
bad inode format in inode 8458344
would have cleared inode 8458344
bad inode format in inode 8458348
would have cleared inode 8458348
         - agno = 2
         - agno = 3
         - agno = 4
         - agno = 5
         - agno = 6
         - agno = 7
         - agno = 8
         - agno = 9
         - agno = 10
         - agno = 11
         - agno = 12
         - agno = 13
         - agno = 14
         - agno = 15
         - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
         - setting up duplicate extent list...
         - check for inodes claiming duplicate blocks...
         - agno = 0
         - agno = 1
entry "cpufreq.c" at block 0 offset 152 in directory inode 8458336 
references free inode 8458341
         would clear inode number in entry at offset 152...
entry "head.S" at block 0 offset 232 in directory inode 8458336 
references free inode 8458344
         would clear inode number in entry at offset 232...
entry "irq.c" at block 0 offset 320 in directory inode 8458336 
references free inode 8458348
         would clear inode number in entry at offset 320...
bad inode format in inode 8458341
would have cleared inode 8458341
bad inode format in inode 8458344
would have cleared inode 8458344
bad inode format in inode 8458348
would have cleared inode 8458348
         - agno = 2
         - agno = 3
         - agno = 4
         - agno = 5
         - agno = 6
         - agno = 7
         - agno = 8
         - agno = 9
         - agno = 10
         - agno = 11
         - agno = 12
         - agno = 13
         - agno = 14
         - agno = 15
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
         - traversing filesystem starting at / ...
entry "cpufreq.c" in directory inode 8458336 points to free inode 
8458341, would junk entry
entry "head.S" in directory inode 8458336 points to free inode 8458344, 
would junk entry
entry "irq.c" in directory inode 8458336 points to free inode 8458348, 
would junk entry
         - traversal finished ...
         - traversing all unattached subtrees ...
         - traversals finished ...
         - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.


> Once you have the bad inode numbers, can you run the following
> on the bad inodes:
> 
> # xfs_db -r -c "inode <inum>" -c "p" <device>


xfs_db on inode 8458341 gives:

core.magic = 0x494e
core.mode = 0100644
core.version = 1
core.format = 0 (dev)
core.nlinkv1 = 1
core.uid = 0
core.gid = 0
core.flushiter = 6
core.atime.sec = Tue Jan 30 22:42:51 2007
core.atime.nsec = 000000000
core.mtime.sec = Wed Jan 10 19:10:37 2007
core.mtime.nsec = 000000000
core.ctime.sec = Wed Mar 28 18:15:36 2007
core.ctime.nsec = 612718490
core.size = 6209
core.nblocks = 2
core.extsize = 0
core.nextents = 1
core.naextents = 0
core.forkoff = 0
core.aformat = 2 (extents)
core.dmevmask = 0
core.dmstate = 0
core.newrtbm = 0
core.prealloc = 0
core.realtime = 0
core.immutable = 0
core.append = 0
core.sync = 0
core.noatime = 0
core.nodump = 0
core.rtinherit = 0
core.projinherit = 0
core.nosymlinks = 0
core.extsz = 0
core.extszinherit = 0
core.nodefrag = 0
core.gen = 0
next_unlinked = null
u.dev = 0



xfs_db on inode 8458344 gives:

core.magic = 0x494e
core.mode = 0100644
core.version = 1
core.format = 0 (dev)
core.nlinkv1 = 1
core.uid = 0
core.gid = 0
core.flushiter = 6
core.atime.sec = Tue Jan 30 22:42:51 2007
core.atime.nsec = 000000000
core.mtime.sec = Wed Jan 10 19:10:37 2007
core.mtime.nsec = 000000000
core.ctime.sec = Wed Mar 28 18:15:36 2007
core.ctime.nsec = 612849562
core.size = 2326
core.nblocks = 1
core.extsize = 0
core.nextents = 1
core.naextents = 0
core.forkoff = 0
core.aformat = 2 (extents)
core.dmevmask = 0
core.dmstate = 0
core.newrtbm = 0
core.prealloc = 0
core.realtime = 0
core.immutable = 0
core.append = 0
core.sync = 0
core.noatime = 0
core.nodump = 0
core.rtinherit = 0
core.projinherit = 0
core.nosymlinks = 0
core.extsz = 0
core.extszinherit = 0
core.nodefrag = 0
core.gen = 0
next_unlinked = null
u.dev = 0


xfs_db on inode 8458336 gives:

core.magic = 0x494e
core.mode = 040755
core.version = 1
core.format = 2 (extents)
core.nlinkv1 = 5
core.uid = 0
core.gid = 0
core.flushiter = 1
core.atime.sec = Tue Jan 30 22:42:51 2007
core.atime.nsec = 906063000
core.mtime.sec = Wed Jan 10 19:10:37 2007
core.mtime.nsec = 000000000
core.ctime.sec = Tue Jan 30 22:44:48 2007
core.ctime.nsec = 428077021
core.size = 4096
core.nblocks = 1
core.extsize = 0
core.nextents = 1
core.naextents = 0
core.forkoff = 0
core.aformat = 2 (extents)
core.dmevmask = 0
core.dmstate = 0
core.newrtbm = 0
core.prealloc = 0
core.realtime = 0
core.immutable = 0
core.append = 0
core.sync = 0
core.noatime = 0
core.nodump = 0
core.rtinherit = 0
core.projinherit = 0
core.nosymlinks = 0
core.extsz = 0
core.extszinherit = 0
core.nodefrag = 0
core.gen = 0
next_unlinked = null
u.bmx[0] = [startoff,startblock,blockcount,extentflag] 0:[0,528704,1,0]

[...]

> and post the output for us? That will enable us to see exactly what
> the corruption is on the inode.

Here is it...

Thanks a lot...

Olli

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Corrupt XFS -Filesystems on new Hardware and Kernel
  2007-03-28 11:31 ` Corrupt XFS -Filesystems on new Hardware and Kernel David Chinner
  2007-03-28 12:42   ` Oliver Joa
@ 2007-04-11  7:36   ` Oliver Joa
  1 sibling, 0 replies; 11+ messages in thread
From: Oliver Joa @ 2007-04-11  7:36 UTC (permalink / raw)
  To: xfs-oss; +Cc: linux-kernel

Hi,

David Chinner wrote:
> On Tue, Mar 27, 2007 at 06:16:04PM +0200, Oliver Joa wrote:
>> Hi,
>>
>> since some weeks i try to get my new hardware running:
>>
>> Intel(R) Core(TM)2 CPU          6300  @ 1.86GHz
>> Intel DP965LT Mainboard
>> Seagate SATA-Harddisk in AHCI-Mode
>>
>> After some hours of running or after some heavy file-i/o
>> (find / | cpio -padm /test) I always get a corrupted
>> XFS-filesystem.

I solved the problem: I made a memtest and found a lot of memory-errors, 
then i bought a other brand of memory and everything working fine. The 
first memory i used was brandnew. I bought it together with the board 
and processor. It was from Kingston. Now i have one from Crucial, which 
seems to work fine.

Thanks to everyone for the help

Olli

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2007-04-11  7:37 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <46094344.4090007@j-o-a.de>
2007-03-28 11:31 ` Corrupt XFS -Filesystems on new Hardware and Kernel David Chinner
2007-03-28 12:42   ` Oliver Joa
2007-03-28 14:56     ` Eric Sandeen
2007-03-28 19:56       ` Oliver Joa
2007-03-29  0:21         ` Linda Walsh
2007-03-29  2:34           ` Linda Walsh
2007-03-29  9:34             ` Jan Kara
2007-03-29 11:14               ` Jens Axboe
2007-03-28 23:46     ` David Chinner
2007-03-30 13:45       ` Oliver Joa
2007-04-11  7:36   ` Oliver Joa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox