Repeated XFS Crash on x86

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* Repeated XFS Crash on x86_64 fiesty
@ 2007-10-12  8:02 Nick Gregory
  2007-10-12 19:52 ` Justin Piszcz
  0 siblings, 1 reply; 4+ messages in thread
From: Nick Gregory @ 2007-10-12  8:02 UTC (permalink / raw)
  To: xfs

Hi,

I run a number of x86_64 ubuntu feisty (2.6.20-16-server) systems. Each 
has a near identical hardware spec i.e. the systems have a large (>6TB) 
xfs storage partition sat on top of a raid 6 array (using the Areca 
ARC-1160).

Over the last couple of months one system has has its xfs filesystem 
crash on a semi frequent basis (1-2 times a week). Googling around the 
error it first seemed to be memory related so I've done a swap for a 
some new ecc memory - unfortunately the problem persists.

The filesystem is reasonable active but the issue doesn't seem to be 
load related as the issue seems to occur at random times of the day.

Can anyone give me any insight on the best place to start looking to 
track down the issue?

Thanks in advance

Nick

XFS Crash dmesg:

[44537.156249] XFS internal error XFS_WANT_CORRUPTED_GOTO at line 1563 
of file fs/xfs/xfs_alloc.c.  Caller 0xffffffff8824e188
[44537.156321]
[44537.156321] Call Trace:
[44537.156391]  [<ffffffff8824c6c2>] :xfs:xfs_free_ag_extent+0x1b2/0x700
[44537.156416]  [<ffffffff8824e188>] :xfs:xfs_free_extent+0xc8/0x110
[44537.156443]  [<ffffffff8825c982>] :xfs:xfs_bmap_finish+0x102/0x190
[44537.156482]  [<ffffffff8827e28c>] :xfs:xfs_itruncate_finish+0x1ac/0x300
[44537.156513]  [<ffffffff88297976>] :xfs:xfs_setattr+0x8a6/0xf30
[44537.156557]  [<ffffffff882a3ee3>] :xfs:xfs_vn_setattr+0x143/0x190
[44537.156578]  [<ffffffff8022dee4>] notify_change+0x164/0x330
[44537.156589]  [<ffffffff802d742e>] do_truncate+0x4e/0x70
[44537.156597]  [<ffffffff8020d56a>] permission+0xca/0x140
[44537.156602]  [<ffffffff80211e59>] may_open+0x1e9/0x260
[44537.156609]  [<ffffffff8021b598>] open_namei+0x2a8/0x680
[44537.156613]  [<ffffffff8021b157>] cp_new_stat+0xe7/0x100
[44537.156617]  [<ffffffff802a3860>] autoremove_wake_function+0x0/0x30
[44537.156625]  [<ffffffff80228a6c>] do_filp_open+0x1c/0x40
[44537.156658]  [<ffffffff80219eda>] do_sys_open+0x5a/0x100
[44537.156666]  [<ffffffff8026111e>] system_call+0x7e/0x83
[44537.156675]
[44537.156685] xfs_force_shutdown(sda3,0x8) called from line 4272 of 
file fs/xfs/xfs_bmap.c.  Return address = 0xffffffff8825c9be
[44537.157614] Filesystem "sda3": Corruption of in-memory data detected. 
  Shutting down filesystem: sda3
[44537.157664] Please umount the filesystem, and rectify the problem(s)


On remount of the file system:

[45035.275936] xfs_force_shutdown(sda3,0x1) called from line 424 of file 
fs/xfs/xfs_rw.c.  Return address = 0xffffffff8829bf3a
[45035.275948] xfs_force_shutdown(sda3,0x1) called from line 424 of file 
fs/xfs/xfs_rw.c.  Return address = 0xffffffff8829bf3a
[45039.698366] XFS mounting filesystem sda3
[45039.822294] Starting XFS recovery on filesystem: sda3 (logdev: internal)
[45040.330263] XFS internal error XFS_WANT_CORRUPTED_GOTO at line 1563 
of file fs/xfs/xfs_alloc.c.  Caller 0xffffffff8824e188
[45040.330319]
[45040.330320] Call Trace:
[45040.330358]  [<ffffffff8824c6c2>] :xfs:xfs_free_ag_extent+0x1b2/0x700
[45040.330382]  [<ffffffff8824e188>] :xfs:xfs_free_extent+0xc8/0x110
[45040.330413]  [<ffffffff88289fce>] :xfs:xlog_recover_finish+0x1be/0x2d0
[45040.330440]  [<ffffffff8828e087>] :xfs:xfs_mountfs+0xa77/0xca0
[45040.330451]  [<ffffffff8025dc60>] generic_unplug_device+0x0/0x30
[45040.330457]  [<ffffffff8020c002>] _atomic_dec_and_lock+0x42/0x80
[45040.330481]  [<ffffffff88294e87>] :xfs:xfs_mount+0x997/0xa80
[45040.330503]  [<ffffffff882a6ca8>] :xfs:xfs_fs_fill_super+0x98/0x230
[45040.330511]  [<ffffffff80267692>] __down_write_nested+0x12/0xb0
[45040.330516]  [<ffffffff80232a0e>] strlcpy+0x4e/0x80
[45040.330523]  [<ffffffff802e1fc2>] get_filesystem+0x12/0x40
[45040.330528]  [<ffffffff802d8e4f>] sget+0x3bf/0x3e0
[45040.330533]  [<ffffffff802d87a0>] set_bdev_super+0x0/0x10
[45040.330541]  [<ffffffff802d9aff>] get_sb_bdev+0x11f/0x190
[45040.330559]  [<ffffffff882a6c10>] :xfs:xfs_fs_fill_super+0x0/0x230
[45040.330570]  [<ffffffff802d9366>] vfs_kern_mount+0xc6/0x170
[45040.330579]  [<ffffffff802d946a>] do_kern_mount+0x4a/0x80
[45040.330586]  [<ffffffff802e3f89>] do_mount+0x6f9/0x7a0
[45040.330592]  [<ffffffff80208b48>] __handle_mm_fault+0x668/0xab0
[45040.330601]  [<ffffffff8020e6e0>] link_path_walk+0xd0/0xf0
[45040.330608]  [<ffffffff80222db1>] __up_read+0x21/0xb0
[45040.330614]  [<ffffffff8026a299>] do_page_fault+0x4b9/0x890
[45040.330623]  [<ffffffff80208923>] __handle_mm_fault+0x443/0xab0
[45040.330629]  [<ffffffff802c6074>] zone_statistics+0x34/0x80
[45040.330652]  [<ffffffff8023e53b>] __get_free_pages+0x1b/0x40
[45040.330661]  [<ffffffff8024e38b>] sys_mount+0x9b/0x100
[45040.330670]  [<ffffffff8026111e>] system_call+0x7e/0x83
[45040.330680]
[45040.365771] Ending XFS recovery on filesystem: sda3 (logdev: internal)

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Repeated XFS Crash on x86_64 fiesty
  2007-10-12  8:02 Repeated XFS Crash on x86_64 fiesty Nick Gregory
@ 2007-10-12 19:52 ` Justin Piszcz
  2007-10-14 12:21   ` Nick Gregory
  0 siblings, 1 reply; 4+ messages in thread
From: Justin Piszcz @ 2007-10-12 19:52 UTC (permalink / raw)
  To: Nick Gregory; +Cc: xfs

Have you run memtest86?  Have you checked the CPU?  Is it an AMD64 CPU 
where the memory controller is onboard (and .. if damaged/overheating) 
could cause problems with the memory?

On Fri, 12 Oct 2007, Nick Gregory wrote:

> Hi,
>
> I run a number of x86_64 ubuntu feisty (2.6.20-16-server) systems. Each has a 
> near identical hardware spec i.e. the systems have a large (>6TB) xfs storage 
> partition sat on top of a raid 6 array (using the Areca ARC-1160).
>
> Over the last couple of months one system has has its xfs filesystem crash on 
> a semi frequent basis (1-2 times a week). Googling around the error it first 
> seemed to be memory related so I've done a swap for a some new ecc memory - 
> unfortunately the problem persists.
>
> The filesystem is reasonable active but the issue doesn't seem to be load 
> related as the issue seems to occur at random times of the day.
>
> Can anyone give me any insight on the best place to start looking to track 
> down the issue?
>
> Thanks in advance
>
> Nick
>
> XFS Crash dmesg:
>
> [44537.156249] XFS internal error XFS_WANT_CORRUPTED_GOTO at line 1563 of 
> file fs/xfs/xfs_alloc.c.  Caller 0xffffffff8824e188
> [44537.156321]
> [44537.156321] Call Trace:
> [44537.156391]  [<ffffffff8824c6c2>] :xfs:xfs_free_ag_extent+0x1b2/0x700
> [44537.156416]  [<ffffffff8824e188>] :xfs:xfs_free_extent+0xc8/0x110
> [44537.156443]  [<ffffffff8825c982>] :xfs:xfs_bmap_finish+0x102/0x190
> [44537.156482]  [<ffffffff8827e28c>] :xfs:xfs_itruncate_finish+0x1ac/0x300
> [44537.156513]  [<ffffffff88297976>] :xfs:xfs_setattr+0x8a6/0xf30
> [44537.156557]  [<ffffffff882a3ee3>] :xfs:xfs_vn_setattr+0x143/0x190
> [44537.156578]  [<ffffffff8022dee4>] notify_change+0x164/0x330
> [44537.156589]  [<ffffffff802d742e>] do_truncate+0x4e/0x70
> [44537.156597]  [<ffffffff8020d56a>] permission+0xca/0x140
> [44537.156602]  [<ffffffff80211e59>] may_open+0x1e9/0x260
> [44537.156609]  [<ffffffff8021b598>] open_namei+0x2a8/0x680
> [44537.156613]  [<ffffffff8021b157>] cp_new_stat+0xe7/0x100
> [44537.156617]  [<ffffffff802a3860>] autoremove_wake_function+0x0/0x30
> [44537.156625]  [<ffffffff80228a6c>] do_filp_open+0x1c/0x40
> [44537.156658]  [<ffffffff80219eda>] do_sys_open+0x5a/0x100
> [44537.156666]  [<ffffffff8026111e>] system_call+0x7e/0x83
> [44537.156675]
> [44537.156685] xfs_force_shutdown(sda3,0x8) called from line 4272 of file 
> fs/xfs/xfs_bmap.c.  Return address = 0xffffffff8825c9be
> [44537.157614] Filesystem "sda3": Corruption of in-memory data detected. 
> Shutting down filesystem: sda3
> [44537.157664] Please umount the filesystem, and rectify the problem(s)
>
>
> On remount of the file system:
>
> [45035.275936] xfs_force_shutdown(sda3,0x1) called from line 424 of file 
> fs/xfs/xfs_rw.c.  Return address = 0xffffffff8829bf3a
> [45035.275948] xfs_force_shutdown(sda3,0x1) called from line 424 of file 
> fs/xfs/xfs_rw.c.  Return address = 0xffffffff8829bf3a
> [45039.698366] XFS mounting filesystem sda3
> [45039.822294] Starting XFS recovery on filesystem: sda3 (logdev: internal)
> [45040.330263] XFS internal error XFS_WANT_CORRUPTED_GOTO at line 1563 of 
> file fs/xfs/xfs_alloc.c.  Caller 0xffffffff8824e188
> [45040.330319]
> [45040.330320] Call Trace:
> [45040.330358]  [<ffffffff8824c6c2>] :xfs:xfs_free_ag_extent+0x1b2/0x700
> [45040.330382]  [<ffffffff8824e188>] :xfs:xfs_free_extent+0xc8/0x110
> [45040.330413]  [<ffffffff88289fce>] :xfs:xlog_recover_finish+0x1be/0x2d0
> [45040.330440]  [<ffffffff8828e087>] :xfs:xfs_mountfs+0xa77/0xca0
> [45040.330451]  [<ffffffff8025dc60>] generic_unplug_device+0x0/0x30
> [45040.330457]  [<ffffffff8020c002>] _atomic_dec_and_lock+0x42/0x80
> [45040.330481]  [<ffffffff88294e87>] :xfs:xfs_mount+0x997/0xa80
> [45040.330503]  [<ffffffff882a6ca8>] :xfs:xfs_fs_fill_super+0x98/0x230
> [45040.330511]  [<ffffffff80267692>] __down_write_nested+0x12/0xb0
> [45040.330516]  [<ffffffff80232a0e>] strlcpy+0x4e/0x80
> [45040.330523]  [<ffffffff802e1fc2>] get_filesystem+0x12/0x40
> [45040.330528]  [<ffffffff802d8e4f>] sget+0x3bf/0x3e0
> [45040.330533]  [<ffffffff802d87a0>] set_bdev_super+0x0/0x10
> [45040.330541]  [<ffffffff802d9aff>] get_sb_bdev+0x11f/0x190
> [45040.330559]  [<ffffffff882a6c10>] :xfs:xfs_fs_fill_super+0x0/0x230
> [45040.330570]  [<ffffffff802d9366>] vfs_kern_mount+0xc6/0x170
> [45040.330579]  [<ffffffff802d946a>] do_kern_mount+0x4a/0x80
> [45040.330586]  [<ffffffff802e3f89>] do_mount+0x6f9/0x7a0
> [45040.330592]  [<ffffffff80208b48>] __handle_mm_fault+0x668/0xab0
> [45040.330601]  [<ffffffff8020e6e0>] link_path_walk+0xd0/0xf0
> [45040.330608]  [<ffffffff80222db1>] __up_read+0x21/0xb0
> [45040.330614]  [<ffffffff8026a299>] do_page_fault+0x4b9/0x890
> [45040.330623]  [<ffffffff80208923>] __handle_mm_fault+0x443/0xab0
> [45040.330629]  [<ffffffff802c6074>] zone_statistics+0x34/0x80
> [45040.330652]  [<ffffffff8023e53b>] __get_free_pages+0x1b/0x40
> [45040.330661]  [<ffffffff8024e38b>] sys_mount+0x9b/0x100
> [45040.330670]  [<ffffffff8026111e>] system_call+0x7e/0x83
> [45040.330680]
> [45040.365771] Ending XFS recovery on filesystem: sda3 (logdev: internal)
>
>
>
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Repeated XFS Crash on x86_64 fiesty
  2007-10-12 19:52 ` Justin Piszcz
@ 2007-10-14 12:21   ` Nick Gregory
  2007-10-14 13:46     ` Peter Grandi
  0 siblings, 1 reply; 4+ messages in thread
From: Nick Gregory @ 2007-10-14 12:21 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: xfs

Justin Piszcz wrote:
> Have you run memtest86?  Have you checked the CPU?  Is it an AMD64 CPU 
> where the memory controller is onboard (and .. if damaged/overheating) 
> could cause problems with the memory?
> 

Thanks for the suggestions. Shortly after my posting I had a drive fail 
shortly followed by another three taking out 6TB of data :-( , luckily 
part of a replicated pair.

So its certainly a hardware thing either a bad raid card or a bad batch 
of drives rather than anything to do with XFS.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Repeated XFS Crash on x86_64 fiesty
  2007-10-14 12:21   ` Nick Gregory
@ 2007-10-14 13:46     ` Peter Grandi
  0 siblings, 0 replies; 4+ messages in thread
From: Peter Grandi @ 2007-10-14 13:46 UTC (permalink / raw)
  To: Linux XFS

>>> On Sun, 14 Oct 2007 13:21:36 +0100, Nick Gregory
>>> <nick@openenterprise.co.uk> said:

nick> [ ... ] Shortly after my posting I had a drive fail shortly
nick> followed by another three taking out 6TB of data :-( ,
nick> luckily part of a replicated pair. So its certainly a
nick> hardware thing either a bad raid card or a bad batch of
nick> drives rather than anything to do with XFS.

Multiple drive failures are somewhat more common than desirable,
and RAID is based on the idea that failures are uncorrelated.

But then I have seen storage systems where all the drives were
of the same brand, model and even had nearly consecutive serial
numbers, and of course got delivered at the same time, and never
mind that all these drives end up all in the same rack, with the
same cooling and power circuit and get started and stopped at
the same times.

For my home PC I have 4 drives of different brands, models,
ordered at different times from different shops, and the backups
are in external cases that are usually unconnected and unplugged
from mains. Can't be too careful :-).

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2007-10-14 22:34 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-12  8:02 Repeated XFS Crash on x86_64 fiesty Nick Gregory
2007-10-12 19:52 ` Justin Piszcz
2007-10-14 12:21   ` Nick Gregory
2007-10-14 13:46     ` Peter Grandi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox