* Is kernel 3.6.1 or filestreams option toxic ?
@ 2012-10-22 14:14 Yann Dupont
2012-10-23 8:24 ` Problems with kernel 3.6.x (vm ?) (was : Is kernel 3.6.1 or filestreams option toxic ?) Yann Dupont
0 siblings, 1 reply; 18+ messages in thread
From: Yann Dupont @ 2012-10-22 14:14 UTC (permalink / raw)
To: xfs
Hello,
Last week, I encountered problems with xfs volumes on several machines.
Kernel hanged under heavy load, I hard to hard reset. After reboot, xfs
volume was not able to mount, and xfs_repair didn't managed to recover
the volume cleanly on 2 different machines.
Just to relax things, It wasn't production data, so it don't matter if I
recover data or not. But more important to me is to understand why
things went wrong...
I'm using XFS since a long time, on lots of data, it's the first time I
encounter such a problem, but I was using unusual option : filestreams,
and was using kernel 3.6.1. So I wonder if it has something to do with
the crash.
I have nothing very conclusive in the kernel logs, apart this :
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.569890]
INFO: task ceph-osd:17856 blocked for more than 120 seconds.
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.569941]
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.569987]
ceph-osd D ffff88056416b1a0 0 17856 1 0x00000000
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.569993]
ffff88056416aed0 0000000000000086 ffff880590751fd8 ffff88000c67eb00
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570047]
ffff880590751fd8 ffff880590751fd8 ffff880590751fd8 ffff88056416aed0
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570101]
0000000000000001 ffff88056416aed0 ffff880a15240d00 ffff880a15240d60
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570156] Call
Trace:
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570187]
[<ffffffff81041335>] ? exit_mm+0x85/0x120
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570216]
[<ffffffff81042a94>] ? do_exit+0x154/0x8e0
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570248]
[<ffffffff8114ec79>] ? file_update_time+0xa9/0x100
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570278]
[<ffffffff81043568>] ? do_group_exit+0x38/0xa0
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570309]
[<ffffffff81051bc6>] ? get_signal_to_deliver+0x1a6/0x5e0
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570341]
[<ffffffff8100223e>] ? do_signal+0x4e/0x970
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570371]
[<ffffffff81170e2e>] ? fsnotify+0x24e/0x340
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570402]
[<ffffffff8100c995>] ? fpu_finit+0x15/0x30
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570431]
[<ffffffff8100db34>] ? restore_i387_xstate+0x64/0x1c0
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570464]
[<ffffffff8108e0d2>] ? sys_futex+0x92/0x1b0
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570493]
[<ffffffff81002bf5>] ? do_notify_resume+0x75/0xc0
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570525]
[<ffffffff813c60fa>] ? int_signal+0x12/0x17
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570553]
INFO: task ceph-osd:17857 blocked for more than 120 seconds.
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570583]
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570628]
ceph-osd D ffff8801161fe720 0 17857 1 0x00000000
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570632]
ffff8801161fe450 0000000000000086 ffffffffffffffe0 ffff880a17c73c30
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570687]
ffff88011347ffd8 ffff88011347ffd8 ffff88011347ffd8 ffff8801161fe450
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570740]
ffff8801161fe450 ffff8801161fe450 ffff880a15240d00 ffff880a15240d60
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570794] Call
Trace:
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570818]
[<ffffffff81041335>] ? exit_mm+0x85/0x120
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570846]
[<ffffffff81042a94>] ? do_exit+0x154/0x8e0
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570875]
[<ffffffff81043568>] ? do_group_exit+0x38/0xa0
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570905]
[<ffffffff81051bc6>] ? get_signal_to_deliver+0x1a6/0x5e0
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570935]
[<ffffffff8100223e>] ? do_signal+0x4e/0x970
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570967]
[<ffffffff81302d24>] ? sys_sendto+0x114/0x150
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570996]
[<ffffffff8108e0d2>] ? sys_futex+0x92/0x1b0
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.571024]
[<ffffffff81002bf5>] ? do_notify_resume+0x75/0xc0
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.571054]
[<ffffffff813c60fa>] ? int_signal+0x12/0x17
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.571082]
INFO: task ceph-osd:17858 blocked for more than 120 seconds.
Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.571111]
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Wasn't able to cleanly shutdown the servers after that. On 2 machines,
xfs volumes (12 TB each) couldn't be mounted anymore, after hardreset,
needed xfs_repair -L ...
On 1 machine, xfs_repair goes to end, but with millions errors, and this
gives this in the end :(
344010712 /XCEPH-PROD/data/osd.8
6841649480 /XCEPH-PROD/data/lost+found/
I understand xfs_repair -L always lead to data loss, but not to that point ?
on the other one, xfs_repairs segfaults, after lots of messages like
that (I mean, really lots):
block (0,1008194-1008194) multiply claimed by cnt space tree, state - 2
block (0,1008200-1008200) multiply claimed by cnt space tree, state - 2
block (0,1012323-1012323) multiply claimed by cnt space tree, state - 2
...
agf_freeblks 87066179, counted 87066033 in ag 0
agi_freecount 489403, counted 488952 in ag 0
agi unlinked bucket 1 is 7681 in ag 0 (inode=7681)
agi unlinked bucket 5 is 67781 in ag 0 (inode=67781)
agi unlinked bucket 6 is 10950 in ag 0 (inode=10950)
...
block (3,30847085-30847085) multiply claimed by cnt space tree, state - 2
block (3,27384823-27384823) multiply claimed by cnt space tree, state - 2
block (3,30115747-30115747) multiply claimed by cnt space tree, state - 2
...
agf_freeblks 90336213, counted 302201427 in ag 3
agf_longest 6144, counted 167772160 in ag 3
inode chunk claims used block, inobt block - agno 3, bno 2380, inopb 16
inode chunk claims used block, inobt block - agno 3, bno 280918, inopb 16
...
Phase 3 - for each AG...
- scan (but don't clear) agi unlinked lists...
found inodes not in the inode allocation tree
- process known inodes and perform inode discovery...
- agno = 0
7f1738c17700: Badness in key lookup (length)
bp=(bno 2848, len 16384 bytes) key=(bno 2848, len 8192 bytes)
7f1738c17700: Badness in key lookup (length)
bp=(bno 3840, len 16384 bytes) key=(bno 3840, len 8192 bytes)
7f1738c17700: Badness in key lookup (length)
bp=(bno 5456, len 16384 bytes) key=(bno 5456, len 8192 bytes)
...
and in the end, xfs_repair segfaults.
Those machines are part of a 12 machine ceph cluster (Ceph itself is
pure user-space). All nodes are independant (not on the same computer
room), but were all running 3.6.1 since some days, and all were using
xfs with filestreams option (I was trying to prevent xfs fragmentation).
Could it be related , as it's the first time I encounter such a
disastrous data loss ?
I don't have much more relevant details, making this mail a poor bug
report ...
If that matters, I can anyway furnish more details about the way those
kernels hanged (ceph nodes reweights, stressing the hardware, lots of
I/O), details about servers & fibre channels disks, and so on.
Cheers,
--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 18+ messages in thread* Problems with kernel 3.6.x (vm ?) (was : Is kernel 3.6.1 or filestreams option toxic ?) 2012-10-22 14:14 Is kernel 3.6.1 or filestreams option toxic ? Yann Dupont @ 2012-10-23 8:24 ` Yann Dupont 2012-10-25 15:21 ` Yann Dupont 0 siblings, 1 reply; 18+ messages in thread From: Yann Dupont @ 2012-10-23 8:24 UTC (permalink / raw) To: xfs; +Cc: linux-kernel Le 22/10/2012 16:14, Yann Dupont a écrit : Hello. This mail is a follow up of a message on XFS mailing list. I had hang with 3.6.1, and then , damage on XFS filesystem. 3.6.1 is not alone. Tried 3.6.2, and had another hang with quite a different trace this time , so not really sure the 2 problems are related . Anyway the problem is maybe not XFS, but is just a consequence of what seems more like kernel problems. cc: to linux-kernel Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.991908] INFO: task ceph-osd:4409 blocked for more than 120 seconds. Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.991954] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.991999] ceph-osd D ffff88084c049030 0 4409 1 0x00000000 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.992003] ffff88084c048d60 0000000000000086 ffff880a1421de78 ffff880a17caa820 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.992054] ffff880a1421dfd8 ffff880a1421dfd8 ffff880a1421dfd8 ffff88084c048d60 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.992105] 0000000003373001 ffff88084c048d60 ffff88051775cb20 ffffffffffffffff Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.992156] Call Trace: Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.992184] [<ffffffff813c52fd>] ? rwsem_down_failed_common+0xbd/0x150 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.992215] [<ffffffff812094a3>] ? call_rwsem_down_write_failed+0x13/0x20 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.992248] [<ffffffff811b83e0>] ? cap_mmap_addr+0x50/0x50 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.992275] [<ffffffff813c3cbc>] ? down_write+0x1c/0x1d Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.992303] [<ffffffff810fcf74>] ? vm_mmap_pgoff+0x64/0xb0 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.992331] [<ffffffff8110d4cc>] ? sys_mmap_pgoff+0x5c/0x190 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.992360] [<ffffffff811357f1>] ? do_sys_open+0x161/0x1e0 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.992387] [<ffffffff813c5ffd>] ? system_call_fastpath+0x1a/0x1f Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.992423] INFO: task ceph-osd:25297 blocked for more than 120 seconds. Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.992451] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.992495] ceph-osd D ffff8801bce7b1a0 0 25297 1 0x00000000 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.992497] ffff8801bce7aed0 0000000000000086 ffff88025d903fd8 ffff880a17cab580 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.992548] ffff88025d903fd8 ffff88025d903fd8 ffff88025d903fd8 ffff8801bce7aed0 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.992599] ffff8801bce7aed0 ffff8801bce7aed0 ffff88051775cb20 ffffffffffffffff Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.992650] Call Trace: Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.992673] [<ffffffff813c52fd>] ? rwsem_down_failed_common+0xbd/0x150 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.992702] [<ffffffff81209474>] ? call_rwsem_down_read_failed+0x14/0x30 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.992732] [<ffffffff813c3c9e>] ? down_read+0xe/0x10 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.992759] [<ffffffff8103129c>] ? do_page_fault+0x16c/0x460 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.992787] [<ffffffff81305862>] ? release_sock+0xd2/0x150 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.992815] [<ffffffff8137aceb>] ? inet_stream_connect+0x4b/0x70 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.992844] [<ffffffff81302b55>] ? sys_connect+0xa5/0xe0 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.992871] [<ffffffff811343e3>] ? fd_install+0x33/0x70 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.992898] [<ffffffff813c5a75>] ? page_fault+0x25/0x30 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.992925] INFO: task ceph-osd:32469 blocked for more than 120 seconds. Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.992953] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.992996] ceph-osd D ffff880556237b30 0 32469 1 0x00000000 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.992999] ffff880556237860 0000000000000086 ffff88059fe5dfd8 ffff880a17c742e0 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.993050] ffff88059fe5dfd8 ffff88059fe5dfd8 ffff88059fe5dfd8 ffff880556237860 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.993101] ffff880556237860 ffff880556237860 ffff88051775cb20 ffffffffffffffff Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.993153] Call Trace: Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.993175] [<ffffffff813c52fd>] ? rwsem_down_failed_common+0xbd/0x150 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.993204] [<ffffffff81209474>] ? call_rwsem_down_read_failed+0x14/0x30 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.993233] [<ffffffff813c3c9e>] ? down_read+0xe/0x10 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.993259] [<ffffffff8103129c>] ? do_page_fault+0x16c/0x460 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.993286] [<ffffffff81305862>] ? release_sock+0xd2/0x150 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.993314] [<ffffffff8137aceb>] ? inet_stream_connect+0x4b/0x70 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.993342] [<ffffffff81302b55>] ? sys_connect+0xa5/0xe0 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.994484] [<ffffffff811343e3>] ? fd_install+0x33/0x70 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.994510] [<ffffffff813c5a75>] ? page_fault+0x25/0x30 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.994538] INFO: task ceph-osd:9660 blocked for more than 120 seconds. Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.994566] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.994609] ceph-osd D ffff8801659f82d0 0 9660 1 0x00000000 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.994612] ffff8801659f8000 0000000000000086 ffff88010f6bdfd8 ffff88084f0c9ac0 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.994662] ffff88010f6bdfd8 ffff88010f6bdfd8 ffff88010f6bdfd8 ffff8801659f8000 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.994713] ffff8801659f8000 ffff8801659f8000 ffff88051775cb20 ffffffffffffffff Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.994764] Call Trace: Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.994786] [<ffffffff813c52fd>] ? rwsem_down_failed_common+0xbd/0x150 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.994815] [<ffffffff81209474>] ? call_rwsem_down_read_failed+0x14/0x30 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.994844] [<ffffffff813c3c9e>] ? down_read+0xe/0x10 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.994870] [<ffffffff8103129c>] ? do_page_fault+0x16c/0x460 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.994898] [<ffffffff81305862>] ? release_sock+0xd2/0x150 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.994925] [<ffffffff8137aceb>] ? inet_stream_connect+0x4b/0x70 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.994953] [<ffffffff81302b55>] ? sys_connect+0xa5/0xe0 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.994980] [<ffffffff811343e3>] ? fd_install+0x33/0x70 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.995006] [<ffffffff813c5a75>] ? page_fault+0x25/0x30 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.995037] INFO: task grep:7014 blocked for more than 120 seconds. Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.995064] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.995108] grep D ffff8800c3f69030 0 7014 7011 0x00000000 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.995110] ffff8800c3f68d60 0000000000000082 0000000000000000 ffff880a17ca9410 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.995161] ffff88002dd2ffd8 ffff88002dd2ffd8 ffff88002dd2ffd8 ffff8800c3f68d60 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.995212] 0000000000000000 ffff8800c3f68d60 ffff88051775cb20 ffffffffffffffff Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.995264] Call Trace: Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.995286] [<ffffffff813c52fd>] ? rwsem_down_failed_common+0xbd/0x150 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.995428] [<ffffffff81191625>] ? proc_pid_cmdline+0xa5/0x130 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.995456] [<ffffffff811922e0>] ? proc_info_read+0xb0/0x110 Oct 22 20:54:29 braeval.u14.univ-nantes.prive kernel: [629576.995484] [<ffffffff81136454>] ? vfs_read+0xa4/0x180 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.943923] INFO: task ceph-osd:4409 blocked for more than 120 seconds. Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.943954] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.943999] ceph-osd D ffff88084c049030 0 4409 1 0x00000000 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.944003] ffff88084c048d60 0000000000000086 ffff880a1421de78 ffff880a17caa820 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.944055] ffff880a1421dfd8 ffff880a1421dfd8 ffff880a1421dfd8 ffff88084c048d60 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.944106] 0000000003373001 ffff88084c048d60 ffff88051775cb20 ffffffffffffffff Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.944157] Call Trace: Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.944185] [<ffffffff813c52fd>] ? rwsem_down_failed_common+0xbd/0x150 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.944216] [<ffffffff812094a3>] ? call_rwsem_down_write_failed+0x13/0x20 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.944248] [<ffffffff811b83e0>] ? cap_mmap_addr+0x50/0x50 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.944275] [<ffffffff813c3cbc>] ? down_write+0x1c/0x1d Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.944303] [<ffffffff810fcf74>] ? vm_mmap_pgoff+0x64/0xb0 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.944330] [<ffffffff8110d4cc>] ? sys_mmap_pgoff+0x5c/0x190 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.944358] [<ffffffff811357f1>] ? do_sys_open+0x161/0x1e0 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.944386] [<ffffffff813c5ffd>] ? system_call_fastpath+0x1a/0x1f Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.944423] INFO: task ceph-osd:25297 blocked for more than 120 seconds. Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.944451] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.944494] ceph-osd D ffff8801bce7b1a0 0 25297 1 0x00000000 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.944496] ffff8801bce7aed0 0000000000000086 ffff88025d903fd8 ffff880a17cab580 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.944548] ffff88025d903fd8 ffff88025d903fd8 ffff88025d903fd8 ffff8801bce7aed0 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.944599] ffff8801bce7aed0 ffff8801bce7aed0 ffff88051775cb20 ffffffffffffffff Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.944650] Call Trace: Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.944673] [<ffffffff813c52fd>] ? rwsem_down_failed_common+0xbd/0x150 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.944702] [<ffffffff81209474>] ? call_rwsem_down_read_failed+0x14/0x30 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.944731] [<ffffffff813c3c9e>] ? down_read+0xe/0x10 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.944758] [<ffffffff8103129c>] ? do_page_fault+0x16c/0x460 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.944786] [<ffffffff81305862>] ? release_sock+0xd2/0x150 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.944814] [<ffffffff8137aceb>] ? inet_stream_connect+0x4b/0x70 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.944843] [<ffffffff81302b55>] ? sys_connect+0xa5/0xe0 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.944870] [<ffffffff811343e3>] ? fd_install+0x33/0x70 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.944897] [<ffffffff813c5a75>] ? page_fault+0x25/0x30 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.944923] INFO: task ceph-osd:12506 blocked for more than 120 seconds. Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.944951] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.944994] ceph-osd D ffff8800227f7480 0 12506 1 0x00000000 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.944996] ffff8800227f71b0 0000000000000086 0000000000000000 ffff880a17cab580 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.945048] ffff880468df1fd8 ffff880468df1fd8 ffff880468df1fd8 ffff8800227f71b0 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.945099] 0000000000000000 ffff8800227f71b0 ffff88051775cb20 ffffffffffffffff Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.945150] Call Trace: Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.945172] [<ffffffff813c52fd>] ? rwsem_down_failed_common+0xbd/0x150 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.945201] [<ffffffff81209474>] ? call_rwsem_down_read_failed+0x14/0x30 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.945231] [<ffffffff813c3c9e>] ? down_read+0xe/0x10 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.945257] [<ffffffff8103129c>] ? do_page_fault+0x16c/0x460 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.945284] [<ffffffff81302fb7>] ? sys_recvfrom+0x107/0x150 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.945311] [<ffffffff81302b55>] ? sys_connect+0xa5/0xe0 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.945339] [<ffffffff8100a465>] ? read_tsc+0x5/0x20 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.945366] [<ffffffff810828cf>] ? ktime_get_ts+0x3f/0xe0 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.945394] [<ffffffff811489a4>] ? poll_select_set_timeout+0x64/0x80 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.945422] [<ffffffff813c5a75>] ? page_fault+0x25/0x30 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.945449] INFO: task ceph-osd:25459 blocked for more than 120 seconds. Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.945476] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.945520] ceph-osd D ffff8803fc809d90 0 25459 1 0x00000000 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.945522] ffff8803fc809ac0 0000000000000086 0000000000000000 ffff880a17c74990 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.945573] ffff880468e25fd8 ffff880468e25fd8 ffff880468e25fd8 ffff8803fc809ac0 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.945624] 0000000000000000 ffff8803fc809ac0 ffff88051775cb20 ffffffffffffffff Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.945675] Call Trace: Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.945697] [<ffffffff813c52fd>] ? rwsem_down_failed_common+0xbd/0x150 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.945726] [<ffffffff81209474>] ? call_rwsem_down_read_failed+0x14/0x30 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.945755] [<ffffffff813c3c9e>] ? down_read+0xe/0x10 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.945781] [<ffffffff8103129c>] ? do_page_fault+0x16c/0x460 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.945808] [<ffffffff81302fb7>] ? sys_recvfrom+0x107/0x150 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.945835] [<ffffffff81082892>] ? ktime_get_ts+0x2/0xe0 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.945862] [<ffffffff8100a465>] ? read_tsc+0x5/0x20 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.945888] [<ffffffff810828cf>] ? ktime_get_ts+0x3f/0xe0 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.945914] [<ffffffff811489a4>] ? poll_select_set_timeout+0x64/0x80 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.945942] [<ffffffff813c5a75>] ? page_fault+0x25/0x30 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.945969] INFO: task ceph-osd:32469 blocked for more than 120 seconds. Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.945997] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.946041] ceph-osd D ffff880556237b30 0 32469 1 0x00000000 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.946043] ffff880556237860 0000000000000086 ffff88059fe5dfd8 ffff880a17c742e0 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.946096] ffff88059fe5dfd8 ffff88059fe5dfd8 ffff88059fe5dfd8 ffff880556237860 Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.946146] ffff880556237860 ffff880556237860 ffff88051775cb20 ffffffffffffffff Oct 22 20:56:29 braeval.u14.univ-nantes.prive kernel: [629696.946198] Call Trace: Well. at least, after the hard reset, xfs volume was still good this time. Old mail (send to xfs mailing list) for reference : > Hello, > Last week, I encountered problems with xfs volumes on several > machines. Kernel hanged under heavy load, I hard to hard reset. After > reboot, xfs volume was not able to mount, and xfs_repair didn't > managed to recover the volume cleanly on 2 different machines. > > Just to relax things, It wasn't production data, so it don't matter if > I recover data or not. But more important to me is to understand why > things went wrong... > > I'm using XFS since a long time, on lots of data, it's the first time > I encounter such a problem, but I was using unusual option : > filestreams, and was using kernel 3.6.1. So I wonder if it has > something to do with the crash. > > I have nothing very conclusive in the kernel logs, apart this : > > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.569890] > INFO: task ceph-osd:17856 blocked for more than 120 seconds. > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.569941] > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.569987] > ceph-osd D ffff88056416b1a0 0 17856 1 0x00000000 > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.569993] > ffff88056416aed0 0000000000000086 ffff880590751fd8 ffff88000c67eb00 > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570047] > ffff880590751fd8 ffff880590751fd8 ffff880590751fd8 ffff88056416aed0 > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570101] > 0000000000000001 ffff88056416aed0 ffff880a15240d00 ffff880a15240d60 > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570156] > Call Trace: > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570187] > [<ffffffff81041335>] ? exit_mm+0x85/0x120 > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570216] > [<ffffffff81042a94>] ? do_exit+0x154/0x8e0 > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570248] > [<ffffffff8114ec79>] ? file_update_time+0xa9/0x100 > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570278] > [<ffffffff81043568>] ? do_group_exit+0x38/0xa0 > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570309] > [<ffffffff81051bc6>] ? get_signal_to_deliver+0x1a6/0x5e0 > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570341] > [<ffffffff8100223e>] ? do_signal+0x4e/0x970 > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570371] > [<ffffffff81170e2e>] ? fsnotify+0x24e/0x340 > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570402] > [<ffffffff8100c995>] ? fpu_finit+0x15/0x30 > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570431] > [<ffffffff8100db34>] ? restore_i387_xstate+0x64/0x1c0 > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570464] > [<ffffffff8108e0d2>] ? sys_futex+0x92/0x1b0 > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570493] > [<ffffffff81002bf5>] ? do_notify_resume+0x75/0xc0 > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570525] > [<ffffffff813c60fa>] ? int_signal+0x12/0x17 > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570553] > INFO: task ceph-osd:17857 blocked for more than 120 seconds. > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570583] > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570628] > ceph-osd D ffff8801161fe720 0 17857 1 0x00000000 > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570632] > ffff8801161fe450 0000000000000086 ffffffffffffffe0 ffff880a17c73c30 > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570687] > ffff88011347ffd8 ffff88011347ffd8 ffff88011347ffd8 ffff8801161fe450 > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570740] > ffff8801161fe450 ffff8801161fe450 ffff880a15240d00 ffff880a15240d60 > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570794] > Call Trace: > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570818] > [<ffffffff81041335>] ? exit_mm+0x85/0x120 > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570846] > [<ffffffff81042a94>] ? do_exit+0x154/0x8e0 > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570875] > [<ffffffff81043568>] ? do_group_exit+0x38/0xa0 > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570905] > [<ffffffff81051bc6>] ? get_signal_to_deliver+0x1a6/0x5e0 > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570935] > [<ffffffff8100223e>] ? do_signal+0x4e/0x970 > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570967] > [<ffffffff81302d24>] ? sys_sendto+0x114/0x150 > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.570996] > [<ffffffff8108e0d2>] ? sys_futex+0x92/0x1b0 > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.571024] > [<ffffffff81002bf5>] ? do_notify_resume+0x75/0xc0 > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.571054] > [<ffffffff813c60fa>] ? int_signal+0x12/0x17 > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.571082] > INFO: task ceph-osd:17858 blocked for more than 120 seconds. > Oct 14 14:37:21 hanyu.u14.univ-nantes.prive kernel: [532905.571111] > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > -- Yann Dupont - Service IRTS, DSI Université de Nantes Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Problems with kernel 3.6.x (vm ?) (was : Is kernel 3.6.1 or filestreams option toxic ?) 2012-10-23 8:24 ` Problems with kernel 3.6.x (vm ?) (was : Is kernel 3.6.1 or filestreams option toxic ?) Yann Dupont @ 2012-10-25 15:21 ` Yann Dupont 2012-10-25 20:55 ` Yann Dupont 2012-10-25 21:10 ` Dave Chinner 0 siblings, 2 replies; 18+ messages in thread From: Yann Dupont @ 2012-10-25 15:21 UTC (permalink / raw) To: xfs Le 23/10/2012 10:24, Yann Dupont a écrit : > Le 22/10/2012 16:14, Yann Dupont a écrit : > > Hello. This mail is a follow up of a message on XFS mailing list. I > had hang with 3.6.1, and then , damage on XFS filesystem. > > 3.6.1 is not alone. Tried 3.6.2, and had another hang with quite a > different trace this time , so not really sure the 2 problems are > related . > Anyway the problem is maybe not XFS, but is just a consequence of what > seems more like kernel problems. > > cc: to linux-kernel Hello. There is definitively something wrong in 3.6.xx with XFS, in particular after an abrupt stop of the machine : I now have corruption on a 3rd machine (not involved with ceph). The machine was just rebooting from 3.6.2 kernel to 3.6.3 kernel. This machine isn't under heavy load, but it's a machine we use for tests & compilations. We often crash it. For 2 years, we didn't have problems. XFS always was reliable, even in hard conditions (hard reset, loss of power, etc) This time, after 3.6.3 boot, one of my xfs volume refuse to mount : mount: /dev/mapper/LocalDisk-debug--git: can't read superblock 276596.189363] XFS (dm-1): Mounting Filesystem [276596.270614] XFS (dm-1): Starting recovery (logdev: internal) [276596.711295] XFS (dm-1): xlog_recover_process_data: bad clientid 0x0 [276596.711329] XFS (dm-1): log mount/recovery failed: error 5 [276596.711516] XFS (dm-1): log mount failed I'm not even sure the reboot was after a crash or just a clean reboot. (I'm not the only one to use this machine). I have nothing suspect on my remote syslog. Anyway, it's the 3rd XFS crashed volume in a row with 3.6 kernel. Different machines, different contexts. Looks suspicious. This time the crashed volume was handled by a PERC (mptsas) card. The 2 others volumes previously reported were handled by emulex lightpulse fibre channel card (lpfc) and this time filestreams option wasn't used. xfs_repair -n seems to show volume is quite broken : Phase 1 - find and verify superblock... Phase 2 - using internal log - scan filesystem freespace and inode maps... block (1,6197-6197) multiply claimed by bno space tree, state - 2 bad magic # 0x7f454c46 in btbno block 3/2320 expected level 0 got 513 in btbno block 3/2320 bad btree nrecs (256, min=255, max=510) in btbno block 3/2320 invalid start block 16793088 in record 0 of bno btree block 3/2320 invalid start block 0 in record 1 of bno btree block 3/2320 invalid start block 0 in record 2 of bno btree block 3/2320 invalid start block 2282029056 in record 3 of bno btree block 3/2320 invalid start block 0 in record 4 of bno btree block 3/2320 invalid length 218106368 in record 5 of bno btree block 3/2320 invalid start block 1684369509 in record 6 of bno btree block 3/2320 invalid start block 6909556 in record 7 of bno btree block 3/2320 invalid start block 1493202533 in record 8 of bno btree block 3/2320 invalid start block 1768111411 in record 9 of bno btree block 3/2320 invalid start block 761557865 in record 10 of bno btree block 3/2320 invalid start block 842084400 in record 11 of bno btree block 3/2320 ... bad magic # 0x41425442 in btcnt block 2/14832 bad btree nrecs (436, min=255, max=510) in btcnt block 2/14832 out-of-order cnt btree record 2 (188545 1) block 2/14832 out-of-order cnt btree record 3 (188650 1) block 2/14832 out-of-order cnt btree record 4 (188658 1) block 2/14832 out-of-order cnt btree record 8 (189021 1) block 2/14832 out-of-order cnt btree record 9 (189104 1) block 2/14832 out-of-order cnt btree record 10 (189127 2) block 2/14832 out-of-order cnt btree record 11 (189193 2) block 2/14832 out-of-order cnt btree record 12 (189259 2) block 2/14832 out-of-order cnt btree record 13 (189268 1) block 2/14832 out-of-order cnt btree record 14 (189307 1) block 2/14832 out-of-order cnt btree record 15 (189330 1) block 2/14832 out-of-order cnt btree record 16 (189379 1) block 2/14832 out-of-order cnt btree record 18 (189477 1) block 2/14832 I won't try to repair this volume right now. This time, volume is small enough to make an image (it's a 100 GB lvm volume). I'll try to image it before making anything else. 1st question : I saw there is ext4 corruption reported too with 3.6 kernel, but as far as I can see, problem seems to be jbd related, so it shouldn't affect xfs ? 2nd question : Am I the only one to see this ?? I saw problems reported with 2.6.37, but here, the kernel is 3.6.xx 3rd question : If you suspect the problem may be lying in XFS , what should I supply to help debugging the problem ? Not CC:ing linux kernel list right now, as I'm really not sure where the problem is right now. Cheers, -- Yann Dupont - Service IRTS, DSI Université de Nantes Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Problems with kernel 3.6.x (vm ?) (was : Is kernel 3.6.1 or filestreams option toxic ?) 2012-10-25 15:21 ` Yann Dupont @ 2012-10-25 20:55 ` Yann Dupont 2012-10-25 21:10 ` Dave Chinner 1 sibling, 0 replies; 18+ messages in thread From: Yann Dupont @ 2012-10-25 20:55 UTC (permalink / raw) To: xfs Le 25/10/2012 17:21, Yann Dupont a écrit : > Hello. > There is definitively something wrong in 3.6.xx with XFS, in > particular after an abrupt stop of the machine : > > I now have corruption on a 3rd machine (not involved with ceph). > The machine was just rebooting from 3.6.2 kernel to 3.6.3 kernel. > > This machine isn't under heavy load, but it's a machine we use for > tests & compilations. We often crash it. For 2 years, we didn't have > problems. XFS always was reliable, even in hard conditions (hard > reset, loss of power, etc) > > This time, after 3.6.3 boot, one of my xfs volume refuse to mount : > > mount: /dev/mapper/LocalDisk-debug--git: can't read superblock > > 276596.189363] XFS (dm-1): Mounting Filesystem > [276596.270614] XFS (dm-1): Starting recovery (logdev: internal) > [276596.711295] XFS (dm-1): xlog_recover_process_data: bad clientid 0x0 > [276596.711329] XFS (dm-1): log mount/recovery failed: error 5 > [276596.711516] XFS (dm-1): log mount failed > Just found something interesting : I was rebooting with 3.4.15 to make a backup of this volume. As I said in previous message, I didn't did xfs_repair on it. Before reboot, I forgot to edit fstab to prevent the mount. To my surprise, under 3.4.15 the volume mounts like a charm !!! [ 37.958374] XFS (dm-1): Mounting Filesystem [ 38.050374] XFS (dm-1): Starting recovery (logdev: internal) [ 69.596892] XFS (dm-1): Ending recovery (logdev: internal) As far as I can say, there is no corruption, no problems, all my files are here !!! So far here is the scenario : You have to hard reset your machine with 3.6 (maybe kernel version isn't important here). As I encoutered others 3.6 Bugs (exit_mm and rwsem_down_failed_common) , I had to do that. So XFS is not clean. 2) boot with 3.6.xx Mounting volume fails, bacause log replay fails for an unkwown reason 3) You think your FS is broken, so you start an xfs_repair, which is somehow fooled and definitively broke your filesystem I hope it's reproductible. Will try tomorrow morning. Cheers, -- Yann Dupont - Service IRTS, DSI Université de Nantes Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Problems with kernel 3.6.x (vm ?) (was : Is kernel 3.6.1 or filestreams option toxic ?) 2012-10-25 15:21 ` Yann Dupont 2012-10-25 20:55 ` Yann Dupont @ 2012-10-25 21:10 ` Dave Chinner 2012-10-26 10:03 ` Yann Dupont 1 sibling, 1 reply; 18+ messages in thread From: Dave Chinner @ 2012-10-25 21:10 UTC (permalink / raw) To: Yann Dupont; +Cc: xfs On Thu, Oct 25, 2012 at 05:21:35PM +0200, Yann Dupont wrote: > Le 23/10/2012 10:24, Yann Dupont a écrit : > >Le 22/10/2012 16:14, Yann Dupont a écrit : > > > >Hello. This mail is a follow up of a message on XFS mailing list. > >I had hang with 3.6.1, and then , damage on XFS filesystem. > > > >3.6.1 is not alone. Tried 3.6.2, and had another hang with quite a > >different trace this time , so not really sure the 2 problems are > >related . > >Anyway the problem is maybe not XFS, but is just a consequence of > >what seems more like kernel problems. > > > >cc: to linux-kernel > Hello. > There is definitively something wrong in 3.6.xx with XFS, in > particular after an abrupt stop of the machine : > > I now have corruption on a 3rd machine (not involved with ceph). > The machine was just rebooting from 3.6.2 kernel to 3.6.3 kernel. > > This machine isn't under heavy load, but it's a machine we use for > tests & compilations. We often crash it. For 2 years, we didn't have > problems. XFS always was reliable, even in hard conditions (hard > reset, loss of power, etc) > > This time, after 3.6.3 boot, one of my xfs volume refuse to mount : > > mount: /dev/mapper/LocalDisk-debug--git: can't read superblock > > 276596.189363] XFS (dm-1): Mounting Filesystem > [276596.270614] XFS (dm-1): Starting recovery (logdev: internal) > [276596.711295] XFS (dm-1): xlog_recover_process_data: bad clientid 0x0 > [276596.711329] XFS (dm-1): log mount/recovery failed: error 5 > [276596.711516] XFS (dm-1): log mount failed That's an indication that zeros are being read from the journal rather than valid transaction data. It may well be caused by an XFS bug, but from experience it is equally likely to be a lower layer storage problem. More information is needed. Firstly: http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F Secondly, is the system still in this state? If so, dump the log to a file using xfs_logprint, zip it up and send it to me so I can have a look at where the log is intact (i.e. likely xfs bug) or contains zero (likely storage bug). If the system is not still in this state, then I'm afraid there's nothing that can be done to understand the problem. > I'm not even sure the reboot was after a crash or just a clean > reboot. (I'm not the only one to use this machine). I have nothing > suspect on my remote syslog. > > Anyway, it's the 3rd XFS crashed volume in a row with 3.6 kernel. > Different machines, different contexts. Looks suspicious. You've had two machines crash with problems in the mm subsystem, and one filesystem problem that might be hardware realted. Bit early to be blaming XFS for all your problems, I think.... > xfs_repair -n seems to show volume is quite broken : Sure, if the log hasn't been replayed then it will be - the filesystem will only be consistent after log recovery has been run. > I won't try to repair this volume right now. > > This time, volume is small enough to make an image (it's a 100 GB > lvm volume). I'll try to image it before making anything else. > > 1st question : I saw there is ext4 corruption reported too with 3.6 > kernel, but as far as I can see, problem seems to be jbd related, so > it shouldn't affect xfs ? No relationship at all. > 2nd question : Am I the only one to see this ?? I saw problems > reported with 2.6.37, but here, the kernel is 3.6.xx Yes, you're the only one to report such problems on 3.6. Anything reported on 2.6.37 is likely to be completely unrelated. > 3rd question : If you suspect the problem may be lying in XFS , what > should I supply to help debugging the problem ? See above. > Not CC:ing linux kernel list right now, as I'm really not sure where > the problem is right now. You should report the mm problems to linux-mm@kvack.org to make sure the right people see them and they don't get lost in the noise of lkml.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Problems with kernel 3.6.x (vm ?) (was : Is kernel 3.6.1 or filestreams option toxic ?) 2012-10-25 21:10 ` Dave Chinner @ 2012-10-26 10:03 ` Yann Dupont 2012-10-26 22:05 ` Yann Dupont 0 siblings, 1 reply; 18+ messages in thread From: Yann Dupont @ 2012-10-26 10:03 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs Le 25/10/2012 23:10, Dave Chinner a écrit : > > This time, after 3.6.3 boot, one of my xfs volume refuse to mount : > > mount: /dev/mapper/LocalDisk-debug--git: can't read superblock > > 276596.189363] XFS (dm-1): Mounting Filesystem > [276596.270614] XFS (dm-1): Starting recovery (logdev: internal) > [276596.711295] XFS (dm-1): xlog_recover_process_data: bad clientid 0x0 > [276596.711329] XFS (dm-1): log mount/recovery failed: error 5 > [276596.711516] XFS (dm-1): log mount failed > That's an indication that zeros are being read from the journal > rather than valid transaction data. It may well be caused by an XFS > bug, but from experience it is equally likely to be a lower layer > storage problem. More information is needed. Hello dave, did you see the next mail ? The fact is that with 3.4.15, journal is OK, and data is, in fact, intact. > Firstly: > > http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F OK, sorry I missed it : here are the informations. Not sure all is relevant, anyway here we go. each time I will distinguish between the first reported crash (nodes of ceph) and the last one, as the setup is quite different. -------- kernel version (uname -a) : 3.6.1 then 3.6.2, vanilla, hand compiled, no proprietary modules. Not running it at the moment, can't give you the exact uname -a ------------ xfs_repair version 3.1.7 on the the third machine, xfs_repair version 3.1.4 on two first machines (part of ceph) ----------- cpu : the same for the 3 machines : Dell PowerEdgme M610, 2x Intel(R) Xeon(R) CPU E5649 @ 2.53GHz , Hyper threading activated (12 physical cores, 24 virtual cores) ------------- meminfo : for example, on the 3rd machine : MemTotal: 41198292 kB MemFree: 28623116 kB Buffers: 1056 kB Cached: 10392452 kB SwapCached: 0 kB Active: 180528 kB Inactive: 10227416 kB Active(anon): 17476 kB Inactive(anon): 180 kB Active(file): 163052 kB Inactive(file): 10227236 kB Unevictable: 3744 kB Mlocked: 3744 kB SwapTotal: 506040 kB SwapFree: 506040 kB Dirty: 0 kB Writeback: 0 kB AnonPages: 18228 kB Mapped: 12688 kB Shmem: 300 kB Slab: 1408204 kB SReclaimable: 1281008 kB SUnreclaim: 127196 kB KernelStack: 1976 kB PageTables: 2736 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 21105184 kB Committed_AS: 136080 kB VmallocTotal: 34359738367 kB VmallocUsed: 398608 kB VmallocChunk: 34337979376 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 7652 kB DirectMap2M: 2076672 kB DirectMap1G: 39845888 kB ---- /proc/mounts: root@label5:~# cat /proc/mounts rootfs / rootfs rw 0 0 sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0 proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0 udev /dev devtmpfs rw,relatime,size=20592788k,nr_inodes=5148197,mode=755 0 0 devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0 tmpfs /run tmpfs rw,nosuid,noexec,relatime,size=4119832k,mode=755 0 0 /dev/mapper/LocalDisk-root / xfs rw,relatime,attr2,noquota 0 0 tmpfs /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k 0 0 tmpfs /tmp tmpfs rw,nosuid,nodev,relatime,size=8239660k 0 0 proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0 sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0 tmpfs /run/shm tmpfs rw,nosuid,nodev,relatime,size=8239660k 0 0 /dev/sda1 /boot ext2 rw,relatime,errors=continue 0 0 ** /dev/mapper/LocalDisk-debug--git /mnt/debug-git xfs rw,relatime,attr2,noquota 0 0 ** this one was the failing on 3.6.xx configfs /sys/kernel/config configfs rw,relatime 0 0 ocfs2_dlmfs /dlm ocfs2_dlmfs rw,relatime 0 0 rpc_pipefs /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0 fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0 nfsd /proc/fs/nfsd nfsd rw,relatime 0 0 This volume is on RAID1 localdisk. on one of the first 2 nodes : root@hanyu:~# cat /proc/mounts rootfs / rootfs rw 0 0 none /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0 none /proc proc rw,nosuid,nodev,noexec,relatime 0 0 none /dev devtmpfs rw,relatime,size=20592652k,nr_inodes=5148163,mode=755 0 0 none /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0 /dev/disk/by-uuid/37dd603c-168c-49de-830d-ef1b5c6982f8 / xfs rw,relatime,attr2,noquota 0 0 tmpfs /lib/init/rw tmpfs rw,nosuid,relatime,mode=755 0 0 tmpfs /dev/shm tmpfs rw,nosuid,nodev,relatime 0 0 /dev/sdk1 /boot ext2 rw,relatime,errors=continue 0 0 none /var/local/cgroup cgroup rw,relatime,net_cls,freezer,devices,memory,cpuacct,cpu,debug,cpuset 0 0 ** /dev/mapper/xceph--hanyu-data /XCEPH-PROD/data xfs rw,noatime,attr2,filestreams,nobarrier,inode64,logbsize=256k,noquota 0 0 ** This one was the failed volume fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0 Please note that on this server, nobarrier is used because the volume is on a battery-backed fibre channel raid array. -------------- /proc/partitions : quite complicated on the ceph node : root@hanyu:~# cat /proc/partitions major minor #blocks name 11 0 1048575 sr0 8 32 6656000000 sdc 8 48 5063483392 sdd 8 64 6656000000 sde 8 80 5063483392 sdf 8 96 6656000000 sdg 8 112 5063483392 sdh 8 128 6656000000 sdi 8 144 5063483392 sdj 8 160 292421632 sdk 8 161 273073 sdk1 8 162 530145 sdk2 8 163 2369587 sdk3 8 164 289242292 sdk4 254 0 6656000000 dm-0 254 1 5063483392 dm-1 254 2 5242880 dm-2 254 3 11676106752 dm-3 please note that we use multipath here. 4 Paths for the LUN : root@hanyu:~# multipath -ll mpath2 (3600d02310006674500000001414d677d) dm-1 IFT,S16F-R1840-4 size=4.7T features='1 queue_if_no_path' hwhandler='0' wp=rw |-+- policy='round-robin 0' prio=100 status=active | |- 0:0:1:96 sdf 8:80 active ready running | `- 6:0:1:96 sdj 8:144 active ready running `-+- policy='round-robin 0' prio=20 status=enabled |- 0:0:0:96 sdd 8:48 active ready running `- 6:0:0:96 sdh 8:112 active ready running mpath1 (3600d02310006674500000000414d677d) dm-0 IFT,S16F-R1840-4 size=6.2T features='1 queue_if_no_path' hwhandler='0' wp=rw |-+- policy='round-robin 0' prio=100 status=active | |- 0:0:1:32 sde 8:64 active ready running | `- 6:0:1:32 sdi 8:128 active ready running `-+- policy='round-robin 0' prio=20 status=enabled |- 0:0:0:32 sdc 8:32 active ready running `- 6:0:0:32 sdg 8:96 active ready running On the 3rd machine, setup is quite simpler root@label5:~# cat /proc/partitions major minor #blocks name 8 0 292421632 sda 8 1 257008 sda1 8 2 506047 sda2 8 3 1261102 sda3 8 4 140705302 sda4 254 0 2609152 dm-0 254 1 104857600 dm-1 254 2 31457280 dm-2 -------------- raid layout : On the first 2 machines (part of ceph cluster), the data is on Raid5 on a fibre channel raid array, accessed by emulex fibre channel (lightpulse, lpfc) On the 3rd, data is on Raid1 accessed by Dell Perc (LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08) driver mptsas) -------------- LVM config : root@hanyu:~# vgs VG #PV #LV #SN Attr VSize VFree LocalDisk 1 1 0 wz--n- 275,84g 270,84g xceph-hanyu 2 1 0 wz--n- 10,91t 41,36g root@hanyu:~# lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert log LocalDisk -wi-a- 5,00g data xceph-hanyu -wi-ao 10,87t and root@label5:~# vgs VG #PV #LV #SN Attr VSize VFree LocalDisk 1 3 0 wz--n- 134,18g 1,70g root@label5:~# lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert 1 LocalDisk -wi-a- 30,00g debug-git LocalDisk -wi-ao 100,00g root LocalDisk -wi-ao 2,49g root@label5:~# ------------------- type of disks : on the raid array I'd say not very important (SEAGATE ST32000444SS near line sas 2TB) on the 3rd machine : TOSHIBA MBF2300RC DA06 --------------------- write cache status : on the raid array, write cache is activated globally for the raid array BUT is explicitely disabled on drives. on the 3rd machine, it is disabled as far as I know ------------------- Size of BBWC : 2 or 4 GB on raid arrays. None on the 3rd. ------------------ xfs_info : root@hanyu:~# xfs_info /dev/xceph-hanyu/data meta-data=/dev/mapper/xceph--hanyu-data isize=256 agcount=11, agsize=268435455 blks = sectsz=512 attr=2 data = bsize=4096 blocks=2919026688, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 (no sunit or swidth on this one) root@label5:~# xfs_info /dev/LocalDisk/debug-git meta-data=/dev/mapper/LocalDisk-debug--git isize=256 agcount=4, agsize=6553600 blks = sectsz=512 attr=2 data = bsize=4096 blocks=26214400, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=12800, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 ----- dmesg : you already have the informations. For iostat, etc, I need to try to reproduce the load. > Secondly, is the system still in this state? If so, dump the log to No. The first 2 nodes have been xfs_repaired. One was completed and it was a terrible mess. The second had xfs_repair segfaulting. Will try with a newer xfs_repair on a 3.4 kernel. The 3rd one is now ok, after booting on 3.4 kernel. > a file using xfs_logprint, zip it up and send it to me so I can have > a look at where the log is intact (i.e. likely xfs bug) or contains > zero (likely storage bug). > > If the system is not still in this state, then I'm afraid there's > nothing that can be done to understand the problem. I'll try to reproduce a similar problem. > You've had two machines crash with problems in the mm subsystem, and > one filesystem problem that might be hardware realted. Bit early to > be blaming XFS for all your problems, I think.... I don't try to blame XFS. I'm very confident in it, and since a long time. BUT I see a very different behaviour on those 3 cases. Nothing conclusive yet. I think the problem is related with kernel 3.6, maybe in dm layer. I don't think it's hardware related : different disks, differents controllers, different machines. The common point is : -XFS -Kernel 3.6.xx -Device Mapper + LVM >> xfs_repair -n seems to show volume is quite broken : > Sure, if the log hasn't been replayed then it will be - the > filesystem will only be consistent after log recovery has been run. > Yes, but I had to use xfs_repair -L in the past (power outage, hardware failures) and never had such disastrous repairs. At least on the 2 first failures, I can understand : There is lots of data, Journal is BIG, and I/O transactions in flight are quite high. on the 3rd failure I'm very septical : low I/O load, little volume. > You should report the mm problems to linux-mm@kvack.org to make sure > the right people see them and they don't get lost in the noise of > lkml.... yes point taken, I'll try now to reproduce this kind of behaviour on a verry little volume (10 GB for exemple) so I can confirm or inform the given scenario . Thanks for your time, -- Yann Dupont - Service IRTS, DSI Université de Nantes Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Problems with kernel 3.6.x (vm ?) (was : Is kernel 3.6.1 or filestreams option toxic ?) 2012-10-26 10:03 ` Yann Dupont @ 2012-10-26 22:05 ` Yann Dupont 2012-10-28 23:48 ` Dave Chinner 0 siblings, 1 reply; 18+ messages in thread From: Yann Dupont @ 2012-10-26 22:05 UTC (permalink / raw) To: xfs Le 26/10/2012 12:03, Yann Dupont a écrit : > Le 25/10/2012 23:10, Dave Chinner a écrit : > > I'll try now to reproduce this kind of behaviour on a verry little > volume (10 GB for exemple) so I can confirm or inform the given > scenario . > This is reproductible. Here is how to do it : - Started a 3.6.2 kernel. - I created a fresh lvm volume on localdisk of 20 GB. - mkfs.xfs on it, with default options - mounted with default options - launch something that hammers this volume. I launched compilebench 0.6 on it - wait some time to fill memory,buffers, and be sure your disks are really busy. I waited some minutes after the initial 30 kernel unpacking in compilebench - hard reset the server (I'm using the Idrac of the server to generate a power cycle) - After some try, I finally had the impossibility to mount the xfs volume, with the error reported in previous mails. So far this is normal . xfs_logprint don't say much : xfs_logprint: data device: 0xfe02 log device: 0xfe02 daddr: 10485792 length: 20480 Header 0x7c wanted 0xfeedbabe ********************************************************************** * ERROR: header cycle=124 block=5414 * ********************************************************************** I tried xfs_logprint -c , it gaves a 22M file. You can grab it here : http://filex.univ-nantes.fr/get?k=QnBXivz2J3LmzJ18uBV - Rebooted 3.4.15 - xfs_logprint gives the exact same result that with 3.6.2 (diff tells no differences) but on 3.4.15, I can mount the volume without problem, log is replayed. for information here is xfs_info of the volume : here is xfs_info output root@label5:/mnt/debug# xfs_info /mnt/tempo meta-data=/dev/mapper/LocalDisk-crashdisk isize=256 agcount=8, agsize=655360 blks = sectsz=512 attr=2 data = bsize=4096 blocks=5242880, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=2560, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 Does this helps you ? Cheers, -- Yann Dupont - Service IRTS, DSI Université de Nantes Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Problems with kernel 3.6.x (vm ?) (was : Is kernel 3.6.1 or filestreams option toxic ?) 2012-10-26 22:05 ` Yann Dupont @ 2012-10-28 23:48 ` Dave Chinner 2012-10-29 1:25 ` Dave Chinner 2012-10-29 8:07 ` Yann Dupont 0 siblings, 2 replies; 18+ messages in thread From: Dave Chinner @ 2012-10-28 23:48 UTC (permalink / raw) To: Yann Dupont; +Cc: xfs On Sat, Oct 27, 2012 at 12:05:34AM +0200, Yann Dupont wrote: > Le 26/10/2012 12:03, Yann Dupont a écrit : > >Le 25/10/2012 23:10, Dave Chinner a écrit : > > > >I'll try now to reproduce this kind of behaviour on a verry little > >volume (10 GB for exemple) so I can confirm or inform the given > >scenario . > > > > This is reproductible. Here is how to do it : > > - Started a 3.6.2 kernel. > > - I created a fresh lvm volume on localdisk of 20 GB. Can you reproduce the problem without LVM? > - mkfs.xfs on it, with default options > - mounted with default options > - launch something that hammers this volume. I launched compilebench > 0.6 on it > - wait some time to fill memory,buffers, and be sure your disks are > really busy. I waited some minutes after the initial 30 kernel > unpacking in compilebench > - hard reset the server (I'm using the Idrac of the server to > generate a power cycle) > - After some try, I finally had the impossibility to mount the xfs > volume, with the error reported in previous mails. So far this is > normal . So it doesn't happen every time, and it may be power cycle related. What is your "local disk"? > > xfs_logprint don't say much : > > xfs_logprint: > data device: 0xfe02 > log device: 0xfe02 daddr: 10485792 length: 20480 > > Header 0x7c wanted 0xfeedbabe > ********************************************************************** > * ERROR: header cycle=124 block=5414 * > ********************************************************************** You didn't look past the initial error, did you? The file is only 482280 lines long, and 482200 lines of that are decoded log data.... :) > I tried xfs_logprint -c , it gaves a 22M file. You can grab it here : > http://filex.univ-nantes.fr/get?k=QnBXivz2J3LmzJ18uBV I really need the raw log data, not the parsed output. The logprint command to do that is "-C <file>", not "-c". > - Rebooted 3.4.15 > - xfs_logprint gives the exact same result that with 3.6.2 (diff > tells no differences) Given that it's generated by the logprint application, I'd expect it to be identical. > but on 3.4.15, I can mount the volume without problem, log is > replayed. > for information here is xfs_info of the volume : > > here is xfs_info output > > root@label5:/mnt/debug# xfs_info /mnt/tempo > meta-data=/dev/mapper/LocalDisk-crashdisk isize=256 agcount=8, > agsize=655360 blks How did you get a default of 8 AGs? That seems wrong. What version of mkfs.xfs are you using? Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Problems with kernel 3.6.x (vm ?) (was : Is kernel 3.6.1 or filestreams option toxic ?) 2012-10-28 23:48 ` Dave Chinner @ 2012-10-29 1:25 ` Dave Chinner 2012-10-29 8:11 ` Yann Dupont 2012-10-29 12:18 ` Dave Chinner 2012-10-29 8:07 ` Yann Dupont 1 sibling, 2 replies; 18+ messages in thread From: Dave Chinner @ 2012-10-29 1:25 UTC (permalink / raw) To: Yann Dupont; +Cc: xfs On Mon, Oct 29, 2012 at 10:48:02AM +1100, Dave Chinner wrote: > On Sat, Oct 27, 2012 at 12:05:34AM +0200, Yann Dupont wrote: > > Le 26/10/2012 12:03, Yann Dupont a écrit : > > >Le 25/10/2012 23:10, Dave Chinner a écrit : > > > > > >I'll try now to reproduce this kind of behaviour on a verry little > > >volume (10 GB for exemple) so I can confirm or inform the given > > >scenario . > > > > > > > This is reproductible. Here is how to do it : > > > > - Started a 3.6.2 kernel. > > > > - I created a fresh lvm volume on localdisk of 20 GB. > > Can you reproduce the problem without LVM? > > > - mkfs.xfs on it, with default options > > - mounted with default options > > - launch something that hammers this volume. I launched compilebench > > 0.6 on it > > - wait some time to fill memory,buffers, and be sure your disks are > > really busy. I waited some minutes after the initial 30 kernel > > unpacking in compilebench > > - hard reset the server (I'm using the Idrac of the server to > > generate a power cycle) > > - After some try, I finally had the impossibility to mount the xfs > > volume, with the error reported in previous mails. So far this is > > normal . > > So it doesn't happen every time, and it may be power cycle related. > What is your "local disk"? I can't reproduce this with a similar setup but using KVM (i.e. killing the VM instead of power cycling) or forcing a shutdown of the filesystem without flushing the log. The second case is very much the same as power cycling, but without the potential "power failure caused partial IOs to be written" problem. The only thing I can see in the logprint that I haven't seen so far in my testing is that your log print indicates a checkpoint that wraps the end of the log. I haven't yet hit that situation by chance, so I'll keep trying to see if that's the case that is causing the problem.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Problems with kernel 3.6.x (vm ?) (was : Is kernel 3.6.1 or filestreams option toxic ?) 2012-10-29 1:25 ` Dave Chinner @ 2012-10-29 8:11 ` Yann Dupont 2012-10-29 12:21 ` Dave Chinner 2012-10-29 12:18 ` Dave Chinner 1 sibling, 1 reply; 18+ messages in thread From: Yann Dupont @ 2012-10-29 8:11 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs Le 29/10/2012 02:25, Dave Chinner a écrit : > I can't reproduce this with a similar setup but using KVM (i.e. > killing the VM instead of power cycling) or forcing a shutdown of the > filesystem without flushing the log. The second case is very much the > same as power cycling, but without the potential "power failure caused > partial IOs to be written" problem. The only thing I can see in the > logprint that I haven't seen so far in my testing is that your log > print indicates a checkpoint that wraps the end of the log. I haven't > yet hit that situation by chance, so I'll keep trying to see if that's > the case that is causing the problem.... Cheers, Dave. Ok, is your kvm guest was lvm enabled ? I'll try to recrash the FS, this time I'll make an image of it on another machine for further testings. And I'll supply a usefull logprint Cheers, -- Yann Dupont - Service IRTS, DSI Université de Nantes Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Problems with kernel 3.6.x (vm ?) (was : Is kernel 3.6.1 or filestreams option toxic ?) 2012-10-29 8:11 ` Yann Dupont @ 2012-10-29 12:21 ` Dave Chinner 0 siblings, 0 replies; 18+ messages in thread From: Dave Chinner @ 2012-10-29 12:21 UTC (permalink / raw) To: Yann Dupont; +Cc: xfs On Mon, Oct 29, 2012 at 09:11:26AM +0100, Yann Dupont wrote: > Le 29/10/2012 02:25, Dave Chinner a écrit : > >I can't reproduce this with a similar setup but using KVM (i.e. > >killing the VM instead of power cycling) or forcing a shutdown of > >the filesystem without flushing the log. The second case is very > >much the same as power cycling, but without the potential "power > >failure caused partial IOs to be written" problem. The only thing > >I can see in the logprint that I haven't seen so far in my testing > >is that your log print indicates a checkpoint that wraps the end > >of the log. I haven't yet hit that situation by chance, so I'll > >keep trying to see if that's the case that is causing the > >problem.... Cheers, Dave. > > Ok, is your kvm guest was lvm enabled ? No. The idea being that if it is an XFS problem, then it will show up without needing LVM. And it did. > I'll try to recrash the FS, this time I'll make an image of it on > another machine for further testings. And I'll supply a usefull > logprint No need, I have a simple local reproducer now based on your example. I should be able to find the problem from here.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Problems with kernel 3.6.x (vm ?) (was : Is kernel 3.6.1 or filestreams option toxic ?) 2012-10-29 1:25 ` Dave Chinner 2012-10-29 8:11 ` Yann Dupont @ 2012-10-29 12:18 ` Dave Chinner 2012-10-29 12:43 ` Yann Dupont 1 sibling, 1 reply; 18+ messages in thread From: Dave Chinner @ 2012-10-29 12:18 UTC (permalink / raw) To: Yann Dupont; +Cc: xfs On Mon, Oct 29, 2012 at 12:25:40PM +1100, Dave Chinner wrote: > On Mon, Oct 29, 2012 at 10:48:02AM +1100, Dave Chinner wrote: > > On Sat, Oct 27, 2012 at 12:05:34AM +0200, Yann Dupont wrote: > > > Le 26/10/2012 12:03, Yann Dupont a écrit : > > > >Le 25/10/2012 23:10, Dave Chinner a écrit : > > > - mkfs.xfs on it, with default options > > > - mounted with default options > > > - launch something that hammers this volume. I launched compilebench > > > 0.6 on it > > > - wait some time to fill memory,buffers, and be sure your disks are > > > really busy. I waited some minutes after the initial 30 kernel > > > unpacking in compilebench > > > - hard reset the server (I'm using the Idrac of the server to > > > generate a power cycle) > > > - After some try, I finally had the impossibility to mount the xfs > > > volume, with the error reported in previous mails. So far this is > > > normal . > > > > So it doesn't happen every time, and it may be power cycle related. > > What is your "local disk"? > > I can't reproduce this with a similar setup but using KVM (i.e. > killing the VM instead of power cycling) or forcing a shutdown of > the filesystem without flushing the log. The second case is very > much the same as power cycling, but without the potential "power > failure caused partial IOs to be written" problem. > > The only thing I can see in the logprint that I haven't seen so far > in my testing is that your log print indicates a checkpoint that > wraps the end of the log. I haven't yet hit that situation by > chance, so I'll keep trying to see if that's the case that is > causing the problem.... Well, it's taken about 12 hours of random variation of parameters in the loop of: mkfs.xfs -f /dev/vdb mount /dev/vdb /mnt/scratch ./compilebench -D /mnt/scratch & sleep <some period> /home/dave/src/xfstests-dev/src/godown /mnt/scratch sleep 5 umount /mnt/scratch xfs_logprint -d /dev/vdb To get a log with a wrapped checkpoint to occur. That was with <some period> equal to 36s. In all that time, I hadn't seen a single log mount failure, and the moment I get a wrapped log: 1917 HEADER Cycle 10 tail 9:018456 len 32256 ops 468 1981 HEADER Cycle 10 tail 9:018456 len 32256 ops 427 ^^^^^^^^^^^^^^^ [00000 - 02045] Cycle 0x0000000a New Cycle 0x00000009 [ 368.364232] XFS (vdb): Mounting Filesystem [ 369.096144] XFS (vdb): Starting recovery (logdev: internal) [ 369.126545] XFS (vdb): xlog_recover_process_data: bad clientid 0x2c [ 369.129522] XFS (vdb): log mount/recovery failed: error 5 [ 369.131884] XFS (vdb): log mount failed Ok, so no LVM, no power failure involved, etc. Dig deeper. Let's see if logprint can dump the transactional record of the log: # xfs_logprint -f log.img -t ..... LOG REC AT LSN cycle 9 block 20312 (0x9, 0x4f58) LOG REC AT LSN cycle 9 block 20376 (0x9, 0x4f98) xfs_logprint: failed in xfs_do_recovery_pass, error: 12288 # Ok, xfs_logprint failed to decode the wrapped transaction at the end of the log. I can't see anything obviously wrong with the contents of the log off the top of my head (logprint is notoriously buggy), but the above command can reproduce the problem (3 out of 3 so far), so I should be able to track down the bug from this. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Problems with kernel 3.6.x (vm ?) (was : Is kernel 3.6.1 or filestreams option toxic ?) 2012-10-29 12:18 ` Dave Chinner @ 2012-10-29 12:43 ` Yann Dupont 2012-10-30 1:33 ` Dave Chinner 0 siblings, 1 reply; 18+ messages in thread From: Yann Dupont @ 2012-10-29 12:43 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs Le 29/10/2012 13:18, Dave Chinner a écrit : > On Mon, Oct 29, 2012 at 12:25:40PM +1100, Dave Chinner wrote: >> On Mon, Oct 29, 2012 at 10:48:02AM +1100, Dave Chinner wrote: >>> On Sat, Oct 27, 2012 at 12:05:34AM +0200, Yann Dupont wrote: >>>> Le 26/10/2012 12:03, Yann Dupont a écrit : >>>>> Le 25/10/2012 23:10, Dave Chinner a écrit : >>>> - mkfs.xfs on it, with default options >>>> - mounted with default options >>>> - launch something that hammers this volume. I launched compilebench >>>> 0.6 on it >>>> - wait some time to fill memory,buffers, and be sure your disks are >>>> really busy. I waited some minutes after the initial 30 kernel >>>> unpacking in compilebench >>>> - hard reset the server (I'm using the Idrac of the server to >>>> generate a power cycle) >>>> - After some try, I finally had the impossibility to mount the xfs >>>> volume, with the error reported in previous mails. So far this is >>>> normal . >>> So it doesn't happen every time, and it may be power cycle related. >>> What is your "local disk"? >> I can't reproduce this with a similar setup but using KVM (i.e. >> killing the VM instead of power cycling) or forcing a shutdown of >> the filesystem without flushing the log. The second case is very >> much the same as power cycling, but without the potential "power >> failure caused partial IOs to be written" problem. >> >> The only thing I can see in the logprint that I haven't seen so far >> in my testing is that your log print indicates a checkpoint that >> wraps the end of the log. I haven't yet hit that situation by >> chance, so I'll keep trying to see if that's the case that is >> causing the problem.... > Well, it's taken about 12 hours of random variation of parameters > in the loop of: > > mkfs.xfs -f /dev/vdb > mount /dev/vdb /mnt/scratch > ./compilebench -D /mnt/scratch & > sleep <some period> > /home/dave/src/xfstests-dev/src/godown /mnt/scratch > sleep 5 > umount /mnt/scratch > xfs_logprint -d /dev/vdb > > To get a log with a wrapped checkpoint to occur. That was with <some > period> equal to 36s. In all that time, I hadn't seen a single log > mount failure, and the moment I get a wrapped log: > > 1917 HEADER Cycle 10 tail 9:018456 len 32256 ops 468 > 1981 HEADER Cycle 10 tail 9:018456 len 32256 ops 427 > ^^^^^^^^^^^^^^^ > [00000 - 02045] Cycle 0x0000000a New Cycle 0x00000009 > > [ 368.364232] XFS (vdb): Mounting Filesystem > [ 369.096144] XFS (vdb): Starting recovery (logdev: internal) > [ 369.126545] XFS (vdb): xlog_recover_process_data: bad clientid 0x2c > [ 369.129522] XFS (vdb): log mount/recovery failed: error 5 > [ 369.131884] XFS (vdb): log mount failed > > Ok, so no LVM, no power failure involved, etc. Dig deeper. Let's see > if logprint can dump the transactional record of the log: > > # xfs_logprint -f log.img -t > ..... > LOG REC AT LSN cycle 9 block 20312 (0x9, 0x4f58) > > LOG REC AT LSN cycle 9 block 20376 (0x9, 0x4f98) > xfs_logprint: failed in xfs_do_recovery_pass, error: 12288 > # > > Ok, xfs_logprint failed to decode the wrapped transaction at the end > of the log. I can't see anything obviously wrong with the contents > of the log off the top of my head (logprint is notoriously buggy), > but the above command can reproduce the problem (3 out of 3 so far), > so I should be able to track down the bug from this. > > Cheers, > > Dave. OK, very glad to hear you were able to reproduce it. Good luck, and now let the chase begin :) Cheers, _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Problems with kernel 3.6.x (vm ?) (was : Is kernel 3.6.1 or filestreams option toxic ?) 2012-10-29 12:43 ` Yann Dupont @ 2012-10-30 1:33 ` Dave Chinner 2012-10-31 11:45 ` Gaudenz Steinlin 2012-11-05 13:57 ` Yann Dupont 0 siblings, 2 replies; 18+ messages in thread From: Dave Chinner @ 2012-10-30 1:33 UTC (permalink / raw) To: Yann Dupont; +Cc: xfs On Mon, Oct 29, 2012 at 01:43:00PM +0100, Yann Dupont wrote: > Le 29/10/2012 13:18, Dave Chinner a écrit : > >On Mon, Oct 29, 2012 at 12:25:40PM +1100, Dave Chinner wrote: > >>On Mon, Oct 29, 2012 at 10:48:02AM +1100, Dave Chinner wrote: > >>>On Sat, Oct 27, 2012 at 12:05:34AM +0200, Yann Dupont wrote: > >>>>Le 26/10/2012 12:03, Yann Dupont a écrit : > >>>>>Le 25/10/2012 23:10, Dave Chinner a écrit : > >>>>- mkfs.xfs on it, with default options > >>>>- mounted with default options > >>>>- launch something that hammers this volume. I launched compilebench > >>>>0.6 on it > >>>>- wait some time to fill memory,buffers, and be sure your disks are > >>>>really busy. I waited some minutes after the initial 30 kernel > >>>>unpacking in compilebench > >>>>- hard reset the server (I'm using the Idrac of the server to > >>>>generate a power cycle) > >>>>- After some try, I finally had the impossibility to mount the xfs > >>>>volume, with the error reported in previous mails. So far this is > >>>>normal . > >>>So it doesn't happen every time, and it may be power cycle related. > >>>What is your "local disk"? > >>I can't reproduce this with a similar setup but using KVM (i.e. > >>killing the VM instead of power cycling) or forcing a shutdown of > >>the filesystem without flushing the log. The second case is very > >>much the same as power cycling, but without the potential "power > >>failure caused partial IOs to be written" problem. > >> > >>The only thing I can see in the logprint that I haven't seen so far > >>in my testing is that your log print indicates a checkpoint that > >>wraps the end of the log. I haven't yet hit that situation by > >>chance, so I'll keep trying to see if that's the case that is > >>causing the problem.... > >Well, it's taken about 12 hours of random variation of parameters > >in the loop of: > > > >mount /dev/vdb /mnt/scratch > >./compilebench -D /mnt/scratch & > >sleep <some period> > >/home/dave/src/xfstests-dev/src/godown /mnt/scratch > >sleep 5 > >umount /mnt/scratch > >xfs_logprint -d /dev/vdb ..... > >Ok, xfs_logprint failed to decode the wrapped transaction at the end > >of the log. I can't see anything obviously wrong with the contents > >of the log off the top of my head (logprint is notoriously buggy), > >but the above command can reproduce the problem (3 out of 3 so far), > >so I should be able to track down the bug from this. > > > OK, very glad to hear you were able to reproduce it. > Good luck, and now let the chase begin :) Not really a huge chase, just a simple matter of isolation. The patch below should fix the problem. However, the fact that recovery succeeded on 3.4 means you may have a corrupted filesystem. The bug has been present since 3.0-rc1 (which was a fix for vmap memory leaks), and recovery is trying to replay stale items from the previous log buffer. As such, it is possible that changes from a previous checkpoint to have overwritten more recent changes in the current checkpoint. As such, you should probably run xfs_repair -n over the filesystems that you remounted on 3.4 that failed on 3.6 just to make sure they are OK. Cheers, Dave. -- Dave Chinner david@fromorbit.com xfs: fix reading of wrapped log data From: Dave Chinner <dchinner@redhat.com> Commit 4439647 ("xfs: reset buffer pointers before freeing them") in 3.0-rc1 introduced a regression when recovering log buffers that wrapped around the end of log. The second part of the log buffer at the start of the physical log was being read into the header buffer rather than the data buffer, and hence recovery was seeing garbage in the data buffer when it got to the region of the log buffer that was incorrectly read. Cc: <stable@vger.kernel.org> # 3.0.x, 3.2.x, 3.4.x 3.6.x Reported-by: Torsten Kaiser <just.for.lkml@googlemail.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_log_recover.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index e445550..02ff9a8 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -3646,7 +3646,7 @@ xlog_do_recovery_pass( * - order is important. */ error = xlog_bread_offset(log, 0, - bblks - split_bblks, hbp, + bblks - split_bblks, dbp, offset + BBTOB(split_bblks)); if (error) goto bread_err2; _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: Problems with kernel 3.6.x (vm ?) (was : Is kernel 3.6.1 or filestreams option toxic ?) 2012-10-30 1:33 ` Dave Chinner @ 2012-10-31 11:45 ` Gaudenz Steinlin 2012-11-05 13:57 ` Yann Dupont 1 sibling, 0 replies; 18+ messages in thread From: Gaudenz Steinlin @ 2012-10-31 11:45 UTC (permalink / raw) To: linux-xfs Hi Dave Chinner <david <at> fromorbit.com> writes: > > On Mon, Oct 29, 2012 at 01:43:00PM +0100, Yann Dupont wrote: > > Le 29/10/2012 13:18, Dave Chinner a écrit : > > OK, very glad to hear you were able to reproduce it. > > Good luck, and now let the chase begin :) > > Not really a huge chase, just a simple matter of isolation. The > patch below should fix the problem. > I ran into the same bug this morning with my home partition after a crash on suspend to ram. I can confirm that the patch posted to the list by Dave fixes the problem. I have a backup of the raw log and the whole filesystem if you need it for further investigation. BTW: Is this fix already sent upstream? I could not find it anywhere. But then it may just not be there yet. Gaudenz _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Problems with kernel 3.6.x (vm ?) (was : Is kernel 3.6.1 or filestreams option toxic ?) 2012-10-30 1:33 ` Dave Chinner 2012-10-31 11:45 ` Gaudenz Steinlin @ 2012-11-05 13:57 ` Yann Dupont 1 sibling, 0 replies; 18+ messages in thread From: Yann Dupont @ 2012-11-05 13:57 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs Le 30/10/2012 02:33, Dave Chinner a écrit : > > Not really a huge chase, just a simple matter of isolation. The > patch below should fix the problem. Yes, it does. Thanks a lot for the fast answer and this one-letter-patch ! > However, the fact that recovery succeeded on 3.4 means you may have > a corrupted filesystem. The bug has been present since 3.0-rc1 > (which was a fix for vmap memory leaks), and recovery is trying to > replay stale items from the previous log buffer. As such, it is > possible that changes from a previous checkpoint to have overwritten > more recent changes in the current checkpoint. As such, you should Ouch. > probably run xfs_repair -n over the filesystems that you remounted > on 3.4 that failed on 3.6 just to make sure they are OK. Will do. As someone else pointed out, I think it should go to stable release. Thanks a lot for your time, Cheers -- Yann Dupont - Service IRTS, DSI Université de Nantes Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Problems with kernel 3.6.x (vm ?) (was : Is kernel 3.6.1 or filestreams option toxic ?) 2012-10-28 23:48 ` Dave Chinner 2012-10-29 1:25 ` Dave Chinner @ 2012-10-29 8:07 ` Yann Dupont 2012-10-29 8:17 ` Yann Dupont 1 sibling, 1 reply; 18+ messages in thread From: Yann Dupont @ 2012-10-29 8:07 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs Le 29/10/2012 00:48, Dave Chinner a écrit : > > This is reproductible. Here is how to do it : > > - Started a 3.6.2 kernel. > > - I created a fresh lvm volume on localdisk of 20 GB. > Can you reproduce the problem without LVM? Hello dave. That is THE question. My intent was to test with and without LVM. But right now I can't , because all my disks are consumed by lvm. In fact my test setup hasn't even enough space to locally clone the volume where I have errors. I only have 146 sas disks on this machine. I have to setup another test platform, and as I'm currently traveling, it won't be easy before next week. What I want to try this week is to recrash a volume, possibly smaller, download this image on another machine with kvm see if I have the mounting problems inside this kvm begin to bisect the kernel . ... > > - After some try, I finally had the impossibility to mount the xfs > volume, with the error reported in previous mails. So far this is > normal . > So it doesn't happen every time, and it may be power cycle related. Yes, during my tests, I had to to power cycle 3 or 4 times before having the actual problem > What is your "local disk"? Raid 1 array (2 disks) with mptsas on this. > >> xfs_logprint don't say much : >> >> xfs_logprint: >> data device: 0xfe02 >> log device: 0xfe02 daddr: 10485792 length: 20480 >> >> Header 0x7c wanted 0xfeedbabe >> ********************************************************************** >> * ERROR: header cycle=124 block=5414 * >> ********************************************************************** > You didn't look past the initial error, did you? The file is only > 482280 lines long, and 482200 lines of that are decoded log data.... > :) Well I'd tried with -c but sorry, i didn't had experience with xfs_logprint so far. > >> I tried xfs_logprint -c , it gaves a 22M file. You can grab it here : >> http://filex.univ-nantes.fr/get?k=QnBXivz2J3LmzJ18uBV > I really need the raw log data, not the parsed output. The logprint > command to do that is "-C <file>", not "-c". Ok ... I should have read the man page more carefully. Time to restart a crash session >> - Rebooted 3.4.15 >> - xfs_logprint gives the exact same result that with 3.6.2 (diff >> tells no differences) > Given that it's generated by the logprint application, I'd expect it > to be identical. Me too, but I'd also expect the log replaying to be identical between the 2 kernels > >> but on 3.4.15, I can mount the volume without problem, log is >> replayed. >> for information here is xfs_info of the volume : >> >> here is xfs_info output >> >> root@label5:/mnt/debug# xfs_info /mnt/tempo >> meta-data=/dev/mapper/LocalDisk-crashdisk isize=256 agcount=8, >> agsize=655360 blks > How did you get a default of 8 AGs? That seems wrong. What version > of mkfs.xfs are you using? root@label5:~# mkfs.xfs -V mkfs.xfs version 3.1.7 the volume was freshly formatted, with defaults options. Absolutely nothing special on my side. Cheers, -- Yann Dupont - Service IRTS, DSI Université de Nantes Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Problems with kernel 3.6.x (vm ?) (was : Is kernel 3.6.1 or filestreams option toxic ?) 2012-10-29 8:07 ` Yann Dupont @ 2012-10-29 8:17 ` Yann Dupont 0 siblings, 0 replies; 18+ messages in thread From: Yann Dupont @ 2012-10-29 8:17 UTC (permalink / raw) To: xfs Le 29/10/2012 09:07, Yann Dupont a écrit : > > Hello dave. That is THE question. My intent was to test with and > without LVM. But right now I can't , because all my disks are consumed > by lvm. > In fact my test setup hasn't even enough space to locally clone the > volume where I have errors. I only have 146 sas disks on this machine. whoooops read 146 GB sas disks If I really had 146 sas disks on that machine I shouldn't have disk space problems :) Cheers, -- Yann Dupont - Service IRTS, DSI Université de Nantes Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2012-11-05 13:55 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-10-22 14:14 Is kernel 3.6.1 or filestreams option toxic ? Yann Dupont 2012-10-23 8:24 ` Problems with kernel 3.6.x (vm ?) (was : Is kernel 3.6.1 or filestreams option toxic ?) Yann Dupont 2012-10-25 15:21 ` Yann Dupont 2012-10-25 20:55 ` Yann Dupont 2012-10-25 21:10 ` Dave Chinner 2012-10-26 10:03 ` Yann Dupont 2012-10-26 22:05 ` Yann Dupont 2012-10-28 23:48 ` Dave Chinner 2012-10-29 1:25 ` Dave Chinner 2012-10-29 8:11 ` Yann Dupont 2012-10-29 12:21 ` Dave Chinner 2012-10-29 12:18 ` Dave Chinner 2012-10-29 12:43 ` Yann Dupont 2012-10-30 1:33 ` Dave Chinner 2012-10-31 11:45 ` Gaudenz Steinlin 2012-11-05 13:57 ` Yann Dupont 2012-10-29 8:07 ` Yann Dupont 2012-10-29 8:17 ` Yann Dupont
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox