* help with deciphering kernel dmesg
@ 2012-01-25 5:24 Keith Keller
2012-01-25 6:42 ` Dave Chinner
0 siblings, 1 reply; 3+ messages in thread
From: Keith Keller @ 2012-01-25 5:24 UTC (permalink / raw)
To: linux-xfs
Hello all,
I had a strange issue with my largish XFS filesystem, and was hoping to
get some help figuring out what it could mean.
Background: this is a ~14TB XFS filesystem on top of a linux LVM md
RAID6 of 9 2TB disks. In the past, I've had no problems working with this,
but I recently added a disk to reshape to 10 disks. During an
rsnapshot, I got the below messages in dmesg, and when I went to umount
the fs to do an xfs_repair -n, it claimed that the filesystem was busy,
even though lsof didn't show any files open on the filesystem.
The last time I did an mdraid reshape, the kernel did a hard crash on a
small amount of disk activity, and at that time the kernel messages
seemed to point to an issue in the mdraid module. But I wasn't able to
save the messages, so I'm not sure whether the messages were a trace
from the mdraid module or the xfs module.
The kernel is the latest from ELrepo (2.6.39-4.el5.elrepo). I am
currently running xfs_repair -n, which hasn't yet found any errors.
(Since the fs isn't mounted I can't yet give xfs_info output, and
because the RAID6 is still reshaping the xfs_repair is taking a long
time. I can interrupt it if needed, but I'm hoping to let it finish.)
In the past the RAID6 has passed a check, but I have not done one this
month. I hope I have munged any identifying information.
Thanks for any suggestions; my dmesg output snippet is below. A
cursory web search on the keywords in the first line of the dmesg output
didn't turn up anything terribly interesting. One I found suggested
ACLs or SELinux issues, but my system isn't using those AFAIK (at least
it shouldn't be).
--keith
Jan 23 23:09:37 XXXXX kernel: XFS (dm-5): I/O error occurred: meta-data dev dm-5 block 0x57786050 ("xfs_trans_read_buf") error 5 buf count 4096
Jan 23 23:10:01 XXXXX kernel: XFS (dm-5): I/O error occurred: meta-data dev dm-5 block 0x5778dff8 ("xfs_trans_read_buf") error 5 buf count 4096
Jan 23 23:10:05 XXXXX kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
Jan 23 23:10:05 XXXXX kernel: IP: [<ffffffffa0435ed2>] xfs_da_do_buf+0x43e/0x592 [xfs]
Jan 23 23:10:05 XXXXX kernel: PGD f3b7d067 PUD eaf19067 PMD 0
Jan 23 23:10:05 XXXXX kernel: Oops: 0000 [#1] PREEMPT SMP
Jan 23 23:10:05 XXXXX kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:0b.0/0000:01:03.0/local_cpus
Jan 23 23:10:05 XXXXX kernel: CPU 1
Jan 23 23:10:05 XXXXX kernel: Modules linked in: xfs raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx netconsole autofs4 nfs fscache auth_rp
cgss nfs_acl lockd sunrpc binfmt_misc dm_multipath scsi_dh video sbs sbshc power_meter pci_slot hed battery acpi_memhotplug acpi_ipmi ipmi_msghandler ac lp st osst ch sg sr_
mod cdrom e100 mii tg3 serio_raw button i2c_amd756 parport_pc parport tpm_tis tpm i2c_amd8111 tpm_bios rtc_cmos i2c_core k8temp shpchp floppy amd64_edac_mod pcspkr pci_hotpl
ug hwmon edac_core amd_rng dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod 3w_9xxx aic79xx scsi_transport_spi pata_amd pata_acpi ata_generic libata sd_mod scsi_mo
d ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: netconsole]
Jan 23 23:10:05 XXXXX kernel:
Jan 23 23:10:05 XXXXX kernel: Pid: 9308, comm: rsync Not tainted 2.6.39-4.el5.elrepo #1 To Be Filled By O.E.M. To Be Filled By O.E.M./TYAN High-End Dual AMD Opteron, S2882
Jan 23 23:10:05 XXXXX kernel: RIP: 0010:[<ffffffffa0435ed2>] [<ffffffffa0435ed2>] xfs_da_do_buf+0x43e/0x592 [xfs]
Jan 23 23:10:05 XXXXX kernel: RSP: 0018:ffff8800ead4fcb8 EFLAGS: 00010246
Jan 23 23:10:05 XXXXX kernel: RAX: 0000000000000000 RBX: ffff8800ead4fd38 RCX: 0000000000000002
Jan 23 23:10:05 XXXXX kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Jan 23 23:10:05 XXXXX kernel: RBP: ffff8800ead4fd88 R08: 0000000000000001 R09: 0000000000000202
Jan 23 23:10:05 XXXXX kernel: R10: ffff8800ead4fc18 R11: ffff8800f39b80f8 R12: 0000000000000000
Jan 23 23:10:05 XXXXX kernel: R13: 0000000000000000 R14: ffff8800f3a8a800 R15: 0000000000000000
Jan 23 23:10:05 XXXXX rsnapshot[9316]: /usr/bin/rsnapshot sync: ERROR: /usr/bin/rsync returned 12 while processing /mnt/YYYYYYYY
Jan 23 23:10:05 XXXXX kernel: FS: 00007f4f1be006e0(0000) GS:ffff8800f7c00000(0000) knlGS:0000000000000000
Jan 23 23:10:05 XXXXX kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jan 23 23:10:05 XXXXX kernel: CR2: 0000000000000008 CR3: 00000000f2f77000 CR4: 00000000000006e0
Jan 23 23:10:05 XXXXX kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan 23 23:10:05 XXXXX kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jan 23 23:10:05 XXXXX kernel: Process rsync (pid: 9308, threadinfo ffff8800ead4e000, task ffff8800f39b80c0)
Jan 23 23:10:05 XXXXX kernel: Stack:
Jan 23 23:10:05 XXXXX kernel: ffff8800ead4fd48 ffffffffa0464fd7 ffff8800ead4fcd8 ffffffff8107ec2b
Jan 23 23:10:05 XXXXX kernel: ffff8800ead4fce8 00000000813c5ae6 ffff8800ead4fe50 ffff8800ead4fdb0
Jan 23 23:10:05 XXXXX kernel: ffff88002b926400 0000000000000000 ffff8800ead4fd18 0000000100000000
Jan 23 23:10:05 XXXXX kernel: Call Trace:
Jan 23 23:10:05 XXXXX kernel: [<ffffffffa0464fd7>] ? spin_unlock+0xe/0x10 [xfs]
Jan 23 23:10:05 XXXXX kernel: [<ffffffff8107ec2b>] ? __raw_spin_unlock+0x29/0x34
Jan 23 23:10:05 XXXXX kernel: [<ffffffffa045d64d>] ? xfs_buf_relse+0x1e/0x22 [xfs]
Jan 23 23:10:05 XXXXX kernel: [<ffffffffa0436098>] xfs_da_read_buf+0x2a/0x2c [xfs]
Jan 23 23:10:05 XXXXX kernel: [<ffffffffa043cbf3>] ? xfs_dir2_leaf_getdents+0x372/0x637 [xfs]
Jan 23 23:10:05 XXXXX kernel: [<ffffffffa043cbf3>] xfs_dir2_leaf_getdents+0x372/0x637 [xfs]
Jan 23 23:10:05 XXXXX kernel: [<ffffffff8113a83c>] ? sys_old_readdir+0x63/0x63
Jan 23 23:10:05 XXXXX kernel: [<ffffffff8113a83c>] ? sys_old_readdir+0x63/0x63
Jan 23 23:10:05 XXXXX kernel: [<ffffffffa0437f0e>] xfs_readdir+0xb5/0xc8 [xfs]
Jan 23 23:10:05 XXXXX kernel: [<ffffffff811b0fbf>] ? fsnotify_perm+0x70/0x7c
Jan 23 23:10:05 XXXXX kernel: [<ffffffff8113a83c>] ? sys_old_readdir+0x63/0x63
Jan 23 23:10:05 XXXXX kernel: [<ffffffffa0468856>] xfs_file_readdir+0x3f/0x4e [xfs]
Jan 23 23:10:05 XXXXX kernel: [<ffffffff8113a6e9>] vfs_readdir+0x78/0x9f
Jan 23 23:10:05 XXXXX kernel: [<ffffffff8113a78f>] sys_getdents+0x7f/0xc9
Jan 23 23:10:05 XXXXX kernel: [<ffffffff813cb802>] system_call_fastpath+0x16/0x1b
Jan 23 23:10:05 XXXXX kernel: Code: eb 17 45 31 ed 48 83 7d c0 00 74 15 48 8b 55 18 48 8d 75 c0 bf 01 00 00 00 e8 d1 ef ff ff 49 89 c5 83 7d 10 01 0f 85 a3 00 00 00
Jan 23 23:10:05 XXXXX kernel: <49> 8b 5d 08 0f b7 7b 08 e8 15 ea ff ff 0f b7 c0 8b 13 0f ca 3d
Jan 23 23:10:05 XXXXX kernel: RIP [<ffffffffa0435ed2>] xfs_da_do_buf+0x43e/0x592 [xfs]
Jan 23 23:10:05 XXXXX kernel: RSP <ffff8800ead4fcb8>
Jan 23 23:10:05 XXXXX kernel: CR2: 0000000000000008
Jan 23 23:10:05 XXXXX kernel: ---[ end trace b5fbfad2fe7e2837 ]---
--
kkeller@wombat.san-francisco.ca.us
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 3+ messages in thread* Re: help with deciphering kernel dmesg
2012-01-25 5:24 help with deciphering kernel dmesg Keith Keller
@ 2012-01-25 6:42 ` Dave Chinner
2012-01-26 4:57 ` Keith Keller
0 siblings, 1 reply; 3+ messages in thread
From: Dave Chinner @ 2012-01-25 6:42 UTC (permalink / raw)
To: Keith Keller; +Cc: linux-xfs
On Tue, Jan 24, 2012 at 09:24:26PM -0800, Keith Keller wrote:
> Hello all,
>
> I had a strange issue with my largish XFS filesystem, and was hoping to
> get some help figuring out what it could mean.
>
> Background: this is a ~14TB XFS filesystem on top of a linux LVM md
> RAID6 of 9 2TB disks. In the past, I've had no problems working with this,
> but I recently added a disk to reshape to 10 disks. During an
> rsnapshot, I got the below messages in dmesg, and when I went to umount
> the fs to do an xfs_repair -n, it claimed that the filesystem was busy,
> even though lsof didn't show any files open on the filesystem.
>
> The last time I did an mdraid reshape, the kernel did a hard crash on a
> small amount of disk activity, and at that time the kernel messages
> seemed to point to an issue in the mdraid module. But I wasn't able to
> save the messages, so I'm not sure whether the messages were a trace
> from the mdraid module or the xfs module.
I suspect that from the IO errors, there is a problem in the md
reshape code, and the XFs code has failed to handle that properly.
> The kernel is the latest from ELrepo (2.6.39-4.el5.elrepo). I am
> currently running xfs_repair -n, which hasn't yet found any errors.
> (Since the fs isn't mounted I can't yet give xfs_info output, and
> because the RAID6 is still reshaping the xfs_repair is taking a long
> time. I can interrupt it if needed, but I'm hoping to let it finish.)
> In the past the RAID6 has passed a check, but I have not done one this
> month. I hope I have munged any identifying information.
the xfs_info output would be really handy for determining what path
through the directory code XFS was taking whenteh crash occurred.
> Thanks for any suggestions; my dmesg output snippet is below. A
> cursory web search on the keywords in the first line of the dmesg output
> didn't turn up anything terribly interesting. One I found suggested
> ACLs or SELinux issues, but my system isn't using those AFAIK (at least
> it shouldn't be).
>
> --keith
>
> Jan 23 23:09:37 XXXXX kernel: XFS (dm-5): I/O error occurred: meta-data dev dm-5 block 0x57786050 ("xfs_trans_read_buf") error 5 buf count 4096
> Jan 23 23:10:01 XXXXX kernel: XFS (dm-5): I/O error occurred: meta-data dev dm-5 block 0x5778dff8 ("xfs_trans_read_buf") error 5 buf count 4096
> Jan 23 23:10:05 XXXXX kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
> Jan 23 23:10:05 XXXXX kernel: IP: [<ffffffffa0435ed2>] xfs_da_do_buf+0x43e/0x592 [xfs]
I'd be worried about those IO errors - i don't think that they were
the cause of the oops, but it implies that the underlying device is
bad in some way. That may have something to do with the reshape in
progress which make me worry that the reshape is actually keeping
your data safe....
As it is, the kernel crashed reading a directory buffer. It's hard
to say what went wrong - can you take the kernel image and run:
$ gdb <path/to/kernel>
(gdb) l *(xfs_da_do_buf+0x43e)
And post the output so we can see what line number in the code the
crash occurred at? That might provide a bit more of a clue to what
the problem is.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 3+ messages in thread* Re: help with deciphering kernel dmesg
2012-01-25 6:42 ` Dave Chinner
@ 2012-01-26 4:57 ` Keith Keller
0 siblings, 0 replies; 3+ messages in thread
From: Keith Keller @ 2012-01-26 4:57 UTC (permalink / raw)
To: linux-xfs
On 2012-01-25, Dave Chinner <david@fromorbit.com> wrote:
>
> the xfs_info output would be really handy for determining what path
> through the directory code XFS was taking whenteh crash occurred.
No problem, here it is. The device is an LVM volume. Unfortunately
I've mounted and umounted the drive a few times since the reboot, so I
don't know how helpful this will actually be. I can attempt to repeat
the symptoms then try an xfs_info before attempting anything else. (I
ended up killing the xfs_repair -n to get this sooner, so I still do
not have any information from that. So far it's on phase 4, which is
taking a very long time; I think the reshape is stealing IO cycles,
which it's not really supposed to. It hasn't reported any errors so
far.)
meta-data=/dev/XXXXXXXX isize=256 agcount=57, agsize=61034784 blks
= sectsz=512 attr=0
data = bsize=4096 blocks=3417949184, imaxpct=25
= sunit=0 swidth=0 blks, unwritten=1
naming =version 2 bsize=4096
log =internal bsize=4096 blocks=32768, version=1
= sectsz=512 sunit=0 blks, lazy-count=0
realtime =none extsz=4096 blocks=0, rtextents=0
> I'd be worried about those IO errors - i don't think that they were
> the cause of the oops, but it implies that the underlying device is
> bad in some way. That may have something to do with the reshape in
> progress which make me worry that the reshape is actually keeping
> your data safe....
Yes, that was my worry as well. Fortunately this is a backup that can
be recreated, but I'd hate to lose my primary store then find out the
backup is hosed.
> As it is, the kernel crashed reading a directory buffer. It's hard
> to say what went wrong - can you take the kernel image and run:
>
> $ gdb <path/to/kernel>
> (gdb) l *(xfs_da_do_buf+0x43e)
>
> And post the output so we can see what line number in the code the
> crash occurred at? That might provide a bit more of a clue to what
> the problem is.
Does my kernel need debugging symbols compiled in? Because my kernel
doesn't seem to want to cooperate with gdb:
# gdb /boot/vmlinuz-2.6.39-4.el5.elrepo
GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-37.el5_7.1)
[snip]
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
"/boot/vmlinuz-2.6.39-4.el5.elrepo": not in executable format: File format not recognized
(gdb) l *(xfs_da_do_buf+0x43e)
No symbol table is loaded. Use the "file" command.
My compiling skills are generally confined to ./configure;make;make
install, so I'm not sure where to go next. If debugging is needed to be
compiled into the kernel, that may be problematic--it looks like ELrepo
doesn't provide the same kernel with debug options, so I'd have to build
one myself to get that. (Wow, I haven't built a kernel in over five
years!)
--keith
--
kkeller@wombat.san-francisco.ca.us
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2012-01-26 5:00 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-01-25 5:24 help with deciphering kernel dmesg Keith Keller
2012-01-25 6:42 ` Dave Chinner
2012-01-26 4:57 ` Keith Keller
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox