* nfsd pointer in __d_lookup
@ 2006-06-01 20:12 Ed L. Cashin
2006-06-02 11:36 ` Neil Brown
0 siblings, 1 reply; 4+ messages in thread
From: Ed L. Cashin @ 2006-06-01 20:12 UTC (permalink / raw)
To: nfs
Hi. On an x86_64 machine, the trace below appears in the system logs
just before NFS service becomes unavailable (until a reboot).
I'm working on getting more specifics, but this machine is exporting
an XFS on an LVM logical volume on one or more AoE device. The aoe
driver in use is not the one in 2.6.16.18 (it's aoe6-23), but I'm
asking the end user to verify that this still happens with the aoe
driver in 2.6.16.18.
Meanwhile, I'm hoping that the trace below will look familiar to
someone. It looks to me like a 32-bit all-ones value might have been
put into a 64-bit variable by mistake.
Unable to handle kernel paging request at 00000000ffffffff RIP:
<ffffffff80179456>{__d_lookup+216}
PGD 3d29d067 PUD 0
CPU 0
Modules linked in: ipv6 nfsd lockd nfs_acl sunrpc xfs exportfs dm_mod aoe i2c_i801 i2c_core piix md_mod rtc psmouse unix
Pid: 2535, comm: nfsd Not tainted 2.6.16.18-c1 #8
RIP: 0010:[__d_lookup+216/253] <ffffffff80179456>{__d_lookup+216}
RSP: 0018:ffff81003e6dd958 EFLAGS: 00010206
RAX: 00000000ffffffff RBX: ffff81003ca443c0 RCX: 0000000000000011
RDX: 0000000000014faf RSI: 00c3df84ef894faf RDI: ffff810037552080
RBP: 00000000ffffffff R08: 0000000000000000 R09: 0000000000000000
R10: 000000000000000c R11: 0000000000000130 R12: ffff810037552080
R13: ffff81003e6dd9c8 R14: 00000000d8788a14 R15: 000000000000000c
FS: 00002b9fd29cbae0(0000) GS:ffffffff803e6000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000ffffffff CR3: 000000003d29c000 CR4: 00000000000006e0
Process nfsd (pid: 2535, threadinfo ffff81003e6dc000, task ffff81003f15f790)
Stack: ffff81003c1e2be2 0000000000000000 ffff810037552080 ffff81003e6dd9c8
0000000000000000 ffff8100375547f8 ffff81003c53d140 ffffffff80170839
000000000000000c ffff810037552080
Call Trace: <ffffffff80170839>{__lookup_hash+96} <ffffffff80170935>{lookup_one_len+95}
<ffffffff88116998>{:nfsd:compose_entry_fh+203} <ffffffff88119121>{:nfsd:encode_entry+453}
<ffffffff88119431>{:nfsd:nfs3svc_encode_entry_plus+11}
<ffffffff880ae26d>{:xfs:linvfs_readdir+325} <ffffffff88119426>{:nfsd:nfs3svc_encode_entry_plus+0}
<ffffffff88119426>{:nfsd:nfs3svc_encode_entry_plus+0}
<ffffffff801747de>{vfs_readdir+90} <ffffffff80163cdf>{default_llseek+0}
<ffffffff88119426>{:nfsd:nfs3svc_encode_entry_plus+0}
<ffffffff8810ebc4>{:nfsd:nfsd_readdir+108} <ffffffff88116391>{:nfsd:nfsd3_proc_readdirplus+244}
<ffffffff8810b3c2>{:nfsd:nfsd_dispatch+221} <ffffffff880d6360>{:sunrpc:svc_process+975}
<ffffffff802c652f>{__down_read+18} <ffffffff8810b648>{:nfsd:nfsd+451}
<ffffffff8012c278>{do_exit+2054} <ffffffff8010b702>{child_rip+8}
<ffffffff8810b485>{:nfsd:nfsd+0} <ffffffff8010b6fa>{child_rip+0}
--
Ed L Cashin <ecashin@coraid.com>
-------------------------------------------------------
All the advantages of Linux Managed Hosting--Without the Cost and Risk!
Fully trained technicians. The highest number of Red Hat certifications in
the hosting industry. Fanatical Support. Click to learn more
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=107521&bid=248729&dat=121642
_______________________________________________
NFS maillist - NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: nfsd pointer in __d_lookup 2006-06-01 20:12 nfsd pointer in __d_lookup Ed L. Cashin @ 2006-06-02 11:36 ` Neil Brown 2006-06-02 15:37 ` Ed L. Cashin 0 siblings, 1 reply; 4+ messages in thread From: Neil Brown @ 2006-06-02 11:36 UTC (permalink / raw) To: Ed L. Cashin; +Cc: nfs On Thursday June 1, ecashin@coraid.com wrote: > Hi. On an x86_64 machine, the trace below appears in the system logs > just before NFS service becomes unavailable (until a reboot). > > I'm working on getting more specifics, but this machine is exporting > an XFS on an LVM logical volume on one or more AoE device. The aoe > driver in use is not the one in 2.6.16.18 (it's aoe6-23), but I'm > asking the end user to verify that this still happens with the aoe > driver in 2.6.16.18. > > Meanwhile, I'm hoping that the trace below will look familiar to > someone. It looks to me like a 32-bit all-ones value might have been > put into a 64-bit variable by mistake. Not enough detail. If you could get a disassembly of __d_lookup so we could see where +216 was, that might help. > > Unable to handle kernel paging request at 00000000ffffffff RIP: > <ffffffff80179456>{__d_lookup+216} > PGD 3d29d067 PUD 0 > CPU 0 > Modules linked in: ipv6 nfsd lockd nfs_acl sunrpc xfs exportfs dm_mod aoe i2c_i801 i2c_core piix md_mod rtc psmouse unix > Pid: 2535, comm: nfsd Not tainted 2.6.16.18-c1 #8 > RIP: 0010:[__d_lookup+216/253] <ffffffff80179456>{__d_lookup+216} > RSP: 0018:ffff81003e6dd958 EFLAGS: 00010206 > RAX: 00000000ffffffff RBX: ffff81003ca443c0 RCX: 0000000000000011 It appears to be dereferencing RAX at an offset of zero. Most of the structure references in that code aren't at offset zero. My guess is that it is in hlist_for_each_entry_rcu, the pos = pos->next I cannot think what result in all those '1's being in pos though. NeilBrown _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: nfsd pointer in __d_lookup 2006-06-02 11:36 ` Neil Brown @ 2006-06-02 15:37 ` Ed L. Cashin 2006-06-04 23:31 ` Neil Brown 0 siblings, 1 reply; 4+ messages in thread From: Ed L. Cashin @ 2006-06-02 15:37 UTC (permalink / raw) To: Neil Brown; +Cc: nfs On Fri, Jun 02, 2006 at 09:36:40PM +1000, Neil Brown wrote: > On Thursday June 1, ecashin@coraid.com wrote: > > Hi. On an x86_64 machine, the trace below appears in the system logs > > just before NFS service becomes unavailable (until a reboot). > > > > I'm working on getting more specifics, but this machine is exporting > > an XFS on an LVM logical volume on one or more AoE device. The aoe > > driver in use is not the one in 2.6.16.18 (it's aoe6-23), but I'm > > asking the end user to verify that this still happens with the aoe > > driver in 2.6.16.18. > > > > Meanwhile, I'm hoping that the trace below will look familiar to > > someone. It looks to me like a 32-bit all-ones value might have been > > put into a 64-bit variable by mistake. > > Not enough detail. > If you could get a disassembly of __d_lookup so we could see where > +216 was, that might help. Sure, I can do that. I used this command: objdump --disassemble --section=.text ~/kernel/linux-2.6.16.18/vmlinux \ | sed -n '/<__d_lookup>:/,/^$/p' ffffffff8017937e <__d_lookup>: ffffffff8017937e: 41 57 push %r15 ffffffff80179380: 48 ba 01 00 fc ff ff mov $0x9e37fffffffc0001,%rdx ffffffff80179387: ff 37 9e ffffffff8017938a: 41 56 push %r14 ffffffff8017938c: 41 55 push %r13 ffffffff8017938e: 49 89 f5 mov %rsi,%r13 ffffffff80179391: 41 54 push %r12 ffffffff80179393: 49 89 fc mov %rdi,%r12 ffffffff80179396: 55 push %rbp ffffffff80179397: 53 push %rbx ffffffff80179398: 53 push %rbx ffffffff80179399: 48 8b 46 08 mov 0x8(%rsi),%rax ffffffff8017939d: 44 8b 36 mov (%rsi),%r14d ffffffff801793a0: 44 8b 7e 04 mov 0x4(%rsi),%r15d ffffffff801793a4: 8b 0d d6 4c 24 00 mov 2378966(%rip),%ecx # ffffffff803be080 <d_hash_shift> ffffffff801793aa: 48 89 04 24 mov %rax,(%rsp) ffffffff801793ae: 48 89 f8 mov %rdi,%rax ffffffff801793b1: 48 31 d0 xor %rdx,%rax ffffffff801793b4: 44 89 f6 mov %r14d,%esi ffffffff801793b7: 48 c1 e8 07 shr $0x7,%rax ffffffff801793bb: 48 01 c6 add %rax,%rsi ffffffff801793be: 48 8b 05 b3 4c 24 00 mov 2378931(%rip),%rax # ffffffff803be078 <dentry_hashtable> ffffffff801793c5: 48 31 f2 xor %rsi,%rdx ffffffff801793c8: 48 d3 ea shr %cl,%rdx ffffffff801793cb: 48 31 d6 xor %rdx,%rsi ffffffff801793ce: 8b 15 b0 4c 24 00 mov 2378928(%rip),%edx # ffffffff803be084 <d_hash_mask> ffffffff801793d4: 21 f2 and %esi,%edx ffffffff801793d6: 48 8b 2c d0 mov (%rax,%rdx,8),%rbp ffffffff801793da: eb 75 jmp ffffffff80179451 <__d_lookup+0xd3> ffffffff801793dc: 4c 39 63 28 cmp %r12,0x28(%rbx) ffffffff801793e0: 75 6b jne ffffffff8017944d <__d_lookup+0xcf> ffffffff801793e2: 48 8d 7b 08 lea 0x8(%rbx),%rdi ffffffff801793e6: e8 7d d6 14 00 callq ffffffff802c6a68 <_spin_lock> ffffffff801793eb: 4c 39 63 28 cmp %r12,0x28(%rbx) ffffffff801793ef: 75 55 jne ffffffff80179446 <__d_lookup+0xc8> ffffffff801793f1: 49 8b 84 24 88 00 00 mov 0x88(%r12),%rax ffffffff801793f8: 00 ffffffff801793f9: 48 8d 73 30 lea 0x30(%rbx),%rsi ffffffff801793fd: 48 85 c0 test %rax,%rax ffffffff80179400: 74 13 je ffffffff80179415 <__d_lookup+0x97> ffffffff80179402: 48 8b 40 10 mov 0x10(%rax),%rax ffffffff80179406: 48 85 c0 test %rax,%rax ffffffff80179409: 74 0a je ffffffff80179415 <__d_lookup+0x97> ffffffff8017940b: 4c 89 ea mov %r13,%rdx ffffffff8017940e: 4c 89 e7 mov %r12,%rdi ffffffff80179411: ff d0 callq *%rax ffffffff80179413: eb 16 jmp ffffffff8017942b <__d_lookup+0xad> ffffffff80179415: 44 39 7e 04 cmp %r15d,0x4(%rsi) ffffffff80179419: 75 2b jne ffffffff80179446 <__d_lookup+0xc8> ffffffff8017941b: 48 8b 7e 08 mov 0x8(%rsi),%rdi ffffffff8017941f: 48 8b 34 24 mov (%rsp),%rsi ffffffff80179423: 44 89 fa mov %r15d,%edx ffffffff80179426: e8 b1 65 04 00 callq ffffffff801bf9dc <memcmp> ffffffff8017942b: 85 c0 test %eax,%eax ffffffff8017942d: 75 17 jne ffffffff80179446 <__d_lookup+0xc8> ffffffff8017942f: 31 c0 xor %eax,%eax ffffffff80179431: f6 43 04 10 testb $0x10,0x4(%rbx) ffffffff80179435: 75 06 jne ffffffff8017943d <__d_lookup+0xbf> ffffffff80179437: f0 ff 03 lock incl (%rbx) ffffffff8017943a: 48 89 d8 mov %rbx,%rax ffffffff8017943d: c7 43 08 01 00 00 00 movl $0x1,0x8(%rbx) ffffffff80179444: eb 28 jmp ffffffff8017946e <__d_lookup+0xf0> ffffffff80179446: c7 43 08 01 00 00 00 movl $0x1,0x8(%rbx) ffffffff8017944d: 48 8b 6d 00 mov 0x0(%rbp),%rbp ffffffff80179451: 48 85 ed test %rbp,%rbp ffffffff80179454: 74 16 je ffffffff8017946c <__d_lookup+0xee> ffffffff80179456: 48 8b 45 00 mov 0x0(%rbp),%rax ffffffff8017945a: 0f 18 08 prefetcht0 (%rax) ffffffff8017945d: 48 8d 5d e8 lea 0xffffffffffffffe8(%rbp),%rbx ffffffff80179461: 44 39 73 30 cmp %r14d,0x30(%rbx) ffffffff80179465: 75 e6 jne ffffffff8017944d <__d_lookup+0xcf> ffffffff80179467: e9 70 ff ff ff jmpq ffffffff801793dc <__d_lookup+0x5e> ffffffff8017946c: 31 c0 xor %eax,%eax ffffffff8017946e: 41 5b pop %r11 ffffffff80179470: 5b pop %rbx ffffffff80179471: 5d pop %rbp ffffffff80179472: 41 5c pop %r12 ffffffff80179474: 41 5d pop %r13 ffffffff80179476: 41 5e pop %r14 ffffffff80179478: 41 5f pop %r15 ffffffff8017947a: c3 retq ... > > Unable to handle kernel paging request at 00000000ffffffff RIP: > > <ffffffff80179456>{__d_lookup+216} > > PGD 3d29d067 PUD 0 > > CPU 0 > > Modules linked in: ipv6 nfsd lockd nfs_acl sunrpc xfs exportfs dm_mod aoe i2c_i801 i2c_core piix md_mod rtc psmouse unix > > Pid: 2535, comm: nfsd Not tainted 2.6.16.18-c1 #8 > > RIP: 0010:[__d_lookup+216/253] <ffffffff80179456>{__d_lookup+216} > > RSP: 0018:ffff81003e6dd958 EFLAGS: 00010206 > > RAX: 00000000ffffffff RBX: ffff81003ca443c0 RCX: 0000000000000011 > > It appears to be dereferencing RAX at an offset of zero. > Most of the structure references in that code aren't at offset zero. > My guess is that it is in hlist_for_each_entry_rcu, the > pos = pos->next > I cannot think what result in all those '1's being in pos though. > > NeilBrown -- Ed L Cashin <ecashin@coraid.com> _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: nfsd pointer in __d_lookup 2006-06-02 15:37 ` Ed L. Cashin @ 2006-06-04 23:31 ` Neil Brown 0 siblings, 0 replies; 4+ messages in thread From: Neil Brown @ 2006-06-04 23:31 UTC (permalink / raw) To: Ed L. Cashin; +Cc: nfs On Friday June 2, ecashin@coraid.com wrote: > On Fri, Jun 02, 2006 at 09:36:40PM +1000, Neil Brown wrote: > > On Thursday June 1, ecashin@coraid.com wrote: > > > Hi. On an x86_64 machine, the trace below appears in the system logs > > > just before NFS service becomes unavailable (until a reboot). > > > > > > I'm working on getting more specifics, but this machine is exporting > > > an XFS on an LVM logical volume on one or more AoE device. The aoe > > > driver in use is not the one in 2.6.16.18 (it's aoe6-23), but I'm > > > asking the end user to verify that this still happens with the aoe > > > driver in 2.6.16.18. > > > > > > Meanwhile, I'm hoping that the trace below will look familiar to > > > someone. It looks to me like a 32-bit all-ones value might have been > > > put into a 64-bit variable by mistake. > > > > Not enough detail. > > If you could get a disassembly of __d_lookup so we could see where > > +216 was, that might help. > > Sure, I can do that. I used this command: > > objdump --disassemble --section=.text ~/kernel/linux-2.6.16.18/vmlinux \ > | sed -n '/<__d_lookup>:/,/^$/p' > > ffffffff80179435: 75 06 jne ffffffff8017943d <__d_lookup+0xbf> > ffffffff80179437: f0 ff 03 lock incl (%rbx) > ffffffff8017943a: 48 89 d8 mov %rbx,%rax > ffffffff8017943d: c7 43 08 01 00 00 00 movl $0x1,0x8(%rbx) > ffffffff80179444: eb 28 jmp ffffffff8017946e <__d_lookup+0xf0> > ffffffff80179446: c7 43 08 01 00 00 00 movl $0x1,0x8(%rbx) > ffffffff8017944d: 48 8b 6d 00 mov 0x0(%rbp),%rbp > ffffffff80179451: 48 85 ed test %rbp,%rbp > ffffffff80179454: 74 16 je ffffffff8017946c <__d_lookup+0xee> > ffffffff80179456: 48 8b 45 00 mov 0x0(%rbp),%rax ^^^^^^HERE^^^^^ > ffffffff8017945a: 0f 18 08 prefetcht0 (%rax) > ffffffff8017945d: 48 8d 5d e8 lea 0xffffffffffffffe8(%rbp),%rbx > ffffffff80179461: 44 39 73 30 cmp %r14d,0x30(%rbx) > ffffffff80179465: 75 e6 jne ffffffff8017944d <__d_lookup+0xcf> > ffffffff80179467: e9 70 ff ff ff jmpq ffffffff801793dc <__d_lookup+0x5e> > ffffffff8017946c: 31 c0 xor %eax,%eax > ffffffff8017946e: 41 5b pop %r11 > ffffffff80179470: 5b pop %rbx > ffffffff80179471: 5d pop %rbp > ffffffff80179472: 41 5c pop %r12 > ffffffff80179474: 41 5d pop %r13 > ffffffff80179476: 41 5e pop %r14 > ffffffff80179478: 41 5f pop %r15 > ffffffff8017947a: c3 retq That corresponds to the prefetch(pos->next) in hlist_for_each_entry_rcu. 'pos' has the values 0x000ffffffff. Looks to me like an RCU related memory corruption. It's hard to know what would cause this. It would help if it could crash a few times and we could see if the corruption was always 0xffffffff or if it was something else other times. Anyway, I suggest you report it on linux-kernel as I doubt very much if it is an NFS specific problem. NeilBrown _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2006-06-04 23:32 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-06-01 20:12 nfsd pointer in __d_lookup Ed L. Cashin 2006-06-02 11:36 ` Neil Brown 2006-06-02 15:37 ` Ed L. Cashin 2006-06-04 23:31 ` Neil Brown
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.