public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
* Re: kernel 2.6.37 : oops in cleanup_once
       [not found] <4D491B8D.1000107@univ-nantes.fr>
@ 2011-02-02 10:52 ` Eric Dumazet
  2011-02-02 11:24   ` Eric Dumazet
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2011-02-02 10:52 UTC (permalink / raw)
  To: Yann Dupont; +Cc: linux-kernel, netdev

Le mercredi 02 février 2011 à 09:53 +0100, Yann Dupont a écrit :
> Hello.
> We recently upgraded one machine with vanilla 2.6.37, and experienced 2 
> kernel oops since. Each oops is after ~1 week of uptime.
> The last oops was last night but we didn't had any trace.
> 
> Here is the previous oops :
> 
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316042] 
> BUG: unable to handle kernel NULL pointer dereference at 000000000000000d
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316096] 
> IP: [<ffffffff8130e6bf>] cleanup_once+0x3f/0xa0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316135] PGD 0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316157] 
> Oops: 0002 [#1] SMP
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316188] 
> last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316234] CPU 1
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316240] 
> Modules linked in: xt_physdev ip6t_LOG nf_conntrack_ipv6 nf_defrag_ipv6 
> ipt_LOG xt_multiport xt_limit nf_conntrack_tftp nf_conntrack_ftp tun 
> ip6table_filter ip6_tables ipt_MASQUERADE iptable_nat nf_nat 
> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT 
> xt_tcpudp iptable_filter ip_tables x_tables kvm_intel kvm ipv6 8021q 
> bridge stp ext2 mbcache fuse snd_pcm snd_timer snd soundcore 
> snd_page_alloc i5000_edac edac_core psmouse evdev i5k_amb tpm_tis tpm 
> joydev dcdbas tpm_bios pcspkr rng_core ghes shpchp serio_raw pci_hotplug 
> processor hed button thermal_sys xfs exportfs dm_mod sg sr_mod sd_mod 
> cdrom usbhid hid usb_storage qla2xxx scsi_transport_fc scsi_tgt uhci_hcd 
> mptsas mptscsih ehci_hcd mptbase bnx2 scsi_transport_sas scsi_mod [last 
> unloaded: scsi_wait_scan]
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316694]
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316715] 
> Pid: 0, comm: kworker/0:0 Not tainted 2.6.37-dsiun-110105 #17 
> 0MY736/PowerEdge M600
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316761] 
> RIP: 0010:[<ffffffff8130e6bf>]  [<ffffffff8130e6bf>] cleanup_once+0x3f/0xa0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316808] 
> RSP: 0018:ffff8800cfc43e20  EFLAGS: 00010202
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316834] 
> RAX: ffff8803d3158018 RBX: ffff8803d3158000 RCX: 0000000000000005
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316878] 
> RDX: 0b000209f1beadde RSI: 00000000000000ac RDI: ffffffff8152a970
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318512] 
> RBP: 00000000000248f6 R08: 00000000003d0900 R09: 0000000000000000
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318560] 
> R10: dead000000200200 R11: 0000000000000000 R12: ffff8800cfc43ea0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318604] 
> R13: 0000000000000100 R14: ffff88040fc99fd8 R15: 0000000000000000
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318652] 
> FS:  0000000000000000(0000) GS:ffff8800cfc40000(0000) knlGS:0000000000000000
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318698] 
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318725] 
> CR2: 000000000000000d CR3: 00000000014f1000 CR4: 00000000000026e0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318768] 
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318812] 
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318855] 
> Process kworker/0:0 (pid: 0, threadinfo ffff88040fc98000, task 
> ffff88040fc6c2e0)
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318901] 
> Stack:
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318921]  
> 0000000000000082 00000001029221c1 00000000000248f6 ffffffff8130e988
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318971]  
> ffff88040fc90000 ffff88040fc90000 ffffffff8152a9a0 ffffffff8105e95f
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319021]  
> ffff8800cfc43e58 ffff88040fc91020 ffffffff8130e950 ffff88040fc99fd8
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319072] 
> Call Trace:
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319093] <IRQ>
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319116]  
> [<ffffffff8130e988>] ? peer_check_expire+0x38/0x110
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319146]  
> [<ffffffff8105e95f>] ? run_timer_softirq+0x16f/0x350
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319175]  
> [<ffffffff8130e950>] ? peer_check_expire+0x0/0x110
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319204]  
> [<ffffffff81079c6b>] ? ktime_get+0x5b/0xe0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319232]  
> [<ffffffff8105685a>] ? __do_softirq+0xaa/0x1e0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319260]  
> [<ffffffff81003ddc>] ? call_softirq+0x1c/0x30
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319288]  
> [<ffffffff81005f75>] ? do_softirq+0x65/0xa0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319315]  
> [<ffffffff81056745>] ? irq_exit+0x85/0x90
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319343]  
> [<ffffffff8102137a>] ? smp_apic_timer_interrupt+0x6a/0xa0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319373]  
> [<ffffffff81003893>] ? apic_timer_interrupt+0x13/0x20
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319401] <EOI>
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319427]  
> [<ffffffffa032218c>] ? acpi_idle_enter_bm+0x243/0x27b [processor]
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319473]  
> [<ffffffffa0322185>] ? acpi_idle_enter_bm+0x23c/0x27b [processor]
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319519]  
> [<ffffffff812c0deb>] ? cpuidle_idle_call+0x8b/0x140
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319547]  
> [<ffffffff8100208a>] ? cpu_idle+0x6a/0xf0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319573] 
> Code: 00 48 8b 05 c4 c2 21 00 48 3d 60 a9 52 81 74 5c 48 8d 58 e8 48 8b 
> 15 11 02 24 00 2b 53 28 48 39 ea 72 49 48 8b 4b 18 48 8b 53 20 <48> 89 
> 51 08 48 89 0a 48 89 43 18 48 89 43 20 f0 ff 40 14 48 c7
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319768] 
> RIP  [<ffffffff8130e6bf>] cleanup_once+0x3f/0xa0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319797]  
> RSP <ffff8800cfc43e20>
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319820] 
> CR2: 000000000000000d
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.320187] 
> ---[ end trace eaf3ed2d46c78768 ]---
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.320257] 
> Kernel panic - not syncing: Fatal exception in interrupt
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.320329] 
> Pid: 0, comm: kworker/0:0 Tainted: G      D     2.6.37-dsiun-110105 #17
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.320418] 
> Call Trace:
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.320481] 
> <IRQ>  [<ffffffff8137c75e>] ? panic+0x92/0x1a2
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.320601]  
> [<ffffffff81007357>] ? oops_end+0xe7/0xf0
> 
> 
> Any ideas ??


Hi Yann

Yes this is a known problem.

Please try commit 3408404a4c2a4eead9d73b0bbbfe3f225b65f492
(inetpeer: Use correct AVL tree base pointer in inet_getpeer())

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3408404a4c2a4eead9d73b0bbbfe3f225b65f492

I believe David will send it to stable team shortly, if not already
done :)

Thanks



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: kernel 2.6.37 : oops in cleanup_once
  2011-02-02 10:52 ` kernel 2.6.37 : oops in cleanup_once Eric Dumazet
@ 2011-02-02 11:24   ` Eric Dumazet
  2011-02-02 13:08     ` Yann Dupont
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2011-02-02 11:24 UTC (permalink / raw)
  To: Yann Dupont; +Cc: linux-kernel, netdev

Le mercredi 02 février 2011 à 11:52 +0100, Eric Dumazet a écrit :
> Le mercredi 02 février 2011 à 09:53 +0100, Yann Dupont a écrit :
> > Hello.
> > We recently upgraded one machine with vanilla 2.6.37, and experienced 2 
> > kernel oops since. Each oops is after ~1 week of uptime.
> > The last oops was last night but we didn't had any trace.

oops, 2.6.37 "only"

> Yes this is a known problem.
> 
> Please try commit 3408404a4c2a4eead9d73b0bbbfe3f225b65f492
> (inetpeer: Use correct AVL tree base pointer in inet_getpeer())
> 
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3408404a4c2a4eead9d73b0bbbfe3f225b65f492
> 
> I believe David will send it to stable team shortly, if not already
> done :)

Please ignore, this patch was for linux-2.6 tree, 2.6.37 was not
affected by the problem.

So its another problem... Is there anything particular you do on this
machine ?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: kernel 2.6.37 : oops in cleanup_once
  2011-02-02 11:24   ` Eric Dumazet
@ 2011-02-02 13:08     ` Yann Dupont
  2011-02-02 14:53       ` Eric Dumazet
  0 siblings, 1 reply; 9+ messages in thread
From: Yann Dupont @ 2011-02-02 13:08 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, netdev

Le 02/02/2011 12:24, Eric Dumazet a écrit :
> Le mercredi 02 février 2011 à 11:52 +0100, Eric Dumazet a écrit :
>> Le mercredi 02 février 2011 à 09:53 +0100, Yann Dupont a écrit :
>>> Hello.
>>> We recently upgraded one machine with vanilla 2.6.37, and experienced 2
>>> kernel oops since. Each oops is after ~1 week of uptime.
>>> The last oops was last night but we didn't had any trace.
> oops, 2.6.37 "only"
>
>> Yes this is a known problem.
>>
>> Please try commit 3408404a4c2a4eead9d73b0bbbfe3f225b65f492
>> (inetpeer: Use correct AVL tree base pointer in inet_getpeer())
>>
>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3408404a4c2a4eead9d73b0bbbfe3f225b65f492
>>
>> I believe David will send it to stable team shortly, if not already
>> done :)
> Please ignore, this patch was for linux-2.6 tree, 2.6.37 was not
> affected by the problem.
>
> So its another problem... Is there anything particular you do on this
> machine ?
>
>
>
>
Nothing really special there, we run a lot (20) of KVM guest (mainly 
linux firewalls for lots of differents vlan), so we have a lot of 
bridges vlan & tun/tap.
Oh, and CONFIG_BRIDGE_IGMP_SNOOPING is set to n (because of  the other 
bug already sent to netdev - more to come on next mail)

Hard to say if this BUG is new in 2.6.37. This host was running fine 
with 2.6.34.2 since August 2010.
Bisecting will be hard due to the time to trigger the bug (and the fact 
that this machine is a production machine)

Anyway, I can test with a specific kernel version if you suspect something.

Regards,


-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: kernel 2.6.37 : oops in cleanup_once
  2011-02-02 13:08     ` Yann Dupont
@ 2011-02-02 14:53       ` Eric Dumazet
  2011-02-02 15:04         ` Yann Dupont
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2011-02-02 14:53 UTC (permalink / raw)
  To: Yann Dupont; +Cc: linux-kernel, netdev

Le mercredi 02 février 2011 à 14:08 +0100, Yann Dupont a écrit :
> Le 02/02/2011 12:24, Eric Dumazet a écrit :
> > Le mercredi 02 février 2011 à 11:52 +0100, Eric Dumazet a écrit :
> >> Le mercredi 02 février 2011 à 09:53 +0100, Yann Dupont a écrit :
> >>> Hello.
> >>> We recently upgraded one machine with vanilla 2.6.37, and experienced 2
> >>> kernel oops since. Each oops is after ~1 week of uptime.
> >>> The last oops was last night but we didn't had any trace.
> > oops, 2.6.37 "only"
> >
> >> Yes this is a known problem.
> >>
> >> Please try commit 3408404a4c2a4eead9d73b0bbbfe3f225b65f492
> >> (inetpeer: Use correct AVL tree base pointer in inet_getpeer())
> >>
> >> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3408404a4c2a4eead9d73b0bbbfe3f225b65f492
> >>
> >> I believe David will send it to stable team shortly, if not already
> >> done :)
> > Please ignore, this patch was for linux-2.6 tree, 2.6.37 was not
> > affected by the problem.
> >
> > So its another problem... Is there anything particular you do on this
> > machine ?
> >
> >
> >
> >
> Nothing really special there, we run a lot (20) of KVM guest (mainly 
> linux firewalls for lots of differents vlan), so we have a lot of 
> bridges vlan & tun/tap.
> Oh, and CONFIG_BRIDGE_IGMP_SNOOPING is set to n (because of  the other 
> bug already sent to netdev - more to come on next mail)
> 
> Hard to say if this BUG is new in 2.6.37. This host was running fine 
> with 2.6.34.2 since August 2010.
> Bisecting will be hard due to the time to trigger the bug (and the fact 
> that this machine is a production machine)
> 
> Anyway, I can test with a specific kernel version if you suspect something.
> 

I suspect a mem corruption from another layer (not inetpeer)

Unfortunately many kmem caches share the "64 bytes" cache.

Could you please add "slub_nomerge" on your boot command ?


This way, we can separate corruptions on each cache.


On your crash, one inetpeer contain garbage on unused_lists next/prev
pointers :

RCX: 0000000000000005
RDX: 0b000209f1beadde

Definitly something overwrote these values with non pointers values.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: kernel 2.6.37 : oops in cleanup_once
  2011-02-02 14:53       ` Eric Dumazet
@ 2011-02-02 15:04         ` Yann Dupont
  2011-02-02 15:08           ` Eric Dumazet
  0 siblings, 1 reply; 9+ messages in thread
From: Yann Dupont @ 2011-02-02 15:04 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, netdev

Le 02/02/2011 15:53, Eric Dumazet a écrit :
> Le mercredi 02 février 2011 à 14:08 +0100, Yann Dupont a écrit :
>> Le 02/02/2011 12:24, Eric Dumazet a écrit :
>>> Le mercredi 02 février 2011 à 11:52 +0100, Eric Dumazet a écrit :
>>>> Le mercredi 02 février 2011 à 09:53 +0100, Yann Dupont a écrit :
>>>>> Hello.
>>>>> We recently upgraded one machine with vanilla 2.6.37, and experienced 2
>>>>> kernel oops since. Each oops is after ~1 week of uptime.
>>>>> The last oops was last night but we didn't had any trace.
>>> oops, 2.6.37 "only"
>>>
>>>> Yes this is a known problem.
>>>>
>>>> Please try commit 3408404a4c2a4eead9d73b0bbbfe3f225b65f492
>>>> (inetpeer: Use correct AVL tree base pointer in inet_getpeer())
>>>>
>>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3408404a4c2a4eead9d73b0bbbfe3f225b65f492
>>>>
>>>> I believe David will send it to stable team shortly, if not already
>>>> done :)
>>> Please ignore, this patch was for linux-2.6 tree, 2.6.37 was not
>>> affected by the problem.
>>>
>>> So its another problem... Is there anything particular you do on this
>>> machine ?
>>>
>>>
>>>
>>>
>> Nothing really special there, we run a lot (20) of KVM guest (mainly
>> linux firewalls for lots of differents vlan), so we have a lot of
>> bridges vlan&  tun/tap.
>> Oh, and CONFIG_BRIDGE_IGMP_SNOOPING is set to n (because of  the other
>> bug already sent to netdev - more to come on next mail)
>>
>> Hard to say if this BUG is new in 2.6.37. This host was running fine
>> with 2.6.34.2 since August 2010.
>> Bisecting will be hard due to the time to trigger the bug (and the fact
>> that this machine is a production machine)
>>
>> Anyway, I can test with a specific kernel version if you suspect something.
>>
> I suspect a mem corruption from another layer (not inetpeer)
>
> Unfortunately many kmem caches share the "64 bytes" cache.
>
> Could you please add "slub_nomerge" on your boot command ?
>
Ok, will do it at 18:30 CET (to minimize impact)
It the suspected bug SLUB related ?

The 2.6.34.2 kernel previously used on that server used SLAB.


2 questions :
-How can I be sure slub_nomerge is active ? Boot message ?
-Is there a very severe impact on performance ?

Regards,

-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: kernel 2.6.37 : oops in cleanup_once
  2011-02-02 15:04         ` Yann Dupont
@ 2011-02-02 15:08           ` Eric Dumazet
  2011-02-02 17:59             ` Yann Dupont
  2011-03-14 10:44             ` Yann Dupont
  0 siblings, 2 replies; 9+ messages in thread
From: Eric Dumazet @ 2011-02-02 15:08 UTC (permalink / raw)
  To: Yann Dupont; +Cc: linux-kernel, netdev

Le mercredi 02 février 2011 à 16:04 +0100, Yann Dupont a écrit :
> >
> Ok, will do it at 18:30 CET (to minimize impact)
> It the suspected bug SLUB related ?
> 

no : It can be a corruption from another part of kernel.

> The 2.6.34.2 kernel previously used on that server used SLAB.
> 
> 
> 2 questions :
> -How can I be sure slub_nomerge is active ? Boot message ?


# ls -l /sys/kernel/slab/

If you have symlinks : merge is on (default)

If you dont have symlinks : nomerge is in action

> -Is there a very severe impact on performance ?
> 

not at all

> Regards,
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: kernel 2.6.37 : oops in cleanup_once
  2011-02-02 15:08           ` Eric Dumazet
@ 2011-02-02 17:59             ` Yann Dupont
  2011-03-14 10:44             ` Yann Dupont
  1 sibling, 0 replies; 9+ messages in thread
From: Yann Dupont @ 2011-02-02 17:59 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, netdev

Le 02/02/2011 16:08, Eric Dumazet a écrit :
> Le mercredi 02 février 2011 à 16:04 +0100, Yann Dupont a écrit :
>> Ok, will do it at 18:30 CET (to minimize impact)
>> It the suspected bug SLUB related ?
>>
> no : It can be a corruption from another part of kernel.
>
>> The 2.6.34.2 kernel previously used on that server used SLAB.
>>
>>
>> 2 questions :
>> -How can I be sure slub_nomerge is active ? Boot message ?
>
> # ls -l /sys/kernel/slab/
>
> If you have symlinks : merge is on (default)
>
> If you dont have symlinks : nomerge is in action
>
>> -Is there a very severe impact on performance ?
>>
> not at all
>
>> Regards,
>>
>
well. The server had the good taste to oops at 18H05, 25 minutes before 
the planned reboot :)

here is the oops (I think it's quite the same) :


Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128042] 
BUG: unable to handle kernel NULL pointer dereference at 000000000000000d
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128097] 
IP: [<ffffffff8130e6bf>] cleanup_once+0x3f/0xa0
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128146] PGD 0
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128173] 
Oops: 0002 [#1] SMP
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128200] 
last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128250] CPU 7
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128260] 
Modules linked in: dell_rbu acpi_cpufreq freq_table mperf nls_utf8 
nls_cp437 btrfs zlib_deflate crc32c libcrc32c ufs qnx4 hfsplus hfs minix 
ntfs vfat msdos fat jfs rei
serfs ext4 jbd2 crc16 ext3 jbd tun ipt_MASQUERADE iptable_nat nf_nat 
ipt_REJECT kvm_intel kvm xt_physdev ip6t_LOG nf_conntrack_ipv6 
nf_defrag_ipv6 ip6table_filter ip6_tables ipt_LOG xt_multiport xt_limit 
xt_tcpudp xt_state iptable_filter
  ip_tables x_tables nf_conntrack_tftp nf_conntrack_ftp 
nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ipv6 8021q bridge stp ext2 
mbcache fuse snd_pcm snd_timer ghes hed button snd soundcore i5000_edac 
edac_core processor shpchp tpm_tis pc
i_hotplug tpm rng_core snd_page_alloc i5k_amb dcdbas tpm_bios joydev 
evdev psmouse pcspkr serio_raw thermal_sys xfs exportfs dm_mod sg sr_mod 
cdrom sd_mod usbhid hid usb_storage qla2xxx scsi_transport_fc scsi_tgt 
uhci_hcd mptsas mptscsih
  mptbase bnx2 scsi_transport_sas scsi_mod ehci_hcd [last unloaded: 
scsi_wait_scan]
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128834]
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128855] 
Pid: 0, comm: kworker/0:1 Not tainted 2.6.37-dsiun-110105 #17 
0MY736/PowerEdge M600
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128901] 
RIP: 0010:[<ffffffff8130e6bf>]  [<ffffffff8130e6bf>] cleanup_once+0x3f/0xa0
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128948] 
RSP: 0018:ffff8800cfdc3e20  EFLAGS: 00010206
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128974] 
RAX: ffff8803a7e0ea18 RBX: ffff8803a7e0ea00 RCX: 0000000000000005
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129003] 
RDX: adde806c0d860b00 RSI: 0000000000000096 RDI: ffffffff8152a970
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129032] 
RBP: 00000000000248f6 R08: 00000000003d0900 R09: 0000000000000000
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129062] 
R10: dead000000200200 R11: 0000000000000000 R12: ffff8800cfdc3ea0
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129091] 
R13: 0000000000000100 R14: ffff88040fd29fd8 R15: 0000000000000000
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129121] 
FS:  0000000000000000(0000) GS:ffff8800cfdc0000(0000) knlGS:0000000000000000
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129166] 
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129193] 
CR2: 000000000000000d CR3: 00000000014f1000 CR4: 00000000000026e0
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129223] 
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129252] 
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129282] 
Process kworker/0:1 (pid: 0, threadinfo ffff88040fd28000, task 
ffff88040fce6450)
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129327] Stack:
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129347]  
0000000000000082 00000001008d3b66 00000000000248f6 ffffffff8130e988
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129397]  
ffff88040fd24000 ffff88040fd24000 ffffffff8152a9a0 ffffffff8105e95f
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129446]  
ffff8800cfdc3e58 ffff88040fd25020 ffffffff8130e950 ffff88040fd29fd8
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129496] 
Call Trace:
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129523] <IRQ>
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129551]  
[<ffffffff8130e988>] ? peer_check_expire+0x38/0x110
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129581]  
[<ffffffff8105e95f>] ? run_timer_softirq+0x16f/0x350
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129609]  
[<ffffffff8130e950>] ? peer_check_expire+0x0/0x110
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129638]  
[<ffffffff81079c6b>] ? ktime_get+0x5b/0xe0
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129666]  
[<ffffffff8105685a>] ? __do_softirq+0xaa/0x1e0
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129694]  
[<ffffffff81003ddc>] ? call_softirq+0x1c/0x30
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129722]  
[<ffffffff81005f75>] ? do_softirq+0x65/0xa0
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129748]  
[<ffffffff81056745>] ? irq_exit+0x85/0x90
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129776]  
[<ffffffff8102137a>] ? smp_apic_timer_interrupt+0x6a/0xa0
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129806]  
[<ffffffff81003893>] ? apic_timer_interrupt+0x13/0x20
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129833] <EOI>
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129857]  
[<ffffffff8123f5ce>] ? acpi_hw_register_read+0x54/0xe2
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129890]  
[<ffffffffa01c52b8>] ? acpi_idle_enter_simple+0xf4/0x126 [processor]
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129936]  
[<ffffffffa01c52b1>] ? acpi_idle_enter_simple+0xed/0x126 [processor]
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.131555]  
[<ffffffffa01c5034>] ? acpi_idle_enter_bm+0xeb/0x27b [processor]
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.131591]  
[<ffffffff812c0deb>] ? cpuidle_idle_call+0x8b/0x140
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.131619]  
[<ffffffff8100208a>] ? cpu_idle+0x6a/0xf0
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.131645] 
Code: 00 48 8b 05 c4 c2 21 00 48 3d 60 a9 52 81 74 5c 48 8d 58 e8 48 8b 
15 11 02 24 00 2b 53 28 48 39 ea 72 49 48 8b 4b 18 48 8b 53 20 <48> 89 
51 08 48 89 0a 48 89 43 18 48 89 43 20 f0 ff 40 14 48 c7
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.131847] 
RIP  [<ffffffff8130e6bf>] cleanup_once+0x3f/0xa0
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.131876]  
RSP <ffff8800cfdc3e20>
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.131898] 
CR2: 000000000000000d
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.132280] 
---[ end trace a9f45436c3b7c143 ]---
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.132350] 
Kernel panic - not syncing: Fatal exception in interrupt
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.132422] 
Pid: 0, comm: kworker/0:1 Tainted: G      D     2.6.37-dsiun-110105 #17
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.132510] 
Call Trace:
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.132574] 
<IRQ>  [<ffffffff8137c75e>] ? panic+0x92/0x1a2

and I also have a screenshot with more details. I'll send it in a 
private message.



Since 18H30, the server runs with slub_nomerge.

-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: kernel 2.6.37 : oops in cleanup_once
  2011-02-02 15:08           ` Eric Dumazet
  2011-02-02 17:59             ` Yann Dupont
@ 2011-03-14 10:44             ` Yann Dupont
  2011-03-14 13:14               ` Eric Dumazet
  1 sibling, 1 reply; 9+ messages in thread
From: Yann Dupont @ 2011-03-14 10:44 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, netdev

Le 02/02/2011 16:08, Eric Dumazet a écrit :


> I suspect a mem corruption from another layer (not inetpeer)
>
> Unfortunately many kmem caches share the "64 bytes" cache.
>
> Could you please add "slub_nomerge" on your boot command ?
>
...

>
>> -Is there a very severe impact on performance ?
>>
> not at all
>
Maybe there is an impact after all : since then, we don't have problems 
anymore !

linkwood:~# uptime
  11:42:03 up 39 days, 17:08,  3 users,  load average: 0.01, 0.03, 0.05

So... could slub_nomerge hide or simply avoid the problem ?
Or are we just lucky this time ?


-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: kernel 2.6.37 : oops in cleanup_once
  2011-03-14 10:44             ` Yann Dupont
@ 2011-03-14 13:14               ` Eric Dumazet
  0 siblings, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2011-03-14 13:14 UTC (permalink / raw)
  To: Yann Dupont; +Cc: linux-kernel, netdev

Le lundi 14 mars 2011 à 11:44 +0100, Yann Dupont a écrit :
> Le 02/02/2011 16:08, Eric Dumazet a écrit :
> 
> 
> > I suspect a mem corruption from another layer (not inetpeer)
> >
> > Unfortunately many kmem caches share the "64 bytes" cache.
> >
> > Could you please add "slub_nomerge" on your boot command ?
> >
> ...
> 
> >
> >> -Is there a very severe impact on performance ?
> >>
> > not at all
> >
> Maybe there is an impact after all : since then, we don't have problems 
> anymore !
> 
> linkwood:~# uptime
>   11:42:03 up 39 days, 17:08,  3 users,  load average: 0.01, 0.03, 0.05
> 
> So... could slub_nomerge hide or simply avoid the problem ?
> Or are we just lucky this time ?
> 
> 

I would say you are lucky ;)

Not all memory corruptions are noticed. Sometimes it touch unused parts
of memory, or some parts with no critical content.




^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2011-03-14 13:14 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <4D491B8D.1000107@univ-nantes.fr>
2011-02-02 10:52 ` kernel 2.6.37 : oops in cleanup_once Eric Dumazet
2011-02-02 11:24   ` Eric Dumazet
2011-02-02 13:08     ` Yann Dupont
2011-02-02 14:53       ` Eric Dumazet
2011-02-02 15:04         ` Yann Dupont
2011-02-02 15:08           ` Eric Dumazet
2011-02-02 17:59             ` Yann Dupont
2011-03-14 10:44             ` Yann Dupont
2011-03-14 13:14               ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox