* Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction)
@ 2004-02-24 6:38 Manfred Spraul
2004-02-24 6:56 ` Andrew Morton
0 siblings, 1 reply; 17+ messages in thread
From: Manfred Spraul @ 2004-02-24 6:38 UTC (permalink / raw)
To: Darren Williams; +Cc: linux-kernel
From your logs:
>Feb 23 14:54:24 calypso kernel: Slab corruption: start=e00000017e84ea00, expend=e00000017e84f1ff, problemat=e00000017e84f020
>Feb 23 14:54:24 calypso kernel: Last user: [<a0000001003c9f30>](kfree_skbmem+0x30/0x80)
>Feb 23 14:54:24 calypso kernel: Data: *****************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
>Feb 23 14:54:28 calypso kernel: **************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************6A ******************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
>Feb 23 14:54:28 calypso kernel: ************************************************************A5
>
>
"6a" instead of 0x6b. One bit is wrong, this is often an indication of a
hardware problem. Do you use ECC memory and is ECC enabled in the BIOS?
--
Manfred
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction) 2004-02-24 6:38 [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction) Manfred Spraul @ 2004-02-24 6:56 ` Andrew Morton 2004-02-24 8:45 ` Darren Williams ` (2 more replies) 0 siblings, 3 replies; 17+ messages in thread From: Andrew Morton @ 2004-02-24 6:56 UTC (permalink / raw) To: Manfred Spraul; +Cc: dsw, linux-kernel Manfred Spraul <manfred@colorfullife.com> wrote: > > From your logs: > > >Feb 23 14:54:24 calypso kernel: Slab corruption: start=e00000017e84ea00, expend=e00000017e84f1ff, problemat=e00000017e84f020 > >Feb 23 14:54:24 calypso kernel: Last user: [<a0000001003c9f30>](kfree_skbmem+0x30/0x80) > >Feb 23 14:54:24 calypso kernel: Data: ************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************! **! > *************************************** > >Feb 23 14:54:28 calypso kernel: **************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************6A *************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************! **! > *************************************** > >Feb 23 14:54:28 calypso kernel: ************************************************************A5 > > > > > "6a" instead of 0x6b. One bit is wrong, this is often an indication of a > hardware problem. Do you use ECC memory and is ECC enabled in the BIOS? Actually, it's often caused by someone doing atomic_dec_and_test() against something which was already freed. Or spin_lock(). One would need to work out what field is at that offset. If it is an atomic_t or a spinlock_t, there you are. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction) 2004-02-24 6:56 ` Andrew Morton @ 2004-02-24 8:45 ` Darren Williams 2004-02-24 17:40 ` Manfred Spraul 2004-02-25 6:17 ` Darren Williams 2 siblings, 0 replies; 17+ messages in thread From: Darren Williams @ 2004-02-24 8:45 UTC (permalink / raw) To: Andrew Morton; +Cc: LKML, manfred Yes the machine is using ECC though I will need to confirm tommorrow that it is enabled, from memory it is. I am currently running memtest over the memory however it is limited to testing 2GB because it uses malloc, so this is not a reliable test. I will continue to swap the memory modules to see if I can find a failed module. If this fails to help I will then look for the offending atomic_t or spinlock_t. Thanks for the replies Darren Mon, 23 Feb 2004, Andrew Morton wrote: > Manfred Spraul <manfred@colorfullife.com> wrote: > > > > From your logs: > > > > >Feb 23 14:54:24 calypso kernel: Slab corruption: start=e00000017e84ea00, expend=e00000017e84f1ff, problemat=e00000017e84f020 > > >Feb 23 14:54:24 calypso kernel: Last user: [<a0000001003c9f30>](kfree_skbmem+0x30/0x80) > > >Feb 23 14:54:24 calypso kernel: Data: ************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************! > **! > > *************************************** > > >Feb 23 14:54:28 calypso kernel: **************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************6A *************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************! > **! > > *************************************** > > >Feb 23 14:54:28 calypso kernel: ************************************************************A5 > > > > > > > > "6a" instead of 0x6b. One bit is wrong, this is often an indication of a > > hardware problem. Do you use ECC memory and is ECC enabled in the BIOS? > > Actually, it's often caused by someone doing atomic_dec_and_test() against > something which was already freed. Or spin_lock(). One would need to work > out what field is at that offset. If it is an atomic_t or a spinlock_t, > there you are. > > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -------------------------------------------------- Darren Williams <dsw AT gelato.unsw.edu.au> Gelato@UNSW <www.gelato.unsw.edu.au> -------------------------------------------------- ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction) 2004-02-24 6:56 ` Andrew Morton 2004-02-24 8:45 ` Darren Williams @ 2004-02-24 17:40 ` Manfred Spraul 2004-02-25 0:58 ` Darren Williams 2004-02-25 6:17 ` Darren Williams 2 siblings, 1 reply; 17+ messages in thread From: Manfred Spraul @ 2004-02-24 17:40 UTC (permalink / raw) To: Andrew Morton; +Cc: dsw, linux-kernel Andrew Morton wrote: >Actually, it's often caused by someone doing atomic_dec_and_test() against >something which was already freed. > The previous user is always kfree_skbmem - I would be surprised if there are atomic operations against the skb data area. Darren, could you try the latest bk snapshot? Linus yesterday merged a patch that hexdumps the affected bytes. We must try to find a pattern - always same offset into a page, always same physical address, always same offset into the object, etc. -- Manfred ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction) 2004-02-24 17:40 ` Manfred Spraul @ 2004-02-25 0:58 ` Darren Williams 2004-02-25 1:05 ` Anton Blanchard ` (2 more replies) 0 siblings, 3 replies; 17+ messages in thread From: Darren Williams @ 2004-02-25 0:58 UTC (permalink / raw) To: Manfred Spraul; +Cc: LKML, akpm Hi Manfred I have updated to the latest bk and new output can be found at: http://quasar.cse.unsw.edu.au/~dsw/public-files/lemon-debug/ kern-log-bk Also I am quite confident that it is not a hardware problem. I took a look at alloc_skb(..) and there is a reference to an atomic_t token with this being the most suspect 150> atomic_set(&(skb_shinfo(skb)->dataref), 1); Not sure though. Darren On Tue, 24 Feb 2004, Manfred Spraul wrote: > Andrew Morton wrote: > > >Actually, it's often caused by someone doing atomic_dec_and_test() against > >something which was already freed. > > > The previous user is always kfree_skbmem - I would be surprised if there > are atomic operations against the skb data area. > > Darren, could you try the latest bk snapshot? Linus yesterday merged a > patch that hexdumps the affected bytes. We must try to find a pattern - > always same offset into a page, always same physical address, always > same offset into the object, etc. > > -- > Manfred > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -------------------------------------------------- Darren Williams <dsw AT gelato.unsw.edu.au> Gelato@UNSW <www.gelato.unsw.edu.au> -------------------------------------------------- ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction) 2004-02-25 0:58 ` Darren Williams @ 2004-02-25 1:05 ` Anton Blanchard 2004-02-25 6:21 ` Manfred Spraul 2004-02-25 10:18 ` Peter Chubb 2 siblings, 0 replies; 17+ messages in thread From: Anton Blanchard @ 2004-02-25 1:05 UTC (permalink / raw) To: Darren Williams; +Cc: Manfred Spraul, LKML, akpm > I have updated to the latest bk and new output can be found at: > http://quasar.cse.unsw.edu.au/~dsw/public-files/lemon-debug/ > kern-log-bk > > Also I am quite confident that it is not a hardware problem. Didnt itanium 1 use that dodgy software IOMMU code? From memory you started having problems at around 2GB, perhaps thats near the cutoff point. Anton ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction) 2004-02-25 0:58 ` Darren Williams 2004-02-25 1:05 ` Anton Blanchard @ 2004-02-25 6:21 ` Manfred Spraul 2004-02-25 7:03 ` David S. Miller 2004-02-25 8:55 ` Darren Williams 2004-02-25 10:18 ` Peter Chubb 2 siblings, 2 replies; 17+ messages in thread From: Manfred Spraul @ 2004-02-25 6:21 UTC (permalink / raw) To: Darren Williams; +Cc: LKML, akpm Darren Williams wrote: >Hi Manfred > >I have updated to the latest bk and new output can be found at: >http://quasar.cse.unsw.edu.au/~dsw/public-files/lemon-debug/ >kern-log-bk > >Also I am quite confident that it is not a hardware problem. > >I took a look at alloc_skb(..) and there is a reference to >an atomic_t token with this being the most suspect > >150> atomic_set(&(skb_shinfo(skb)->dataref), 1); > > I don't think so: The allocation that generates the error is skb->head: The cache name is "size-2048", thus the allocation is a kmalloc(1000-2000, probably 1536 for one eth frame). The skb itself is allocated from the skbuff_head_cache. I don't see a pattern in the virtual addresses: start=e000000101ee09a0, len=2048 start=e000000101ee09a0, len=2048 start=e000000101ee11b8, len=2048 start=e000000101ee19d0, len=2048 start=e000000101ee3218, len=2048 start=e00000017eed1b90, len=2048 start=e00000017eed23a8, len=2048 start=e00000017eed2bc0, len=2048 start=e00000017eed4308, len=2048 start=e00000017eed5338, len=2048 start=e00000017eed5338, len=2048 start=e00000017eed5b50, len=2048 start=e00000017eed5b50, len=2048 start=e00000017eed6b80, len=2048 start=e00000017eed82c8, len=2048 start=e00000017eedc288, len=2048 start=e00000017eedcaa0, len=2048 start=e00000017eeddad0, len=2048 start=e00000017eede2e8, len=2048 start=e00000017eedeb00, len=2048 start=e00000017ef60a60, len=2048 start=e00000017ef61a90, len=2048 start=e00000017ef622a8, len=2048 start=e00000017ef62ac0, len=2048 start=e00000017ef632d8, len=2048 start=e00000017ef65a50, len=2048 start=e00000017ef65a50, len=2048 start=e00000017ef65a50, len=2048 start=e00000017ef66a80, len=2048 That virtually rules out a bad memory chip. But the corrupted byte is always at offset 0x620 into the allocation: Slab corruption: start=e00000017ef65a50, len=2048 620: 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b -- Slab corruption: start=e000000101ee19d0, len=2048 620: 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b -- Slab corruption: start=e000000101ee3218, len=2048 620: 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b -- Slab corruption: start=e00000017ef66a80, len=2048 620: 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b -- Slab corruption: start=e000000101ee11b8, len=2048 620: 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 0x620 (1568) is behind the end of the actual eth frame. Who could modify that? Darren, which nic do you use? Could you try what happens if you reduce the MTU? -- Manfred ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction) 2004-02-25 6:21 ` Manfred Spraul @ 2004-02-25 7:03 ` David S. Miller 2004-02-25 7:22 ` Andrew Morton 2004-02-25 8:55 ` Darren Williams 1 sibling, 1 reply; 17+ messages in thread From: David S. Miller @ 2004-02-25 7:03 UTC (permalink / raw) To: Manfred Spraul; +Cc: dsw, linux-kernel, akpm On Wed, 25 Feb 2004 07:21:56 +0100 Manfred Spraul <manfred@colorfullife.com> wrote: > 0x620 (1568) is behind the end of the actual eth frame. Who could modify > that? At the end of the SKB data area is where we keep struct skb_shared_info, something is messing with the SKB state after a free it appears. And since it's turning the debugging value 0x6b to 0x6a it must be the "atomic_t dataref;" that is being mucked with. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction) 2004-02-25 7:03 ` David S. Miller @ 2004-02-25 7:22 ` Andrew Morton 2004-02-25 8:24 ` Darren Williams 2004-02-25 17:18 ` Manfred Spraul 0 siblings, 2 replies; 17+ messages in thread From: Andrew Morton @ 2004-02-25 7:22 UTC (permalink / raw) To: David S. Miller; +Cc: manfred, dsw, linux-kernel "David S. Miller" <davem@redhat.com> wrote: > > On Wed, 25 Feb 2004 07:21:56 +0100 > Manfred Spraul <manfred@colorfullife.com> wrote: > > > 0x620 (1568) is behind the end of the actual eth frame. Who could modify > > that? > > At the end of the SKB data area is where we keep struct skb_shared_info, something > is messing with the SKB state after a free it appears. > > And since it's turning the debugging value 0x6b to 0x6a it must be the > "atomic_t dataref;" that is being mucked with. Ah-hah. This should find it: 25-akpm/include/linux/skbuff.h | 1 + 25-akpm/net/core/dev.c | 1 + 25-akpm/net/core/skbuff.c | 6 ++++++ 3 files changed, 8 insertions(+) diff -puN include/linux/skbuff.h~dataref-debug include/linux/skbuff.h --- 25/include/linux/skbuff.h~dataref-debug Tue Feb 24 23:18:56 2004 +++ 25-akpm/include/linux/skbuff.h Tue Feb 24 23:19:20 2004 @@ -140,6 +140,7 @@ struct skb_frag_struct { */ struct skb_shared_info { atomic_t dataref; + int debug; unsigned int nr_frags; unsigned short tso_size; unsigned short tso_segs; diff -puN net/core/dev.c~dataref-debug net/core/dev.c --- 25/net/core/dev.c~dataref-debug Tue Feb 24 23:18:56 2004 +++ 25-akpm/net/core/dev.c Tue Feb 24 23:19:34 2004 @@ -1272,6 +1272,7 @@ int __skb_linearize(struct sk_buff *skb, /* Set up shinfo */ ninfo = (struct skb_shared_info*)(data + size); atomic_set(&ninfo->dataref, 1); + ninfo->debug = 0; ninfo->tso_size = skb_shinfo(skb)->tso_size; ninfo->tso_segs = skb_shinfo(skb)->tso_segs; ninfo->nr_frags = 0; diff -puN net/core/skbuff.c~dataref-debug net/core/skbuff.c --- 25/net/core/skbuff.c~dataref-debug Tue Feb 24 23:18:56 2004 +++ 25-akpm/net/core/skbuff.c Tue Feb 24 23:21:36 2004 @@ -148,6 +148,7 @@ struct sk_buff *alloc_skb(unsigned int s skb->end = data + size; atomic_set(&(skb_shinfo(skb)->dataref), 1); + skb_shinfo(skb)->debug = 0; skb_shinfo(skb)->nr_frags = 0; skb_shinfo(skb)->tso_size = 0; skb_shinfo(skb)->tso_segs = 0; @@ -184,6 +185,9 @@ static void skb_clone_fraglist(struct sk void skb_release_data(struct sk_buff *skb) { + if (!skb->cloned) + WARN_ON(skb_shinfo(skb)->debug != 0); + if (!skb->cloned || atomic_dec_and_test(&(skb_shinfo(skb)->dataref))) { if (skb_shinfo(skb)->nr_frags) { @@ -320,6 +324,7 @@ struct sk_buff *skb_clone(struct sk_buff C(tail); C(end); + WARN_ON(skb_shinfo(skb)->debug != 0); atomic_inc(&(skb_shinfo(skb)->dataref)); skb->cloned = 1; @@ -526,6 +531,7 @@ int pskb_expand_head(struct sk_buff *skb skb->h.raw += off; skb->nh.raw += off; skb->cloned = 0; + skb_shinfo(skb)->debug = 0; atomic_set(&skb_shinfo(skb)->dataref, 1); return 0; _ ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction) 2004-02-25 7:22 ` Andrew Morton @ 2004-02-25 8:24 ` Darren Williams 2004-02-25 17:18 ` Manfred Spraul 1 sibling, 0 replies; 17+ messages in thread From: Darren Williams @ 2004-02-25 8:24 UTC (permalink / raw) To: Andrew Morton; +Cc: davem, manfred, LKML OK the patch is applied and the results are at: http://quasar.cse.unsw.edu.au/~dsw/public-files/lemon-debug/ *-latest-skb Darren On Tue, 24 Feb 2004, Andrew Morton wrote: > "David S. Miller" <davem@redhat.com> wrote: > > > > On Wed, 25 Feb 2004 07:21:56 +0100 > > Manfred Spraul <manfred@colorfullife.com> wrote: > > > > > 0x620 (1568) is behind the end of the actual eth frame. Who could modify > > > that? > > > > At the end of the SKB data area is where we keep struct skb_shared_info, something > > is messing with the SKB state after a free it appears. > > > > And since it's turning the debugging value 0x6b to 0x6a it must be the > > "atomic_t dataref;" that is being mucked with. > > Ah-hah. > > This should find it: > > > 25-akpm/include/linux/skbuff.h | 1 + > 25-akpm/net/core/dev.c | 1 + > 25-akpm/net/core/skbuff.c | 6 ++++++ > 3 files changed, 8 insertions(+) > > diff -puN include/linux/skbuff.h~dataref-debug include/linux/skbuff.h > --- 25/include/linux/skbuff.h~dataref-debug Tue Feb 24 23:18:56 2004 > +++ 25-akpm/include/linux/skbuff.h Tue Feb 24 23:19:20 2004 > @@ -140,6 +140,7 @@ struct skb_frag_struct { > */ > struct skb_shared_info { > atomic_t dataref; > + int debug; > unsigned int nr_frags; > unsigned short tso_size; > unsigned short tso_segs; > diff -puN net/core/dev.c~dataref-debug net/core/dev.c > --- 25/net/core/dev.c~dataref-debug Tue Feb 24 23:18:56 2004 > +++ 25-akpm/net/core/dev.c Tue Feb 24 23:19:34 2004 > @@ -1272,6 +1272,7 @@ int __skb_linearize(struct sk_buff *skb, > /* Set up shinfo */ > ninfo = (struct skb_shared_info*)(data + size); > atomic_set(&ninfo->dataref, 1); > + ninfo->debug = 0; > ninfo->tso_size = skb_shinfo(skb)->tso_size; > ninfo->tso_segs = skb_shinfo(skb)->tso_segs; > ninfo->nr_frags = 0; > diff -puN net/core/skbuff.c~dataref-debug net/core/skbuff.c > --- 25/net/core/skbuff.c~dataref-debug Tue Feb 24 23:18:56 2004 > +++ 25-akpm/net/core/skbuff.c Tue Feb 24 23:21:36 2004 > @@ -148,6 +148,7 @@ struct sk_buff *alloc_skb(unsigned int s > skb->end = data + size; > > atomic_set(&(skb_shinfo(skb)->dataref), 1); > + skb_shinfo(skb)->debug = 0; > skb_shinfo(skb)->nr_frags = 0; > skb_shinfo(skb)->tso_size = 0; > skb_shinfo(skb)->tso_segs = 0; > @@ -184,6 +185,9 @@ static void skb_clone_fraglist(struct sk > > void skb_release_data(struct sk_buff *skb) > { > + if (!skb->cloned) > + WARN_ON(skb_shinfo(skb)->debug != 0); > + > if (!skb->cloned || > atomic_dec_and_test(&(skb_shinfo(skb)->dataref))) { > if (skb_shinfo(skb)->nr_frags) { > @@ -320,6 +324,7 @@ struct sk_buff *skb_clone(struct sk_buff > C(tail); > C(end); > > + WARN_ON(skb_shinfo(skb)->debug != 0); > atomic_inc(&(skb_shinfo(skb)->dataref)); > skb->cloned = 1; > > @@ -526,6 +531,7 @@ int pskb_expand_head(struct sk_buff *skb > skb->h.raw += off; > skb->nh.raw += off; > skb->cloned = 0; > + skb_shinfo(skb)->debug = 0; > atomic_set(&skb_shinfo(skb)->dataref, 1); > return 0; > > > _ > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -------------------------------------------------- Darren Williams <dsw AT gelato.unsw.edu.au> Gelato@UNSW <www.gelato.unsw.edu.au> -------------------------------------------------- ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction) 2004-02-25 7:22 ` Andrew Morton 2004-02-25 8:24 ` Darren Williams @ 2004-02-25 17:18 ` Manfred Spraul 2004-02-26 0:30 ` Darren Williams 1 sibling, 1 reply; 17+ messages in thread From: Manfred Spraul @ 2004-02-25 17:18 UTC (permalink / raw) To: Andrew Morton; +Cc: David S. Miller, dsw, linux-kernel Andrew Morton wrote: >Ah-hah. > >This should find it: > > I think we should first check that skb->dataref is really the problem: what about adding an unused field before the dataref? Something like struct skb_shared_info { + int unused; atomic_t dataref; int debug; If the dataref decrease causes the problem, then the affected offset should change to 0x628. -- Manfred ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction) 2004-02-25 17:18 ` Manfred Spraul @ 2004-02-26 0:30 ` Darren Williams 0 siblings, 0 replies; 17+ messages in thread From: Darren Williams @ 2004-02-26 0:30 UTC (permalink / raw) To: Manfred Spraul; +Cc: davem, akpm, LKML, anton Hi Manfred With the introduction of the unused int the slab corruption errors are not present. Darren On Wed, 25 Feb 2004, Manfred Spraul wrote: > Andrew Morton wrote: > > >Ah-hah. > > > >This should find it: > > > > > I think we should first check that skb->dataref is really the problem: > what about adding an unused field before the dataref? Something like > > struct skb_shared_info { > + int unused; > atomic_t dataref; > int debug; > > If the dataref decrease causes the problem, then the affected offset should > change to 0x628. > > -- > Manfred > > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -------------------------------------------------- Darren Williams <dsw AT gelato.unsw.edu.au> Gelato@UNSW <www.gelato.unsw.edu.au> -------------------------------------------------- ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction) 2004-02-25 6:21 ` Manfred Spraul 2004-02-25 7:03 ` David S. Miller @ 2004-02-25 8:55 ` Darren Williams 1 sibling, 0 replies; 17+ messages in thread From: Darren Williams @ 2004-02-25 8:55 UTC (permalink / raw) To: Manfred Spraul; +Cc: LKML Hi Manfred The nic is Intel e100pro onboard card. Interestingly the following allocation that has caught my attention is this: static inline struct RxFD *speedo_rx_alloc(struct net_device *dev, int entry) { struct speedo_private *sp = (struct speedo_private *)dev->priv; struct RxFD *rxf; struct sk_buff *skb; /* Get a fresh skbuff to replace the consumed one. */ skb = dev_alloc_skb(PKT_BUF_SZ + sizeof(struct RxFD)); This allocation on ia64 is 1500 bytes, and as I have explained in later e-mails when I use the Intel e100 driver the slab corruption goes away. So I am guessing that it may be in the eepro100 driver. Darren On Wed, 25 Feb 2004, Manfred Spraul wrote: > Darren Williams wrote: > > >Hi Manfred > > > >I have updated to the latest bk and new output can be found at: > >http://quasar.cse.unsw.edu.au/~dsw/public-files/lemon-debug/ > >kern-log-bk > > > >Also I am quite confident that it is not a hardware problem. > > > >I took a look at alloc_skb(..) and there is a reference to > >an atomic_t token with this being the most suspect > > > >150> atomic_set(&(skb_shinfo(skb)->dataref), 1); > > > > > I don't think so: > The allocation that generates the error is skb->head: The cache name is > "size-2048", thus the allocation is a kmalloc(1000-2000, probably 1536 > for one eth frame). The skb itself is allocated from the skbuff_head_cache. > > I don't see a pattern in the virtual addresses: > start=e000000101ee09a0, len=2048 > start=e000000101ee09a0, len=2048 > start=e000000101ee11b8, len=2048 > start=e000000101ee19d0, len=2048 > start=e000000101ee3218, len=2048 > start=e00000017eed1b90, len=2048 > start=e00000017eed23a8, len=2048 > start=e00000017eed2bc0, len=2048 > start=e00000017eed4308, len=2048 > start=e00000017eed5338, len=2048 > start=e00000017eed5338, len=2048 > start=e00000017eed5b50, len=2048 > start=e00000017eed5b50, len=2048 > start=e00000017eed6b80, len=2048 > start=e00000017eed82c8, len=2048 > start=e00000017eedc288, len=2048 > start=e00000017eedcaa0, len=2048 > start=e00000017eeddad0, len=2048 > start=e00000017eede2e8, len=2048 > start=e00000017eedeb00, len=2048 > start=e00000017ef60a60, len=2048 > start=e00000017ef61a90, len=2048 > start=e00000017ef622a8, len=2048 > start=e00000017ef62ac0, len=2048 > start=e00000017ef632d8, len=2048 > start=e00000017ef65a50, len=2048 > start=e00000017ef65a50, len=2048 > start=e00000017ef65a50, len=2048 > start=e00000017ef66a80, len=2048 > > That virtually rules out a bad memory chip. > > But the corrupted byte is always at offset 0x620 into the allocation: > Slab corruption: start=e00000017ef65a50, len=2048 > 620: 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > -- > Slab corruption: start=e000000101ee19d0, len=2048 > 620: 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > -- > Slab corruption: start=e000000101ee3218, len=2048 > 620: 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > -- > Slab corruption: start=e00000017ef66a80, len=2048 > 620: 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > -- > Slab corruption: start=e000000101ee11b8, len=2048 > 620: 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > 0x620 (1568) is behind the end of the actual eth frame. Who could modify > that? > > Darren, which nic do you use? Could you try what happens if you reduce > the MTU? > > -- > Manfred > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -------------------------------------------------- Darren Williams <dsw AT gelato.unsw.edu.au> Gelato@UNSW <www.gelato.unsw.edu.au> -------------------------------------------------- ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction) 2004-02-25 0:58 ` Darren Williams 2004-02-25 1:05 ` Anton Blanchard 2004-02-25 6:21 ` Manfred Spraul @ 2004-02-25 10:18 ` Peter Chubb 2 siblings, 0 replies; 17+ messages in thread From: Peter Chubb @ 2004-02-25 10:18 UTC (permalink / raw) To: Darren Williams; +Cc: Manfred Spraul, LKML, akpm >>>>> "Darren" == Darren Williams <dsw@gelato.unsw.edu.au> writes: Darren> Hi Manfred I have updated to the latest bk and new output can Darren> be found at: Darren> http://quasar.cse.unsw.edu.au/~dsw/public-files/lemon-debug/ Darren> kern-log-bk Intersting. Offset 0x620 is well off the end of the struct skb, which is only 256 bytes big (I think), yet the object that's having problems is a 2k object. Darren> I took a look at alloc_skb(..) and there is a reference to an Darren> atomic_t token with this being the most suspect 150> atomic_set(&(skb_shinfo(skb)->dataref), 1); No, the skb_shinfo is off in kmalloced space, not part of the slab. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction) 2004-02-24 6:56 ` Andrew Morton 2004-02-24 8:45 ` Darren Williams 2004-02-24 17:40 ` Manfred Spraul @ 2004-02-25 6:17 ` Darren Williams 2 siblings, 0 replies; 17+ messages in thread From: Darren Williams @ 2004-02-25 6:17 UTC (permalink / raw) To: Andrew Morton, manfred; +Cc: saw, LKML Looking at the stack trace when transfering data to the Itanium box the eepro100 driver seems to be producing the slab errors. To check this I swapped over to the Intel e100 driver and the slab errors have ceased. A quick look at eepro100.c shows that it takes a lock in speedo_interrupt(..) Then the callgraph looks something like this. speedo_interrupt(..) |->speedo_rx(..) |->speedo_refill_rx_buffers(..) |->speedo_rx_alloc(..) |->dev_alloc_skb(..) |->alloc_skb(..) Though I do not think the lock is held when alloc_skb(..) is called? Andrey you would know more about what is going on in the eepro100 driver any comments. I have posted the latest logs at: http://quasar.cse.unsw.edu.au/~dsw/public-files/lemon-debug/ the file *-xffUL are the equivilent debug files. Included here is also some disassembled code that may help I found the back trace of the new slab error code limited debugging since earlier we were able to see the trace from the network code also. Darren On Mon, 23 Feb 2004, Andrew Morton wrote: > Manfred Spraul <manfred@colorfullife.com> wrote: > > > > From your logs: > > > > >Feb 23 14:54:24 calypso kernel: Slab corruption: start=e00000017e84ea00, expend=e00000017e84f1ff, problemat=e00000017e84f020 > > >Feb 23 14:54:24 calypso kernel: Last user: [<a0000001003c9f30>](kfree_skbmem+0x30/0x80) > > >Feb 23 14:54:24 calypso kernel: Data: ************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************! > **! > > *************************************** > > >Feb 23 14:54:28 calypso kernel: **************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************6A *************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************! > **! > > *************************************** > > >Feb 23 14:54:28 calypso kernel: ************************************************************A5 > > > > > > > > "6a" instead of 0x6b. One bit is wrong, this is often an indication of a > > hardware problem. Do you use ECC memory and is ECC enabled in the BIOS? > > Actually, it's often caused by someone doing atomic_dec_and_test() against > something which was already freed. Or spin_lock(). One would need to work > out what field is at that offset. If it is an atomic_t or a spinlock_t, > there you are. > > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -------------------------------------------------- Darren Williams <dsw AT gelato.unsw.edu.au> Gelato@UNSW <www.gelato.unsw.edu.au> -------------------------------------------------- ^ permalink raw reply [flat|nested] 17+ messages in thread
* [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2GB @ 2004-02-24 0:22 Darren Williams 2004-02-24 1:14 ` [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction) Darren Williams 0 siblings, 1 reply; 17+ messages in thread From: Darren Williams @ 2004-02-24 0:22 UTC (permalink / raw) To: LKML; +Cc: Ia64 Linux On Ia64 Itanium 1 machines with more than 2.5GB of RAM the follwing error is triggered. slab error in check_poison_obj(): cache `size-16384': object was modified after freeing The machine that triggered the error above is an i2000 HP workstation 4gb RAM 1gb SWAP An identical machine with 2.5GB ram produces: slab error in check_poison_obj(): cache `size-2048': object was modified after freeing if the amount of RAM is reduced to 2GB or less then the errors do not appear. Kernel logs and configs can be found at: http://quasar.cse.unsw.edu.au/~dsw/public-files/lemon-debug/ -------------------------------------------------- Darren Williams <dsw AT gelato.unsw.edu.au> Gelato@UNSW <www.gelato.unsw.edu.au> -------------------------------------------------- ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction) 2004-02-24 0:22 [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2GB Darren Williams @ 2004-02-24 1:14 ` Darren Williams 2004-02-26 1:09 ` Darren Williams 0 siblings, 1 reply; 17+ messages in thread From: Darren Williams @ 2004-02-24 1:14 UTC (permalink / raw) To: LKML; +Cc: Ia64 Linux Hi Darren On Tue, 24 Feb 2004, Darren Williams wrote: > > On Ia64 Itanium 1 machines with more than 2.5GB of RAM the follwing error is triggered. > > slab error in check_poison_obj(): cache `size-16384': object was modified after freeing > > The machine that triggered the error above is an > > i2000 HP workstation > 4gb RAM > 1gb SWAP > > An identical machine with 3GB ram produces: ^^^ > slab error in check_poison_obj(): cache `size-2048': object was modified after freeing > > if the amount of RAM is reduced to 2.5GB or less then the errors do not appear. ^^^^^ > Kernel logs and configs can be found at: > http://quasar.cse.unsw.edu.au/~dsw/public-files/lemon-debug/ > > > -------------------------------------------------- > Darren Williams <dsw AT gelato.unsw.edu.au> > Gelato@UNSW <www.gelato.unsw.edu.au> > -------------------------------------------------- > - > To unsubscribe from this list: send the line "unsubscribe linux-ia64" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -------------------------------------------------- Darren Williams <dsw AT gelato.unsw.edu.au> Gelato@UNSW <www.gelato.unsw.edu.au> -------------------------------------------------- ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction) 2004-02-24 1:14 ` [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction) Darren Williams @ 2004-02-26 1:09 ` Darren Williams 0 siblings, 0 replies; 17+ messages in thread From: Darren Williams @ 2004-02-26 1:09 UTC (permalink / raw) To: LKML Where we are upto: The machine below starts producing slab corruption errors when the amount of RAM is 3GB or more. The hardware has been check and it does not appear to be a hardware error, additional hardware that was producing different errors was removed and the slab corruption persisted. The driver that seems to be triggering the error is the eepro100. And only when receiving data, transmitting data produces no errors. the test was send kern-image A -> B, no errors. send kern-image B -> A, errors. A being the itanium Using the alternative Intel e100 driver removes the slab corruption errors. A small modification to 'struct skb_shared_info' that places an int before the 'atomic_t dataref' field removes the slab corruption errors. Darren On Tue, 24 Feb 2004, Darren Williams wrote: > Hi Darren > > On Tue, 24 Feb 2004, Darren Williams wrote: > > > > > On Ia64 Itanium 1 machines with more than 2.5GB of RAM the follwing error is triggered. > > > > slab error in check_poison_obj(): cache `size-16384': object was modified after freeing > > > > The machine that triggered the error above is an > > > > i2000 HP workstation > > 4gb RAM > > 1gb SWAP > > > > An identical machine with 3GB ram produces: > ^^^ > > > slab error in check_poison_obj(): cache `size-2048': object was modified after freeing > > > > if the amount of RAM is reduced to 2.5GB or less then the errors do not appear. > ^^^^^ > > > Kernel logs and configs can be found at: > > http://quasar.cse.unsw.edu.au/~dsw/public-files/lemon-debug/ > > > > > > -------------------------------------------------- > > Darren Williams <dsw AT gelato.unsw.edu.au> > > Gelato@UNSW <www.gelato.unsw.edu.au> > > -------------------------------------------------- > > - > > To unsubscribe from this list: send the line "unsubscribe linux-ia64" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -------------------------------------------------- > Darren Williams <dsw AT gelato.unsw.edu.au> > Gelato@UNSW <www.gelato.unsw.edu.au> > -------------------------------------------------- > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -------------------------------------------------- Darren Williams <dsw AT gelato.unsw.edu.au> Gelato@UNSW <www.gelato.unsw.edu.au> -------------------------------------------------- ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2004-02-26 1:09 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2004-02-24 6:38 [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction) Manfred Spraul 2004-02-24 6:56 ` Andrew Morton 2004-02-24 8:45 ` Darren Williams 2004-02-24 17:40 ` Manfred Spraul 2004-02-25 0:58 ` Darren Williams 2004-02-25 1:05 ` Anton Blanchard 2004-02-25 6:21 ` Manfred Spraul 2004-02-25 7:03 ` David S. Miller 2004-02-25 7:22 ` Andrew Morton 2004-02-25 8:24 ` Darren Williams 2004-02-25 17:18 ` Manfred Spraul 2004-02-26 0:30 ` Darren Williams 2004-02-25 8:55 ` Darren Williams 2004-02-25 10:18 ` Peter Chubb 2004-02-25 6:17 ` Darren Williams -- strict thread matches above, loose matches on Subject: below -- 2004-02-24 0:22 [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2GB Darren Williams 2004-02-24 1:14 ` [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction) Darren Williams 2004-02-26 1:09 ` Darren Williams
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox