Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction)
@ 2004-02-24  6:38 Manfred Spraul
  2004-02-24  6:56 ` Andrew Morton
  0 siblings, 1 reply; 17+ messages in thread
From: Manfred Spraul @ 2004-02-24  6:38 UTC (permalink / raw)
  To: Darren Williams; +Cc: linux-kernel

 From your logs:

>Feb 23 14:54:24 calypso kernel: Slab corruption: start=e00000017e84ea00, expend=e00000017e84f1ff, problemat=e00000017e84f020
>Feb 23 14:54:24 calypso kernel: Last user: [<a0000001003c9f30>](kfree_skbmem+0x30/0x80)
>Feb 23 14:54:24 calypso kernel: Data: *****************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
>Feb 23 14:54:28 calypso kernel: **************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************6A ******************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
>Feb 23 14:54:28 calypso kernel: ************************************************************A5 
>  
>
"6a" instead of 0x6b. One bit is wrong, this is often an indication of a 
hardware problem. Do you use ECC memory and is ECC enabled in the BIOS?

--
    Manfred


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction)
  2004-02-24  6:38 [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction) Manfred Spraul
@ 2004-02-24  6:56 ` Andrew Morton
  2004-02-24  8:45   ` Darren Williams
                     ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Andrew Morton @ 2004-02-24  6:56 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: dsw, linux-kernel

Manfred Spraul <manfred@colorfullife.com> wrote:
>
>  From your logs:
> 
> >Feb 23 14:54:24 calypso kernel: Slab corruption: start=e00000017e84ea00, expend=e00000017e84f1ff, problemat=e00000017e84f020
> >Feb 23 14:54:24 calypso kernel: Last user: [<a0000001003c9f30>](kfree_skbmem+0x30/0x80)
> >Feb 23 14:54:24 calypso kernel: Data: ************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************!
**!
> ***************************************
> >Feb 23 14:54:28 calypso kernel: **************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************6A *************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************!
**!
> ***************************************
> >Feb 23 14:54:28 calypso kernel: ************************************************************A5 
> >  
> >
> "6a" instead of 0x6b. One bit is wrong, this is often an indication of a 
> hardware problem. Do you use ECC memory and is ECC enabled in the BIOS?

Actually, it's often caused by someone doing atomic_dec_and_test() against
something which was already freed.  Or spin_lock().  One would need to work
out what field is at that offset.  If it is an atomic_t or a spinlock_t,
there you are.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction)
  2004-02-24  6:56 ` Andrew Morton
@ 2004-02-24  8:45   ` Darren Williams
  2004-02-24 17:40   ` Manfred Spraul
  2004-02-25  6:17   ` Darren Williams
  2 siblings, 0 replies; 17+ messages in thread
From: Darren Williams @ 2004-02-24  8:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, manfred


Yes the machine is using ECC though I will need to confirm tommorrow that
it is enabled, from memory it is. I am currently running memtest over the
memory however it is limited to testing 2GB because it uses malloc, so this
is not a reliable test.
I will continue to swap the memory modules to see if I can find a failed
module.
                                                                                                                             
If this fails to help I will then look for the offending atomic_t or
spinlock_t.
                                                                                                                             
Thanks for the replies
Darren 

Mon, 23 Feb 2004, Andrew Morton wrote:

> Manfred Spraul <manfred@colorfullife.com> wrote:
> >
> >  From your logs:
> > 
> > >Feb 23 14:54:24 calypso kernel: Slab corruption: start=e00000017e84ea00, expend=e00000017e84f1ff, problemat=e00000017e84f020
> > >Feb 23 14:54:24 calypso kernel: Last user: [<a0000001003c9f30>](kfree_skbmem+0x30/0x80)
> > >Feb 23 14:54:24 calypso kernel: Data: ************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************!
> **!
> > ***************************************
> > >Feb 23 14:54:28 calypso kernel: **************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************6A *************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************!
> **!
> > ***************************************
> > >Feb 23 14:54:28 calypso kernel: ************************************************************A5 
> > >  
> > >
> > "6a" instead of 0x6b. One bit is wrong, this is often an indication of a 
> > hardware problem. Do you use ECC memory and is ECC enabled in the BIOS?
> 
> Actually, it's often caused by someone doing atomic_dec_and_test() against
> something which was already freed.  Or spin_lock().  One would need to work
> out what field is at that offset.  If it is an atomic_t or a spinlock_t,
> there you are.
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--------------------------------------------------
Darren Williams <dsw AT gelato.unsw.edu.au>
Gelato@UNSW <www.gelato.unsw.edu.au>
--------------------------------------------------

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction)
  2004-02-24  6:56 ` Andrew Morton
  2004-02-24  8:45   ` Darren Williams
@ 2004-02-24 17:40   ` Manfred Spraul
  2004-02-25  0:58     ` Darren Williams
  2004-02-25  6:17   ` Darren Williams
  2 siblings, 1 reply; 17+ messages in thread
From: Manfred Spraul @ 2004-02-24 17:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: dsw, linux-kernel

Andrew Morton wrote:

>Actually, it's often caused by someone doing atomic_dec_and_test() against
>something which was already freed.
>
The previous user is always kfree_skbmem - I would be surprised if there 
are atomic operations against the skb data area.

Darren, could you try the latest bk snapshot? Linus yesterday merged a 
patch that hexdumps the affected bytes. We must try to find a pattern - 
always same offset into a page, always same physical address, always 
same offset into the object, etc.

--
    Manfred


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction)
  2004-02-24 17:40   ` Manfred Spraul
@ 2004-02-25  0:58     ` Darren Williams
  2004-02-25  1:05       ` Anton Blanchard
                         ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Darren Williams @ 2004-02-25  0:58 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: LKML, akpm

Hi Manfred

I have updated to the latest bk and new output can be found at:
http://quasar.cse.unsw.edu.au/~dsw/public-files/lemon-debug/
kern-log-bk

Also I am quite confident that it is not a hardware problem.

I took a look at alloc_skb(..) and there is a reference to
an atomic_t token with this being the most suspect

150> atomic_set(&(skb_shinfo(skb)->dataref), 1);

Not sure though.

Darren


On Tue, 24 Feb 2004, Manfred Spraul wrote:

> Andrew Morton wrote:
> 
> >Actually, it's often caused by someone doing atomic_dec_and_test() against
> >something which was already freed.
> >
> The previous user is always kfree_skbmem - I would be surprised if there 
> are atomic operations against the skb data area.
> 
> Darren, could you try the latest bk snapshot? Linus yesterday merged a 
> patch that hexdumps the affected bytes. We must try to find a pattern - 
> always same offset into a page, always same physical address, always 
> same offset into the object, etc.
> 
> --
>    Manfred
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--------------------------------------------------
Darren Williams <dsw AT gelato.unsw.edu.au>
Gelato@UNSW <www.gelato.unsw.edu.au>
--------------------------------------------------

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction)
  2004-02-25  0:58     ` Darren Williams
@ 2004-02-25  1:05       ` Anton Blanchard
  2004-02-25  6:21       ` Manfred Spraul
  2004-02-25 10:18       ` Peter Chubb
  2 siblings, 0 replies; 17+ messages in thread
From: Anton Blanchard @ 2004-02-25  1:05 UTC (permalink / raw)
  To: Darren Williams; +Cc: Manfred Spraul, LKML, akpm

 
> I have updated to the latest bk and new output can be found at:
> http://quasar.cse.unsw.edu.au/~dsw/public-files/lemon-debug/
> kern-log-bk
> 
> Also I am quite confident that it is not a hardware problem.

Didnt itanium 1 use that dodgy software IOMMU code? From memory you
started having problems at around 2GB, perhaps thats near the cutoff point.

Anton

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction)
  2004-02-25  0:58     ` Darren Williams
  2004-02-25  1:05       ` Anton Blanchard
@ 2004-02-25  6:21       ` Manfred Spraul
  2004-02-25  7:03         ` David S. Miller
  2004-02-25  8:55         ` Darren Williams
  2004-02-25 10:18       ` Peter Chubb
  2 siblings, 2 replies; 17+ messages in thread
From: Manfred Spraul @ 2004-02-25  6:21 UTC (permalink / raw)
  To: Darren Williams; +Cc: LKML, akpm

Darren Williams wrote:

>Hi Manfred
>
>I have updated to the latest bk and new output can be found at:
>http://quasar.cse.unsw.edu.au/~dsw/public-files/lemon-debug/
>kern-log-bk
>
>Also I am quite confident that it is not a hardware problem.
>
>I took a look at alloc_skb(..) and there is a reference to
>an atomic_t token with this being the most suspect
>
>150> atomic_set(&(skb_shinfo(skb)->dataref), 1);
>  
>
I don't think so:
The allocation that generates the error is skb->head: The cache name is 
"size-2048", thus the allocation is a kmalloc(1000-2000, probably 1536 
for one eth frame). The skb itself is allocated from the skbuff_head_cache.

I don't see a pattern in the virtual addresses:
 start=e000000101ee09a0, len=2048
 start=e000000101ee09a0, len=2048
 start=e000000101ee11b8, len=2048
 start=e000000101ee19d0, len=2048
 start=e000000101ee3218, len=2048
 start=e00000017eed1b90, len=2048
 start=e00000017eed23a8, len=2048
 start=e00000017eed2bc0, len=2048
 start=e00000017eed4308, len=2048
 start=e00000017eed5338, len=2048
 start=e00000017eed5338, len=2048
 start=e00000017eed5b50, len=2048
 start=e00000017eed5b50, len=2048
 start=e00000017eed6b80, len=2048
 start=e00000017eed82c8, len=2048
 start=e00000017eedc288, len=2048
 start=e00000017eedcaa0, len=2048
 start=e00000017eeddad0, len=2048
 start=e00000017eede2e8, len=2048
 start=e00000017eedeb00, len=2048
 start=e00000017ef60a60, len=2048
 start=e00000017ef61a90, len=2048
 start=e00000017ef622a8, len=2048
 start=e00000017ef62ac0, len=2048
 start=e00000017ef632d8, len=2048
 start=e00000017ef65a50, len=2048
 start=e00000017ef65a50, len=2048
 start=e00000017ef65a50, len=2048
 start=e00000017ef66a80, len=2048

That virtually rules out a bad memory chip.

But the corrupted byte is always at offset 0x620 into the allocation:
 Slab corruption: start=e00000017ef65a50, len=2048
 620: 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
--
 Slab corruption: start=e000000101ee19d0, len=2048
 620: 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
--
 Slab corruption: start=e000000101ee3218, len=2048
 620: 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
--
 Slab corruption: start=e00000017ef66a80, len=2048
 620: 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
--
 Slab corruption: start=e000000101ee11b8, len=2048
 620: 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b

0x620 (1568) is behind the end of the actual eth frame. Who could modify 
that?

Darren, which nic do you use? Could you try what happens if you reduce 
the MTU?

--
    Manfred


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction)
  2004-02-25  6:21       ` Manfred Spraul
@ 2004-02-25  7:03         ` David S. Miller
  2004-02-25  7:22           ` Andrew Morton
  2004-02-25  8:55         ` Darren Williams
  1 sibling, 1 reply; 17+ messages in thread
From: David S. Miller @ 2004-02-25  7:03 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: dsw, linux-kernel, akpm

On Wed, 25 Feb 2004 07:21:56 +0100
Manfred Spraul <manfred@colorfullife.com> wrote:

> 0x620 (1568) is behind the end of the actual eth frame. Who could modify 
> that?

At the end of the SKB data area is where we keep struct skb_shared_info, something
is messing with the SKB state after a free it appears.

And since it's turning the debugging value 0x6b to 0x6a it must be the
"atomic_t dataref;" that is being mucked with.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction)
  2004-02-25  7:03         ` David S. Miller
@ 2004-02-25  7:22           ` Andrew Morton
  2004-02-25  8:24             ` Darren Williams
  2004-02-25 17:18             ` Manfred Spraul
  0 siblings, 2 replies; 17+ messages in thread
From: Andrew Morton @ 2004-02-25  7:22 UTC (permalink / raw)
  To: David S. Miller; +Cc: manfred, dsw, linux-kernel

"David S. Miller" <davem@redhat.com> wrote:
>
> On Wed, 25 Feb 2004 07:21:56 +0100
> Manfred Spraul <manfred@colorfullife.com> wrote:
> 
> > 0x620 (1568) is behind the end of the actual eth frame. Who could modify 
> > that?
> 
> At the end of the SKB data area is where we keep struct skb_shared_info, something
> is messing with the SKB state after a free it appears.
> 
> And since it's turning the debugging value 0x6b to 0x6a it must be the
> "atomic_t dataref;" that is being mucked with.

Ah-hah.

This should find it:


 25-akpm/include/linux/skbuff.h |    1 +
 25-akpm/net/core/dev.c         |    1 +
 25-akpm/net/core/skbuff.c      |    6 ++++++
 3 files changed, 8 insertions(+)

diff -puN include/linux/skbuff.h~dataref-debug include/linux/skbuff.h
--- 25/include/linux/skbuff.h~dataref-debug	Tue Feb 24 23:18:56 2004
+++ 25-akpm/include/linux/skbuff.h	Tue Feb 24 23:19:20 2004
@@ -140,6 +140,7 @@ struct skb_frag_struct {
  */
 struct skb_shared_info {
 	atomic_t	dataref;
+	int		debug;
 	unsigned int	nr_frags;
 	unsigned short	tso_size;
 	unsigned short	tso_segs;
diff -puN net/core/dev.c~dataref-debug net/core/dev.c
--- 25/net/core/dev.c~dataref-debug	Tue Feb 24 23:18:56 2004
+++ 25-akpm/net/core/dev.c	Tue Feb 24 23:19:34 2004
@@ -1272,6 +1272,7 @@ int __skb_linearize(struct sk_buff *skb,
 	/* Set up shinfo */
 	ninfo = (struct skb_shared_info*)(data + size);
 	atomic_set(&ninfo->dataref, 1);
+	ninfo->debug = 0;
 	ninfo->tso_size = skb_shinfo(skb)->tso_size;
 	ninfo->tso_segs = skb_shinfo(skb)->tso_segs;
 	ninfo->nr_frags = 0;
diff -puN net/core/skbuff.c~dataref-debug net/core/skbuff.c
--- 25/net/core/skbuff.c~dataref-debug	Tue Feb 24 23:18:56 2004
+++ 25-akpm/net/core/skbuff.c	Tue Feb 24 23:21:36 2004
@@ -148,6 +148,7 @@ struct sk_buff *alloc_skb(unsigned int s
 	skb->end  = data + size;
 
 	atomic_set(&(skb_shinfo(skb)->dataref), 1);
+	skb_shinfo(skb)->debug  = 0;
 	skb_shinfo(skb)->nr_frags  = 0;
 	skb_shinfo(skb)->tso_size = 0;
 	skb_shinfo(skb)->tso_segs = 0;
@@ -184,6 +185,9 @@ static void skb_clone_fraglist(struct sk
 
 void skb_release_data(struct sk_buff *skb)
 {
+	if (!skb->cloned)
+		WARN_ON(skb_shinfo(skb)->debug != 0);
+
 	if (!skb->cloned ||
 	    atomic_dec_and_test(&(skb_shinfo(skb)->dataref))) {
 		if (skb_shinfo(skb)->nr_frags) {
@@ -320,6 +324,7 @@ struct sk_buff *skb_clone(struct sk_buff
 	C(tail);
 	C(end);
 
+	WARN_ON(skb_shinfo(skb)->debug != 0);
 	atomic_inc(&(skb_shinfo(skb)->dataref));
 	skb->cloned = 1;
 
@@ -526,6 +531,7 @@ int pskb_expand_head(struct sk_buff *skb
 	skb->h.raw   += off;
 	skb->nh.raw  += off;
 	skb->cloned   = 0;
+	skb_shinfo(skb)->debug = 0;
 	atomic_set(&skb_shinfo(skb)->dataref, 1);
 	return 0;
 

_


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction)
  2004-02-25  7:22           ` Andrew Morton
@ 2004-02-25  8:24             ` Darren Williams
  2004-02-25 17:18             ` Manfred Spraul
  1 sibling, 0 replies; 17+ messages in thread
From: Darren Williams @ 2004-02-25  8:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: davem, manfred, LKML

OK the patch is applied and the results are at:
http://quasar.cse.unsw.edu.au/~dsw/public-files/lemon-debug/
*-latest-skb

Darren

On Tue, 24 Feb 2004, Andrew Morton wrote:

> "David S. Miller" <davem@redhat.com> wrote:
> >
> > On Wed, 25 Feb 2004 07:21:56 +0100
> > Manfred Spraul <manfred@colorfullife.com> wrote:
> > 
> > > 0x620 (1568) is behind the end of the actual eth frame. Who could modify 
> > > that?
> > 
> > At the end of the SKB data area is where we keep struct skb_shared_info, something
> > is messing with the SKB state after a free it appears.
> > 
> > And since it's turning the debugging value 0x6b to 0x6a it must be the
> > "atomic_t dataref;" that is being mucked with.
> 
> Ah-hah.
> 
> This should find it:
> 
> 
>  25-akpm/include/linux/skbuff.h |    1 +
>  25-akpm/net/core/dev.c         |    1 +
>  25-akpm/net/core/skbuff.c      |    6 ++++++
>  3 files changed, 8 insertions(+)
> 
> diff -puN include/linux/skbuff.h~dataref-debug include/linux/skbuff.h
> --- 25/include/linux/skbuff.h~dataref-debug	Tue Feb 24 23:18:56 2004
> +++ 25-akpm/include/linux/skbuff.h	Tue Feb 24 23:19:20 2004
> @@ -140,6 +140,7 @@ struct skb_frag_struct {
>   */
>  struct skb_shared_info {
>  	atomic_t	dataref;
> +	int		debug;
>  	unsigned int	nr_frags;
>  	unsigned short	tso_size;
>  	unsigned short	tso_segs;
> diff -puN net/core/dev.c~dataref-debug net/core/dev.c
> --- 25/net/core/dev.c~dataref-debug	Tue Feb 24 23:18:56 2004
> +++ 25-akpm/net/core/dev.c	Tue Feb 24 23:19:34 2004
> @@ -1272,6 +1272,7 @@ int __skb_linearize(struct sk_buff *skb,
>  	/* Set up shinfo */
>  	ninfo = (struct skb_shared_info*)(data + size);
>  	atomic_set(&ninfo->dataref, 1);
> +	ninfo->debug = 0;
>  	ninfo->tso_size = skb_shinfo(skb)->tso_size;
>  	ninfo->tso_segs = skb_shinfo(skb)->tso_segs;
>  	ninfo->nr_frags = 0;
> diff -puN net/core/skbuff.c~dataref-debug net/core/skbuff.c
> --- 25/net/core/skbuff.c~dataref-debug	Tue Feb 24 23:18:56 2004
> +++ 25-akpm/net/core/skbuff.c	Tue Feb 24 23:21:36 2004
> @@ -148,6 +148,7 @@ struct sk_buff *alloc_skb(unsigned int s
>  	skb->end  = data + size;
>  
>  	atomic_set(&(skb_shinfo(skb)->dataref), 1);
> +	skb_shinfo(skb)->debug  = 0;
>  	skb_shinfo(skb)->nr_frags  = 0;
>  	skb_shinfo(skb)->tso_size = 0;
>  	skb_shinfo(skb)->tso_segs = 0;
> @@ -184,6 +185,9 @@ static void skb_clone_fraglist(struct sk
>  
>  void skb_release_data(struct sk_buff *skb)
>  {
> +	if (!skb->cloned)
> +		WARN_ON(skb_shinfo(skb)->debug != 0);
> +
>  	if (!skb->cloned ||
>  	    atomic_dec_and_test(&(skb_shinfo(skb)->dataref))) {
>  		if (skb_shinfo(skb)->nr_frags) {
> @@ -320,6 +324,7 @@ struct sk_buff *skb_clone(struct sk_buff
>  	C(tail);
>  	C(end);
>  
> +	WARN_ON(skb_shinfo(skb)->debug != 0);
>  	atomic_inc(&(skb_shinfo(skb)->dataref));
>  	skb->cloned = 1;
>  
> @@ -526,6 +531,7 @@ int pskb_expand_head(struct sk_buff *skb
>  	skb->h.raw   += off;
>  	skb->nh.raw  += off;
>  	skb->cloned   = 0;
> +	skb_shinfo(skb)->debug = 0;
>  	atomic_set(&skb_shinfo(skb)->dataref, 1);
>  	return 0;
>  
> 
> _
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--------------------------------------------------
Darren Williams <dsw AT gelato.unsw.edu.au>
Gelato@UNSW <www.gelato.unsw.edu.au>
--------------------------------------------------

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction)
  2004-02-25  7:22           ` Andrew Morton
  2004-02-25  8:24             ` Darren Williams
@ 2004-02-25 17:18             ` Manfred Spraul
  2004-02-26  0:30               ` Darren Williams
  1 sibling, 1 reply; 17+ messages in thread
From: Manfred Spraul @ 2004-02-25 17:18 UTC (permalink / raw)
  To: Andrew Morton; +Cc: David S. Miller, dsw, linux-kernel

Andrew Morton wrote:

>Ah-hah.
>
>This should find it:
>  
>
I think we should first check that skb->dataref is really the problem: 
what about adding an unused field before the dataref? Something like

struct skb_shared_info {
+	int		unused;
 	atomic_t	dataref;
	int		debug;

If the dataref decrease causes the problem, then the affected offset should change to 0x628.

--
	Manfred



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction)
  2004-02-25 17:18             ` Manfred Spraul
@ 2004-02-26  0:30               ` Darren Williams
  0 siblings, 0 replies; 17+ messages in thread
From: Darren Williams @ 2004-02-26  0:30 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: davem, akpm, LKML, anton

Hi Manfred

With the introduction of the unused int the slab corruption
errors are not present.

Darren

On Wed, 25 Feb 2004, Manfred Spraul wrote:

> Andrew Morton wrote:
> 
> >Ah-hah.
> >
> >This should find it:
> > 
> >
> I think we should first check that skb->dataref is really the problem: 
> what about adding an unused field before the dataref? Something like
> 
> struct skb_shared_info {
> +	int		unused;
> 	atomic_t	dataref;
> 	int		debug;
> 
> If the dataref decrease causes the problem, then the affected offset should 
> change to 0x628.
> 
> --
> 	Manfred
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--------------------------------------------------
Darren Williams <dsw AT gelato.unsw.edu.au>
Gelato@UNSW <www.gelato.unsw.edu.au>
--------------------------------------------------

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction)
  2004-02-25  6:21       ` Manfred Spraul
  2004-02-25  7:03         ` David S. Miller
@ 2004-02-25  8:55         ` Darren Williams
  1 sibling, 0 replies; 17+ messages in thread
From: Darren Williams @ 2004-02-25  8:55 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: LKML

Hi Manfred

The nic is Intel e100pro onboard card.
Interestingly the following allocation that has caught
my attention is this:

static inline struct RxFD *speedo_rx_alloc(struct net_device *dev, int entry)
{
	struct speedo_private *sp = (struct speedo_private *)dev->priv;
	struct RxFD *rxf;
	struct sk_buff *skb;
	/* Get a fresh skbuff to replace the consumed one. */
	skb = dev_alloc_skb(PKT_BUF_SZ + sizeof(struct RxFD));

This allocation on ia64 is 1500 bytes, and as I have explained in
later e-mails when I use the Intel e100 driver the slab corruption
goes away.

So I am guessing that it may be in the eepro100 driver.

Darren



On Wed, 25 Feb 2004, Manfred Spraul wrote:

> Darren Williams wrote:
> 
> >Hi Manfred
> >
> >I have updated to the latest bk and new output can be found at:
> >http://quasar.cse.unsw.edu.au/~dsw/public-files/lemon-debug/
> >kern-log-bk
> >
> >Also I am quite confident that it is not a hardware problem.
> >
> >I took a look at alloc_skb(..) and there is a reference to
> >an atomic_t token with this being the most suspect
> >
> >150> atomic_set(&(skb_shinfo(skb)->dataref), 1);
> > 
> >
> I don't think so:
> The allocation that generates the error is skb->head: The cache name is 
> "size-2048", thus the allocation is a kmalloc(1000-2000, probably 1536 
> for one eth frame). The skb itself is allocated from the skbuff_head_cache.
> 
> I don't see a pattern in the virtual addresses:
> start=e000000101ee09a0, len=2048
> start=e000000101ee09a0, len=2048
> start=e000000101ee11b8, len=2048
> start=e000000101ee19d0, len=2048
> start=e000000101ee3218, len=2048
> start=e00000017eed1b90, len=2048
> start=e00000017eed23a8, len=2048
> start=e00000017eed2bc0, len=2048
> start=e00000017eed4308, len=2048
> start=e00000017eed5338, len=2048
> start=e00000017eed5338, len=2048
> start=e00000017eed5b50, len=2048
> start=e00000017eed5b50, len=2048
> start=e00000017eed6b80, len=2048
> start=e00000017eed82c8, len=2048
> start=e00000017eedc288, len=2048
> start=e00000017eedcaa0, len=2048
> start=e00000017eeddad0, len=2048
> start=e00000017eede2e8, len=2048
> start=e00000017eedeb00, len=2048
> start=e00000017ef60a60, len=2048
> start=e00000017ef61a90, len=2048
> start=e00000017ef622a8, len=2048
> start=e00000017ef62ac0, len=2048
> start=e00000017ef632d8, len=2048
> start=e00000017ef65a50, len=2048
> start=e00000017ef65a50, len=2048
> start=e00000017ef65a50, len=2048
> start=e00000017ef66a80, len=2048
> 
> That virtually rules out a bad memory chip.
> 
> But the corrupted byte is always at offset 0x620 into the allocation:
> Slab corruption: start=e00000017ef65a50, len=2048
> 620: 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> --
> Slab corruption: start=e000000101ee19d0, len=2048
> 620: 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> --
> Slab corruption: start=e000000101ee3218, len=2048
> 620: 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> --
> Slab corruption: start=e00000017ef66a80, len=2048
> 620: 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> --
> Slab corruption: start=e000000101ee11b8, len=2048
> 620: 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> 
> 0x620 (1568) is behind the end of the actual eth frame. Who could modify 
> that?
> 
> Darren, which nic do you use? Could you try what happens if you reduce 
> the MTU?
> 
> --
>    Manfred
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--------------------------------------------------
Darren Williams <dsw AT gelato.unsw.edu.au>
Gelato@UNSW <www.gelato.unsw.edu.au>
--------------------------------------------------

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction)
  2004-02-25  0:58     ` Darren Williams
  2004-02-25  1:05       ` Anton Blanchard
  2004-02-25  6:21       ` Manfred Spraul
@ 2004-02-25 10:18       ` Peter Chubb
  2 siblings, 0 replies; 17+ messages in thread
From: Peter Chubb @ 2004-02-25 10:18 UTC (permalink / raw)
  To: Darren Williams; +Cc: Manfred Spraul, LKML, akpm

>>>>> "Darren" == Darren Williams <dsw@gelato.unsw.edu.au> writes:

Darren> Hi Manfred I have updated to the latest bk and new output can
Darren> be found at:
Darren> http://quasar.cse.unsw.edu.au/~dsw/public-files/lemon-debug/
Darren> kern-log-bk

Intersting.  Offset 0x620 is well off the end of the struct skb, which
is only 256 bytes big (I think), yet the object that's having problems
is a 2k object.

Darren> I took a look at alloc_skb(..) and there is a reference to an
Darren> atomic_t token with this being the most suspect

150> atomic_set(&(skb_shinfo(skb)->dataref), 1);

No, the skb_shinfo is off in kmalloced space, not part of the slab.

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction)
  2004-02-24  6:56 ` Andrew Morton
  2004-02-24  8:45   ` Darren Williams
  2004-02-24 17:40   ` Manfred Spraul
@ 2004-02-25  6:17   ` Darren Williams
  2 siblings, 0 replies; 17+ messages in thread
From: Darren Williams @ 2004-02-25  6:17 UTC (permalink / raw)
  To: Andrew Morton, manfred; +Cc: saw, LKML

Looking at the stack trace when transfering data to the Itanium box the eepro100 driver
seems to be producing the slab errors.
                                                                                                                                                             
To check this I swapped over to the Intel e100 driver and the slab errors have ceased.
                                                                                                                                                             
A quick look at eepro100.c shows that it takes a lock in
speedo_interrupt(..)
                                                                                                                                                             
Then the callgraph looks something like this.
                                                                                                                                                             
speedo_interrupt(..)
        |->speedo_rx(..)
            |->speedo_refill_rx_buffers(..)
                |->speedo_rx_alloc(..)
                   |->dev_alloc_skb(..)
                        |->alloc_skb(..)

Though I do not think the lock is held when alloc_skb(..) is called?

Andrey you would know more about what is going on in the eepro100 driver any
comments.

I have posted the latest logs at:
http://quasar.cse.unsw.edu.au/~dsw/public-files/lemon-debug/

the file *-xffUL are the equivilent debug files.

Included here is also some disassembled code that may help

I found the back trace of the new slab error code limited debugging
since earlier we were able to see the trace from the network code
also.

Darren

On Mon, 23 Feb 2004, Andrew Morton wrote:

> Manfred Spraul <manfred@colorfullife.com> wrote:
> >
> >  From your logs:
> > 
> > >Feb 23 14:54:24 calypso kernel: Slab corruption: start=e00000017e84ea00, expend=e00000017e84f1ff, problemat=e00000017e84f020
> > >Feb 23 14:54:24 calypso kernel: Last user: [<a0000001003c9f30>](kfree_skbmem+0x30/0x80)
> > >Feb 23 14:54:24 calypso kernel: Data: ************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************!
> **!
> > ***************************************
> > >Feb 23 14:54:28 calypso kernel: **************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************6A *************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************!
> **!
> > ***************************************
> > >Feb 23 14:54:28 calypso kernel: ************************************************************A5 
> > >  
> > >
> > "6a" instead of 0x6b. One bit is wrong, this is often an indication of a 
> > hardware problem. Do you use ECC memory and is ECC enabled in the BIOS?
> 
> Actually, it's often caused by someone doing atomic_dec_and_test() against
> something which was already freed.  Or spin_lock().  One would need to work
> out what field is at that offset.  If it is an atomic_t or a spinlock_t,
> there you are.
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--------------------------------------------------
Darren Williams <dsw AT gelato.unsw.edu.au>
Gelato@UNSW <www.gelato.unsw.edu.au>
--------------------------------------------------

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2GB
@ 2004-02-24  0:22 Darren Williams
  2004-02-24  1:14 ` [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction) Darren Williams
  0 siblings, 1 reply; 17+ messages in thread
From: Darren Williams @ 2004-02-24  0:22 UTC (permalink / raw)
  To: LKML; +Cc: Ia64 Linux


On Ia64 Itanium 1 machines with more than 2.5GB of RAM the follwing error is triggered.
 
slab error in check_poison_obj(): cache `size-16384': object was modified after freeing

The machine that triggered the error above is an

i2000 HP workstation
4gb RAM
1gb SWAP

An identical machine with 2.5GB ram produces:

slab error in check_poison_obj(): cache `size-2048': object was modified after freeing

if the amount of RAM is reduced to 2GB or less then the errors do not appear.

Kernel logs and configs can be found at:
http://quasar.cse.unsw.edu.au/~dsw/public-files/lemon-debug/


--------------------------------------------------
Darren Williams <dsw AT gelato.unsw.edu.au>
Gelato@UNSW <www.gelato.unsw.edu.au>
--------------------------------------------------

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction)
  2004-02-24  0:22 [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2GB Darren Williams
@ 2004-02-24  1:14 ` Darren Williams
  2004-02-26  1:09   ` Darren Williams
  0 siblings, 1 reply; 17+ messages in thread
From: Darren Williams @ 2004-02-24  1:14 UTC (permalink / raw)
  To: LKML; +Cc: Ia64 Linux

Hi Darren

On Tue, 24 Feb 2004, Darren Williams wrote:

> 
> On Ia64 Itanium 1 machines with more than 2.5GB of RAM the follwing error is triggered.
>  
> slab error in check_poison_obj(): cache `size-16384': object was modified after freeing
> 
> The machine that triggered the error above is an
> 
> i2000 HP workstation
> 4gb RAM
> 1gb SWAP
> 
> An identical machine with 3GB ram produces:
                            ^^^
 
> slab error in check_poison_obj(): cache `size-2048': object was modified after freeing
> 
> if the amount of RAM is reduced to 2.5GB or less then the errors do not appear.
                                     ^^^^^ 

> Kernel logs and configs can be found at:
> http://quasar.cse.unsw.edu.au/~dsw/public-files/lemon-debug/
> 
> 
> --------------------------------------------------
> Darren Williams <dsw AT gelato.unsw.edu.au>
> Gelato@UNSW <www.gelato.unsw.edu.au>
> --------------------------------------------------
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--------------------------------------------------
Darren Williams <dsw AT gelato.unsw.edu.au>
Gelato@UNSW <www.gelato.unsw.edu.au>
--------------------------------------------------

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction)
  2004-02-24  1:14 ` [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction) Darren Williams
@ 2004-02-26  1:09   ` Darren Williams
  0 siblings, 0 replies; 17+ messages in thread
From: Darren Williams @ 2004-02-26  1:09 UTC (permalink / raw)
  To: LKML


Where we are upto:

The machine below starts producing slab corruption errors when the amount of RAM is
3GB or more.

The hardware has been check and it does not appear to be a hardware error, additional
hardware that was producing different errors was removed and the slab corruption
persisted.

The driver that seems to be triggering the error is the eepro100. And only when receiving
data, transmitting data produces no errors. 
the test was 
 send kern-image A -> B, no errors.
 send kern-image B -> A, errors.

A being the itanium

Using the alternative Intel e100 driver removes the slab corruption errors.

A small modification to 'struct skb_shared_info' that places an int before the 
'atomic_t dataref' field removes the slab corruption errors.


Darren

On Tue, 24 Feb 2004, Darren Williams wrote:

> Hi Darren
> 
> On Tue, 24 Feb 2004, Darren Williams wrote:
> 
> > 
> > On Ia64 Itanium 1 machines with more than 2.5GB of RAM the follwing error is triggered.
> >  
> > slab error in check_poison_obj(): cache `size-16384': object was modified after freeing
> > 
> > The machine that triggered the error above is an
> > 
> > i2000 HP workstation
> > 4gb RAM
> > 1gb SWAP
> > 
> > An identical machine with 3GB ram produces:
>                             ^^^
>  
> > slab error in check_poison_obj(): cache `size-2048': object was modified after freeing
> > 
> > if the amount of RAM is reduced to 2.5GB or less then the errors do not appear.
>                                      ^^^^^ 
> 
> > Kernel logs and configs can be found at:
> > http://quasar.cse.unsw.edu.au/~dsw/public-files/lemon-debug/
> > 
> > 
> > --------------------------------------------------
> > Darren Williams <dsw AT gelato.unsw.edu.au>
> > Gelato@UNSW <www.gelato.unsw.edu.au>
> > --------------------------------------------------
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --------------------------------------------------
> Darren Williams <dsw AT gelato.unsw.edu.au>
> Gelato@UNSW <www.gelato.unsw.edu.au>
> --------------------------------------------------
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--------------------------------------------------
Darren Williams <dsw AT gelato.unsw.edu.au>
Gelato@UNSW <www.gelato.unsw.edu.au>
--------------------------------------------------

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2004-02-26  1:09 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-02-24  6:38 [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction) Manfred Spraul
2004-02-24  6:56 ` Andrew Morton
2004-02-24  8:45   ` Darren Williams
2004-02-24 17:40   ` Manfred Spraul
2004-02-25  0:58     ` Darren Williams
2004-02-25  1:05       ` Anton Blanchard
2004-02-25  6:21       ` Manfred Spraul
2004-02-25  7:03         ` David S. Miller
2004-02-25  7:22           ` Andrew Morton
2004-02-25  8:24             ` Darren Williams
2004-02-25 17:18             ` Manfred Spraul
2004-02-26  0:30               ` Darren Williams
2004-02-25  8:55         ` Darren Williams
2004-02-25 10:18       ` Peter Chubb
2004-02-25  6:17   ` Darren Williams
  -- strict thread matches above, loose matches on Subject: below --
2004-02-24  0:22 [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2GB Darren Williams
2004-02-24  1:14 ` [BUG] 2.6.3 Slab corruption: errors are triggered when memory exceeds 2.5GB (correction) Darren Williams
2004-02-26  1:09   ` Darren Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox