netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* NetXen driver causing slab corruption in -RT kernels
@ 2007-09-18 18:17 Vernon Mauery
  2007-09-19  6:43 ` Dhananjay Phadke
  0 siblings, 1 reply; 2+ messages in thread
From: Vernon Mauery @ 2007-09-18 18:17 UTC (permalink / raw)
  To: netdev; +Cc: LKML, Dhananjay Phadke

In doing some stress testing of the NetXen driver, I found that my machine was 
dying in all sorts of weird ways.  I saw several different crashes, BUG 
messages in the TCP stack and some assert messages in the TCP stack as well.  
I really didn't think that there could be six different bugs all at once in 
the TCP/IP stack, so I started looking at possible memory corruption.

I first saw this on 2.6.16-rt22 with a backported netxen driver from 2.6.22.  
I figured I should try the latest kernel, so I tried it on 2.6.23-rc6-git7 
but could not trigger the slab corruption messages with CONFIG_DEBUG_SLAB, so 
I figured the race must only exist in the -RT kernels.  Next I tried 
2.6.23-rc6-git7-rt1 (I applied patch-2.6.23-rc4-rt1 patch to 2.6.23-rc6-git7 
and fixed the 5 failing hunks).  After an hour or so, lots of slab corruption 
messages showed up:

Slab corruption: size-2048 start=f40c4670, len=2048
Slab corruption: size-2048 start=f313cf48, len=2048
Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
Last user: [<c0166be4>](kfree+0x80/0x95)
010: 6b 6b 00 0e 1e 00 16 13 00 0e 1e 00 19 3d 08 00
020: 45 00 05 dc 92 ab 40 00 40 11 8a 5b 0a 02 02 03
030: 0a 02 02 04 80 0c 80 0d 05 c8 dc 39 00 00 00 00
040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Prev obj: start=f313c730, len=2048
Redzone: 0xd84156c5635688c0/0xd84156c5635688c0.
Last user: [<f8f06186>](netxen_post_rx_buffers_nodb+0x62/0x1f0 [netxen_nic])
000: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
010: 5a 5a 00 0e 1e 00 16 13 00 0e 1e 00 19 3d 08 00
Next obj: start=f313d760, len=2048
Redzone: 0xd84156c5635688c0/0xd84156c5635688c0.
Last user: [<f8f06186>](netxen_post_rx_buffers_nodb+0x62/0x1f0 [netxen_nic])
000: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
010: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
Slab corruption: size-2048 start=f395a6f0, len=2048
Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
Last user: [<c0166be4>](kfree+0x80/0x95)
010: 6b 6b 00 0e 1e 00 16 13 00 0e 1e 00 19 3d 08 00
020: 45 00 05 dc 92 ac 40 00 40 11 8a 5a 0a 02 02 03
030: 0a 02 02 04 80 0c 80 0d 05 c8 dc 39 00 00 00 00
040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Next obj: start=f395af08, len=2048
Redzone: 0xd84156c5635688c0/0xd84156c5635688c0.
Last user: [<f8f06186>](netxen_post_rx_buffers_nodb+0x62/0x1f0 [netxen_nic])
000: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
010: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
Last user: [<c0166be4>](kfree+0x80/0x95)
010: 6b 6b 00 0e 1e 00 16 13 00 0e 1e 00 19 3d 08 00
020: 45 00 05 dc 92 aa 40 00 40 11 8a 5c 0a 02 02 03
030: 0a 02 02 04 80 0c 80 0d 05 c8 dc 39 00 00 00 00
040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Next obj: start=f40c4e88, len=2048
Redzone: 0xd84156c5635688c0/0xd84156c5635688c0.
Last user: [<f8f06186>](netxen_post_rx_buffers_nodb+0x62/0x1f0 [netxen_nic])
000: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
010: 5a 5a 00 0e 1e 00 16 13 00 0e 1e 00 19 3d 08 00

The stress test that I am running is basically a mixed bag of stuff I threw 
together.  It runs eight concurrent netperf TCP streams and two concurrent 
UDP streams in both directions, (and on both 10GbE interfaces), ping -f in 
both directions, some more disk/cpu loads in the background and a little bit 
of NFS traffic thrown in for good measure.

--Vernon

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: NetXen driver causing slab corruption in -RT kernels
  2007-09-18 18:17 NetXen driver causing slab corruption in -RT kernels Vernon Mauery
@ 2007-09-19  6:43 ` Dhananjay Phadke
  0 siblings, 0 replies; 2+ messages in thread
From: Dhananjay Phadke @ 2007-09-19  6:43 UTC (permalink / raw)
  To: Vernon Mauery; +Cc: netdev, LKML

That "00 0e 1e ..." is netxen's mac address, so sounds like the NIC is
dumping a frame in the skb already freed (and poisoned) by the stack.
I suppose -RT kernels preempt the softirq, giving a chance for this race.

The netxen driver doesn't seem to clear the mapped address of the skb in
rx descriptor, instead it replaces it with mapped address of newly
allocated skb. Also, the rx ring is replenished in bursts (at the end of
poll routine), this can help the race if preempted in between.

I am currently reworking the rx handling, hopefully it will fix this.

Thanks for reporting in detail.

-Dhananjay


Vernon Mauery wrote:
> In doing some stress testing of the NetXen driver, I found that my machine was 
> dying in all sorts of weird ways.  I saw several different crashes, BUG 
> messages in the TCP stack and some assert messages in the TCP stack as well.  
> I really didn't think that there could be six different bugs all at once in 
> the TCP/IP stack, so I started looking at possible memory corruption.
> 
> I first saw this on 2.6.16-rt22 with a backported netxen driver from 2.6.22.  
> I figured I should try the latest kernel, so I tried it on 2.6.23-rc6-git7 
> but could not trigger the slab corruption messages with CONFIG_DEBUG_SLAB, so 
> I figured the race must only exist in the -RT kernels.  Next I tried 
> 2.6.23-rc6-git7-rt1 (I applied patch-2.6.23-rc4-rt1 patch to 2.6.23-rc6-git7 
> and fixed the 5 failing hunks).  After an hour or so, lots of slab corruption 
> messages showed up:
> 
> Slab corruption: size-2048 start=f40c4670, len=2048
> Slab corruption: size-2048 start=f313cf48, len=2048
> Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
> Last user: [<c0166be4>](kfree+0x80/0x95)
> 010: 6b 6b 00 0e 1e 00 16 13 00 0e 1e 00 19 3d 08 00
> 020: 45 00 05 dc 92 ab 40 00 40 11 8a 5b 0a 02 02 03
> 030: 0a 02 02 04 80 0c 80 0d 05 c8 dc 39 00 00 00 00
> 040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> Prev obj: start=f313c730, len=2048
> Redzone: 0xd84156c5635688c0/0xd84156c5635688c0.
> Last user: [<f8f06186>](netxen_post_rx_buffers_nodb+0x62/0x1f0 [netxen_nic])
> 000: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
> 010: 5a 5a 00 0e 1e 00 16 13 00 0e 1e 00 19 3d 08 00
> Next obj: start=f313d760, len=2048
> Redzone: 0xd84156c5635688c0/0xd84156c5635688c0.
> Last user: [<f8f06186>](netxen_post_rx_buffers_nodb+0x62/0x1f0 [netxen_nic])
> 000: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
> 010: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
> Slab corruption: size-2048 start=f395a6f0, len=2048
> Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
> Last user: [<c0166be4>](kfree+0x80/0x95)
> 010: 6b 6b 00 0e 1e 00 16 13 00 0e 1e 00 19 3d 08 00
> 020: 45 00 05 dc 92 ac 40 00 40 11 8a 5a 0a 02 02 03
> 030: 0a 02 02 04 80 0c 80 0d 05 c8 dc 39 00 00 00 00
> 040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> Next obj: start=f395af08, len=2048
> Redzone: 0xd84156c5635688c0/0xd84156c5635688c0.
> Last user: [<f8f06186>](netxen_post_rx_buffers_nodb+0x62/0x1f0 [netxen_nic])
> 000: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
> 010: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
> Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
> Last user: [<c0166be4>](kfree+0x80/0x95)
> 010: 6b 6b 00 0e 1e 00 16 13 00 0e 1e 00 19 3d 08 00
> 020: 45 00 05 dc 92 aa 40 00 40 11 8a 5c 0a 02 02 03
> 030: 0a 02 02 04 80 0c 80 0d 05 c8 dc 39 00 00 00 00
> 040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> Next obj: start=f40c4e88, len=2048
> Redzone: 0xd84156c5635688c0/0xd84156c5635688c0.
> Last user: [<f8f06186>](netxen_post_rx_buffers_nodb+0x62/0x1f0 [netxen_nic])
> 000: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
> 010: 5a 5a 00 0e 1e 00 16 13 00 0e 1e 00 19 3d 08 00
> 
> The stress test that I am running is basically a mixed bag of stuff I threw 
> together.  It runs eight concurrent netperf TCP streams and two concurrent 
> UDP streams in both directions, (and on both 10GbE interfaces), ping -f in 
> both directions, some more disk/cpu loads in the background and a little bit 
> of NFS traffic thrown in for good measure.
> 
> --Vernon

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2007-09-19  6:43 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-09-18 18:17 NetXen driver causing slab corruption in -RT kernels Vernon Mauery
2007-09-19  6:43 ` Dhananjay Phadke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).