From mboxrd@z Thu Jan 1 00:00:00 1970 From: Greg Banks Subject: Re: [PATCH] fix BUG in tg3_tx Date: Mon, 24 May 2004 18:04:31 +1000 Sender: netdev-bounce@oss.sgi.com Message-ID: <20040524080431.GD27177@sgi.com> References: <20040524072657.GC27177@sgi.com> <20040524004045.58b3eb44.davem@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: netdev@oss.sgi.com Return-path: To: "David S. Miller" Content-Disposition: inline In-Reply-To: <20040524004045.58b3eb44.davem@redhat.com> Errors-to: netdev-bounce@oss.sgi.com List-Id: netdev.vger.kernel.org On Mon, May 24, 2004 at 12:40:45AM -0700, David S. Miller wrote: > On Mon, 24 May 2004 17:26:58 +1000 > Greg Banks wrote: > > > The tg3 transmit code assumes that tg3_tx() will never have to clean > > up part of an skb queued for transmit. This assumption is wrong; > > Greg, perhaps my reading of the tg3 chip docs is different > from yours. The hardware is NEVER supposed to do this. I'd like to know where you read that, because neither I nor any of the other SGI engineers who have read the Broadcom docs can find any such guarantee. The consensus here is that there is no guarantee about where the transmit ring consumer index is when the card DMAs a status block update. We've seen that BUG() trip many times both on 2.4 and 2.6. SGI ProPack for Linux has been shipping for a year now with a workaround for this bug (the BUG() is changed to a break). > Or is there an errata in some chip versions? We've seen this on both 5701 and 5704 hardware. The SGI cards ship have slightly different firmware from stock Broadcom cards, I don't know if that's a factor. It might also be something to do with the Altix IO architecture. I don't know why the card chooses to do this, but it most certainly does. The IRIX driver guys tells me this same behaviour happens on the Origin hardware, and the IRIX driver had an equivalent fix applied about 18 months ago. > I've never triggered that BUG() assertion on any of my > hardware, ever. Ok, here's one of several mostly identical stack traces reported by various people inside SGI on 2.6 kernels. If you want to wait for a day or so I can probably make a 2.4 kernel do this also (I did that during testing a coupld of days ago but didn't save the stack trace, doh!) [root@budgie root]# swapper[0]: bugcheck! 0 [1] Pid: 0, CPU 0, comm: swapper psr : 0000101009022018 ifs : 8000000000000d1e ip : [] Not tainted ip is at tg3_tx+0x5a0/0x5c0 [tg3] unat: 0000000000000000 pfs : 0000000000000d1e rsc : 0000000000000003 rnat: 800000025da66955 bsps: a0000001000fdec0 pr : 80000000ff7669a5 ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f csd : 0000000000000000 ssd : 0000000000000000 b0 : a000000200165980 b6 : a000000100003320 b7 : a0000001000cc200 f6 : 1003e0fc0fc0fc0fc0fc1 f7 : 0ffdaa200000000000000 f8 : 1003e0000000000000240 f9 : 1003e0000000000002490 f10 : 1003e000000000ea00000 f11 : 1003e00000000367b7ad0 r1 : a0000001009ec7f0 r2 : 0000000000000000 r3 : 0000000000004000 r8 : 0000000000000026 r9 : 0000000000000002 r10 : 0000000000000001 r11 : 0000000000000004 r12 : e0000030146a3c50 r13 : e00000301469c000 r14 : 0000000000004000 r15 : a000000100734fb0 r16 : e00000b004a147a8 r17 : e00000b07ba88060 r18 : e000003007ba0000 r19 : 0000000000180000 r20 : 0000000000000014 r21 : 0000000000080000 r22 : 0000000000100000 r23 : e00000b004a150e4 r24 : e00000b004a150d8 r25 : e0000030146a3bf0 r26 : e00000b004a15810 r27 : 0000000000000074 r28 : 0000000000000074 r29 : e000003007ba002c r30 : e00000b07ba8802c r31 : 0000000000000002 Call Trace: [] show_stack+0x80/0xa0 sp=e0000030146a3820 bsp=e00000301469d2c8 [] die+0x1b0/0x280 sp=e0000030146a39f0 bsp=e00000301469d2a0 [] ia64_bad_break+0x340/0x480 sp=e0000030146a39f0 bsp=e00000301469d280 [] ia64_leave_kernel+0x0/0x260 sp=e0000030146a3a80 bsp=e00000301469d280 [] tg3_tx+0x5a0/0x5c0 [tg3] sp=e0000030146a3c50 bsp=e00000301469d188 [] tg3_interrupt_main_work+0x150/0x280 [tg3] sp=e0000030146a3c50 bsp=e00000301469d158 [] tg3_interrupt+0x100/0x1c0 [tg3] sp=e0000030146a3c50 bsp=e00000301469d110 [] handle_IRQ_event+0xa0/0x120 sp=e0000030146a3c50 bsp=e00000301469d0c8 [] do_IRQ+0x390/0x4a0 sp=e0000030146a3c50 bsp=e00000301469d078 [] ia64_handle_irq+0xc0/0x1a0 sp=e0000030146a3c50 bsp=e00000301469d040 [] ia64_leave_kernel+0x0/0x260 sp=e0000030146a3c50 bsp=e00000301469d040 [] snidle+0xb0/0x180 sp=e0000030146a3e20 bsp=e00000301469d038 [] cpu_idle+0x130/0x220 sp=e0000030146a3e20 bsp=e00000301469cfa8 [] start_kernel+0x460/0x4e0 sp=e0000030146a3e20 bsp=e00000301469cf50 [] _start+0x2c0/0x0 sp=e0000030146a3e30 bsp=e00000301469cf50 3 out of 4 cpus in kdb, waiting for the rest 1 cpu are not in kdb, their state is unknown Entering kdb (current=0xe00000301469c000, pid 0) on processor 0 Oops: due to oops @ 0xa000000200165980 psr: 0x0000101009022018 ifs: 0x8000000000000d1e ip: 0xa000000200165980 unat: 0x0000000000000000 pfs: 0x0000000000000d1e rsc: 0x0000000000000003 rnat: 0x800000025da66955 bsps: 0xa0000001000fdec0 pr: 0x80000000ff7669a5 ldrs: 0x0000000000000000 ccv: 0x0000000000000000 fpsr: 0x0009804c8a70033f b0: 0xa000000200165980 b6: 0xa000000100003320 b7: 0xa0000001000cc200 r1: 0xa0000001009ec7f0 r2: 0x0000000000000000 r3: 0x0000000000004000 r8: 0x0000000000000026 r9: 0x0000000000000002 r10: 0x0000000000000001 r11: 0x0000000000000004 r12: 0xe0000030146a3c50 r13: 0xe00000301469c000 r14: 0x0000000000004000 r15: 0xa000000100734fb0 r16: 0xe00000b004a147a8 r17: 0xe00000b07ba88060 r18: 0xe000003007ba0000 r19: 0x0000000000180000 r20: 0x0000000000000014 r21: 0x0000000000080000 r22: 0x0000000000100000 r23: 0xe00000b004a150e4 r24: 0xe00000b004a150d8 r25: 0xe0000030146a3bf0 r26: 0xe00000b004a15810 r27: 0x0000000000000074 r28: 0x0000000000000074 r29: 0xe000003007ba002c r30: 0xe00000b07ba8802c r31: 0x0000000000000002 ®s = e0000030146a3a90 [0]kdb> Greg. -- Greg Banks, R&D Software Engineer, SGI Australian Software Group. I don't speak for SGI.