forcedeth oops

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* forcedeth oops
@ 2007-02-24  8:07 Chris Wedgwood
  0 siblings, 0 replies; 4+ messages in thread
From: Chris Wedgwood @ 2007-02-24  8:07 UTC (permalink / raw)
  To: netdev; +Cc: manfred, aabdulla

Using 2.6.21-rc1 (x86-64) I can get an oops in the forcedeth driver in
usually under about 5s with heavy network load (near line-rate GE, a
simpy using netcat and /dev/zero from one host to another suffices).

In nv_rx_done we have:

        if (flags & NV_TX_LASTPACKET) {
                if (flags & NV_TX_ERROR) {
                        if (flags & NV_TX_UNDERFLOW)
                                np->stats.tx_fifo_errors++;
                        if (flags & NV_TX_CARRIERLOST)
                                np->stats.tx_carrier_errors++;
                        np->stats.tx_errors++;
                } else {
                        np->stats.tx_packets++;
                        np->stats.tx_bytes += np->get_tx_ctx->skb->len;
                }
                dev_kfree_skb_any(np->get_tx_ctx->skb);
                np->get_tx_ctx->skb = NULL;
        }

Now, it seems that sometimes, for reasons I've not really looked into
as yet that np->get_tx_ctx->skb is NULL, so things go kaput (cr2 ends
up being 0x88, which I assume is the offset of len in skb).

No, if I do something along the lines of:

diff --git a/drivers/net/forcedeth.c b/drivers/net/forcedeth.c
index a363148..59027aa 100644
--- a/drivers/net/forcedeth.c
+++ b/drivers/net/forcedeth.c
@@ -1918,7 +1918,12 @@ static void nv_tx_done(struct net_device *dev)
 					np->stats.tx_errors++;
 				} else {
 					np->stats.tx_packets++;
-					np->stats.tx_bytes += np->get_tx_ctx->skb->len;
+					/* XXX for some reason under heavy load,
+					   np->get_tx_ctx->skb can be null */
+					if (likely(np->get_tx_ctx->skb))
+						np->stats.tx_bytes += np->get_tx_ctx->skb->len;
+					else
+						printk(KERN_ERR "XXX saw null skb\n");
 				}
 				dev_kfree_skb_any(np->get_tx_ctx->skb);
 				np->get_tx_ctx->skb = NULL;

the problem goes away completely, I can do hours of traffic, 100s of
GBs where it would break in a few seconds before.  However, I never
see the printk actually print anything...  so I'm a bit mystified.  I
disassembled the code in the original case and it seems perfectly
sane.

Can anyone explain why I see ->skb == NULL and why the above change
seems to make that go away?  (Or perhaps why the printk isn't
working).

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* forcedeth oops
@ 2008-01-22 11:54 Andrew Brooks
  2008-01-22 19:28 ` Jeff Garzik
  0 siblings, 1 reply; 4+ messages in thread
From: Andrew Brooks @ 2008-01-22 11:54 UTC (permalink / raw)
  To: netdev

Hello

I'm getting an oops in forcedeth whenever I shutdown, details below.

I've tried kernel 2.6.16.59 and the latest forcedeth.c from nvidia.com
which is package-1.23 version-0.62 date-2007/04/27.

How can I download the latest forcedeth.c (including 2008-01-13 patches) ?
It's not in the latest snapshot linux-2.6.24-rc8.

Also, why is the version on nvidia.com not just older than the one in
the kernel, but it appears to have forked back in May 2006.  Has there
been independent development on each version?  They should be the same!

Here's the diff:
<  *    0.56: 22 Mar 2006: Additional ethtool and moduleparam support.
<  *    0.57: 14 May 2006: Moved mac address writes to nv_probe and nv_remove.
<  *    0.58: 20 May 2006: Optimized rx and tx data paths.
<  *    0.59: 31 May 2006: Added support for sideband management unit.
<  *    0.60: 31 May 2006: Added support for recoverable error.
<  *    0.61: 18 Jul 2006: Added support for suspend/resume.
<  *    0.62: 16 Jan 2007: Fixed statistics, mgmt communication, and low phy speed on S5.
---
>  *    0.56: 22 Mar 2006: Additional ethtool config and moduleparam support.
>  *    0.57: 14 May 2006: Mac address set in probe/remove and order corrections.
>  *    0.58: 30 Oct 2006: Added support for sideband management unit.
>  *    0.59: 30 Oct 2006: Added support for recoverable error.
>  *    0.60: 20 Jan 2007: Code optimizations for rings, rx & tx data paths, and stats.

Here's the details of the oops:
md: md0 switched to read-only mode.
Unable to handle kernel NULL pointer dereference at virtual address 00000000
printing eip:
f8ccdd55
*pde = 36c6a001
Oops: 0000 [#1]
SMP
Modules linked in: nvidia ... forcedeth ... sata_nv
CPU: 1
EIP:
EFLAGS: 00010286 (2.6.16.59 #1)
EIP is at nv_suspend+0x85/0x350 [forcedeth]
eax:
esi:
ds:
Process reboot
Stack:
Call Trace:
show_stack_log
show_registers
die
do_page_fault
error_code
nv_reboot_handler
notifier_call_chain
kernel_restart_prepare
kernel_restart
sys_reboot
sysenter_past_esp
Code: 8b 8c 3a 98 01 00 00 01 c8 8b ...
INIT: no more processes left in this runlevel

Andrew

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: forcedeth oops
  2008-01-22 11:54 forcedeth oops Andrew Brooks
@ 2008-01-22 19:28 ` Jeff Garzik
  2008-01-28 15:31   ` Ayaz Abdulla
  0 siblings, 1 reply; 4+ messages in thread
From: Jeff Garzik @ 2008-01-22 19:28 UTC (permalink / raw)
  To: Andrew Brooks; +Cc: netdev

Andrew Brooks wrote:
> Hello
> 
> I'm getting an oops in forcedeth whenever I shutdown, details below.
> 
> I've tried kernel 2.6.16.59 and the latest forcedeth.c from nvidia.com
> which is package-1.23 version-0.62 date-2007/04/27.
> 
> How can I download the latest forcedeth.c (including 2008-01-13 patches) ?
> It's not in the latest snapshot linux-2.6.24-rc8.
> 
> Also, why is the version on nvidia.com not just older than the one in
> the kernel, but it appears to have forked back in May 2006.  Has there
> been independent development on each version?  They should be the same!

We don't run nvidia.com here :)


> Here's the diff:
> <  *    0.56: 22 Mar 2006: Additional ethtool and moduleparam support.
> <  *    0.57: 14 May 2006: Moved mac address writes to nv_probe and nv_remove.
> <  *    0.58: 20 May 2006: Optimized rx and tx data paths.
> <  *    0.59: 31 May 2006: Added support for sideband management unit.
> <  *    0.60: 31 May 2006: Added support for recoverable error.
> <  *    0.61: 18 Jul 2006: Added support for suspend/resume.
> <  *    0.62: 16 Jan 2007: Fixed statistics, mgmt communication, and low phy speed on S5.
> ---
>>  *    0.56: 22 Mar 2006: Additional ethtool config and moduleparam support.
>>  *    0.57: 14 May 2006: Mac address set in probe/remove and order corrections.
>>  *    0.58: 30 Oct 2006: Added support for sideband management unit.
>>  *    0.59: 30 Oct 2006: Added support for recoverable error.
>>  *    0.60: 20 Jan 2007: Code optimizations for rings, rx & tx data paths, and stats.
> 
> 
> Here's the details of the oops:
> md: md0 switched to read-only mode.
> Unable to handle kernel NULL pointer dereference at virtual address 00000000
> printing eip:
> f8ccdd55
> *pde = 36c6a001
> Oops: 0000 [#1]
> SMP
> Modules linked in: nvidia ... forcedeth ... sata_nv
> CPU: 1
> EIP:
> EFLAGS: 00010286 (2.6.16.59 #1)
> EIP is at nv_suspend+0x85/0x350 [forcedeth]
> eax:
> esi:
> ds:
> Process reboot
> Stack:
> Call Trace:
> show_stack_log
> show_registers
> die
> do_page_fault
> error_code
> nv_reboot_handler
> notifier_call_chain
> kernel_restart_prepare
> kernel_restart
> sys_reboot
> sysenter_past_esp
> Code: 8b 8c 3a 98 01 00 00 01 c8 8b ...
> INIT: no more processes left in this runlevel

Please reproduce this problem on a modern kernel (2.6.24-rc) without any 
closed source modules or drivers loaded.  Thanks.

	Jeff




^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: forcedeth oops
  2008-01-22 19:28 ` Jeff Garzik
@ 2008-01-28 15:31   ` Ayaz Abdulla
  0 siblings, 0 replies; 4+ messages in thread
From: Ayaz Abdulla @ 2008-01-28 15:31 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Andrew Brooks, netdev



Jeff Garzik wrote:
> Andrew Brooks wrote:
>  > Hello
>  >
>  > I'm getting an oops in forcedeth whenever I shutdown, details below.
>  >
>  > I've tried kernel 2.6.16.59 and the latest forcedeth.c from nvidia.com
>  > which is package-1.23 version-0.62 date-2007/04/27.
>  >
>  > How can I download the latest forcedeth.c (including 2008-01-13 
> patches) ?
>  > It's not in the latest snapshot linux-2.6.24-rc8.
>  >
>  > Also, why is the version on nvidia.com not just older than the one in
>  > the kernel, but it appears to have forked back in May 2006.  Has there
>  > been independent development on each version?  They should be the same!
> 
> We don't run nvidia.com here :)
> 
> 
>  > Here's the diff:
>  > <  *    0.56: 22 Mar 2006: Additional ethtool and moduleparam support.
>  > <  *    0.57: 14 May 2006: Moved mac address writes to nv_probe and 
> nv_remove.
>  > <  *    0.58: 20 May 2006: Optimized rx and tx data paths.
>  > <  *    0.59: 31 May 2006: Added support for sideband management unit.
>  > <  *    0.60: 31 May 2006: Added support for recoverable error.
>  > <  *    0.61: 18 Jul 2006: Added support for suspend/resume.
>  > <  *    0.62: 16 Jan 2007: Fixed statistics, mgmt communication, and 
> low phy speed on S5.
>  > ---
>  >>  *    0.56: 22 Mar 2006: Additional ethtool config and moduleparam 
> support.
>  >>  *    0.57: 14 May 2006: Mac address set in probe/remove and order 
> corrections.
>  >>  *    0.58: 30 Oct 2006: Added support for sideband management unit.
>  >>  *    0.59: 30 Oct 2006: Added support for recoverable error.
>  >>  *    0.60: 20 Jan 2007: Code optimizations for rings, rx & tx data 
> paths, and stats.
>  >
>  >
>  > Here's the details of the oops:
>  > md: md0 switched to read-only mode.
>  > Unable to handle kernel NULL pointer dereference at virtual address 
> 00000000
>  > printing eip:
>  > f8ccdd55
>  > *pde = 36c6a001
>  > Oops: 0000 [#1]
>  > SMP
>  > Modules linked in: nvidia ... forcedeth ... sata_nv
>  > CPU: 1
>  > EIP:
>  > EFLAGS: 00010286 (2.6.16.59 #1)
>  > EIP is at nv_suspend+0x85/0x350 [forcedeth]
>  > eax:
>  > esi:
>  > ds:
>  > Process reboot
>  > Stack:
>  > Call Trace:
>  > show_stack_log
>  > show_registers
>  > die
>  > do_page_fault
>  > error_code
>  > nv_reboot_handler
>  > notifier_call_chain
>  > kernel_restart_prepare
>  > kernel_restart
>  > sys_reboot
>  > sysenter_past_esp
>  > Code: 8b 8c 3a 98 01 00 00 01 c8 8b ...
>  > INIT: no more processes left in this runlevel
> 
> Please reproduce this problem on a modern kernel (2.6.24-rc) without any
> closed source modules or drivers loaded.  Thanks.

Andrew,

The driver from the nvidia.com site was forked because we needed to 
create a backport driver package for older kernels. At some point, we 
need to converge the two branches again.

Let us know if you still have an issue with the latest kernel as Jeff 
mentioned.

Regards,
Ayaz


> 
>         Jeff
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information.  Any unauthorized review, use, disclosure or distribution
is prohibited.  If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2008-01-29 17:08 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-22 11:54 forcedeth oops Andrew Brooks
2008-01-22 19:28 ` Jeff Garzik
2008-01-28 15:31   ` Ayaz Abdulla
  -- strict thread matches above, loose matches on Subject: below --
2007-02-24  8:07 Chris Wedgwood

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).