From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner+w=401wt.eu-S1754666AbXIQRCs@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1754666AbXIQRCs (ORCPT <rfc822;w@1wt.eu>);
	Mon, 17 Sep 2007 13:02:48 -0400
Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752691AbXIQRCm
	(ORCPT <rfc822;linux-kernel-outgoing>);
	Mon, 17 Sep 2007 13:02:42 -0400
Received: from www.sophics.cz ([194.108.6.2]:39567 "EHLO www.sophics.cz"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1752685AbXIQRCl (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Mon, 17 Sep 2007 13:02:41 -0400
X-Greylist: delayed 2056 seconds by postgrey-1.27 at vger.kernel.org; Mon, 17 Sep 2007 13:02:40 EDT
Message-ID: <46EEAB26.6050400@sophics.cz>
Date: Mon, 17 Sep 2007 18:28:22 +0200
From: Petr Stehlik <pstehlik@sophics.cz>
User-Agent: Icedove 1.5.0.12 (X11/20070607)
MIME-Version: 1.0
To: linux-kernel@vger.kernel.org
Subject: forcedeth kernel panic
Content-Type: text/plain; charset=ISO-8859-2; format=flowed
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
X-Mailing-List: linux-kernel@vger.kernel.org

Hi,

an ASUS M2N32 WS Pro (nVidia MCP55 chipset) based machine with on-board 
Gbit ethernet leads to kernel panic under high network load.

The machine is to be a Samba server and got minimal 64bit Debian Etch 
installed. First it crashed with stock Debian 2.6.18-amd64 kernel so I 
upgraded to 2.6.21 and at last to 2.6.22-2-amd64 (source from Debian). 
The crashes varied per kernel but were always fatal (only hard reset 
helped) so I decided to post also here (in addition to Debian's BTS 
#442877).

The crash occurs under high network load generated by tserv from dbench 
package within about 20 minutes of tserv test (run from another machine) 
against this machine (which is running tserv_srv).

Before it crashes it fills the kernel log with the following messages 
that may or may not be related to the crash:

Sep 17 14:51:27 harapes kernel: eth0: too many iterations (6) in nv_nic_irq.
Sep 17 14:51:58 harapes last message repeated 1026 times
Sep 17 14:52:59 harapes last message repeated 2063 times
Sep 17 14:54:00 harapes last message repeated 2055 times
Sep 17 14:55:01 harapes last message repeated 2044 times

I wrote it may not be related because I got here an older nForce based 
machine that is running the tserv against the crashing server and it 
also fills the log with the same messages - but fortunately it does not 
crash...

After killing the machine several times in a row I googled a bit and 
found some suggestions so now I am testing a different setup - the 
forcedeth driver loaded with "optimization_mode=1" parameter and so far 
(95 minutes of tserv run) it didn't crash...

More details about the hardware: AMD64 3600+ (=2GHz), 2GB of DDR2, 6 
SATA drives in RAID1 and RAID5 configuration on the on-board SATA 
driver, a PCI S3 graphics and that's it.

dmesg output related to networking:

forcedeth.c: Reverse Engineered nForce ethernet driver. Version 0.60.
forcedeth: using HIGHDMA
eth0: forcedeth.c: subsystem: 01043:81fb bound to 0000:00:10.0
eth0: no IPv6 routers present


lspci -vv:

00:10.0 Bridge: nVidia Corporation MCP55 Ethernet (rev a2)
         Subsystem: ASUSTeK Computer Inc. Unknown device 81fb
         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- 
ParErr- Stepping- SERR- FastB2B-
         Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- 
<TAbort- <MAbort- >SERR- <PERR-
         Latency: 0 (250ns min, 5000ns max)
         Interrupt: pin A routed to IRQ 1272
         Region 0: Memory at fe02a000 (32-bit, non-prefetchable) [size=4K]
         Region 1: I/O ports at b400 [size=8]
         Region 2: Memory at fe029000 (32-bit, non-prefetchable) [size=256]
         Region 3: Memory at fe028000 (32-bit, non-prefetchable) [size=16]
         Capabilities: [44] Power Management version 2
                 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA 
PME(D0+,D1+,D2+,D3hot+,D3cold+)
                 Status: D0 PME-Enable+ DSel=0 DScale=0 PME-
         Capabilities: [70] MSI-X: Enable- Mask- TabSize=8
                 Vector table: BAR=2 offset=00000000
                 PBA: BAR=3 offset=00000000
         Capabilities: [50] Message Signalled Interrupts: Mask+ 64bit+ 
Queue=0/3 Enable+
                 Address: 00000000fee0300c  Data: 4189
                 Masking: 000000fe  Pending: 00000000
         Capabilities: [6c] HyperTransport: MSI Mapping


The incomplete kernel panic dump hand-copied from the stuck console:

Call Trace:
<IRQ> :forcedeth: nv_nic_irq_optimized+0x89/0x22c
  handle_IRQ_event+0x25/0x53
  __do_softirq+0x55/0xc3
  handle_edge_irq+0xe4/0x127
  do_IRQ+0x6c/0xd5
  default_idle+0x0/0x3d
  ret_from_intr+0x0/0xa
  <EOI> default_idle+0x29/0x3d
  cpu_idle+0x8b/0xae

Code: 8a 83 84 00 00 00 83 e0 f3 83 c8 04 88 83 84 00 00 00 83 7b
RIP :forcedeth:nv_rx_process_optimized+0xe6/0x380
Kernel panic - not syncing: Aiee, killing interrupt handler!



I may have to replace the on-board ethernet with some PCI based card 
because I need a reliable server very soon and when it gets deployed I 
won't have a chance of playing with it anymore so if there is a 
suggestion I could try now for perfect kernel forcedeth stability then 
please let me know soon. Is the "optimization_mode=1" the right 
solution? What kind of negative impact does it have?

Thanks!

Petr