Re: 2.6.20->2.6.21 - networking dies after random time

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: 2.6.20->2.6.21 - networking dies after random time
       [not found] <4bacf17f0706161435g1bb7c08bpd427901f64d57fa@mail.gmail.com>
@ 2007-06-18 11:08 ` Jarek Poplawski
  2007-06-18 15:10   ` Stephen Hemminger
  0 siblings, 1 reply; 68+ messages in thread
From: Jarek Poplawski @ 2007-06-18 11:08 UTC (permalink / raw)
  To: =?ISO-8859-2?Q?Marcin_=A6lusarz?=; +Cc: linux-kernel, linux-net, netdev

On 16-06-2007 23:35, Marcin .lusarz wrote:
> hi
> after upgrading kernel from 2.6.20 to 2.6.21.3 i'm experiencing really
> strange problem - my _both_ network cards dies after random uptime -
> sometimes it's a few minutes, sometimes hours, sometimes it does not
> happen for a couple of days...
> today it happened for the first time without nvidia module and almost
> immediately after system start
> 
> here is the output of some commands which might help debug this:
...
> [   21.726533] Write protecting the kernel read-only data: 1457k
> [   25.734316] ACPI: PCI Interrupt 0000:00:0a.0[A] -> GSI 17 (level,
> low) -> IRQ 17
> [   25.734367] skge 1.10 addr 0xfab00000 irq 17 chip Yukon-Lite rev 9
> [   25.734763] skge eth0: addr 00:11:d8:60:74:55
> [   25.971279] ne2k-pci.c:v1.03 9/22/2003 D. Becker/P. Gortmaker
> [   25.971282]   http://www.scyld.com/network/ne2k-pci.html
> [   25.971364] ACPI: PCI Interrupt 0000:00:0c.0[A] -> GSI 17 (level,
> low) -> IRQ 17
> [   25.971691] eth1: Compex RL2000 found at 0xb000, IRQ 17, 
> 00:80:48:DE:5E:89.
> [   26.888372] Linux video capture interface: v2.00
> [   26.906732] bttv: driver version 0.9.17 loaded
...
> [   31.659572] Adding 1020112k swap on /dev/sda2.  Priority:-1
> extents:1 across:1020112k
> [   42.681974] skge eth1: enabling interface
> [   43.228729] NET: Registered protocol family 17
> [   46.429756] Time: acpi_pm clocksource has been installed.
> [   50.743512] NETDEV WATCHDOG: eth0: transmit timed out
> [   50.743521] eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=574.
...

It looks like skge driver enables different device than probbed.
Maybe you've something old/wrong about eth0/eth1 in /etc configs?
You can also try with netdev= or pci= kernel parameters.
If no result - resend it, please - maybe with some debugging on
(modinfo skge). BTW - netdev seems to be preferred for this.

Regards,
Jarek P.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-06-18 11:08 ` Jarek Poplawski
@ 2007-06-18 15:10   ` Stephen Hemminger
  2007-06-19  5:27     ` Jarek Poplawski
  2007-06-19  5:50     ` Jarek Poplawski
  0 siblings, 2 replies; 68+ messages in thread
From: Stephen Hemminger @ 2007-06-18 15:10 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Marcin Ślusarz, linux-kernel, linux-net, netdev

On Mon, 18 Jun 2007 13:08:49 +0200
Jarek Poplawski <jarkao2@o2.pl> wrote:

> On 16-06-2007 23:35, Marcin .lusarz wrote:
> > hi
> > after upgrading kernel from 2.6.20 to 2.6.21.3 i'm experiencing really
> > strange problem - my _both_ network cards dies after random uptime -
> > sometimes it's a few minutes, sometimes hours, sometimes it does not
> > happen for a couple of days...
> > today it happened for the first time without nvidia module and almost
> > immediately after system start
> > 
> > here is the output of some commands which might help debug this:
> ...
> > [   21.726533] Write protecting the kernel read-only data: 1457k
> > [   25.734316] ACPI: PCI Interrupt 0000:00:0a.0[A] -> GSI 17 (level,
> > low) -> IRQ 17
> > [   25.734367] skge 1.10 addr 0xfab00000 irq 17 chip Yukon-Lite rev 9
> > [   25.734763] skge eth0: addr 00:11:d8:60:74:55
> > [   25.971279] ne2k-pci.c:v1.03 9/22/2003 D. Becker/P. Gortmaker
> > [   25.971282]   http://www.scyld.com/network/ne2k-pci.html
> > [   25.971364] ACPI: PCI Interrupt 0000:00:0c.0[A] -> GSI 17 (level,
> > low) -> IRQ 17
> > [   25.971691] eth1: Compex RL2000 found at 0xb000, IRQ 17, 
> > 00:80:48:DE:5E:89.
> > [   26.888372] Linux video capture interface: v2.00
> > [   26.906732] bttv: driver version 0.9.17 loaded
> ...
> > [   31.659572] Adding 1020112k swap on /dev/sda2.  Priority:-1
> > extents:1 across:1020112k
> > [   42.681974] skge eth1: enabling interface
> > [   43.228729] NET: Registered protocol family 17
> > [   46.429756] Time: acpi_pm clocksource has been installed.
> > [   50.743512] NETDEV WATCHDOG: eth0: transmit timed out
> > [   50.743521] eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=574.
> ...
> 
> It looks like skge driver enables different device than probbed.
> Maybe you've something old/wrong about eth0/eth1 in /etc configs?

More likely it is just user level device renaming. Most distro's
rename devices (if needed) using udev.

> You can also try with netdev= or pci= kernel parameters.

Bad idea. 

> If no result - resend it, please - maybe with some debugging on
> (modinfo skge). BTW - netdev seems to be preferred for this.

What is the contents of /proc/interrupts

-- 
Stephen Hemminger <shemminger@linux-foundation.org>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-06-18 15:10   ` Stephen Hemminger
@ 2007-06-19  5:27     ` Jarek Poplawski
  2007-06-19  5:50     ` Jarek Poplawski
  1 sibling, 0 replies; 68+ messages in thread
From: Jarek Poplawski @ 2007-06-19  5:27 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Marcin Ślusarz, linux-kernel, linux-net, netdev

On Mon, Jun 18, 2007 at 08:10:00AM -0700, Stephen Hemminger wrote:
> On Mon, 18 Jun 2007 13:08:49 +0200
> Jarek Poplawski <jarkao2@o2.pl> wrote:
...
> > It looks like skge driver enables different device than probbed.
> > Maybe you've something old/wrong about eth0/eth1 in /etc configs?
> 
> More likely it is just user level device renaming. Most distro's
> rename devices (if needed) using udev.

I hope you're right, and the problem is resolved already, but for
historical reasons I'd notice the original message with quite a lot
of configs is available on linux-kernel.

Regards,
Jarek P.

> 
> > You can also try with netdev= or pci= kernel parameters.
> 
> Bad idea. 
> 
> > If no result - resend it, please - maybe with some debugging on
> > (modinfo skge). BTW - netdev seems to be preferred for this.
> 
> What is the contents of /proc/interrupts

--->

> On 16-06-2007 23:35, Marcin .lusarz wrote:
...
> joi ~ # cat /proc/interrupts ; sleep 5; cat /proc/interrupts
>           CPU0
>  0:     891160   IO-APIC-edge      timer
>  1:       2218   IO-APIC-edge      i8042
>  8:          2   IO-APIC-edge      rtc
>  9:          1   IO-APIC-fasteoi   acpi
> 12:       9110   IO-APIC-edge      i8042
> 14:          0   IO-APIC-edge      libata
> 15:        122   IO-APIC-edge      libata
> 17:         12   IO-APIC-fasteoi   eth1, eth0
> 18:      57275   IO-APIC-fasteoi   bttv0
> 20:      18810   IO-APIC-fasteoi   libata
> 21:          0   IO-APIC-fasteoi   ehci_hcd:usb1
> 22:      77945   IO-APIC-fasteoi   VIA8237
> NMI:          0
> LOC:     890924
> ERR:          0
>           CPU0
>  0:     896221   IO-APIC-edge      timer
>  1:       2219   IO-APIC-edge      i8042
>  8:          2   IO-APIC-edge      rtc
>  9:          1   IO-APIC-fasteoi   acpi
> 12:       9110   IO-APIC-edge      i8042
> 14:          0   IO-APIC-edge      libata
> 15:        122   IO-APIC-edge      libata
> 17:         12   IO-APIC-fasteoi   eth1, eth0
> 18:      57654   IO-APIC-fasteoi   bttv0
> 20:      18813   IO-APIC-fasteoi   libata
> 21:          0   IO-APIC-fasteoi   ehci_hcd:usb1
> 22:      78421   IO-APIC-fasteoi   VIA8237
> NMI:          0
> LOC:     895984
> ERR:          0
> 
> joi ~ # cat /proc/ioports
> 0000-001f : dma1
> 0020-0021 : pic1
> 0040-0043 : timer0
> 0050-0053 : timer1
> 0060-006f : keyboard
> 0070-0077 : rtc
> 0080-008f : dma page reg
> 00a0-00a1 : pic2
> 00c0-00df : dma2
> 00f0-00ff : fpu
> 0170-0177 : 0000:00:0f.1
>  0170-0177 : libata
> 01f0-01f7 : 0000:00:0f.1
>  01f0-01f7 : libata
> 0290-0297 : pnp 00:09
> 02f8-02ff : serial
> 0376-0376 : 0000:00:0f.1
>  0376-0376 : libata
> 03c0-03df : vesafb
> 03f6-03f6 : 0000:00:0f.1
>  03f6-03f6 : libata
> 03f8-03ff : serial
> 0400-0407 : vt596_smbus
> 0680-06ff : pnp 00:09
> 0800-0803 : ACPI PM1a_EVT_BLK
> 0804-0805 : ACPI PM1a_CNT_BLK
> 0808-080b : ACPI PM_TMR
> 0810-0815 : ACPI CPU throttle
> 0820-0823 : ACPI GPE0_BLK
> 0cf8-0cff : PCI conf1
> 1000-10ff : 0000:00:11.6
> a800-a8ff : 0000:00:0a.0
>  a800-a8ff : skge
> b000-b01f : 0000:00:0c.0
>  b000-b01f : ne2k-pci
> b400-b4ff : 0000:00:0f.0
>  b400-b4ff : sata_via
> b800-b80f : 0000:00:0f.0
>  b800-b80f : sata_via
> c000-c003 : 0000:00:0f.0
>  c000-c003 : sata_via
> c400-c407 : 0000:00:0f.0
>  c400-c407 : sata_via
> c800-c803 : 0000:00:0f.0
>  c800-c803 : sata_via
> d000-d007 : 0000:00:0f.0
>  d000-d007 : sata_via
> d400-d41f : 0000:00:10.0
> d800-d81f : 0000:00:10.1
> e000-e01f : 0000:00:10.2
> e400-e41f : 0000:00:10.3
> e800-e8ff : 0000:00:11.5
>  e800-e8ff : VIA8237
> fc00-fc0f : 0000:00:0f.1
>  fc00-fc0f : libata
> 
> joi ~ # cat /proc/iomem
> 00000000-0009fbff : System RAM
> 0009fc00-0009ffff : reserved
> 000c0000-000dffff : pnp 00:0e
> 000e4000-000fffff : reserved
> 00100000-3ffaffff : System RAM
>  00200000-0059ebc7 : Kernel code
>  0059ebc8-0077248f : Kernel data
> 3ffb0000-3ffbffff : ACPI Tables
> 3ffc0000-3ffeffff : ACPI Non-volatile Storage
> 3fff0000-3fffffff : reserved
> e8000000-ebffffff : 0000:00:00.0
>  e8000000-ebffffff : aperture
> efe00000-efe00fff : 0000:00:0d.0
>  efe00000-efe00fff : bttv0
> eff00000-eff00fff : 0000:00:0d.1
> f0000000-f9ffffff : PCI Bus #01
>  f0000000-f7ffffff : 0000:01:00.0
>    f0000000-f7ffffff : vesafb
> faa00000-faa1ffff : 0000:00:0a.0
> fab00000-fab03fff : 0000:00:0a.0
>  fab00000-fab03fff : skge
> fac00000-fac07fff : 0000:00:0c.0
> fae00000-fae000ff : 0000:00:10.4
>  fae00000-fae000ff : ehci_hcd
> faf00000-fbffffff : PCI Bus #01
>  faf00000-faf1ffff : 0000:01:00.0
>  fb000000-fbffffff : 0000:01:00.0
> fec00000-fec00fff : IOAPIC 0
>  fec00000-fec00fff : pnp 00:0b
> fee00000-fee00fff : Local APIC
> ff780000-ffffffff : reserved
> 
> joi ~ # cat /proc/sys/kernel/tainted
> 0
...

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-06-18 15:10   ` Stephen Hemminger
  2007-06-19  5:27     ` Jarek Poplawski
@ 2007-06-19  5:50     ` Jarek Poplawski
  2007-06-22  8:56       ` Marcin Ślusarz
  1 sibling, 1 reply; 68+ messages in thread
From: Jarek Poplawski @ 2007-06-19  5:50 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Marcin Ślusarz, linux-kernel, linux-net, netdev

On Mon, Jun 18, 2007 at 08:10:00AM -0700, Stephen Hemminger wrote:
> On Mon, 18 Jun 2007 13:08:49 +0200
> Jarek Poplawski <jarkao2@o2.pl> wrote:
> 
> > On 16-06-2007 23:35, Marcin .lusarz wrote:
> > > hi
> > > after upgrading kernel from 2.6.20 to 2.6.21.3 i'm experiencing really
> > > strange problem - my _both_ network cards dies after random uptime -
> > > sometimes it's a few minutes, sometimes hours, sometimes it does not
> > > happen for a couple of days...
> > > today it happened for the first time without nvidia module and almost
> > > immediately after system start
> > > 
> > > here is the output of some commands which might help debug this:
...
> > It looks like skge driver enables different device than probbed.
> > Maybe you've something old/wrong about eth0/eth1 in /etc configs?
> 
> More likely it is just user level device renaming. Most distro's
> rename devices (if needed) using udev.

On the other hand it's interesting, why it's not always, and why
sometimes it took so long?

Jarek P. 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-06-19  5:50     ` Jarek Poplawski
@ 2007-06-22  8:56       ` Marcin Ślusarz
  2007-06-22 13:32         ` Jarek Poplawski
  0 siblings, 1 reply; 68+ messages in thread
From: Marcin Ślusarz @ 2007-06-22  8:56 UTC (permalink / raw)
  To: Jarek Poplawski, Stephen Hemminger, linux-kernel, linux-net,
	netdev

2007/6/19, Jarek Poplawski <jarkao2@o2.pl>:
> On Mon, Jun 18, 2007 at 08:10:00AM -0700, Stephen Hemminger wrote:
> > On Mon, 18 Jun 2007 13:08:49 +0200
> > Jarek Poplawski <jarkao2@o2.pl> wrote:
> >
> > > On 16-06-2007 23:35, Marcin .lusarz wrote:
> > > > hi
> > > > after upgrading kernel from 2.6.20 to 2.6.21.3 i'm experiencing really
> > > > strange problem - my _both_ network cards dies after random uptime -
> > > > sometimes it's a few minutes, sometimes hours, sometimes it does not
> > > > happen for a couple of days...
> > > > today it happened for the first time without nvidia module and almost
> > > > immediately after system start
> > > >
> > > > here is the output of some commands which might help debug this:
> ...
> > > It looks like skge driver enables different device than probbed.
> > > Maybe you've something old/wrong about eth0/eth1 in /etc configs?
> >
> > More likely it is just user level device renaming. Most distro's
> > rename devices (if needed) using udev.
>
> On the other hand it's interesting, why it's not always, and why
> sometimes it took so long?

I'm sorry for delay, but i was offline for the last week and probably
will for some time :|

When I disable on-board network card in BIOS (controlled by skge)
ne2k-pci card is still locking up. So I think it's strictly ne2k-pci
card bug. I made some tests and I know how to reproduce it fast (on my
machine) - just make some heavy network traffic...

As I'm offline right now I can't bisect it, but i turned on more
debugging, maybe you can deduce something...

[    0.000000] Linux version 2.6.21.3 (root@joi) (gcc version 4.1.2
(Gentoo 4.1.2)) #4 PREEMPT Wed Jun 20 22:37:05 CEST 2007
[    0.000000] Command line: root=/dev/sda5 video=vesafb vga=794
[    0.000000] BIOS-provided physical RAM map:
[    0.000000]  BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
[    0.000000]  BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
[    0.000000]  BIOS-e820: 00000000000e4000 - 0000000000100000 (reserved)
[    0.000000]  BIOS-e820: 0000000000100000 - 000000003ffb0000 (usable)
[    0.000000]  BIOS-e820: 000000003ffb0000 - 000000003ffc0000 (ACPI data)
[    0.000000]  BIOS-e820: 000000003ffc0000 - 000000003fff0000 (ACPI NVS)
[    0.000000]  BIOS-e820: 000000003fff0000 - 0000000040000000 (reserved)
[    0.000000]  BIOS-e820: 00000000ff780000 - 0000000100000000 (reserved)
[    0.000000] Entering add_active_range(0, 0, 159) 0 entries of 256 used
[    0.000000] Entering add_active_range(0, 256, 262064) 1 entries of 256 used
[    0.000000] end_pfn_map = 1048576
[    0.000000] DMI 2.3 present.
[    0.000000] ACPI: RSDP 000FA810, 0021 (r2 ACPIAM)
[    0.000000] ACPI: XSDT 3FFB0100, 003C (r1 A M I  OEMXSDT  10000427
MSFT       97)
[    0.000000] ACPI: FACP 3FFB0290, 00F4 (r3 A M I  OEMFACP  10000427
MSFT       97)
[    0.000000] ACPI: DSDT 3FFB03E0, 38A1 (r1  A0036 A0036001        1
MSFT  100000D)
[    0.000000] ACPI: FACS 3FFC0000, 0040
[    0.000000] ACPI: APIC 3FFB0390, 004A (r1 A M I  OEMAPIC  10000427
MSFT       97)
[    0.000000] ACPI: OEMB 3FFC0040, 003F (r1 A M I  OEMBIOS  10000427
MSFT       97)
[    0.000000] Entering add_active_range(0, 0, 159) 0 entries of 256 used
[    0.000000] Entering add_active_range(0, 256, 262064) 1 entries of 256 used
[    0.000000] Zone PFN ranges:
[    0.000000]   DMA             0 ->     4096
[    0.000000]   DMA32        4096 ->  1048576
[    0.000000]   Normal    1048576 ->  1048576
[    0.000000] early_node_map[2] active PFN ranges
[    0.000000]     0:        0 ->      159
[    0.000000]     0:      256 ->   262064
[    0.000000] On node 0 totalpages: 261967
[    0.000000]   DMA zone: 56 pages used for memmap
[    0.000000]   DMA zone: 2549 pages reserved
[    0.000000]   DMA zone: 1394 pages, LIFO batch:0
[    0.000000]   DMA32 zone: 3526 pages used for memmap
[    0.000000]   DMA32 zone: 254442 pages, LIFO batch:31
[    0.000000]   Normal zone: 0 pages used for memmap
[    0.000000] Looks like a VIA chipset. Disabling IOMMU. Override
with iommu=allowed
[    0.000000] ACPI: PM-Timer IO Port: 0x808
[    0.000000] ACPI: Local APIC address 0xfee00000
[    0.000000] ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
[    0.000000] Processor #0 (Bootup-CPU)
[    0.000000] ACPI: IOAPIC (id[0x01] address[0xfec00000] gsi_base[0])
[    0.000000] IOAPIC[0]: apic_id 1, address 0xfec00000, GSI 0-23
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[    0.000000] ACPI: IRQ0 used by override.
[    0.000000] ACPI: IRQ2 used by override.
[    0.000000] ACPI: IRQ9 used by override.
[    0.000000] Setting APIC routing to flat
[    0.000000] Using ACPI (MADT) for SMP configuration information
[    0.000000] Nosave address range: 000000000009f000 - 00000000000a0000
[    0.000000] Nosave address range: 00000000000a0000 - 00000000000e4000
[    0.000000] Nosave address range: 00000000000e4000 - 0000000000100000
[    0.000000] Allocating PCI resources starting at 50000000 (gap:
40000000:bf780000)
[    0.000000] Built 1 zonelists.  Total pages: 255836
[    0.000000] Kernel command line: root=/dev/sda5 video=vesafb vga=794
[    0.000000] Initializing CPU#0
[    0.000000] PID hash table entries: 4096 (order: 12, 32768 bytes)
[    0.000000] Extended CMOS year: 2000
[   26.044738] time.c: Detected 2002.658 MHz processor.
[   26.044799] Console: colour dummy device 80x25
[   26.044856] Lock dependency validator: Copyright (c) 2006 Red Hat,
Inc., Ingo Molnar
[   26.044860] ... MAX_LOCKDEP_SUBCLASSES:    8
[   26.044863] ... MAX_LOCK_DEPTH:          30
[   26.044866] ... MAX_LOCKDEP_KEYS:        2048
[   26.044868] ... CLASSHASH_SIZE:           1024
[   26.044871] ... MAX_LOCKDEP_ENTRIES:     8192
[   26.044874] ... MAX_LOCKDEP_CHAINS:      16384
[   26.044876] ... CHAINHASH_SIZE:          8192
[   26.044879]  memory used by lock dependency info: 1648 kB
[   26.044882]  per task-struct memory footprint: 1680 bytes
[   26.045542] Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes)
[   26.046642] Inode-cache hash table entries: 65536 (order: 7, 524288 bytes)
[   26.058847] Memory: 1021348k/1048256k available (3717k kernel code,
26216k reserved, 1875k data, 224k init)
[   26.118666] Calibrating delay using timer specific routine..
4008.70 BogoMIPS (lpj=2004351)
[   26.118841] Mount-cache hash table entries: 256
[   26.119520] CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
[   26.119525] CPU: L2 Cache: 512K (64 bytes/line)
[   26.119537] CPU: AMD Athlon(tm) 64 Processor 3200+ stepping 00
[   26.119548] ACPI: Core revision 20070126
[   26.127970] Parsing all Control Methods:
[   26.128081] Table [DSDT](id 0001) - 543 Objects with 51 Devices 146
Methods 25 Regions
[   26.128087]  tbxface-0587 [02] tb_load_namespace     : ACPI Tables
successfully acquired
[   26.128311] evxfevnt-0091 [02] enable                : Transition
to ACPI mode successful
[   26.139359] Using local APIC timer interrupts.
[   26.189266] result 12516628
[   26.189268] Detected 12.516 MHz APIC timer.
[   26.190215] NET: Registered protocol family 16
[   26.190738] ACPI: bus type pci registered
[   26.190758] PCI: Using configuration type 1
[   26.197813] evgpeblk-0952 [04] ev_create_gpe_block   : GPE 00 to 0F
[_GPE] 2 regs on int 0x9
[   26.203204] evgpeblk-1049 [03] ev_initialize_gpe_bloc: Found 7
Wake, Enabled 0 Runtime GPEs in this block
[   26.203623] Completing Region/Field/Buffer/Package
initialization:..........................................................................................................................
[   26.213486] Initialized 24/25 Regions 44/44 Fields 41/41 Buffers
13/14 Packages (552 nodes)
[   26.213492] Initializing Device/Processor/Thermal objects by
executing _INI methods:
[   26.213537] Executed 0 _INI methods requiring 0 _STA executions
(examined 54 objects)
[   26.213622] ACPI: Interpreter enabled
[   26.213626] ACPI: (supports S0 S1 S3 S4 S5)
[   26.213712] ACPI: Using IOAPIC for interrupt routing
[   26.242059] ACPI: PCI Root Bridge [PCI0] (0000:00)
[   26.242108] PCI: Probing PCI hardware (bus 00)
[   26.242281] PCI: Scanning bus 0000:00
[   26.242339] PCI: Found 0000:00:00.0 [1106/0282] 000600 00
[   26.242377] PCI: Calling quirk ffffffff8051f880 for 0000:00:00.0
[   26.242405] PCI: Found 0000:00:00.1 [1106/1282] 000600 00
[   26.242440] PCI: Calling quirk ffffffff8051f880 for 0000:00:00.1
[   26.242467] PCI: Found 0000:00:00.2 [1106/2282] 000600 00
[   26.242503] PCI: Calling quirk ffffffff8051f880 for 0000:00:00.2
[   26.242530] PCI: Found 0000:00:00.3 [1106/3282] 000600 00
[   26.242571] PCI: Calling quirk ffffffff8051f880 for 0000:00:00.3
[   26.242598] PCI: Found 0000:00:00.4 [1106/4282] 000600 00
[   26.242634] PCI: Calling quirk ffffffff8051f880 for 0000:00:00.4
[   26.242663] PCI: Found 0000:00:00.7 [1106/7282] 000600 00
[   26.242698] PCI: Calling quirk ffffffff8051f880 for 0000:00:00.7
[   26.242731] PCI: Found 0000:00:01.0 [1106/b188] 000604 01
[   26.242749] PCI: Calling quirk ffffffff8051f880 for 0000:00:01.0
[   26.242789] PCI: Found 0000:00:0c.0 [11f6/1401] 000200 00
[   26.242832] PCI: Calling quirk ffffffff8051f880 for 0000:00:0c.0
[   26.242872] PCI: Found 0000:00:0d.0 [109e/036e] 000400 00
[   26.242913] PCI: Calling quirk ffffffff8051f880 for 0000:00:0d.0
[   26.242954] PCI: Found 0000:00:0d.1 [109e/0878] 000480 00
[   26.242994] PCI: Calling quirk ffffffff8051f880 for 0000:00:0d.1
[   26.243039] PCI: Found 0000:00:0f.0 [1106/3149] 000104 00
[   26.243083] PCI: Calling quirk ffffffff8051f880 for 0000:00:0f.0
[   26.243118] PCI: Found 0000:00:0f.1 [1106/0571] 000101 00
[   26.243163] PCI: Calling quirk ffffffff8051f880 for 0000:00:0f.1
[   26.243205] PCI: Found 0000:00:10.0 [1106/3038] 000c03 00
[   26.243246] PCI: Calling quirk ffffffff8051f880 for 0000:00:10.0
[   26.243284] PCI: Found 0000:00:10.1 [1106/3038] 000c03 00
[   26.243325] PCI: Calling quirk ffffffff8051f880 for 0000:00:10.1
[   26.243360] PCI: Found 0000:00:10.2 [1106/3038] 000c03 00
[   26.243401] PCI: Calling quirk ffffffff8051f880 for 0000:00:10.2
[   26.243436] PCI: Found 0000:00:10.3 [1106/3038] 000c03 00
[   26.243477] PCI: Calling quirk ffffffff8051f880 for 0000:00:10.3
[   26.243511] PCI: Found 0000:00:10.4 [1106/3104] 000c03 00
[   26.243579] PCI: Calling quirk ffffffff8051f880 for 0000:00:10.4
[   26.243618] PCI: Found 0000:00:11.0 [1106/3227] 000601 00
[   26.243660] PCI: Calling quirk ffffffff80409e90 for 0000:00:11.0
[   26.243666] PCI: enabled onboard AC97/MC97 devices
[   26.243671] PCI: Calling quirk ffffffff80409de0 for 0000:00:11.0
[   26.243675] PCI: Calling quirk ffffffff804090f0 for 0000:00:11.0
[   26.243678] PCI: Calling quirk ffffffff8051f880 for 0000:00:11.0
[   26.243718] PCI: Found 0000:00:11.5 [1106/3059] 000401 00
[   26.243762] PCI: Calling quirk ffffffff8051f880 for 0000:00:11.5
[   26.243797] PCI: Found 0000:00:11.6 [1106/3068] 000780 00
[   26.243841] PCI: Calling quirk ffffffff8051f880 for 0000:00:11.6
[   26.243877] PCI: Found 0000:00:18.0 [1022/1100] 000600 00
[   26.243898] PCI: Calling quirk ffffffff8051f880 for 0000:00:18.0
[   26.243922] PCI: Found 0000:00:18.1 [1022/1101] 000600 00
[   26.243944] PCI: Calling quirk ffffffff8051f880 for 0000:00:18.1
[   26.243968] PCI: Found 0000:00:18.2 [1022/1102] 000600 00
[   26.243990] PCI: Calling quirk ffffffff8051f880 for 0000:00:18.2
[   26.244013] PCI: Found 0000:00:18.3 [1022/1103] 000600 00
[   26.244035] PCI: Calling quirk ffffffff8051f880 for 0000:00:18.3
[   26.244050] PCI: Fixups for bus 0000:00
[   26.244054] PCI: Scanning behind PCI bridge 0000:00:01.0, config
010100, pass 0
[   26.244182] PCI: Scanning bus 0000:01
[   26.244213] PCI: Found 0000:01:00.0 [10de/0322] 000300 00
[   26.244245] PCI: Calling quirk ffffffff8051f880 for 0000:01:00.0
[   26.244250] Boot video device is 0000:01:00.0
[   26.244280] PCI: Fixups for bus 0000:01
[   26.244289] PCI: Bus scan for 0000:01 returning with max=01
[   26.244295] PCI: Scanning behind PCI bridge 0000:00:01.0, config
010100, pass 1
[   26.244305] PCI: Bus scan for 0000:00 returning with max=01
[   26.244318] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
[   26.271155] ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 7 10 *11 14 15)
[   26.271414] ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 7 *10 11 14 15)
[   26.271674] ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 *5 7 10 11 14 15)
[   26.271910] ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 7 10 11 14
15) *0, disabled.
[   26.272145] ACPI: PCI Interrupt Link [LNKE] (IRQs 3 4 5 7 10 11 14
15) *0, disabled.
[   26.272381] ACPI: PCI Interrupt Link [LNKF] (IRQs 3 4 5 7 10 11 14
15) *0, disabled.
[   26.272624] ACPI: PCI Interrupt Link [LNKG] (IRQs 3 4 5 7 10 11 14
15) *0, disabled.
[   26.272869] ACPI: PCI Interrupt Link [LNKH] (IRQs 3 4 5 7 10 11 14
15) *0, disabled.
[   26.273098] Linux Plug and Play Support v0.97 (c) Adam Belay
[   26.273140] pnp: PnP ACPI init
[   26.285974] pnp: PnP ACPI: found 15 devices
[   26.287356] SCSI subsystem initialized
[   26.287450] libata version 2.20 loaded.
[   26.287710] usbcore: registered new interface driver usbfs
[   26.287860] usbcore: registered new interface driver hub
[   26.288016] usbcore: registered new device driver usb
[   26.288228] PCI: Using ACPI for IRQ routing
[   26.288233] PCI: If a device doesn't work, try "pci=routeirq".  If
it helps, post a report
[   26.288627] agpgart: Detected AGP bridge 0
[   26.291744] agpgart: AGP aperture is 64M @ 0xe8000000
[   26.292009] pnp: 00:09: ioport range 0x680-0x6ff has been reserved
[   26.292016] pnp: 00:09: ioport range 0x290-0x297 has been reserved
[   26.292036] pnp: 00:0b: iomem range 0xfec00000-0xfec00fff has been reserved
[   26.292042] pnp: 00:0b: iomem range 0xfee00000-0xfee00fff could not
be reserved
[   26.292049] pnp: 00:0b: iomem range 0xfff80000-0xffffffff could not
be reserved
[   26.292064] pnp: 00:0e: iomem range 0x0-0x9ffff could not be reserved
[   26.292069] pnp: 00:0e: iomem range 0xc0000-0xdffff has been reserved
[   26.292074] pnp: 00:0e: iomem range 0xe0000-0xfffff could not be reserved
[   26.292080] pnp: 00:0e: iomem range 0x100000-0x3ffeffff could not be reserved
[   26.292513] Time: tsc clocksource has been installed.
[   26.293377]   got res [1000:10ff] bus [1000:10ff] flags 101 for BAR
0 of 0000:00:11.6
[   26.293383] PCI: moved device 0000:00:11.6 resource 0 (101) to 1000
[   26.293386] PCI: Bridge: 0000:00:01.0
[   26.293389]   IO window: disabled.
[   26.293395]   MEM window: faf00000-fbffffff
[   26.293401]   PREFETCH window: f0000000-f9ffffff
[   26.293420] PCI: Calling quirk ffffffff80409bd0 for 0000:00:01.0
[   26.293426] PCI: Setting latency timer of device 0000:00:01.0 to 64
[   26.293487] NET: Registered protocol family 2
[   26.301599] IP route cache hash table entries: 32768 (order: 6, 262144 bytes)
[   26.302033] TCP established hash table entries: 32768 (order: 9,
2621440 bytes)
[   26.304812] TCP bind hash table entries: 32768 (order: 9, 2621440 bytes)
[   26.308521] TCP: Hash tables configured (established 32768 bind 32768)
[   26.308552] TCP reno registered
[   26.314409] Total HugeTLB memory allocated, 0
[   26.315431] Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
[   26.316913] NTFS driver 2.1.28 [Flags: R/W].
[   26.316969] fuse init (API version 7.8)
[   26.317480] io scheduler noop registered
[   26.317509] io scheduler cfq registered (default)
[   26.317516] PCI: Calling quirk ffffffff8040a780 for 0000:00:00.0
[   26.317520] PCI: Calling quirk ffffffff804b85d0 for 0000:00:00.0
[   26.317525] PCI: Calling quirk ffffffff8040a780 for 0000:00:00.1
[   26.317528] PCI: Calling quirk ffffffff804b85d0 for 0000:00:00.1
[   26.317532] PCI: Calling quirk ffffffff8040a780 for 0000:00:00.2
[   26.317535] PCI: Calling quirk ffffffff804b85d0 for 0000:00:00.2
[   26.317539] PCI: Calling quirk ffffffff8040a780 for 0000:00:00.3
[   26.317541] PCI: Calling quirk ffffffff804b85d0 for 0000:00:00.3
[   26.317545] PCI: Calling quirk ffffffff8040a780 for 0000:00:00.4
[   26.317548] PCI: Calling quirk ffffffff804b85d0 for 0000:00:00.4
[   26.317552] PCI: Calling quirk ffffffff8040a780 for 0000:00:00.7
[   26.317555] PCI: Calling quirk ffffffff804b85d0 for 0000:00:00.7
[   26.317559] PCI: Calling quirk ffffffff8040a780 for 0000:00:01.0
[   26.317562] PCI: Calling quirk ffffffff804b85d0 for 0000:00:01.0
[   26.317566] PCI: Calling quirk ffffffff8040a780 for 0000:00:0c.0
[   26.317569] PCI: Calling quirk ffffffff804b85d0 for 0000:00:0c.0
[   26.317572] PCI: Calling quirk ffffffff8040a780 for 0000:00:0d.0
[   26.317575] PCI: Calling quirk ffffffff804b85d0 for 0000:00:0d.0
[   26.317579] PCI: Calling quirk ffffffff8040a780 for 0000:00:0d.1
[   26.317582] PCI: Calling quirk ffffffff804b85d0 for 0000:00:0d.1
[   26.317586] PCI: Calling quirk ffffffff8040a780 for 0000:00:0f.0
[   26.317589] PCI: Calling quirk ffffffff804b85d0 for 0000:00:0f.0
[   26.317593] PCI: Calling quirk ffffffff8040a780 for 0000:00:0f.1
[   26.317596] PCI: Calling quirk ffffffff804b85d0 for 0000:00:0f.1
[   26.317600] PCI: Calling quirk ffffffff8040a780 for 0000:00:10.0
[   26.317603] PCI: Calling quirk ffffffff804b85d0 for 0000:00:10.0
[   26.317620] PCI: Calling quirk ffffffff8040a780 for 0000:00:10.1
[   26.317623] PCI: Calling quirk ffffffff804b85d0 for 0000:00:10.1
[   26.317638] PCI: Calling quirk ffffffff8040a780 for 0000:00:10.2
[   26.317641] PCI: Calling quirk ffffffff804b85d0 for 0000:00:10.2
[   26.317655] PCI: Calling quirk ffffffff8040a780 for 0000:00:10.3
[   26.317658] PCI: Calling quirk ffffffff804b85d0 for 0000:00:10.3
[   26.317673] PCI: Calling quirk ffffffff8040a780 for 0000:00:10.4
[   26.317676] PCI: Calling quirk ffffffff804b85d0 for 0000:00:10.4
[   26.317708] PCI: Calling quirk ffffffff8040a780 for 0000:00:11.0
[   26.317710] PCI: Calling quirk ffffffff80409ae0 for 0000:00:11.0
[   26.317715] PCI: Calling quirk ffffffff804b85d0 for 0000:00:11.0
[   26.317719] PCI: Calling quirk ffffffff8040a780 for 0000:00:11.5
[   26.317722] PCI: Calling quirk ffffffff804b85d0 for 0000:00:11.5
[   26.317726] PCI: Calling quirk ffffffff8040a780 for 0000:00:11.6
[   26.317729] PCI: Calling quirk ffffffff804b85d0 for 0000:00:11.6
[   26.317733] PCI: Calling quirk ffffffff8040a780 for 0000:00:18.0
[   26.317736] PCI: Calling quirk ffffffff804b85d0 for 0000:00:18.0
[   26.317740] PCI: Calling quirk ffffffff8040a780 for 0000:00:18.1
[   26.317743] PCI: Calling quirk ffffffff804b85d0 for 0000:00:18.1
[   26.317746] PCI: Calling quirk ffffffff8040a780 for 0000:00:18.2
[   26.317749] PCI: Calling quirk ffffffff804b85d0 for 0000:00:18.2
[   26.317753] PCI: Calling quirk ffffffff8040a780 for 0000:00:18.3
[   26.317756] PCI: Calling quirk ffffffff804b85d0 for 0000:00:18.3
[   26.317760] PCI: Calling quirk ffffffff8040a780 for 0000:01:00.0
[   26.317763] PCI: Calling quirk ffffffff804b85d0 for 0000:01:00.0
[   26.318431] vesafb: framebuffer at 0xf0000000, mapped to
0xffffc20000080000, using 5120k, total 131072k
[   26.318437] vesafb: mode is 1280x1024x16, linelength=2560, pages=1
[   26.318441] vesafb: scrolling: redraw
[   26.318444] vesafb: Truecolor: size=0:5:6:5, shift=0:11:5:0
[   26.379031] Console: switching to colour frame buffer device 160x64
[   26.434727] fb0: VESA VGA frame buffer device
[   26.435525] input: Power Button (FF) as /class/input/input0
[   26.435908] ACPI: Power Button (FF) [PWRF]
[   26.436414] input: Power Button (CM) as /class/input/input1
[   26.436794] ACPI: Power Button (CM) [PWRB]
[   26.437264] input: Sleep Button (CM) as /class/input/input2
[   26.437653] ACPI: Sleep Button (CM) [SLPB]
[   26.536507] Real Time Clock Driver v1.12ac
[   26.537072] Linux agpgart interface v0.102 (c) Dave Jones
[   26.537491] Hangcheck: starting hangcheck timer 0.9.0 (tick is 180
seconds, margin is 60 seconds).
[   26.538092] Hangcheck: Using get_cycles().
[   26.540885] RAMDISK driver initialized: 16 RAM disks of 4096K size
1024 blocksize
[   26.542705] loop: loaded (max 8 devices)
[   26.543826] sata_via 0000:00:0f.0: version 2.1
[   26.543851] ACPI: PCI Interrupt 0000:00:0f.0[B] -> GSI 20 (level,
low) -> IRQ 20
[   26.544403] PCI: Calling quirk ffffffff80409bd0 for 0000:00:0f.0
[   26.544435] sata_via 0000:00:0f.0: routed to hard irq line 10
[   26.545008] ata1: SATA max UDMA/133 cmd 0x000000000001d000 ctl
0x000000000001c802 bmdma 0x000000000001b800 irq 20
[   26.545812] ata2: SATA max UDMA/133 cmd 0x000000000001c400 ctl
0x000000000001c002 bmdma 0x000000000001b808 irq 20
[   26.546553] scsi0 : sata_via
[   26.747139] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[   26.908768] ATA: abnormal status 0x7F on port 0x000000000001d007
[   26.919916] ATA: abnormal status 0x7F on port 0x000000000001d007
[   26.939155] ata1.00: ATA-6: WDC WD1600JD-00HBB0, 08.02D08, max UDMA/133
[   26.939600] ata1.00: 312581808 sectors, multi 16: LBA48
[   26.960135] ata1.00: configured for UDMA/133
[   26.960427] scsi1 : sata_via
[   27.160795] ata2: SATA link down 1.5 Gbps (SStatus 0 SControl 300)
[   27.171943] ATA: abnormal status 0x7F on port 0x000000000001c407
[   27.172737] scsi 0:0:0:0: Direct-Access     ATA      WDC
WD1600JD-00H 08.0 PQ: 0 ANSI: 5
[   27.173857] SCSI device sda: 312581808 512-byte hdwr sectors (160042 MB)
[   27.174329] sda: Write Protect is off
[   27.174575] sda: Mode Sense: 00 3a 00 00
[   27.174615] SCSI device sda: write cache: enabled, read cache:
enabled, doesn't support DPO or FUA
[   27.175637] SCSI device sda: 312581808 512-byte hdwr sectors (160042 MB)
[   27.176118] sda: Write Protect is off
[   27.176364] sda: Mode Sense: 00 3a 00 00
[   27.176404] SCSI device sda: write cache: enabled, read cache:
enabled, doesn't support DPO or FUA
[   27.189170]  sda: sda1 sda2 sda3 sda4 < sda5 sda6 sda7 sda8 sda9 sda10 >
[   27.268831] sd 0:0:0:0: Attached scsi disk sda
[   27.282334] sd 0:0:0:0: Attached scsi generic sg0 type 0
[   27.295873] pata_via 0000:00:0f.1: version 0.2.1
[   27.295910] ACPI: PCI Interrupt 0000:00:0f.1[A] -> GSI 20 (level,
low) -> IRQ 20
[   27.309842] PCI: Calling quirk ffffffff80409bd0 for 0000:00:0f.1
[   27.309989] ata3: PATA max UDMA/133 cmd 0x00000000000101f0 ctl
0x00000000000103f6 bmdma 0x000000000001fc00 irq 14
[   27.324748] ata4: PATA max UDMA/133 cmd 0x0000000000010170 ctl
0x0000000000010376 bmdma 0x000000000001fc08 irq 15
[   27.339501] scsi2 : pata_via
[   27.515698] ATA: abnormal status 0x8 on port 0x00000000000101f7
[   27.530966] scsi3 : pata_via
[   27.867411] ata4.00: ATAPI, max UDMA/33
[   28.055247] ata4.00: configured for UDMA/33
[   28.074812] scsi 3:0:0:0: CD-ROM            HL-DT-ST DVDRAM
GSA-4163B A102 PQ: 0 ANSI: 5
[   28.109938] sr0: scsi3-mmc drive: 40x/40x writer dvd-ram cd/rw
xa/form2 cdda tray
[   28.126803] Uniform CD-ROM driver Revision: 3.20
[   28.144093] sr 3:0:0:0: Attached scsi CD-ROM sr0
[   28.144346] sr 3:0:0:0: Attached scsi generic sg1 type 5
[   28.161743] usbmon: debugfs is not available
[   28.179418] ACPI: PCI Interrupt 0000:00:10.4[C] -> GSI 21 (level,
low) -> IRQ 21
[   28.197538] PCI: Calling quirk ffffffff80409bd0 for 0000:00:10.4
[   28.197560] ehci_hcd 0000:00:10.4: EHCI Host Controller
[   28.216428] ehci_hcd 0000:00:10.4: new USB bus registered, assigned
bus number 1
[   28.235286] ehci_hcd 0000:00:10.4: irq 21, io mem 0xfae00000
[   28.254191] ehci_hcd 0000:00:10.4: USB 2.0 started, EHCI 1.00,
driver 10 Dec 2004
[   28.274102] usb usb1: configuration #1 chosen from 1 choice
[   28.293861] hub 1-0:1.0: USB hub found
[   28.313588] hub 1-0:1.0: 8 ports detected
[   28.434475] Initializing USB Mass Storage driver...
[   28.454021] usbcore: registered new interface driver usb-storage
[   28.473538] USB Mass Storage support registered.
[   28.493132] usbcore: registered new interface driver usbhid
[   28.512623] drivers/usb/input/hid-core.c: v2.6:USB HID core driver
[   28.532461] PNP: PS/2 Controller [PNP0303:PS2K,PNP0f03:PS2M] at
0x60,0x64 irq 1,12
[   28.552815] serio: i8042 KBD port at 0x60,0x64 irq 1
[   28.572776] serio: i8042 AUX port at 0x60,0x64 irq 12
[   28.592891] mice: PS/2 mouse device common for all mice
[   28.642647] input: AT Translated Set 2 keyboard as /class/input/input3
[   28.666382] rtc_cmos 00:02: rtc core: registered rtc_cmos as rtc0
[   28.686306] rtc_cmos: probe of 00:02 failed with error -16
[   28.705968] EDAC MC: Ver: 2.0.1 Jun 20 2007
[   28.725670] Advanced Linux Sound Architecture Driver Version
1.0.14rc3 (Wed Mar 14 07:25:50 2007 UTC).
[   29.304386] input: ImPS/2 Generic Wheel Mouse as /class/input/input4
[   29.327009] ACPI: PCI Interrupt 0000:00:11.5[C] -> GSI 22 (level,
low) -> IRQ 22
[   29.346449] PCI: Calling quirk ffffffff80409bd0 for 0000:00:11.5
[   29.346818] PCI: Enabling bus mastering for device 0000:00:11.5
[   29.346823] PCI: Setting latency timer of device 0000:00:11.5 to 64
[   29.862503] ALSA device list:
[   29.881975]   #0: VIA 8237 with ALC850 at 0xe800, irq 22
[   29.901398] oprofile: using NMI interrupt.
[   29.920609] Netfilter messages via NETLINK v0.30.
[   29.939747] nf_conntrack version 0.5.0 (4094 buckets, 32752 max)
[   29.959648] ip_tables: (C) 2000-2006 Netfilter Core Team
[   29.979174] TCP cubic registered
[   29.998369] Initializing XFRM netlink socket
[   30.017657] NET: Registered protocol family 1
[   30.037162] powernow-k8: Found 1 AMD Athlon(tm) 64 Processor 3200+
processors (version 2.00.00)
[   30.057095] powernow-k8:    0 : fid 0xc (2000 MHz), vid 0x6
[   30.076731] powernow-k8:    1 : fid 0xa (1800 MHz), vid 0x8
[   30.096008] powernow-k8:    2 : fid 0x2 (1000 MHz), vid 0x12
[   30.115370] powernow-k8: ph2 null fid transition 0xc
[   30.134342] drivers/rtc/hctosys.c: unable to open rtc device (rtc0)
[   30.181037] kjournald starting.  Commit interval 5 seconds
[   30.199803] EXT3-fs: mounted filesystem with ordered data mode.
[   30.218611] VFS: Mounted root (ext3 filesystem) readonly.
[   30.237165] Freeing unused kernel memory: 224k freed
[   30.255764] Write protecting the kernel read-only data: 1460k
[   33.753755] ne2k-pci.c:v1.03 9/22/2003 D. Becker/P. Gortmaker
[   33.753758]   http://www.scyld.com/network/ne2k-pci.html
[   33.753841] ACPI: PCI Interrupt 0000:00:0c.0[A] -> GSI 17 (level,
low) -> IRQ 17
[   33.754191] eth0: Compex RL2000 found at 0xb000, IRQ 17, 00:80:48:DE:5E:89.
[   34.554337] Linux video capture interface: v2.00
[   34.575361] bttv: driver version 0.9.17 loaded
[   34.575365] bttv: using 8 buffers with 2080k (520 pages) each for capture
[   34.575474] bttv: Bt8xx card found (0).
[   34.575506] ACPI: PCI Interrupt 0000:00:0d.0[A] -> GSI 18 (level,
low) -> IRQ 18
[   34.575520] bttv0: Bt878 (rev 17) at 0000:00:0d.0, irq: 18,
latency: 64, mmio: 0xefe00000
[   34.575534] bttv0: using: Lifeview FlyVideo 2000 /FlyVideo A2/
Lifetec LT 9415 TV [LR90] [card=54,insmod option]
[   34.575571] bttv0: gpio: en=00000000, out=00000000 in=00d4dfe0 [init]
[   34.577304] bttv0: FlyVideo Radio=yes RemoteControl=yes Tuner=5 gpio=0xd4dfe0
[   34.577307] bttv0: FlyVideo  LR90=yes tda9821/tda9820=no  capture_only=no
[   34.577309] bttv0: using tuner=5
[   34.577313] bttv0: i2c: checking for MSP34xx @ 0x80... not found
[   34.578081] bttv0: i2c: checking for TDA9875 @ 0xb0... found
[   34.768938] tda9875: no such chip at 0xb0 (dic=0x7 rev=0x7)
[   34.768944] i2c_adapter i2c-0: Client creation failed at 0x58 (1)
[   34.769281] bttv0: i2c: checking for TDA7432 @ 0x8a... not found
[   34.770053] bttv0: i2c: checking for TDA9887 @ 0x86... not found
[   35.026056] tuner 0-0061: chip found @ 0xc2 (bt878 #0 [sw])
[   35.026595] tuner 0-0061: type set to 5 (Philips PAL_BG (FI1216 and
compatibles))
[   35.026600] tuner 0-0061: type set to 5 (Philips PAL_BG (FI1216 and
compatibles))
[   35.038039] bttv0: registered device video0
[   35.038106] bttv0: registered device vbi0
[   35.038174] bttv0: registered device radio0
[   35.038195] bttv0: PLL: 28636363 => 35468950 .. ok
[   36.796913] Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports,
IRQ sharing disabled
[   36.797477] serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
[   36.798246] serial8250: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
[   36.800910] 00:0c: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
[   36.801270] 00:0d: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
[   38.859952] EXT3 FS on sda5, internal journal
[   39.691474] kjournald starting.  Commit interval 5 seconds
[   39.691850] EXT3 FS on sda6, internal journal
[   39.691859] EXT3-fs: mounted filesystem with ordered data mode.
[   39.711032] kjournald starting.  Commit interval 5 seconds
[   39.711382] EXT3 FS on sda8, internal journal
[   39.711391] EXT3-fs: mounted filesystem with ordered data mode.
[   39.728005] kjournald starting.  Commit interval 5 seconds
[   39.728414] EXT3 FS on sda10, internal journal
[   39.728422] EXT3-fs: mounted filesystem with ordered data mode.
[   39.759834] kjournald starting.  Commit interval 5 seconds
[   39.760067] EXT3 FS on sda7, internal journal
[   39.760075] EXT3-fs: mounted filesystem with ordered data mode.
[   39.880675] Adding 1020112k swap on /dev/sda2.  Priority:-1
extents:1 across:1020112k
[   51.948467] NET: Registered protocol family 17
[   55.425892] Time: acpi_pm clocksource has been installed.
[  396.368541] NETDEV WATCHDOG: eth0: transmit timed out
[  396.368551] eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=247.
[  397.167874] NETDEV WATCHDOG: eth0: transmit timed out
[  397.167884] eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=384.
[  398.167027] NETDEV WATCHDOG: eth0: transmit timed out
[  398.167037] eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=500.
[  399.947117] NETDEV WATCHDOG: eth0: transmit timed out
[  399.947124] eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=732.
[  402.403992] NETDEV WATCHDOG: eth0: transmit timed out
[  402.404002] eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=370.
[  403.403148] NETDEV WATCHDOG: eth0: transmit timed out
[  403.403158] eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=370.
[  403.971763] NETDEV WATCHDOG: eth0: transmit timed out
[  403.971770] eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=370.
[  408.108310] NETDEV WATCHDOG: eth0: transmit timed out
[  408.108317] eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=324.
[  412.299736] NETDEV WATCHDOG: eth0: transmit timed out
[  412.299746] eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=247.
[  420.331554] NETDEV WATCHDOG: eth0: transmit timed out
[  420.331564] eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=980.
[  424.349861] NETDEV WATCHDOG: eth0: transmit timed out
[  424.349868] eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=980.
[  425.315492] NETDEV WATCHDOG: eth0: transmit timed out
[  425.315502] eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=980.
[  426.314656] NETDEV WATCHDOG: eth0: transmit timed out
[  426.314665] eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=980.
[  440.362038] NETDEV WATCHDOG: eth0: transmit timed out
[  440.362045] eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=691.
[  440.861616] NETDEV WATCHDOG: eth0: transmit timed out
[  440.861624] eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=691.
[  441.361203] NETDEV WATCHDOG: eth0: transmit timed out
[  441.361210] eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=691.


ps: i think it's not udev which swaps cards because they are pinned to
ethernet addresses:
# PCI device 0x11f6:0x1401 (ne2k-pci)
SUBSYSTEM=="net", DRIVERS=="?*", ATTRS{address}=="00:80:48:de:5e:89",
NAME="eth0"

# PCI device 0x11ab:0x4320 (skge)
SUBSYSTEM=="net", DRIVERS=="?*", ATTRS{address}=="00:11:d8:60:74:55",
NAME="eth1"

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-06-22  8:56       ` Marcin Ślusarz
@ 2007-06-22 13:32         ` Jarek Poplawski
       [not found]           ` <4bacf17f0706252310w155fc4d7v1bf12319a650559a@mail.gmail.com>
  0 siblings, 1 reply; 68+ messages in thread
From: Jarek Poplawski @ 2007-06-22 13:32 UTC (permalink / raw)
  To: Marcin Ślusarz; +Cc: Stephen Hemminger, linux-kernel, linux-net, netdev

On Fri, Jun 22, 2007 at 10:56:44AM +0200, Marcin Ślusarz wrote:
...
> When I disable on-board network card in BIOS (controlled by skge)
> ne2k-pci card is still locking up. So I think it's strictly ne2k-pci
> card bug. I made some tests and I know how to reproduce it fast (on my
> machine) - just make some heavy network traffic...
...

I'm no good at hardware, but I guess this log could be not enough.
So, if nobody will find something more sensible, maybe you can try
some of these suggestions:

- you've written it was OK with 2.6.20; it would be interesting
to check if there were any changes in config (beside new options)
or even retry 2.6.20 with "current" config after make oldconfig;
- during such problems it's better to try to turn off as much
unnecessary options/drivers as possible to find if it's really
about network driver; e.g.: no SMP, tv cards, acpi - only
basic, without options etc.;
- if possible try it with newer kernel e.g. 2.6.22-rc5;
- if possible try it with another, fresh distro (e.g. some live
CD/DVD/USB bootable);
- there was a lockdep warning from tvtime/bttv;
- try to get some more debugging (help: modinfo ne2k-pci).

Regards,
Jarek P.

PS: for anybody interested - here is the beginning of this story:
http://marc.info/?l=linux-kernel&m=118202978609968&w=2

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
       [not found]           ` <4bacf17f0706252310w155fc4d7v1bf12319a650559a@mail.gmail.com>
@ 2007-06-26  8:08             ` Jarek Poplawski
  0 siblings, 0 replies; 68+ messages in thread
From: Jarek Poplawski @ 2007-06-26  8:08 UTC (permalink / raw)
  To: Marcin Ślusarz; +Cc: Stephen Hemminger, linux-kernel, linux-net, netdev

On Tue, Jun 26, 2007 at 08:10:17AM +0200, Marcin Ślusarz wrote:
... 
> I reproduced it on minimal config:
...

Hm... This method is usable if you can find such minimal config
with which the bug cannot be reproduced. Then you can add more
until the bug is back. Of course, this takes time...

We know your hardware should be OK - since it was fine with 2.6.20.
We don't know how much your configs (kernel & apps) have changed.
Sometimes the change of kernel needs some apps to be recompiled too.
That's why it could be usable to try 2.6.21 from a live distro to
find if it's really kernel's fault.

And, alas, this log doesn't seem to tell nothing new...

Regards,
Jarek P.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
@ 2007-06-26 14:24 Jean-Baptiste Vignaud
  2007-06-27 10:17 ` Jarek Poplawski
  0 siblings, 1 reply; 68+ messages in thread
From: Jean-Baptiste Vignaud @ 2007-06-26 14:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: jarkao2, marcin.slusarz, shemminger, linux-net, netdev

Hello, i have a very similar problem with 2.6.21 also;

2 3com NICs and they are failling randomly.

The kernel is a basic fedora 7 kernel (2.6.21-1.3228.fc7)
I found a bug report and added details here : https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=243960

I'm not subcribed on this list, so please cc me if there is any questions.

JB

> On Tue, Jun 26, 2007 at 08:10:17AM +0200, Marcin Ślusarz wrote:
> ... 
> > I reproduced it on minimal config:
> ...
> 
> Hm... This method is usable if you can find such minimal config
> with which the bug cannot be reproduced. Then you can add more
> until the bug is back. Of course, this takes time...
> 
> We know your hardware should be OK - since it was fine with 2.6.20.
> We don't know how much your configs (kernel & apps) have changed.
> Sometimes the change of kernel needs some apps to be recompiled too.
> That's why it could be usable to try 2.6.21 from a live distro to
> find if it's really kernel's fault.
> 
> And, alas, this log doesn't seem to tell nothing new...
> 
> Regards,
> Jarek P.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 
> 


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-06-26 14:24 Jean-Baptiste Vignaud
@ 2007-06-27 10:17 ` Jarek Poplawski
  0 siblings, 0 replies; 68+ messages in thread
From: Jarek Poplawski @ 2007-06-27 10:17 UTC (permalink / raw)
  To: Jean-Baptiste Vignaud
  Cc: linux-kernel, marcin.slusarz, shemminger, linux-net, netdev

On Tue, Jun 26, 2007 at 04:24:07PM +0200, Jean-Baptiste Vignaud wrote:
> Hello, i have a very similar problem with 2.6.21 also;
> 
> 2 3com NICs and they are failling randomly.
> 
> The kernel is a basic fedora 7 kernel (2.6.21-1.3228.fc7)
> I found a bug report and added details here : https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=243960
> 
> I'm not subcribed on this list, so please cc me if there is any questions.
> 
> JB
> 
> > On Tue, Jun 26, 2007 at 08:10:17AM +0200, Marcin Ślusarz wrote:
> > ...
> > > I reproduced it on minimal config:
...
> > We know your hardware should be OK - since it was fine with 2.6.20.
...

It looks like there is something common in the air...

Marcin: ne2k_pci with 8390, Jean: 3com, and now I see
similar problem with 8139cp too (plus some ideas):

http://marc.info/?l=linux-netdev&m=118293314109648&w=2

So, you probably should wait a little & look for new patches here.

Cheers,
Jarek P.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
@ 2007-06-29  8:50 Jean-Baptiste Vignaud
  2007-06-29 15:07 ` Jarek Poplawski
  0 siblings, 1 reply; 68+ messages in thread
From: Jean-Baptiste Vignaud @ 2007-06-29  8:50 UTC (permalink / raw)
  To: jarkao2; +Cc: linux-kernel, marcin.slusarz, shemminger, linux-net, netdev

Update...
I did 2 tests :

1)  booted with option acpi=off
It booted correctly, i managed to get some load on one of the card and after a while (10 minutes i guess) the Timeout occurs. Side effect, at the same moment the sata contolers lost control of the disks somehow and the raid 5 array on the system crashed hard. I have no traces as i was unable to rebuild it (and i tried a lot of extreme  voodoo methods).

2) changed the 3com cards
i replaced by two cards,
01:06.0 Ethernet controller: VIA Technologies, Inc. VT6102 [Rhine-II] (rev 42)
01:07.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8029(AS)

reinstalled and stressed the network (small download from a laptop) and :

Jun 29 09:34:10 loki kernel: NETDEV WATCHDOG: eth0: transmit timed out
Jun 29 09:34:51 loki last message repeated 14 times
Jun 29 09:35:18 loki last message repeated 8 times

so it seems to be a more generic problem.

(i'v updated the fedora bugzilla aswell)

did not test the  "[PATCH] 8139cp dev->tx_timeout" yet.

JB


> On Tue, Jun 26, 2007 at 04:24:07PM +0200, Jean-Baptiste Vignaud wrote:
> > Hello, i have a very similar problem with 2.6.21 also;
> > 
> > 2 3com NICs and they are failling randomly.
> > 
> > The kernel is a basic fedora 7 kernel (2.6.21-1.3228.fc7)
> > I found a bug report and added details here : https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=243960
> > 
> > I'm not subcribed on this list, so please cc me if there is any questions.
> > 
> > JB
> > 
> > > On Tue, Jun 26, 2007 at 08:10:17AM +0200, Marcin Ślusarz wrote:
> > > ...
> > > > I reproduced it on minimal config:
> ...
> > > We know your hardware should be OK - since it was fine with 2.6.20.
> ...
> 
> It looks like there is something common in the air...
> 
> Marcin: ne2k_pci with 8390, Jean: 3com, and now I see
> similar problem with 8139cp too (plus some ideas):
> 
> http://marc.info/?l=linux-netdev&m=118293314109648&w=2
> 
> So, you probably should wait a little & look for new patches here.
> 
> Cheers,
> Jarek P.
> 


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-06-29  8:50 Jean-Baptiste Vignaud
@ 2007-06-29 15:07 ` Jarek Poplawski
  2007-07-23  5:44   ` Marcin Ślusarz
  0 siblings, 1 reply; 68+ messages in thread
From: Jarek Poplawski @ 2007-06-29 15:07 UTC (permalink / raw)
  To: Jean-Baptiste Vignaud
  Cc: linux-kernel, marcin.slusarz, shemminger, linux-net, netdev

On Fri, Jun 29, 2007 at 10:50:20AM +0200, Jean-Baptiste Vignaud wrote:
> Update...
> I did 2 tests :
> 
> 1)  booted with option acpi=off
> It booted correctly, i managed to get some load on one of the card
> and after a while (10 minutes i guess) the Timeout occurs. Side effect,
> at the same moment the sata contolers lost control of the disks somehow
> and the raid 5 array on the system crashed hard. I have no traces as i
> was unable to rebuild it (and i tried a lot of extreme  voodoo methods).

I think the main option: acpi=on is usually needed.

If you, guys, are not exhausted yet, I think you could try to
turn off (or change for somethig else) most of the options from
"Processors type and features", and maybe something below PCI
support. But there are many new options which couldn't be turned
off so easy, so there is no much hope...

> 
> 2) changed the 3com cards
> i replaced by two cards,
> 01:06.0 Ethernet controller: VIA Technologies, Inc. VT6102 [Rhine-II] (rev 42)
> 01:07.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8029(AS)
> 
> reinstalled and stressed the network (small download from a laptop) and :
> 
> Jun 29 09:34:10 loki kernel: NETDEV WATCHDOG: eth0: transmit timed out
> Jun 29 09:34:51 loki last message repeated 14 times
> Jun 29 09:35:18 loki last message repeated 8 times
> 
> so it seems to be a more generic problem.

I wonder if you tried to change the place - I've read this
advice many times. And maybe it would be better to try with
one card at first?

It seems there are some patches with dev->tx_timeout but it
looks like fixing results only. Let's wait...

Cheers,
Jarek P.

PS: Marcin - your last message wasn't plain text - so probably
dumped by kernel lists. 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-06-29 15:07 ` Jarek Poplawski
@ 2007-07-23  5:44   ` Marcin Ślusarz
  2007-07-23  8:53     ` Jarek Poplawski
                       ` (2 more replies)
  0 siblings, 3 replies; 68+ messages in thread
From: Marcin Ślusarz @ 2007-07-23  5:44 UTC (permalink / raw)
  To: Jarek Poplawski, Jean-Baptiste Vignaud, linux-kernel, shemminger,
	linux-net, netdev, Ingo Molnar, Thomas Gleixner, Andrew Morton,
	Linus Torvalds

Ok, I've bisected this problem and found that this patch broke my NIC:

76d2160147f43f982dfe881404cfde9fd0a9da21 is first bad commit
commit 76d2160147f43f982dfe881404cfde9fd0a9da21
Author: Ingo Molnar <mingo@elte.hu>
Date:   Fri Feb 16 01:28:24 2007 -0800

    [PATCH] genirq: do not mask interrupts by default

    Never mask interrupts immediately upon request.  Disabling interrupts in
    high-performance codepaths is rare, and on the other hand this change could
    recover lost edges (or even other types of lost interrupts) by
conservatively
    only masking interrupts after they happen.  (NOTE: with this change the
    highlevel irq-disable code still soft-disables this IRQ line - and
if such an
    interrupt happens then the IRQ flow handler keeps the IRQ masked.)

    Mark i8529A controllers as 'never loses an edge'.

    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=76d2160147f43f982dfe881404cfde9fd0a9da21

After reverting it on top of 2.6.21.3 (with
d7e25f3394ba05a6d64cb2be42c2765fe72ea6b2 - [PATCH] genirq: remove
IRQ_DISABLED (which ment "remove IRQ_DELAYED_DISABLE")), the problem
didn't show up :)
(http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d7e25f3394ba05a6d64cb2be42c2765fe72ea6b2)

So I cooked patch like below and everything is working fine (so far)

Fix default_disable interrupt function (broken by [PATCH] genirq: do
not mask interrupts by default) - revert removal of codepath which was
invoked when removed flag (IRQ_DELAYED_DISABLE) wag NOT set

Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
---
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index 76a9106..0bb23cd 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -230,6 +230,8 @@ static void default_enable(unsigned int irq)
  */
 static void default_disable(unsigned int irq)
 {
+	struct irq_desc *desc = irq_desc + irq;
+	desc->chip->mask(irq);
 }

 /*

(Sorry for whitespace damage, but I have to send it from webmail :|)
(I'm a kernel noob, so don't kill me if my patch is wrong ;)
ps: Here is the beginning of this thread: http://lkml.org/lkml/2007/6/16/182


Regards,
Marcin Slusarz

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-23  5:44   ` Marcin Ślusarz
@ 2007-07-23  8:53     ` Jarek Poplawski
  2007-07-24  7:18     ` Jarek Poplawski
  2007-07-24  8:05     ` Ingo Molnar
  2 siblings, 0 replies; 68+ messages in thread
From: Jarek Poplawski @ 2007-07-23  8:53 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Ingo Molnar, Thomas Gleixner, Andrew Morton,
	Linus Torvalds

On Mon, Jul 23, 2007 at 07:44:58AM +0200, Marcin Ślusarz wrote:
> Ok, I've bisected this problem and found that this patch broke my NIC:

Congratulations!

> 
> 76d2160147f43f982dfe881404cfde9fd0a9da21 is first bad commit
> commit 76d2160147f43f982dfe881404cfde9fd0a9da21
> Author: Ingo Molnar <mingo@elte.hu>
> Date:   Fri Feb 16 01:28:24 2007 -0800
> 
>    [PATCH] genirq: do not mask interrupts by default
...
> So I cooked patch like below and everything is working fine (so far)
> 
> Fix default_disable interrupt function (broken by [PATCH] genirq: do
> not mask interrupts by default) - revert removal of codepath which was
> invoked when removed flag (IRQ_DELAYED_DISABLE) wag NOT set
> 
> Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
> ---
> diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
> index 76a9106..0bb23cd 100644
> --- a/kernel/irq/chip.c
> +++ b/kernel/irq/chip.c
> @@ -230,6 +230,8 @@ static void default_enable(unsigned int irq)
>  */
> static void default_disable(unsigned int irq)
> {
> +	struct irq_desc *desc = irq_desc + irq;
> +	desc->chip->mask(irq);
> }
> 
> /*

I think your patch should very good point the source of the problem
and would help to many people, but it looks like too arbitrary for
those who didn't have such problems. It seems it was mainly with
x86_64, so maybe something like this below would be enough?

Cheers,
Jarek P.

PS: not tested!

---

diff -Nurp 2.6.22-/arch/x86_64/kernel/io_apic.c 2.6.22/arch/x86_64/kernel/io_apic.c
--- 2.6.22-/arch/x86_64/kernel/io_apic.c	2007-07-09 01:32:17.000000000 +0200
+++ 2.6.22/arch/x86_64/kernel/io_apic.c	2007-07-23 10:33:05.000000000 +0200
@@ -1427,6 +1427,7 @@ static struct irq_chip ioapic_chip __rea
 	.name 		= "IO-APIC",
 	.startup 	= startup_ioapic_irq,
 	.mask	 	= mask_IO_APIC_irq,
+	.disable	= mask_IO_APIC_irq,
 	.unmask	 	= unmask_IO_APIC_irq,
 	.ack 		= ack_apic_edge,
 	.eoi 		= ack_apic_level,

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-23  5:44   ` Marcin Ślusarz
  2007-07-23  8:53     ` Jarek Poplawski
@ 2007-07-24  7:18     ` Jarek Poplawski
  2007-07-24  8:05     ` Ingo Molnar
  2 siblings, 0 replies; 68+ messages in thread
From: Jarek Poplawski @ 2007-07-24  7:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Marcin Ślusarz, Jean-Baptiste Vignaud, linux-kernel,
	shemminger, linux-net, netdev, Ingo Molnar, Thomas Gleixner,
	Andrew Morton, Linus Torvalds

On Mon, Jul 23, 2007 at 07:44:58AM +0200, Marcin Ślusarz wrote:
> Ok, I've bisected this problem and found that this patch broke my NIC:
> 
> 76d2160147f43f982dfe881404cfde9fd0a9da21 is first bad commit
> commit 76d2160147f43f982dfe881404cfde9fd0a9da21
> Author: Ingo Molnar <mingo@elte.hu>
> Date:   Fri Feb 16 01:28:24 2007 -0800
> 
>    [PATCH] genirq: do not mask interrupts by default
> 
>    Never mask interrupts immediately upon request.  Disabling interrupts in
>    high-performance codepaths is rare, and on the other hand this change 
>    could
>    recover lost edges (or even other types of lost interrupts) by
> conservatively
>    only masking interrupts after they happen.  (NOTE: with this change the
>    highlevel irq-disable code still soft-disables this IRQ line - and
> if such an
>    interrupt happens then the IRQ flow handler keeps the IRQ masked.)
> 
>    Mark i8529A controllers as 'never loses an edge'.
> 
>    Signed-off-by: Ingo Molnar <mingo@elte.hu>
>    Cc: Thomas Gleixner <tglx@linutronix.de>
>    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

So, it seems nobody (except the users) cares...

BTW, maybe there should be created something like "Network Cards
Producers Made Rich on Unnecessary Changed Cards Linux Foundation"?:

On Fri, Jun 29, 2007 at 10:50:20AM +0200, Jean-Baptiste Vignaud wrote:
...
> 2) changed the 3com cards
> i replaced by two cards,
> 01:06.0 Ethernet controller: VIA Technologies, Inc. VT6102 [Rhine-II] (rev 42)
> 01:07.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8029(AS)
> 
> reinstalled and stressed the network (small download from a laptop) and :
> 
> Jun 29 09:34:10 loki kernel: NETDEV WATCHDOG: eth0: transmit timed out
> Jun 29 09:34:51 loki last message repeated 14 times
> Jun 29 09:35:18 loki last message repeated 8 times

...Of course, no response of any "serious" developer for this as well.

BTW #2: I wonder how true is this (after above-mentioned patch):

From include/linux/irq.h:
> /**
>  * struct irq_chip - hardware interrupt chip descriptor
...
>  * @disable:		disable the interrupt (defaults to chip->mask if NULL)

Regards,
Jarek P.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-23  5:44   ` Marcin Ślusarz
  2007-07-23  8:53     ` Jarek Poplawski
  2007-07-24  7:18     ` Jarek Poplawski
@ 2007-07-24  8:05     ` Ingo Molnar
  2007-07-24  9:42       ` Ingo Molnar
  2 siblings, 1 reply; 68+ messages in thread
From: Ingo Molnar @ 2007-07-24  8:05 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Jarek Poplawski, Jean-Baptiste Vignaud, linux-kernel, shemminger,
	linux-net, netdev, Thomas Gleixner, Andrew Morton, Linus Torvalds


* Marcin Ślusarz <marcin.slusarz@gmail.com> wrote:

> Ok, I've bisected this problem and found that this patch broke my NIC:
> 
> 76d2160147f43f982dfe881404cfde9fd0a9da21 is first bad commit
> commit 76d2160147f43f982dfe881404cfde9fd0a9da21
> Author: Ingo Molnar <mingo@elte.hu>
> Date:   Fri Feb 16 01:28:24 2007 -0800
> 
>    [PATCH] genirq: do not mask interrupts by default

thanks for tracking it down! Could you try the patch below (ontop an 
otherwise unmodified kernel)? This tests the theory whether the problem 
is related to the disable_irq_nosync() call in the ne2k driver's xmit 
path. Does this solve the hangs too?

	Ingo

Index: linux/kernel/irq/manage.c
===================================================================
--- linux.orig/kernel/irq/manage.c
+++ linux/kernel/irq/manage.c
@@ -102,7 +102,19 @@ void disable_irq_nosync(unsigned int irq
 	spin_lock_irqsave(&desc->lock, flags);
 	if (!desc->depth++) {
 		desc->status |= IRQ_DISABLED;
-		desc->chip->disable(irq);
+		/*
+		 * the _nosync variant of irq-disable suggests that the
+		 * caller is not worried about concurrency but about the
+		 * ordering of the irq flow itself. (such as hardware
+		 * getting confused about certain, normally valid irq
+		 * handling sequences.) So if the default disable handler
+		 * is in place then try the more conservative masking
+		 * instead:
+		 */
+		if (desc->chip->disable == default_disable && desc->chip->mask)
+			desc->chip->mask(irq);
+		else
+			desc->chip->disable(irq);
 	}
 	spin_unlock_irqrestore(&desc->lock, flags);
 }

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-24  8:05     ` Ingo Molnar
@ 2007-07-24  9:42       ` Ingo Molnar
  2007-07-24 19:30         ` Linus Torvalds
  0 siblings, 1 reply; 68+ messages in thread
From: Ingo Molnar @ 2007-07-24  9:42 UTC (permalink / raw)
  To: Marcin ??lusarz
  Cc: Jarek Poplawski, Jean-Baptiste Vignaud, linux-kernel, shemminger,
	linux-net, netdev, Thomas Gleixner, Andrew Morton, Linus Torvalds


* Ingo Molnar <mingo@elte.hu> wrote:

> thanks for tracking it down! Could you try the patch below (ontop an 
> otherwise unmodified kernel)? This tests the theory whether the 
> problem is related to the disable_irq_nosync() call in the ne2k 
> driver's xmit path. Does this solve the hangs too?

please try the patch below instead.

	Ingo

Index: linux/kernel/irq/chip.c
===================================================================
--- linux.orig/kernel/irq/chip.c
+++ linux/kernel/irq/chip.c
@@ -231,7 +231,7 @@ static void default_enable(unsigned int 
 /*
  * default disable function
  */
-static void default_disable(unsigned int irq)
+void default_disable(unsigned int irq)
 {
 }
 
Index: linux/kernel/irq/internals.h
===================================================================
--- linux.orig/kernel/irq/internals.h
+++ linux/kernel/irq/internals.h
@@ -10,6 +10,8 @@ extern void irq_chip_set_defaults(struct
 /* Set default handler: */
 extern void compat_irq_chip_set_default_handler(struct irq_desc *desc);
 
+extern void default_disable(unsigned int irq);
+
 #ifdef CONFIG_PROC_FS
 extern void register_irq_proc(unsigned int irq);
 extern void register_handler_proc(unsigned int irq, struct irqaction *action);
Index: linux/kernel/irq/manage.c
===================================================================
--- linux.orig/kernel/irq/manage.c
+++ linux/kernel/irq/manage.c
@@ -102,7 +102,19 @@ void disable_irq_nosync(unsigned int irq
 	spin_lock_irqsave(&desc->lock, flags);
 	if (!desc->depth++) {
 		desc->status |= IRQ_DISABLED;
-		desc->chip->disable(irq);
+		/*
+		 * the _nosync variant of irq-disable suggests that the
+		 * caller is not worried about concurrency but about the
+		 * ordering of the irq flow itself. (such as hardware
+		 * getting confused about certain, normally valid irq
+		 * handling sequences.) So if the default disable handler
+		 * is in place then try the more conservative masking
+		 * instead:
+		 */
+		if (desc->chip->disable == default_disable && desc->chip->mask)
+			desc->chip->mask(irq);
+		else
+			desc->chip->disable(irq);
 	}
 	spin_unlock_irqrestore(&desc->lock, flags);
 }

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-24  9:42       ` Ingo Molnar
@ 2007-07-24 19:30         ` Linus Torvalds
  2007-07-24 20:04           ` Ingo Molnar
  0 siblings, 1 reply; 68+ messages in thread
From: Linus Torvalds @ 2007-07-24 19:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Marcin ??lusarz, Jarek Poplawski, Jean-Baptiste Vignaud,
	linux-kernel, shemminger, linux-net, netdev, Thomas Gleixner,
	Andrew Morton

On Tue, 24 Jul 2007, Ingo Molnar wrote:
> 
> please try the patch below instead.

I'm hoping this is just a "let's see if the behavior changes" patch, not 
something that you think should be applied if it fixes something?

This patch looks like it is trying to paper over (rather than fix) some 
possible bug in the "->disable" logic. Makes sense as a "let's see if it's 
that" kind of thing, but not as a "let's fix it".

Or am I missing something?

		Linus

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-24 19:30         ` Linus Torvalds
@ 2007-07-24 20:04           ` Ingo Molnar
  2007-07-25  0:19             ` Thomas Gleixner
  0 siblings, 1 reply; 68+ messages in thread
From: Ingo Molnar @ 2007-07-24 20:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Marcin ??lusarz, Jarek Poplawski, Jean-Baptiste Vignaud,
	linux-kernel, shemminger, linux-net, netdev, Thomas Gleixner,
	Andrew Morton

* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, 24 Jul 2007, Ingo Molnar wrote:
> > 
> > please try the patch below instead.
> 
> I'm hoping this is just a "let's see if the behavior changes" patch, 
> not something that you think should be applied if it fixes something?
> 
> This patch looks like it is trying to paper over (rather than fix) 
> some possible bug in the "->disable" logic. Makes sense as a "let's 
> see if it's that" kind of thing, but not as a "let's fix it".
> 
> Or am I missing something?

yeah - it's a totaly bad and unacceptable hack (i realized how bad it 
was when i wrote up that comment section ...), i just wanted to see 
which portion of ne2k/lib8390.c is sensitive to the fact whether an irq 
line is masked or not. The patch has no SOB line either.

the current best fix forward is to undo my original change, unless we 
find a better fix for this problem. (Note that the other patches posted 
in this thread are broken too: they only mask the irq but dont reliably 
unmask it.)

here's the current method of handling irqs for Marcin's card:

17:         12   IO-APIC-fasteoi   eth1, eth0

and fasteoi is a really simple sequence: no masking/unmasking by the 
flow handler itself but a NOP at entry and an APIC-EOI at the end. The 
disable/enable irq thing should thus have minimal effect if done within 
an irq handler.

now ne2k does something uncommon: for xmit (which is normally done 
outside of irq handlers) it will disable_irq_nosync()/enable_irq() the 
interrupt. It does it to exclude the handler from _that_ CPU, but due to 
the _nosync it does not exclude it from any other CPUs. So that's a bit 
weird already.

just in case, i've just re-checked all the genirq bits that change 
IRQ_DISABLED (that bit accidentally clear would be the only way to truly 
allow an IRQ handler to interrupt the disable_irq_nosync() critical 
section on that CPU) - but i can see no way for that to happen: we 
unconditionally detect and report unbalanced and underflowing 
irq_desc->depth, and the only other place (besides enable_irq()) that 
clears IRQ_DISABLED is __set_irq_handler(), and on x86 that is only used 
during bootup.

Marcin, could you try the patch below too? [without having any other 
patch applied.] It basically turns the critical section into an irqs-off 
critical section and thus checks whether your problem is related to that 
particular area of code.

	Ingo

Index: linux/drivers/net/lib8390.c
===================================================================
--- linux.orig/drivers/net/lib8390.c
+++ linux/drivers/net/lib8390.c
@@ -297,9 +297,7 @@ static int ei_start_xmit(struct sk_buff 
 	 *	Slow phase with lock held.
 	 */

-	disable_irq_nosync_lockdep_irqsave(dev->irq, &flags);
-
-	spin_lock(&ei_local->page_lock);
+	spin_lock_irqsave(&ei_local->page_lock, flags);

 	ei_local->irqlock = 1;

@@ -376,8 +374,7 @@ static int ei_start_xmit(struct sk_buff 
 	ei_local->irqlock = 0;
 	ei_outb_p(ENISR_ALL, e8390_base + EN0_IMR);

-	spin_unlock(&ei_local->page_lock);
-	enable_irq_lockdep_irqrestore(dev->irq, &flags);
+	spin_unlock_irqrestore(&ei_local->page_lock, flags);

 	dev_kfree_skb (skb);
 	ei_local->stat.tx_bytes += send_length;

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-24 20:04           ` Ingo Molnar
@ 2007-07-25  0:19             ` Thomas Gleixner
  2007-07-25  7:23               ` Jarek Poplawski
                                 ` (2 more replies)
  0 siblings, 3 replies; 68+ messages in thread
From: Thomas Gleixner @ 2007-07-25  0:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Marcin ??lusarz, Jarek Poplawski,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton

On Tue, 2007-07-24 at 22:04 +0200, Ingo Molnar wrote:
> Marcin, could you try the patch below too? [without having any other 
> patch applied.] It basically turns the critical section into an irqs-off 
> critical section and thus checks whether your problem is related to that 
> particular area of code.
> 

I read back on this thread and I think the problem is somewhere else:

delayed disable relies on the ability to re-trigger the interrupt in the
case that a real interrupt happens after the software disable was set.
In this case we actually disable the interrupt on the hardware level
_after_ it occurred.

On enable_irq, we need to re-trigger the interrupt. On i386 this relies
on a hardware resend mechanism (send_IPI_self()). 

Actually we only need the resend for edge type interrupts. Level type
interrupts come back once enable_irq() re-enables the interrupt line.

I assume that the interrupt in question is level triggered because it is
shared and above the legacy irqs 0-15:

	17:         12   IO-APIC-fasteoi   eth1, eth0

Looking into the IO_APIC code, the resend via send_IPI_self() happens
unconditionally. So the resend is done for level and edge interrupts.
This makes the problem more mysterious.

The code in question lib8390.c does

	disable_irq();
	fiddle_with_the_network_card_hardware()
	enable_irq();

The fiddle_with_the_network_card_hardware() might cause interrupts,
which are cleared in the same code path again,

Marcin found that when he disables the irq line on the hardware level
(removing the delayed disable) the card is kept alive.

So the difference is that we can get a resend on enable_irq, when an
interrupt happens during the time, where we are in the disabled region.

No idea how this affects the network card, as the code there must be
able to handle interrupts, which are not originated from the card due to
interrupt sharing.

Marcin, can you please try the patch below ? It's just a debugging aid
to gather some more data about that problem.

If the patch fixes the problem, then we should try to disable the resend
mechanism for not edge type irq lines on the irq_chip level (i.e. the
IOAPIC code)

Thanks,

	tglx

--- linux-2.6.orig/kernel/irq/resend.c
+++ linux-2.6/kernel/irq/resend.c
@@ -62,6 +62,15 @@ void check_irq_resend(struct irq_desc *desc, unsigned int irq)
 	 */
 	desc->chip->enable(irq);

+	/*
+	 * Temporary hack to figure out more about the problem, which
+	 * is causing the ancient network cards to die.
+	 */
+	if (desc->handle_irq != handle_edge_irq) {
+		printk(KERN_DEBUG "Skip resend for irq %u\n", irq);
+		return;
+	}
+
 	if ((status & (IRQ_PENDING | IRQ_REPLAY)) == IRQ_PENDING) {
 		desc->status = (status & ~IRQ_PENDING) | IRQ_REPLAY;

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-25  0:19             ` Thomas Gleixner
@ 2007-07-25  7:23               ` Jarek Poplawski
  2007-07-25 13:57               ` Jarek Poplawski
  2007-07-26  7:16               ` Marcin Ślusarz
  2 siblings, 0 replies; 68+ messages in thread
From: Jarek Poplawski @ 2007-07-25  7:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, Linus Torvalds, Marcin ??lusarz,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton

On Wed, Jul 25, 2007 at 02:19:31AM +0200, Thomas Gleixner wrote:
> On Tue, 2007-07-24 at 22:04 +0200, Ingo Molnar wrote:
> > Marcin, could you try the patch below too? [without having any other 
> > patch applied.] It basically turns the critical section into an irqs-off 
> > critical section and thus checks whether your problem is related to that 
> > particular area of code.
> > 
> 
> I read back on this thread and I think the problem is somewhere else:

So do I. Of course, I certainly miss most of the details, but I can't
imagine how this yesterday Ingo's patch couldn't work - unless
Marcin's test wasn't long enough...

IMHO, the main problem is that such delicate things shouldn't be
changed this way. If current ideas work for Marcin they will probably
break other boxes. Very similar symptoms were reported before Ingo's
patch too, so it looks like this place is very fragile. If such
things could happen:

(from: arch/i386/kernel/io_apic.c)
> static void ack_ioapic_quirk_irq(unsigned int irq)
> ...
> /*
>  * It appears there is an erratum which affects at least version 0x11
>  * of I/O APIC (that's the 82093AA and cores integrated into various
>  * chipsets).  Under certain conditions a level-triggered interrupt is
>  * erroneously delivered as edge-triggered one but the respective IRR
>  * bit gets set nevertheless.  As a result the I/O unit expects an EOI
>  * message but it will never arrive and further interrupts are blocked
>  * from the source.  The exact reason is so far unknown, but the
>  * phenomenon was observed when two consecutive interrupt requests
>  * from a given source get delivered to the same CPU and the source is
>  * temporarily disabled in between.
...

there is no reason to think this is all.

I can also see this comment in arch/x86_64/kernel/io_apic.c:

> static void setup_IO_APIC_irq(int apic, int pin, unsigned int irq,
>                               int trigger, int polarity)
...
>        /* Mask level triggered irqs.
>         * Use IRQ_DELAYED_DISABLE for edge triggered irqs.
>         */

It seems somebody have seen a difference, probably after testing,
but it wasn't respected.

I also presume ne2k/lib8390.c solution could be a result of "real
life", and I don't think Marcin's tests can be enough here. 

So, my point is that such places first of all need some documented
knobs in config or elsewere, which make it possible for users to
easily go back to previous method (i.e. from 2.6.21 to 2.6.20 here).

Regards,
Jarek P.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-25  0:19             ` Thomas Gleixner
  2007-07-25  7:23               ` Jarek Poplawski
@ 2007-07-25 13:57               ` Jarek Poplawski
  2007-07-25 14:46                 ` Alan Cox
  2007-07-26  7:16               ` Marcin Ślusarz
  2 siblings, 1 reply; 68+ messages in thread
From: Jarek Poplawski @ 2007-07-25 13:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, Linus Torvalds, Marcin ??lusarz,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton

On Wed, Jul 25, 2007 at 02:19:31AM +0200, Thomas Gleixner wrote:
...
> Looking into the IO_APIC code, the resend via send_IPI_self() happens
> unconditionally. So the resend is done for level and edge interrupts.
> This makes the problem more mysterious.
> 
> The code in question lib8390.c does
> 
> 	disable_irq();
> 	fiddle_with_the_network_card_hardware()
> 	enable_irq();
...
> 
> No idea how this affects the network card, as the code there must be
> able to handle interrupts, which are not originated from the card due to
> interrupt sharing.

I think, in this last yesterday's patch Ingo could be right, yet!
The comment at the beginnig points this is done like that because
of chip's slowness. And problems with timing are mysterious.

On the other hand author of this code didn't use spin_lock_irqsave
for some reason, probably after testing this option too. So, I hope
this is the right path, but alas, I'm not sure this patch has to
prove this 100%.

Anyway, in my opinion this situation where interrupts could/have_to
be used for such strange things should confirm the need of more
options for handling irqs individually.

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-25 13:57               ` Jarek Poplawski
@ 2007-07-25 14:46                 ` Alan Cox
  2007-07-30  8:46                   ` Ingo Molnar
  0 siblings, 1 reply; 68+ messages in thread
From: Alan Cox @ 2007-07-25 14:46 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Thomas Gleixner, Ingo Molnar, Linus Torvalds, Marcin ??lusarz,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton

> > The code in question lib8390.c does
> > 
> > 	disable_irq();
> > 	fiddle_with_the_network_card_hardware()
> > 	enable_irq();
> ...
> > 
> > No idea how this affects the network card, as the code there must be
> > able to handle interrupts, which are not originated from the card due to
> > interrupt sharing.
> 
> I think, in this last yesterday's patch Ingo could be right, yet!
> The comment at the beginnig points this is done like that because
> of chip's slowness. And problems with timing are mysterious.
> 
> On the other hand author of this code didn't use spin_lock_irqsave
> for some reason, probably after testing this option too. So, I hope
> this is the right path, but alas, I'm not sure this patch has to
> prove this 100%.

The author (me) didn't use spin_lock_irqsave because the slowness of the
card means that approach caused horrible problems like losing serial data
at 38400 baud on some chips. Rememeber many 8390 nics on PCI were ISA
chips with FPGA front ends.

> Anyway, in my opinion this situation where interrupts could/have_to
> be used for such strange things should confirm the need of more
> options for handling irqs individually.

Ok the logic behind the 8390 is very simple:

Things to know
	- IRQ delivery is asynchronous to the PCI bus
	- Blocking the local CPU IRQ via spin locks was too slow
	- The chip has register windows needing locking work

So the path was once (I say once as people appear to have changed it
in the mean time and it now looks rather bogus if the changes to use
disable_irq_nosync_irqsave are disabling the local IRQ)


	Take the page lock
	Mask the IRQ on chip
	Disable the IRQ (but not mask locally- someone seems to have
		broken this with the lock validator stuff)
		[This must be _nosync as the page lock may otherwise
			deadlock us]
	Drop the page lock and turn IRQs back on
	
	At this point an existing IRQ may still be running but we can't
	get a new one

	Take the lock (so we know the IRQ has terminated) but don't mask
the IRQs on the processor
	Set irqlock [for debug]

	Transmit (slow as ****)

	re-enable the IRQ


We have to use disable_irq because otherwise you will get delayed
interrupts on the APIC bus deadlocking the transmit path.

Quite hairy but the chip simply wasn't designed for SMP and you can't
even ACK an interrupt without risking corrupting other parallel
activities on the chip.

Alan

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-25  0:19             ` Thomas Gleixner
  2007-07-25  7:23               ` Jarek Poplawski
  2007-07-25 13:57               ` Jarek Poplawski
@ 2007-07-26  7:16               ` Marcin Ślusarz
  2007-07-26  8:13                 ` Jarek Poplawski
  2007-07-26  8:16                 ` Ingo Molnar
  2 siblings, 2 replies; 68+ messages in thread
From: Marcin Ślusarz @ 2007-07-26  7:16 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Linus Torvalds, Jarek Poplawski,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton

2007/7/25, Thomas Gleixner <tglx@linutronix.de>:
> (...)

I've tested Jarek's patch, 2 Ingo's patches (2nd and 3rd) and Thomas'
patch (one patch at time of course) - all of them fixed the problem,
but the last one flooded my logs with "Skip resend for irq 17". All
tests were done on 2.6.21.3.

I wanted to test them all on 2.6.22.1, but I didn't have enough time.
I've verified only that 2.6.22.1 has the same problem. I can test it
later, but I can report results back at beginning of next week.

Regards,
Marcin Slusarz

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-26  8:13                 ` Jarek Poplawski
@ 2007-07-26  8:10                   ` Thomas Gleixner
  2007-07-26  8:31                     ` Ingo Molnar
  2007-07-26  9:11                     ` Jarek Poplawski
  2007-07-26  8:19                   ` Jarek Poplawski
  1 sibling, 2 replies; 68+ messages in thread
From: Thomas Gleixner @ 2007-07-26  8:10 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Marcin Ślusarz, Ingo Molnar, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

On Thu, 2007-07-26 at 10:13 +0200, Jarek Poplawski wrote:
> > I wanted to test them all on 2.6.22.1, but I didn't have enough time.
> > I've verified only that 2.6.22.1 has the same problem. I can test it
> > later, but I can report results back at beginning of next week.
> 
> 
> So, everything is clear - any changes are good!
> Except the signed-off ones... 
> 
> Thanks Marcin,
> Jarek P.
> 
> PS: Now, it seems to me Thomas could be the nearest. BTW, could somebody
> give me some tip, how these re-triggered interrupts are skipped on dev's
> reset before enable_irq?

I think the correct solution is really not to resend level type
interrupts. If the interrupt line is still active, then the interrupt
comes up by itself. I'm cooking a patch for that.

The other question is: 

Is the driver confused by the resent irq or is the chip-set unhappy
about the resend ?

We could figure the latter out by activating the software based resend
method.

	tglx



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-26  7:16               ` Marcin Ślusarz
@ 2007-07-26  8:13                 ` Jarek Poplawski
  2007-07-26  8:10                   ` Thomas Gleixner
  2007-07-26  8:19                   ` Jarek Poplawski
  2007-07-26  8:16                 ` Ingo Molnar
  1 sibling, 2 replies; 68+ messages in thread
From: Jarek Poplawski @ 2007-07-26  8:13 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Thomas Gleixner, Ingo Molnar, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

On Thu, Jul 26, 2007 at 09:16:10AM +0200, Marcin Ślusarz wrote:
> 2007/7/25, Thomas Gleixner <tglx@linutronix.de>:
> >(...)
> 
> I've tested Jarek's patch, 2 Ingo's patches (2nd and 3rd) and Thomas'
> patch (one patch at time of course) - all of them fixed the problem,
> but the last one flooded my logs with "Skip resend for irq 17". All
> tests were done on 2.6.21.3.
> 
> I wanted to test them all on 2.6.22.1, but I didn't have enough time.
> I've verified only that 2.6.22.1 has the same problem. I can test it
> later, but I can report results back at beginning of next week.


So, everything is clear - any changes are good!
Except the signed-off ones... 

Thanks Marcin,
Jarek P.

PS: Now, it seems to me Thomas could be the nearest. BTW, could somebody
give me some tip, how these re-triggered interrupts are skipped on dev's
reset before enable_irq?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-26  7:16               ` Marcin Ślusarz
  2007-07-26  8:13                 ` Jarek Poplawski
@ 2007-07-26  8:16                 ` Ingo Molnar
  1 sibling, 0 replies; 68+ messages in thread
From: Ingo Molnar @ 2007-07-26  8:16 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Thomas Gleixner, Linus Torvalds, Jarek Poplawski,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton


* Marcin Ślusarz <marcin.slusarz@gmail.com> wrote:

> 2007/7/25, Thomas Gleixner <tglx@linutronix.de>:
> >(...)
> 
> I've tested Jarek's patch, 2 Ingo's patches (2nd and 3rd) and Thomas' 
> patch (one patch at time of course) - all of them fixed the problem, 
> but the last one flooded my logs with "Skip resend for irq 17". All 
> tests were done on 2.6.21.3.

that's great! I think we have two good theories about what might be 
going on:

 - the driver might be buggy in that it gets confused by the 'resent' 
   irq.

 - or the chipset/cpu has a bug where it might get confused about the
   resent APIC vector getting mixed up with the same vector coming
   externally too. (Now, it makes little sense to 'resend' a
   level-triggered interrupt on x86 platforms that have flat PIC 
   hierarchies (other architectures might need more than that to
   retrigger an interrupt) - but there's nothing wrong about it in 
   theory and it needs fixing for edge irqs anyway.)

in any case, the problem was triggered by our change generating much 
more resent irqs than before. Nevertheless we'd like to fix that resend 
bug (and if the driver is buggy, the driver bug too). It's really good 
progress so far - we are working on doing the real fix now.

	Ingo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-26  8:13                 ` Jarek Poplawski
  2007-07-26  8:10                   ` Thomas Gleixner
@ 2007-07-26  8:19                   ` Jarek Poplawski
  1 sibling, 0 replies; 68+ messages in thread
From: Jarek Poplawski @ 2007-07-26  8:19 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Thomas Gleixner, Ingo Molnar, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

On Thu, Jul 26, 2007 at 10:13:26AM +0200, Jarek Poplawski wrote:
...
> So, everything is clear - any changes are good!
> Except the signed-off ones... 

Oops! Marcin's patch was both signed-off and good.
So, there is probably something more...

Sorry Marcin,
Jarek P.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-26  8:10                   ` Thomas Gleixner
@ 2007-07-26  8:31                     ` Ingo Molnar
  2007-07-26  8:55                       ` Jarek Poplawski
  2007-07-26  9:11                     ` Jarek Poplawski
  1 sibling, 1 reply; 68+ messages in thread
From: Ingo Molnar @ 2007-07-26  8:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jarek Poplawski, Marcin ??lusarz, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox


* Thomas Gleixner <tglx@linutronix.de> wrote:

> The other question is:
> 
> Is the driver confused by the resent irq or is the chip-set unhappy 
> about the resend ?
> 
> We could figure the latter out by activating the software based resend 
> method.

yeah. The patch below enables sw-resend on x86, to test the theory 
whether the APIC-driven hardware-vector-resend code has some problem.

Marcin, could you please give this one a try too? Good behavior would be 
a fully working kernel (no hung device) with no extra kernel messages. 
Bad behavior would be any extra kernel message or any non-working 
device.

	Ingo

----------------------------->
Subject: x86: activate HARDIRQS_SW_RESEND
From: Ingo Molnar <mingo@elte.hu>

activate the software-triggered IRQ-resend logic.

it appears some chipsets/cpus do not handle local-APIC driven IRQ
resends all that well, so always use the soft mechanism to trigger
the execution of pending interrupts.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/i386/Kconfig   |    4 ++++
 kernel/irq/manage.c |    8 ++++++++
 2 files changed, 12 insertions(+)

Index: linux/arch/i386/Kconfig
===================================================================
--- linux.orig/arch/i386/Kconfig
+++ linux/arch/i386/Kconfig
@@ -1270,6 +1270,10 @@ config GENERIC_PENDING_IRQ
 	depends on GENERIC_HARDIRQS && SMP
 	default y
 
+config HARDIRQS_SW_RESEND
+	bool
+	default y
+
 config X86_SMP
 	bool
 	depends on SMP && !X86_VOYAGER
Index: linux/kernel/irq/manage.c
===================================================================
--- linux.orig/kernel/irq/manage.c
+++ linux/kernel/irq/manage.c
@@ -181,6 +181,14 @@ void enable_irq(unsigned int irq)
 		desc->depth--;
 	}
 	spin_unlock_irqrestore(&desc->lock, flags);
+#ifdef CONFIG_HARDIRQS_SW_RESEND
+	/*
+	 * Do a bh disable/enable pair to trigger any pending
+	 * irq resend logic:
+	 */
+	local_bh_disable();
+	local_bh_enable();
+#endif
 }
 EXPORT_SYMBOL(enable_irq);
 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-26  8:31                     ` Ingo Molnar
@ 2007-07-26  8:55                       ` Jarek Poplawski
  2007-07-26  9:12                         ` Ingo Molnar
  0 siblings, 1 reply; 68+ messages in thread
From: Jarek Poplawski @ 2007-07-26  8:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Marcin ??lusarz, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

On Thu, Jul 26, 2007 at 10:31:20AM +0200, Ingo Molnar wrote:
...
> yeah. The patch below enables sw-resend on x86, to test the theory 
> whether the APIC-driven hardware-vector-resend code has some problem.

I think Marcin is using x86_64 (Athlon 64) yet.

Jarek P.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-26  8:10                   ` Thomas Gleixner
  2007-07-26  8:31                     ` Ingo Molnar
@ 2007-07-26  9:11                     ` Jarek Poplawski
  1 sibling, 0 replies; 68+ messages in thread
From: Jarek Poplawski @ 2007-07-26  9:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Marcin Ślusarz, Ingo Molnar, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

On Thu, Jul 26, 2007 at 10:10:31AM +0200, Thomas Gleixner wrote:
> On Thu, 2007-07-26 at 10:13 +0200, Jarek Poplawski wrote:
...
> > PS: Now, it seems to me Thomas could be the nearest. BTW, could somebody
> > give me some tip, how these re-triggered interrupts are skipped on dev's
> > reset before enable_irq?
> 
> I think the correct solution is really not to resend level type
> interrupts. If the interrupt line is still active, then the interrupt
> comes up by itself. I'm cooking a patch for that.
> 
> The other question is: 
> 
> Is the driver confused by the resent irq or is the chip-set unhappy
> about the resend ?
> 
> We could figure the latter out by activating the software based resend
> method.

Probably I miss something, but isn't there any problem with level type,
when APIC re-triggers an interrupt, which is not acked by driver nor
card (after some hw reset/clear)?

Jarek P.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-26  8:55                       ` Jarek Poplawski
@ 2007-07-26  9:12                         ` Ingo Molnar
  2007-07-30  7:29                           ` Marcin Ślusarz
  0 siblings, 1 reply; 68+ messages in thread
From: Ingo Molnar @ 2007-07-26  9:12 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Thomas Gleixner, Marcin ??lusarz, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox


* Jarek Poplawski <jarkao2@o2.pl> wrote:

> On Thu, Jul 26, 2007 at 10:31:20AM +0200, Ingo Molnar wrote:
> ...
> > yeah. The patch below enables sw-resend on x86, to test the theory 
> > whether the APIC-driven hardware-vector-resend code has some problem.
> 
> I think Marcin is using x86_64 (Athlon 64) yet.

yeah - i meant to cover both arches but forgot about x86_64 - updated 
patch attached below.

	Ingo

----------------->
Subject: x86: activate HARDIRQS_SW_RESEND
From: Ingo Molnar <mingo@elte.hu>

activate the software-triggered IRQ-resend logic.

it appears some chipsets/cpus do not handle local-APIC driven IRQ
resends all that well, so always use the soft mechanism to trigger
the execution of pending interrupts.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/i386/Kconfig   |    4 ++++
 arch/x86_64/Kconfig |    4 ++++
 kernel/irq/manage.c |    8 ++++++++
 3 files changed, 16 insertions(+)

Index: linux-rt-rebase.q/arch/i386/Kconfig
===================================================================
--- linux-rt-rebase.q.orig/arch/i386/Kconfig
+++ linux-rt-rebase.q/arch/i386/Kconfig
@@ -1284,6 +1284,10 @@ config GENERIC_PENDING_IRQ
 	depends on GENERIC_HARDIRQS && SMP
 	default y
 
+config HARDIRQS_SW_RESEND
+	bool
+	default y
+
 config X86_SMP
 	bool
 	depends on SMP && !X86_VOYAGER
Index: linux-rt-rebase.q/arch/x86_64/Kconfig
===================================================================
--- linux-rt-rebase.q.orig/arch/x86_64/Kconfig
+++ linux-rt-rebase.q/arch/x86_64/Kconfig
@@ -721,6 +721,10 @@ config GENERIC_PENDING_IRQ
 	depends on GENERIC_HARDIRQS && SMP
 	default y
 
+config HARDIRQS_SW_RESEND
+	bool
+	default y
+
 menu "Power management options"
 
 source kernel/power/Kconfig
Index: linux-rt-rebase.q/kernel/irq/manage.c
===================================================================
--- linux-rt-rebase.q.orig/kernel/irq/manage.c
+++ linux-rt-rebase.q/kernel/irq/manage.c
@@ -175,6 +175,14 @@ void enable_irq(unsigned int irq)
 		desc->depth--;
 	}
 	spin_unlock_irqrestore(&desc->lock, flags);
+#ifdef CONFIG_HARDIRQS_SW_RESEND
+	/*
+	 * Do a bh disable/enable pair to trigger any pending
+	 * irq resend logic:
+	 */
+	local_bh_disable();
+	local_bh_enable();
+#endif
 }
 EXPORT_SYMBOL(enable_irq);
 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-26  9:12                         ` Ingo Molnar
@ 2007-07-30  7:29                           ` Marcin Ślusarz
  2007-07-30  8:49                             ` Ingo Molnar
  2007-07-31 13:20                             ` Jarek Poplawski
  0 siblings, 2 replies; 68+ messages in thread
From: Marcin Ślusarz @ 2007-07-30  7:29 UTC (permalink / raw)
  To: Ingo Molnar, Jarek Poplawski, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

2007/7/26, Ingo Molnar <mingo@elte.hu>:
> (..)
> yeah - i meant to cover both arches but forgot about x86_64 - updated
> patch attached below.
>
>         Ingo
>
> ----------------->
> Subject: x86: activate HARDIRQS_SW_RESEND
> From: Ingo Molnar <mingo@elte.hu>
>
> activate the software-triggered IRQ-resend logic.
>
> it appears some chipsets/cpus do not handle local-APIC driven IRQ
> resends all that well, so always use the soft mechanism to trigger
> the execution of pending interrupts.
>
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> ---
>  arch/i386/Kconfig   |    4 ++++
>  arch/x86_64/Kconfig |    4 ++++
>  kernel/irq/manage.c |    8 ++++++++
>  3 files changed, 16 insertions(+)
>
> Index: linux-rt-rebase.q/arch/i386/Kconfig
> ===================================================================
> --- linux-rt-rebase.q.orig/arch/i386/Kconfig
> +++ linux-rt-rebase.q/arch/i386/Kconfig
> @@ -1284,6 +1284,10 @@ config GENERIC_PENDING_IRQ
>         depends on GENERIC_HARDIRQS && SMP
>         default y
>
> +config HARDIRQS_SW_RESEND
> +       bool
> +       default y
> +
>  config X86_SMP
>         bool
>         depends on SMP && !X86_VOYAGER
> Index: linux-rt-rebase.q/arch/x86_64/Kconfig
> ===================================================================
> --- linux-rt-rebase.q.orig/arch/x86_64/Kconfig
> +++ linux-rt-rebase.q/arch/x86_64/Kconfig
> @@ -721,6 +721,10 @@ config GENERIC_PENDING_IRQ
>         depends on GENERIC_HARDIRQS && SMP
>         default y
>
> +config HARDIRQS_SW_RESEND
> +       bool
> +       default y
> +
>  menu "Power management options"
>
>  source kernel/power/Kconfig
> Index: linux-rt-rebase.q/kernel/irq/manage.c
> ===================================================================
> --- linux-rt-rebase.q.orig/kernel/irq/manage.c
> +++ linux-rt-rebase.q/kernel/irq/manage.c
> @@ -175,6 +175,14 @@ void enable_irq(unsigned int irq)
>                 desc->depth--;
>         }
>         spin_unlock_irqrestore(&desc->lock, flags);
> +#ifdef CONFIG_HARDIRQS_SW_RESEND
> +       /*
> +        * Do a bh disable/enable pair to trigger any pending
> +        * irq resend logic:
> +        */
> +       local_bh_disable();
> +       local_bh_enable();
> +#endif
>  }
>  EXPORT_SYMBOL(enable_irq);

This patch didn't help (tested on 2.6.22.1) - ne2k_pci timed out.

ps: I retested all patches posted in this thread on top of 2.6.22.1
and behavior from 2.6.21.3 didn't changed. My next tests will be on
2.6.22.x only.

Regards,
Marcin Slusarz

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-25 14:46                 ` Alan Cox
@ 2007-07-30  8:46                   ` Ingo Molnar
  2007-07-30 13:05                     ` Alan Cox
  0 siblings, 1 reply; 68+ messages in thread
From: Ingo Molnar @ 2007-07-30  8:46 UTC (permalink / raw)
  To: Alan Cox
  Cc: Jarek Poplawski, Thomas Gleixner, Linus Torvalds, Marcin ??lusarz,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton

* Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> Ok the logic behind the 8390 is very simple:

thanks for the explanation Alan! A few comments and a question:

> Things to know
> 	- IRQ delivery is asynchronous to the PCI bus
> 	- Blocking the local CPU IRQ via spin locks was too slow
> 	- The chip has register windows needing locking work
> 
> So the path was once (I say once as people appear to have changed it 
> in the mean time and it now looks rather bogus if the changes to use 
> disable_irq_nosync_irqsave are disabling the local IRQ)
> 
> 
> 	Take the page lock
> 	Mask the IRQ on chip
> 	Disable the IRQ (but not mask locally- someone seems to have
> 		broken this with the lock validator stuff)
> 		[This must be _nosync as the page lock may otherwise
> 			deadlock us]

( side-note: you can ignore the lock validator stuff here, the validator
  changes are supposed to a NOP on the !lockdep case. Local irqs will
  only be disabled if the validator is running. This could cause dropped
  serial irqs on very old boxes but i doubt anyone will want to run the
  validator on those. )

> 	Drop the page lock and turn IRQs back on
> 	
> 	At this point an existing IRQ may still be running but we can't
> 	get a new one
> 
> 	Take the lock (so we know the IRQ has terminated) but don't mask
> the IRQs on the processor
> 	Set irqlock [for debug]
> 
> 	Transmit (slow as ****)
> 
> 	re-enable the IRQ
> 
> 
> We have to use disable_irq because otherwise you will get delayed 
> interrupts on the APIC bus deadlocking the transmit path.
> 
> Quite hairy but the chip simply wasn't designed for SMP and you can't 
> even ACK an interrupt without risking corrupting other parallel 
> activities on the chip.

So the whole locking is to be able to keep irqs enabled for a long time, 
without risking entry of the same IRQ handler on this same CPU, correct?

Marcin's test results suggest that if an IRQ is resent right at the 
enable_irq() point [be that via the hw irq-resend mechanism or the sw 
irq-resend mechanism], the hang happens.

In the previous 2.6.20 logic we'd not normally generate an IRQ at that 
point (because we masked the irq and the card itself deasserts the line 
so any level-triggered irq is now moot).

Once Thomas hacked off this resend mechanism for level-triggered irqs, 
Marcin saw the hangs go away.

So it seems to me that maybe the driver could be surprised via these 
spurious interrupts that happen right after the irq_enable(). Does the 
patch below make any sense in your opinion?

	Ingo

Index: linux/drivers/net/lib8390.c
===================================================================
--- linux.orig/drivers/net/lib8390.c
+++ linux/drivers/net/lib8390.c
@@ -375,6 +375,8 @@ static int ei_start_xmit(struct sk_buff 
 	/* Turn 8390 interrupts back on. */
 	ei_local->irqlock = 0;
 	ei_outb_p(ENISR_ALL, e8390_base + EN0_IMR);
+	/* force POST: */
+	ei_inb_p(e8390_base + EN0_IMR);

 	spin_unlock(&ei_local->page_lock);
 	enable_irq_lockdep_irqrestore(dev->irq, &flags);

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-30  7:29                           ` Marcin Ślusarz
@ 2007-07-30  8:49                             ` Ingo Molnar
  2007-08-01  7:24                               ` Marcin Ślusarz
  2007-07-31 13:20                             ` Jarek Poplawski
  1 sibling, 1 reply; 68+ messages in thread
From: Ingo Molnar @ 2007-07-30  8:49 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Jarek Poplawski, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox


* Marcin Ślusarz <marcin.slusarz@gmail.com> wrote:

> > Subject: x86: activate HARDIRQS_SW_RESEND
> > From: Ingo Molnar <mingo@elte.hu>
> >
> > activate the software-triggered IRQ-resend logic.

> This patch didn't help (tested on 2.6.22.1) - ne2k_pci timed out.

ok. This makes it more likely that the driver itself (or the card) gets 
confused by the resend.

does the patch below fix those timeouts? It tests the theory whether any 
POST latency could expose this problem.

	Ingo

Index: linux/drivers/net/lib8390.c
===================================================================
--- linux.orig/drivers/net/lib8390.c
+++ linux/drivers/net/lib8390.c
@@ -375,6 +375,8 @@ static int ei_start_xmit(struct sk_buff 
 	/* Turn 8390 interrupts back on. */
 	ei_local->irqlock = 0;
 	ei_outb_p(ENISR_ALL, e8390_base + EN0_IMR);
+	/* force POST: */
+	ei_inb_p(e8390_base + EN0_IMR);
 
 	spin_unlock(&ei_local->page_lock);
 	enable_irq_lockdep_irqrestore(dev->irq, &flags);

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-30  8:46                   ` Ingo Molnar
@ 2007-07-30 13:05                     ` Alan Cox
  0 siblings, 0 replies; 68+ messages in thread
From: Alan Cox @ 2007-07-30 13:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jarek Poplawski, Thomas Gleixner, Linus Torvalds, Marcin ??lusarz,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton

> So the whole locking is to be able to keep irqs enabled for a long time, 
> without risking entry of the same IRQ handler on this same CPU, correct?

As implemented - on any CPU.

We also need to know that the IRQ handler is not doing useful work on
another processor which is why we take the lock after disabling the
interrupt line everywhere. Without that we might be completing an IRQ on
another CPU and that would race the transmit and make a nasty mess.

> So it seems to me that maybe the driver could be surprised via these 
> spurious interrupts that happen right after the irq_enable(). Does the 
> patch below make any sense in your opinion?

For MMIO it does look like that may be needed. Looks sensible.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-30  7:29                           ` Marcin Ślusarz
  2007-07-30  8:49                             ` Ingo Molnar
@ 2007-07-31 13:20                             ` Jarek Poplawski
  2007-08-06  7:00                               ` Marcin Ślusarz
  1 sibling, 1 reply; 68+ messages in thread
From: Jarek Poplawski @ 2007-07-31 13:20 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

On Mon, Jul 30, 2007 at 09:29:38AM +0200, Marcin Ślusarz wrote:
...
> ps: I retested all patches posted in this thread on top of 2.6.22.1
> and behavior from 2.6.21.3 didn't changed. My next tests will be on
> 2.6.22.x only.

Marcin,

I see you're quite busy, but if after testing this next Ingo's patch
you are alive yet, maybe you could try one more "idea"? No patch this
time, but if you could try this after adding boot option "noirqdebug"
(I'd like to be sure it's not about timinig after all).

Cheers & thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-30  8:49                             ` Ingo Molnar
@ 2007-08-01  7:24                               ` Marcin Ślusarz
  2007-08-01  7:27                                 ` Ingo Molnar
  0 siblings, 1 reply; 68+ messages in thread
From: Marcin Ślusarz @ 2007-08-01  7:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jarek Poplawski, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

2007/7/30, Ingo Molnar <mingo@elte.hu>:
> (..)
> does the patch below fix those timeouts? It tests the theory whether any
> POST latency could expose this problem.
>
>         Ingo
>
> Index: linux/drivers/net/lib8390.c
> ===================================================================
> --- linux.orig/drivers/net/lib8390.c
> +++ linux/drivers/net/lib8390.c
> @@ -375,6 +375,8 @@ static int ei_start_xmit(struct sk_buff
>         /* Turn 8390 interrupts back on. */
>         ei_local->irqlock = 0;
>         ei_outb_p(ENISR_ALL, e8390_base + EN0_IMR);
> +       /* force POST: */
> +       ei_inb_p(e8390_base + EN0_IMR);
>
>         spin_unlock(&ei_local->page_lock);
>         enable_irq_lockdep_irqrestore(dev->irq, &flags);
>

Bad news. It doesn't fix the problem.

Marcin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-01  7:24                               ` Marcin Ślusarz
@ 2007-08-01  7:27                                 ` Ingo Molnar
  2007-08-06  6:58                                   ` Marcin Ślusarz
  0 siblings, 1 reply; 68+ messages in thread
From: Ingo Molnar @ 2007-08-01  7:27 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Jarek Poplawski, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

* Marcin Ślusarz <marcin.slusarz@gmail.com> wrote:

> >         ei_outb_p(ENISR_ALL, e8390_base + EN0_IMR);
> > +       /* force POST: */
> > +       ei_inb_p(e8390_base + EN0_IMR);
> >
> >         spin_unlock(&ei_local->page_lock);
> >         enable_irq_lockdep_irqrestore(dev->irq, &flags);
> >
> 
> Bad news. It doesn't fix the problem.

ok, it wasnt supposed to be _that_ easy i guess :-) Can you please 
(re-)confirm that the workaround below indeed fixes the hung card 
problem? (after producing a single WARN_ON message into the syslog)

also, does removing the ne2k-pci module and reinserting it again solve 
the problem too, or is your network card stuck forever once it got into 
that state?

	Ingo

----------------------->
From: Thomas Gleixner <tglx@linutronix.de>
Subject: genirq: temporary fix for level-triggered IRQ resend

delayed disable relies on the ability to re-trigger the interrupt in the
case that a real interrupt happens after the software disable was set.
In this case we actually disable the interrupt on the hardware level
_after_ it occurred.

On enable_irq, we need to re-trigger the interrupt. On i386 this relies
on a hardware resend mechanism (send_IPI_self()). 

Actually we only need the resend for edge type interrupts. Level type
interrupts come back once enable_irq() re-enables the interrupt line.

I assume that the interrupt in question is level triggered because it is
shared and above the legacy irqs 0-15:

	17:         12   IO-APIC-fasteoi   eth1, eth0

Looking into the IO_APIC code, the resend via send_IPI_self() happens
unconditionally. So the resend is done for level and edge interrupts.
This makes the problem more mysterious.

The code in question lib8390.c does

	disable_irq();
	fiddle_with_the_network_card_hardware()
	enable_irq();

The fiddle_with_the_network_card_hardware() might cause interrupts,
which are cleared in the same code path again,

Marcin found that when he disables the irq line on the hardware level
(removing the delayed disable) the card is kept alive.

So the difference is that we can get a resend on enable_irq, when an
interrupt happens during the time, where we are in the disabled region.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/irq/resend.c |    9 +++++++++
 1 file changed, 9 insertions(+)

Index: linux/kernel/irq/resend.c
===================================================================
--- linux.orig/kernel/irq/resend.c
+++ linux/kernel/irq/resend.c
@@ -62,6 +62,15 @@ void check_irq_resend(struct irq_desc *d
 	 */
 	desc->chip->enable(irq);

+	/*
+	 * Temporary hack to figure out more about the problem, which
+	 * is causing the ancient network cards to die.
+	 */
+	if (desc->handle_irq != handle_edge_irq) {
+		WARN_ON_ONCE(1);
+		return;
+	}
+
 	if ((status & (IRQ_PENDING | IRQ_REPLAY)) == IRQ_PENDING) {
 		desc->status = (status & ~IRQ_PENDING) | IRQ_REPLAY;

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-01  7:27                                 ` Ingo Molnar
@ 2007-08-06  6:58                                   ` Marcin Ślusarz
  0 siblings, 0 replies; 68+ messages in thread
From: Marcin Ślusarz @ 2007-08-06  6:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jarek Poplawski, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

2007/8/1, Ingo Molnar <mingo@elte.hu>:
> ok, it wasnt supposed to be _that_ easy i guess :-) Can you please
> (re-)confirm that the workaround below indeed fixes the hung card
> problem? (after producing a single WARN_ON message into the syslog)
yes, with this patch everything works fine

end of dmesg:

EXT3 FS on sda7, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
Adding 1020112k swap on /dev/sda2.  Priority:-1 extents:1 across:1020112k
skge eth1: enabling interface
NET: Registered protocol family 17
WARNING: at kernel/irq/resend.c:70 check_irq_resend()

Call Trace:
 [<ffffffff8025e5c8>] check_irq_resend+0xa8/0xc0
 [<ffffffff8025e1ca>] enable_irq+0xea/0xf0
 [<ffffffff8800f21d>] :8390:ei_start_xmit+0x14d/0x30c
 [<ffffffff8052b5ce>] dev_hard_start_xmit+0x26e/0x2d0
 [<ffffffff80539b10>] __qdisc_run+0xc0/0x1f0
 [<ffffffff8052db9f>] dev_queue_xmit+0x24f/0x310
 [<ffffffff880d7ac9>] :af_packet:packet_sendmsg+0x259/0x2c0
 [<ffffffff8051f0bf>] sock_sendmsg+0xdf/0x110
 [<ffffffff8024b8c9>] trace_hardirqs_on+0xd9/0x180
 [<ffffffff8024c1dd>] __lock_acquire+0x31d/0xff0
 [<ffffffff80243290>] autoremove_wake_function+0x0/0x40
 [<ffffffff803e3103>] __up_read+0x23/0xb0
 [<ffffffff803e3125>] __up_read+0x45/0xb0
 [<ffffffff805bd8f5>] _spin_unlock_irqrestore+0x65/0x80
 [<ffffffff8024b8c9>] trace_hardirqs_on+0xd9/0x180
 [<ffffffff803e3125>] __up_read+0x45/0xb0
 [<ffffffff802464b6>] up_read+0x26/0x30
 [<ffffffff8051f4f1>] sys_sendto+0x111/0x150
 [<ffffffff8024b8c9>] trace_hardirqs_on+0xd9/0x180
 [<ffffffff805bd93b>] _spin_unlock_irq+0x2b/0x60
 [<ffffffff8023861a>] do_sigaction+0x11a/0x1d0
 [<ffffffff802097fe>] system_call+0x7e/0x83

Marking TSC unstable due to cpufreq changes
Time: acpi_pm clocksource has been installed.

> also, does removing the ne2k-pci module and reinserting it again solve
> the problem too, or is your network card stuck forever once it got into
> that state?
it doesn't change anything - i tried reloading both modules (ne2k_pci and skge)

Marcin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-31 13:20                             ` Jarek Poplawski
@ 2007-08-06  7:00                               ` Marcin Ślusarz
  2007-08-06  7:03                                 ` Ingo Molnar
  0 siblings, 1 reply; 68+ messages in thread
From: Marcin Ślusarz @ 2007-08-06  7:00 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

2007/7/31, Jarek Poplawski <jarkao2@o2.pl>:
> Marcin,
>
> I see you're quite busy, but if after testing this next Ingo's patch
> you are alive yet, maybe you could try one more "idea"? No patch this
> time, but if you could try this after adding boot option "noirqdebug"
> (I'd like to be sure it's not about timinig after all).
It didn't change anything. Network card still timed out.

Marcin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-06  7:00                               ` Marcin Ślusarz
@ 2007-08-06  7:03                                 ` Ingo Molnar
  2007-08-06 17:43                                   ` Chuck Ebbert
  2007-08-07  7:46                                   ` Marcin Ślusarz
  0 siblings, 2 replies; 68+ messages in thread
From: Ingo Molnar @ 2007-08-06  7:03 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Jarek Poplawski, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

* Marcin Ślusarz <marcin.slusarz@gmail.com> wrote:

> 2007/7/31, Jarek Poplawski <jarkao2@o2.pl>:
> > Marcin,
> >
> > I see you're quite busy, but if after testing this next Ingo's patch
> > you are alive yet, maybe you could try one more "idea"? No patch this
> > time, but if you could try this after adding boot option "noirqdebug"
> > (I'd like to be sure it's not about timinig after all).
> It didn't change anything. Network card still timed out.

please try Jarek's second patch too - there was a missing unmask.

	Ingo

-------------->
Subject: genirq: fix simple and fasteoi irq handlers
From: Jarek Poplawski <jarkao2@o2.pl>

After the "genirq: do not mask interrupts by default" patch interrupts
should be disabled not immediately upon request, but after they happen.
But, handle_simple_irq() and handle_fasteoi_irq() can skip this once or
more if an irq is just serviced (IRQ_INPROGRESS), possibly disrupting a
driver's work.

The main reason of problems here, pointing the broken patch and making
the first patch which can fix this was done by Marcin Slusarz.
Additional test patches of Thomas Gleixner and Ingo Molnar tested by
Marcin Slusarz helped to narrow possible reasons even more. Thanks.

PS: this patch fixes only one evident error here, but there could be
more places affected by above-mentioned change in irq handling.

PS 2:
After rethinking, IMHO, there are two most probable scenarios here:

1. After hw resend there could be a conflict between retriggered
edge type irq and the next level type one: e.g. if this level type
irq (io_apic is enabled then) is triggered while retriggered irq is
serviced (IRQ_INPROGRESS) there is goto out with eoi, and probably
the next such levels are triggered and looping, so probably kind of
flood in io_apic until this retriggered edge service has ended. 
2. There is something wrong with ioapic_retrigger_irq (less probable
because this should be probably seen with 'normal' edge retriggers,
but on the other hand, they could be less common).

So, if there is #1, this fixed patch should work.

But, since level types don't need this retriggers too much I think
this "don't mask interrupts by default" idea should be rethinked:
is there enough gain to risk such hard to diagnose errors?

So, IMHO, there should be at least possibility to turn this off for
level types in config (it should be a visible option, so people could
find & try this before writing for help or changing a network card).

Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>

---

diff -Nurp 2.6.23-rc1-/kernel/irq/chip.c 2.6.23-rc1/kernel/irq/chip.c
--- 2.6.23-rc1-/kernel/irq/chip.c	2007-07-09 01:32:17.000000000 +0200
+++ 2.6.23-rc1/kernel/irq/chip.c	2007-08-05 21:49:46.000000000 +0200
@@ -295,12 +295,11 @@ handle_simple_irq(unsigned int irq, stru

 	spin_lock(&desc->lock);

-	if (unlikely(desc->status & IRQ_INPROGRESS))
-		goto out_unlock;
 	kstat_cpu(cpu).irqs[irq]++;

 	action = desc->action;
-	if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
+	if (unlikely(!action || (desc->status & (IRQ_INPROGRESS |
+						 IRQ_DISABLED)))) {
 		if (desc->chip->mask)
 			desc->chip->mask(irq);
 		desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
@@ -318,6 +317,8 @@ handle_simple_irq(unsigned int irq, stru

 	spin_lock(&desc->lock);
 	desc->status &= ~IRQ_INPROGRESS;
+	if (!(desc->status & IRQ_DISABLED) && desc->chip->unmask)
+		desc->chip->unmask(irq);
 out_unlock:
 	spin_unlock(&desc->lock);
 }
@@ -392,18 +393,16 @@ handle_fasteoi_irq(unsigned int irq, str

 	spin_lock(&desc->lock);

-	if (unlikely(desc->status & IRQ_INPROGRESS))
-		goto out;
-
 	desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
 	kstat_cpu(cpu).irqs[irq]++;

 	/*
-	 * If its disabled or no action available
+	 * If it's running, disabled or no action available
 	 * then mask it and get out of here:
 	 */
 	action = desc->action;
-	if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
+	if (unlikely(!action || (desc->status & (IRQ_INPROGRESS |
+						 IRQ_DISABLED)))) {
 		desc->status |= IRQ_PENDING;
 		if (desc->chip->mask)
 			desc->chip->mask(irq);
@@ -420,6 +419,8 @@ handle_fasteoi_irq(unsigned int irq, str

 	spin_lock(&desc->lock);
 	desc->status &= ~IRQ_INPROGRESS;
+	if (!(desc->status & IRQ_DISABLED) && desc->chip->unmask)
+		desc->chip->unmask(irq);
 out:
 	desc->chip->eoi(irq);

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-06  7:03                                 ` Ingo Molnar
@ 2007-08-06 17:43                                   ` Chuck Ebbert
  2007-08-06 19:08                                     ` Ingo Molnar
  2007-08-07 10:09                                     ` Jarek Poplawski
  2007-08-07  7:46                                   ` Marcin Ślusarz
  1 sibling, 2 replies; 68+ messages in thread
From: Chuck Ebbert @ 2007-08-06 17:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Marcin Ślusarz, Jarek Poplawski, Thomas Gleixner,
	Linus Torvalds, Jean-Baptiste Vignaud, linux-kernel, shemminger,
	linux-net, netdev, Andrew Morton, Alan Cox

On 08/06/2007 03:03 AM, Ingo Molnar wrote:
> 
> But, since level types don't need this retriggers too much I think
> this "don't mask interrupts by default" idea should be rethinked:
> is there enough gain to risk such hard to diagnose errors?
>   
> 

I reverted those masking changes in Fedora and the baffling problem
with 3Com 3C905 network adapters went away.

Before, they would print:

eth0: transmit timed out, tx_status 00 status e601.
  diagnostics: net 0ccc media 8880 dma 0000003a fifo 0000
eth0: Interrupt posted but not delivered -- IRQ blocked by another device?
  Flags; bus-master 1, dirty 295757(13) current 295757(13)
  Transmit list 00000000 vs. f7150a20.
  0: @f7150200  length 80000070 status 0c010070
  1: @f71502a0  length 80000070 status 0c010070
  2: @f7150340  length 8000005c status 0c01005c

Now they just work, apparently...

So why not just revert the change?


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-06 17:43                                   ` Chuck Ebbert
@ 2007-08-06 19:08                                     ` Ingo Molnar
  2007-08-07 10:09                                     ` Jarek Poplawski
  1 sibling, 0 replies; 68+ messages in thread
From: Ingo Molnar @ 2007-08-06 19:08 UTC (permalink / raw)
  To: Chuck Ebbert
  Cc: Marcin Ślusarz, Jarek Poplawski, Thomas Gleixner,
	Linus Torvalds, Jean-Baptiste Vignaud, linux-kernel, shemminger,
	linux-net, netdev, Andrew Morton, Alan Cox

* Chuck Ebbert <cebbert@redhat.com> wrote:

> Before, they would print:
> 
> eth0: transmit timed out, tx_status 00 status e601.
>   diagnostics: net 0ccc media 8880 dma 0000003a fifo 0000
> eth0: Interrupt posted but not delivered -- IRQ blocked by another device?
>   Flags; bus-master 1, dirty 295757(13) current 295757(13)
>   Transmit list 00000000 vs. f7150a20.
>   0: @f7150200  length 80000070 status 0c010070
>   1: @f71502a0  length 80000070 status 0c010070
>   2: @f7150340  length 8000005c status 0c01005c
> 
> Now they just work, apparently...

could you please try the patch below? If this doesnt do the trick then i 
guess we need to revert that change.

	Ingo

------------>
(take 2)

Subject: genirq: fix simple and fasteoi irq handlers

After the "genirq: do not mask interrupts by default" patch interrupts
should be disabled not immediately upon request, but after they happen.
But, handle_simple_irq() and handle_fasteoi_irq() can skip this once or
more if an irq is just serviced (IRQ_INPROGRESS), possibly disrupting a
driver's work.

The main reason of problems here, pointing the broken patch and making
the first patch which can fix this was done by Marcin Slusarz.
Additional test patches of Thomas Gleixner and Ingo Molnar tested by
Marcin Slusarz helped to narrow possible reasons even more. Thanks.

PS: this patch fixes only one evident error here, but there could be
more places affected by above-mentioned change in irq handling.

PS 2:
After rethinking, IMHO, there are two most probable scenarios here:

1. After hw resend there could be a conflict between retriggered
edge type irq and the next level type one: e.g. if this level type
irq (io_apic is enabled then) is triggered while retriggered irq is
serviced (IRQ_INPROGRESS) there is goto out with eoi, and probably
the next such levels are triggered and looping, so probably kind of
flood in io_apic until this retriggered edge service has ended. 
2. There is something wrong with ioapic_retrigger_irq (less probable
because this should be probably seen with 'normal' edge retriggers,
but on the other hand, they could be less common).

So, if there is #1, this fixed patch should work.

But, since level types don't need this retriggers too much I think
this "don't mask interrupts by default" idea should be rethinked:
is there enough gain to risk such hard to diagnose errors?

So, IMHO, there should be at least possibility to turn this off for
level types in config (it should be a visible option, so people could
find & try this before writing for help or changing a network card).

Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>

---

diff -Nurp 2.6.23-rc1-/kernel/irq/chip.c 2.6.23-rc1/kernel/irq/chip.c
--- 2.6.23-rc1-/kernel/irq/chip.c	2007-07-09 01:32:17.000000000 +0200
+++ 2.6.23-rc1/kernel/irq/chip.c	2007-08-05 21:49:46.000000000 +0200
@@ -295,12 +295,11 @@ handle_simple_irq(unsigned int irq, stru

 	spin_lock(&desc->lock);

-	if (unlikely(desc->status & IRQ_INPROGRESS))
-		goto out_unlock;
 	kstat_cpu(cpu).irqs[irq]++;

 	action = desc->action;
-	if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
+	if (unlikely(!action || (desc->status & (IRQ_INPROGRESS |
+						 IRQ_DISABLED)))) {
 		if (desc->chip->mask)
 			desc->chip->mask(irq);
 		desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
@@ -318,6 +317,8 @@ handle_simple_irq(unsigned int irq, stru

 	spin_lock(&desc->lock);
 	desc->status &= ~IRQ_INPROGRESS;
+	if (!(desc->status & IRQ_DISABLED) && desc->chip->unmask)
+		desc->chip->unmask(irq);
 out_unlock:
 	spin_unlock(&desc->lock);
 }
@@ -392,18 +393,16 @@ handle_fasteoi_irq(unsigned int irq, str

 	spin_lock(&desc->lock);

-	if (unlikely(desc->status & IRQ_INPROGRESS))
-		goto out;
-
 	desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
 	kstat_cpu(cpu).irqs[irq]++;

 	/*
-	 * If its disabled or no action available
+	 * If it's running, disabled or no action available
 	 * then mask it and get out of here:
 	 */
 	action = desc->action;
-	if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
+	if (unlikely(!action || (desc->status & (IRQ_INPROGRESS |
+						 IRQ_DISABLED)))) {
 		desc->status |= IRQ_PENDING;
 		if (desc->chip->mask)
 			desc->chip->mask(irq);
@@ -420,6 +419,8 @@ handle_fasteoi_irq(unsigned int irq, str

 	spin_lock(&desc->lock);
 	desc->status &= ~IRQ_INPROGRESS;
+	if (!(desc->status & IRQ_DISABLED) && desc->chip->unmask)
+		desc->chip->unmask(irq);
 out:
 	desc->chip->eoi(irq);

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
@ 2007-08-06 19:36 Jean-Baptiste Vignaud
  0 siblings, 0 replies; 68+ messages in thread
From: Jean-Baptiste Vignaud @ 2007-08-06 19:36 UTC (permalink / raw)
  To: mingo
  Cc: cebbert, marcin.slusarz, jarkao2, tglx, torvalds, linux-kernel,
	shemminger, linux-net, netdev, akpm, alan

> * Chuck Ebbert <cebbert@redhat.com> wrote:
> 
> > Before, they would print:
> > 
> > eth0: transmit timed out, tx_status 00 status e601.
> >   diagnostics: net 0ccc media 8880 dma 0000003a fifo 0000
> > eth0: Interrupt posted but not delivered -- IRQ blocked by another device?
> >   Flags; bus-master 1, dirty 295757(13) current 295757(13)
> >   Transmit list 00000000 vs. f7150a20.
> >   0: @f7150200  length 80000070 status 0c010070
> >   1: @f71502a0  length 80000070 status 0c010070
> >   2: @f7150340  length 8000005c status 0c01005c
> > 
> > Now they just work, apparently...
> 
> could you please try the patch below? If this doesnt do the trick then i 
> guess we need to revert that change.

I confirm that the latest fedora kernel 2.6.22.1-41.fc7 (with the removal of [PATCH] genirq: do not mask interrupts by default) still work on my machine for 3 days.

Atm I'm still stressing the network (2 * 3com cards + 1 onboard nvidia card) to be sure.

Jb




^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
@ 2007-08-06 20:42 Jean-Baptiste Vignaud
  2007-08-06 21:19 ` Chuck Ebbert
  2007-08-06 21:30 ` Al Boldi
  0 siblings, 2 replies; 68+ messages in thread
From: Jean-Baptiste Vignaud @ 2007-08-06 20:42 UTC (permalink / raw)
  To: mingo
  Cc: cebbert, marcin.slusarz, jarkao2, tglx, torvalds, linux-kernel,
	shemminger, linux-net, netdev, akpm, alan

Mmm, bad news, after 4 hours of intensive network stressing, one of the 2 3com card failed with the latest fedora kernel.

Aug  6 22:31:09 loki kernel: NETDEV WATCHDOG: eth2: transmit timed out
Aug  6 22:31:09 loki kernel: eth2: transmit timed out, tx_status 00 status e601.
Aug  6 22:31:09 loki kernel:   diagnostics: net 0ccc media 8880 dma 0000003a fifo 8000
Aug  6 22:31:09 loki kernel: eth2: Interrupt posted but not delivered -- IRQ blocked by another device?
Aug  6 22:31:09 loki kernel:   Flags; bus-master 1, dirty 26085000(8) current 26085000(8)
Aug  6 22:31:09 loki kernel:   Transmit list 00000000 vs. ffff81007c807700.

Stressing eth2 by copying large files on a samba on share and eth0 by downloading big files on the internet.

Jb

> 
> * Chuck Ebbert <cebbert@redhat.com> wrote:
> 
> > Before, they would print:
> > 
> > eth0: transmit timed out, tx_status 00 status e601.
> >   diagnostics: net 0ccc media 8880 dma 0000003a fifo 0000
> > eth0: Interrupt posted but not delivered -- IRQ blocked by another device?
> >   Flags; bus-master 1, dirty 295757(13) current 295757(13)
> >   Transmit list 00000000 vs. f7150a20.
> >   0: @f7150200  length 80000070 status 0c010070
> >   1: @f71502a0  length 80000070 status 0c010070
> >   2: @f7150340  length 8000005c status 0c01005c
> > 
> > Now they just work, apparently...
> 
> could you please try the patch below? If this doesnt do the trick then i 
> guess we need to revert that change.
> 
> 	Ingo
> 
> ------------>
> (take 2)
> 
> Subject: genirq: fix simple and fasteoi irq handlers
> 
> After the "genirq: do not mask interrupts by default" patch interrupts
> should be disabled not immediately upon request, but after they happen.
> But, handle_simple_irq() and handle_fasteoi_irq() can skip this once or
> more if an irq is just serviced (IRQ_INPROGRESS), possibly disrupting a
> driver's work.
> 
> The main reason of problems here, pointing the broken patch and making
> the first patch which can fix this was done by Marcin Slusarz.
> Additional test patches of Thomas Gleixner and Ingo Molnar tested by
> Marcin Slusarz helped to narrow possible reasons even more. Thanks.
> 
> PS: this patch fixes only one evident error here, but there could be
> more places affected by above-mentioned change in irq handling.
> 
> PS 2:
> After rethinking, IMHO, there are two most probable scenarios here:
> 
> 1. After hw resend there could be a conflict between retriggered
> edge type irq and the next level type one: e.g. if this level type
> irq (io_apic is enabled then) is triggered while retriggered irq is
> serviced (IRQ_INPROGRESS) there is goto out with eoi, and probably
> the next such levels are triggered and looping, so probably kind of
> flood in io_apic until this retriggered edge service has ended. 
> 2. There is something wrong with ioapic_retrigger_irq (less probable
> because this should be probably seen with 'normal' edge retriggers,
> but on the other hand, they could be less common).
> 
> So, if there is #1, this fixed patch should work.
> 
> But, since level types don't need this retriggers too much I think
> this "don't mask interrupts by default" idea should be rethinked:
> is there enough gain to risk such hard to diagnose errors?
>   
> So, IMHO, there should be at least possibility to turn this off for
> level types in config (it should be a visible option, so people could
> find & try this before writing for help or changing a network card).
> 
> 
> Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
> 
> ---
> 
> diff -Nurp 2.6.23-rc1-/kernel/irq/chip.c 2.6.23-rc1/kernel/irq/chip.c
> --- 2.6.23-rc1-/kernel/irq/chip.c	2007-07-09 01:32:17.000000000 +0200
> +++ 2.6.23-rc1/kernel/irq/chip.c	2007-08-05 21:49:46.000000000 +0200
> @@ -295,12 +295,11 @@ handle_simple_irq(unsigned int irq, stru
>  
>  	spin_lock(&desc->lock);
>  
> -	if (unlikely(desc->status & IRQ_INPROGRESS))
> -		goto out_unlock;
>  	kstat_cpu(cpu).irqs[irq]++;
>  
>  	action = desc->action;
> -	if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
> +	if (unlikely(!action || (desc->status & (IRQ_INPROGRESS |
> +						 IRQ_DISABLED)))) {
>  		if (desc->chip->mask)
>  			desc->chip->mask(irq);
>  		desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
> @@ -318,6 +317,8 @@ handle_simple_irq(unsigned int irq, stru
>  
>  	spin_lock(&desc->lock);
>  	desc->status &= ~IRQ_INPROGRESS;
> +	if (!(desc->status & IRQ_DISABLED) && desc->chip->unmask)
> +		desc->chip->unmask(irq);
>  out_unlock:
>  	spin_unlock(&desc->lock);
>  }
> @@ -392,18 +393,16 @@ handle_fasteoi_irq(unsigned int irq, str
>  
>  	spin_lock(&desc->lock);
>  
> -	if (unlikely(desc->status & IRQ_INPROGRESS))
> -		goto out;
> -
>  	desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
>  	kstat_cpu(cpu).irqs[irq]++;
>  
>  	/*
> -	 * If its disabled or no action available
> +	 * If it's running, disabled or no action available
>  	 * then mask it and get out of here:
>  	 */
>  	action = desc->action;
> -	if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
> +	if (unlikely(!action || (desc->status & (IRQ_INPROGRESS |
> +						 IRQ_DISABLED)))) {
>  		desc->status |= IRQ_PENDING;
>  		if (desc->chip->mask)
>  			desc->chip->mask(irq);
> @@ -420,6 +419,8 @@ handle_fasteoi_irq(unsigned int irq, str
>  
>  	spin_lock(&desc->lock);
>  	desc->status &= ~IRQ_INPROGRESS;
> +	if (!(desc->status & IRQ_DISABLED) && desc->chip->unmask)
> +		desc->chip->unmask(irq);
>  out:
>  	desc->chip->eoi(irq);
>  
> 


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-06 20:42 Jean-Baptiste Vignaud
@ 2007-08-06 21:19 ` Chuck Ebbert
  2007-08-07  7:26   ` Jarek Poplawski
  2007-08-06 21:30 ` Al Boldi
  1 sibling, 1 reply; 68+ messages in thread
From: Chuck Ebbert @ 2007-08-06 21:19 UTC (permalink / raw)
  To: Jean-Baptiste Vignaud
  Cc: mingo, marcin.slusarz, jarkao2, tglx, torvalds, linux-kernel,
	shemminger, linux-net, netdev, akpm, alan

On 08/06/2007 04:42 PM, Jean-Baptiste Vignaud wrote:
> Mmm, bad news, after 4 hours of intensive network stressing, one of the 2 3com card failed with the latest fedora kernel.
> 
> Aug  6 22:31:09 loki kernel: NETDEV WATCHDOG: eth2: transmit timed out
> Aug  6 22:31:09 loki kernel: eth2: transmit timed out, tx_status 00 status e601.
> Aug  6 22:31:09 loki kernel:   diagnostics: net 0ccc media 8880 dma 0000003a fifo 8000
> Aug  6 22:31:09 loki kernel: eth2: Interrupt posted but not delivered -- IRQ blocked by another device?
> Aug  6 22:31:09 loki kernel:   Flags; bus-master 1, dirty 26085000(8) current 26085000(8)
> Aug  6 22:31:09 loki kernel:   Transmit list 00000000 vs. ffff81007c807700.
> 
> Stressing eth2 by copying large files on a samba on share and eth0 by downloading big files on the internet.

So even the full revert doesn't fix the 3Com driver, it just makes it less
likely to do that.

The other patch probably won't be any better -- I'd guess there's some
kind of IRQ handling bug in that driver.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-06 20:42 Jean-Baptiste Vignaud
  2007-08-06 21:19 ` Chuck Ebbert
@ 2007-08-06 21:30 ` Al Boldi
  1 sibling, 0 replies; 68+ messages in thread
From: Al Boldi @ 2007-08-06 21:30 UTC (permalink / raw)
  To: netdev, linux-net; +Cc: linux-kernel

Jean-Baptiste Vignaud wrote:
> Mmm, bad news, after 4 hours of intensive network stressing, one of the 2
> 3com card failed with the latest fedora kernel.
>
> Aug  6 22:31:09 loki kernel: NETDEV WATCHDOG: eth2: transmit timed out
> Aug  6 22:31:09 loki kernel: eth2: transmit timed out, tx_status 00 status
> e601. Aug  6 22:31:09 loki kernel:   diagnostics: net 0ccc media 8880 dma
> 0000003a fifo 8000 Aug  6 22:31:09 loki kernel: eth2: Interrupt posted but
> not delivered -- IRQ blocked by another device? Aug  6 22:31:09 loki
> kernel:   Flags; bus-master 1, dirty 26085000(8) current 26085000(8) Aug 
> 6 22:31:09 loki kernel:   Transmit list 00000000 vs. ffff81007c807700.
>
> Stressing eth2 by copying large files on a samba on share and eth0 by
> downloading big files on the internet.

Next time you want to stress your network you may want to try this:

  # ping 10.1 -s8 -f -l9

or

  # ping 10.1 -s8 -A > /dev/null

BTW, I mentioned this before, there maybe a BIOS irq config mismatch before 
booting the kernel.


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-06 21:19 ` Chuck Ebbert
@ 2007-08-07  7:26   ` Jarek Poplawski
  0 siblings, 0 replies; 68+ messages in thread
From: Jarek Poplawski @ 2007-08-07  7:26 UTC (permalink / raw)
  To: Chuck Ebbert
  Cc: Jean-Baptiste Vignaud, mingo, marcin.slusarz, tglx, torvalds,
	linux-kernel, shemminger, linux-net, netdev, akpm, alan

On Mon, Aug 06, 2007 at 05:19:03PM -0400, Chuck Ebbert wrote:
> On 08/06/2007 04:42 PM, Jean-Baptiste Vignaud wrote:
> > Mmm, bad news, after 4 hours of intensive network stressing, one of the 2 3com card failed with the latest fedora kernel.
> > 
> > Aug  6 22:31:09 loki kernel: NETDEV WATCHDOG: eth2: transmit timed out
> > Aug  6 22:31:09 loki kernel: eth2: transmit timed out, tx_status 00 status e601.
> > Aug  6 22:31:09 loki kernel:   diagnostics: net 0ccc media 8880 dma 0000003a fifo 8000
> > Aug  6 22:31:09 loki kernel: eth2: Interrupt posted but not delivered -- IRQ blocked by another device?
> > Aug  6 22:31:09 loki kernel:   Flags; bus-master 1, dirty 26085000(8) current 26085000(8)
> > Aug  6 22:31:09 loki kernel:   Transmit list 00000000 vs. ffff81007c807700.
> > 
> > Stressing eth2 by copying large files on a samba on share and eth0 by downloading big files on the internet.
> 
> So even the full revert doesn't fix the 3Com driver, it just makes it less
> likely to do that.
> 
> The other patch probably won't be any better -- I'd guess there's some
> kind of IRQ handling bug in that driver.
> 

I don't know how fast are these 3com chips regarding these 8390
described by Alan, and how are irqs shared on Jean-Baptiste's box,
but I'm surprised they could have worked sharing interrupts and
without such time outs before this change in 2.6.21. It seems some
of those older chips, because of slowness, could have transmit
problems even without irq sharing. So, IMHO, if possible, there
should be never irq sharing enabled between two (or more) drivers
using both disable_irq.

These time out problems were reported long time ago, but I think
it would be nice if this thread could at least remove these new
problems reported only after 2.6.21, which it seems is possible
now, after Marcin's diagnose: by reverting the whole 2.6.21 patch
or by this current temporary patch in 2.6.23-rc2's resend.c.
It would be nice if you could try this patch too.

BTW: Jean-Babtiste, could you send or point to you current configs?
I mean at least proc/interrupts, but with dmesg and .config it would
be even better. (I assume this last report was about the revert patch
mentioned by Chuck, not the one below your message?)

Regards,
Jarek P.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-06  7:03                                 ` Ingo Molnar
  2007-08-06 17:43                                   ` Chuck Ebbert
@ 2007-08-07  7:46                                   ` Marcin Ślusarz
  2007-08-07  8:23                                     ` Jarek Poplawski
  1 sibling, 1 reply; 68+ messages in thread
From: Marcin Ślusarz @ 2007-08-07  7:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jarek Poplawski, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

2007/8/6, Ingo Molnar <mingo@elte.hu>:
> (..)
> please try Jarek's second patch too - there was a missing unmask.
>
>         Ingo
>
> -------------->
> Subject: genirq: fix simple and fasteoi irq handlers
> From: Jarek Poplawski <jarkao2@o2.pl>
>
> After the "genirq: do not mask interrupts by default" patch interrupts
> should be disabled not immediately upon request, but after they happen.
> But, handle_simple_irq() and handle_fasteoi_irq() can skip this once or
> more if an irq is just serviced (IRQ_INPROGRESS), possibly disrupting a
> driver's work.
>
> The main reason of problems here, pointing the broken patch and making
> the first patch which can fix this was done by Marcin Slusarz.
> Additional test patches of Thomas Gleixner and Ingo Molnar tested by
> Marcin Slusarz helped to narrow possible reasons even more. Thanks.
>
> PS: this patch fixes only one evident error here, but there could be
> more places affected by above-mentioned change in irq handling.
>
> PS 2:
> After rethinking, IMHO, there are two most probable scenarios here:
>
> 1. After hw resend there could be a conflict between retriggered
> edge type irq and the next level type one: e.g. if this level type
> irq (io_apic is enabled then) is triggered while retriggered irq is
> serviced (IRQ_INPROGRESS) there is goto out with eoi, and probably
> the next such levels are triggered and looping, so probably kind of
> flood in io_apic until this retriggered edge service has ended.
> 2. There is something wrong with ioapic_retrigger_irq (less probable
> because this should be probably seen with 'normal' edge retriggers,
> but on the other hand, they could be less common).
>
> So, if there is #1, this fixed patch should work.
>
> But, since level types don't need this retriggers too much I think
> this "don't mask interrupts by default" idea should be rethinked:
> is there enough gain to risk such hard to diagnose errors?
>
> So, IMHO, there should be at least possibility to turn this off for
> level types in config (it should be a visible option, so people could
> find & try this before writing for help or changing a network card).
>
>
> Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
>
> ---
>
> diff -Nurp 2.6.23-rc1-/kernel/irq/chip.c 2.6.23-rc1/kernel/irq/chip.c
> --- 2.6.23-rc1-/kernel/irq/chip.c       2007-07-09 01:32:17.000000000 +0200
> +++ 2.6.23-rc1/kernel/irq/chip.c        2007-08-05 21:49:46.000000000 +0200
> @@ -295,12 +295,11 @@ handle_simple_irq(unsigned int irq, stru
>
>         spin_lock(&desc->lock);
>
> -       if (unlikely(desc->status & IRQ_INPROGRESS))
> -               goto out_unlock;
>         kstat_cpu(cpu).irqs[irq]++;
>
>         action = desc->action;
> -       if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
> +       if (unlikely(!action || (desc->status & (IRQ_INPROGRESS |
> +                                                IRQ_DISABLED)))) {
>                 if (desc->chip->mask)
>                         desc->chip->mask(irq);
>                 desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
> @@ -318,6 +317,8 @@ handle_simple_irq(unsigned int irq, stru
>
>         spin_lock(&desc->lock);
>         desc->status &= ~IRQ_INPROGRESS;
> +       if (!(desc->status & IRQ_DISABLED) && desc->chip->unmask)
> +               desc->chip->unmask(irq);
>  out_unlock:
>         spin_unlock(&desc->lock);
>  }
> @@ -392,18 +393,16 @@ handle_fasteoi_irq(unsigned int irq, str
>
>         spin_lock(&desc->lock);
>
> -       if (unlikely(desc->status & IRQ_INPROGRESS))
> -               goto out;
> -
>         desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
>         kstat_cpu(cpu).irqs[irq]++;
>
>         /*
> -        * If its disabled or no action available
> +        * If it's running, disabled or no action available
>          * then mask it and get out of here:
>          */
>         action = desc->action;
> -       if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
> +       if (unlikely(!action || (desc->status & (IRQ_INPROGRESS |
> +                                                IRQ_DISABLED)))) {
>                 desc->status |= IRQ_PENDING;
>                 if (desc->chip->mask)
>                         desc->chip->mask(irq);
> @@ -420,6 +419,8 @@ handle_fasteoi_irq(unsigned int irq, str
>
>         spin_lock(&desc->lock);
>         desc->status &= ~IRQ_INPROGRESS;
> +       if (!(desc->status & IRQ_DISABLED) && desc->chip->unmask)
> +               desc->chip->unmask(irq);
>  out:
>         desc->chip->eoi(irq);
>
>
Network card still locks up (tested on 2.6.22.1). I had to upload more
data than usual (~350 MB vs ~1-100 MB) to trigger that bug but it
might be a coincidence...

Marcin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
@ 2007-08-07  8:10 Jean-Baptiste Vignaud
  2007-08-07  9:05 ` Jarek Poplawski
  0 siblings, 1 reply; 68+ messages in thread
From: Jean-Baptiste Vignaud @ 2007-08-07  8:10 UTC (permalink / raw)
  To: jarkao2
  Cc: cebbert, mingo, marcin.slusarz, tglx, torvalds, linux-kernel,
	shemminger, linux-net, netdev, akpm, alan


> BTW: Jean-Babtiste, could you send or point to you current configs?
> I mean at least proc/interrupts, but with dmesg and .config it would
> be even better. (I assume this last report was about the revert patch
> mentioned by Chuck, not the one below your message?)

Sure. 

Last reports are with the 2.6.22.1-41.fc7 kernel, which has in changelog :

* Sat Jul 28 2007 Chuck Ebbert <cebbert@redhat.com>
- revert upstream "genirq: do not mask interrupts by default"


* interrupts (i use irqbalance, but problem was the same without)

[root@loki ~]# cat /proc/interrupts 
           CPU0       CPU1       
  0:       4487    4910668   IO-APIC-edge      timer
  1:        241         58   IO-APIC-edge      i8042
  8:          0          0   IO-APIC-edge      rtc0
  9:          0          0   IO-APIC-fasteoi   acpi
 12:          2        139   IO-APIC-edge      i8042
 14:          0          0   IO-APIC-edge      libata
 15:          0          0   IO-APIC-edge      libata
 16:      72625         96   IO-APIC-fasteoi   eth1
 17:       4667        128   IO-APIC-fasteoi   eth2
 20:       4156      39870   IO-APIC-fasteoi   sata_nv
 21:      34794       9177   IO-APIC-fasteoi   sata_nv
 22:          0          0   IO-APIC-fasteoi   ehci_hcd:usb2
 23:       6005       1565   IO-APIC-fasteoi   ohci_hcd:usb1, sata_nv
2297:          3     492180   PCI-MSI-edge      eth0
NMI:          0          0 
LOC:    4915345    4915282 
ERR:          0

problems are with eth1 and eth2 here. never had any problems with the onboard (eth0).

* pci

00:00.0 RAM memory: nVidia Corporation MCP55 Memory Controller (rev a1)
00:01.0 ISA bridge: nVidia Corporation MCP55 LPC Bridge (rev a2)
00:01.1 SMBus: nVidia Corporation MCP55 SMBus (rev a2)
00:01.2 RAM memory: nVidia Corporation MCP55 Memory Controller (rev a2)
00:02.0 USB Controller: nVidia Corporation MCP55 USB Controller (rev a1)
00:02.1 USB Controller: nVidia Corporation MCP55 USB Controller (rev a2)
00:04.0 IDE interface: nVidia Corporation MCP55 IDE (rev a1)
00:05.0 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a2)
00:05.1 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a2)
00:05.2 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a2)
00:06.0 PCI bridge: nVidia Corporation MCP55 PCI bridge (rev a2)
00:08.0 Bridge: nVidia Corporation MCP55 Ethernet (rev a2)
00:0a.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a2)
00:0b.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a2)
00:0c.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a2)
00:0d.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a2)
00:0e.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a2)
00:0f.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a2)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
01:06.0 Ethernet controller: 3Com Corporation 3c905C-TX/TX-M [Tornado] (rev 78)
01:07.0 Ethernet controller: 3Com Corporation 3c905C-TX/TX-M [Tornado] (rev 78)
07:00.0 VGA compatible controller: nVidia Corporation NV44 [GeForce 6200 LE] (rev a1)

* dmesg (from a reboot this morning)

Linux version 2.6.22.1-41.fc7 (kojibuilder@xenbuilder1.fedora.redhat.com) (gcc version 4.1.2 20070502 (Red Hat 4.1.2-12)) #1 SMP Fri Jul 27 18:21:43 EDT 2007
Command line: ro root=/dev/all/root
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009f000 (usable)
 BIOS-e820: 000000000009f000 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000007fee0000 (usable)
 BIOS-e820: 000000007fee0000 - 000000007fee3000 (ACPI NVS)
 BIOS-e820: 000000007fee3000 - 000000007fef0000 (ACPI data)
 BIOS-e820: 000000007fef0000 - 000000007ff00000 (reserved)
 BIOS-e820: 00000000f0000000 - 00000000f4000000 (reserved)
 BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
Entering add_active_range(0, 0, 159) 0 entries of 3200 used
Entering add_active_range(0, 256, 524000) 1 entries of 3200 used
end_pfn_map = 1048576
DMI 2.4 present.
ACPI: RSDP 000F7620, 0024 (r2 Nvidia)
ACPI: XSDT 7FEE30C0, 0044 (r1 Nvidia ASUSACPI 42302E31 AWRD        0)
ACPI: FACP 7FEEC400, 00F4 (r3 Nvidia ASUSACPI 42302E31 AWRD        0)
ACPI: DSDT 7FEE3240, 9164 (r1 NVIDIA AWRDACPI     1000 MSFT  3000000)
ACPI: FACS 7FEE0000, 0040
ACPI: HPET 7FEEC600, 0038 (r1 Nvidia ASUSACPI 42302E31 AWRD       98)
ACPI: MCFG 7FEEC680, 003C (r1 Nvidia ASUSACPI 42302E31 AWRD        0)
ACPI: APIC 7FEEC540, 007C (r1 Nvidia ASUSACPI 42302E31 AWRD        0)
Scanning NUMA topology in Northbridge 24
No NUMA configuration found
Faking a node at 0000000000000000-000000007fee0000
Entering add_active_range(0, 0, 159) 0 entries of 3200 used
Entering add_active_range(0, 256, 524000) 1 entries of 3200 used
Bootmem setup node 0 0000000000000000-000000007fee0000
Zone PFN ranges:
  DMA             0 ->     4096
  DMA32        4096 ->  1048576
  Normal    1048576 ->  1048576
early_node_map[2] active PFN ranges
    0:        0 ->      159
    0:      256 ->   524000
On node 0 totalpages: 523903
  DMA zone: 56 pages used for memmap
  DMA zone: 1300 pages reserved
  DMA zone: 2643 pages, LIFO batch:0
  DMA32 zone: 7108 pages used for memmap
  DMA32 zone: 512796 pages, LIFO batch:31
  Normal zone: 0 pages used for memmap
ACPI: PM-Timer IO Port: 0x1008
ACPI: Local APIC address 0xfee00000
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 (Bootup-CPU)
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
Processor #1
ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 2, address 0xfec00000, GSI 0-23
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
ACPI: INT_SRC_OVR (bus 0 bus_irq 14 global_irq 14 high edge)
ACPI: INT_SRC_OVR (bus 0 bus_irq 15 global_irq 15 high edge)
ACPI: IRQ0 used by override.
ACPI: IRQ2 used by override.
ACPI: IRQ9 used by override.
ACPI: IRQ14 used by override.
ACPI: IRQ15 used by override.
Setting APIC routing to flat
ACPI: HPET id: 0x10de8201 base: 0xfefff000
Using ACPI (MADT) for SMP configuration information
swsusp: Registered nosave memory region: 000000000009f000 - 00000000000a0000
swsusp: Registered nosave memory region: 00000000000a0000 - 00000000000f0000
swsusp: Registered nosave memory region: 00000000000f0000 - 0000000000100000
Allocating PCI resources starting at 80000000 (gap: 7ff00000:70100000)
SMP: Allowing 2 CPUs, 0 hotplug CPUs
PERCPU: Allocating 40968 bytes of per cpu data
Built 1 zonelists.  Total pages: 515439
Kernel command line: ro root=/dev/all/root
Initializing CPU#0
PID hash table entries: 4096 (order: 12, 32768 bytes)
Extended CMOS year: 2000
Marking TSC unstable due to TSCs unsynchronized
time.c: Detected 2009.257 MHz processor.
Console: colour VGA+ 80x25
Checking aperture...
CPU 0: aperture @ 64000000 size 32 MB
Aperture too small (32 MB)
No AGP bridge found
Memory: 2057524k/2096000k available (2362k kernel code, 38088k reserved, 1401k data, 312k init)
SLUB: Genslabs=23, HWalign=64, Order=0-1, MinObjects=4, CPUs=2, Nodes=1
Calibrating delay using timer specific routine.. 4020.98 BogoMIPS (lpj=2010494)
Security Framework v1.0.0 initialized
SELinux:  Initializing.
SELinux:  Starting in permissive mode
selinux_register_security:  Registering secondary module capability
Capability LSM initialized as secondary
Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes)
Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes)
Mount-cache hash table entries: 256
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 512K (64 bytes/line)
CPU 0/0 -> Node 0
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 0
SMP alternatives: switching to UP code
ACPI: Core revision 20070126
Using local APIC timer interrupts.
result 12557855
Detected 12.557 MHz APIC timer.
SMP alternatives: switching to SMP code
Booting processor 1/2 APIC 0x1
Initializing CPU#1
Calibrating delay using timer specific routine.. 4018.58 BogoMIPS (lpj=2009293)
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 512K (64 bytes/line)
CPU 1/1 -> Node 0
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 1
AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ stepping 02
Brought up 2 CPUs
sizeof(vma)=176 bytes
sizeof(page)=56 bytes
sizeof(inode)=560 bytes
sizeof(dentry)=208 bytes
sizeof(ext3inode)=760 bytes
sizeof(buffer_head)=104 bytes
sizeof(skbuff)=232 bytes
sizeof(task_struct)=2048 bytes
NET: Registered protocol family 16
ACPI: bus type pci registered
PCI: Using MMCONFIG at f0000000 - f3ffffff
PCI: No mmconfig possible on device 00:18
ACPI: Interpreter enabled
ACPI: (supports S0 S1 S3 S4 S5)
ACPI: Using IOAPIC for interrupt routing
ACPI: PCI Root Bridge [PCI0] (0000:00)
PCI: Probing PCI hardware (bus 00)
PCI: Transparent bridge - 0000:00:06.0
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.HUB0._PRT]
ACPI: PCI Interrupt Link [LNK1] (IRQs 5 7 9 10 *11 14 15)
ACPI: PCI Interrupt Link [LNK2] (IRQs 5 *7 9 10 11 14 15)
ACPI: PCI Interrupt Link [LNK3] (IRQs 5 7 9 10 11 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LNK4] (IRQs 5 7 9 10 11 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LNK5] (IRQs 5 7 9 10 11 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LNK6] (IRQs 5 7 9 *10 11 14 15)
ACPI: PCI Interrupt Link [LNK7] (IRQs 5 7 9 10 11 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LNK8] (IRQs 5 7 9 10 11 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LP2P] (IRQs 5 7 9 10 11 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LUBA] (IRQs 5 7 9 10 11 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LMAC] (IRQs 5 7 9 *10 11 14 15)
ACPI: PCI Interrupt Link [LAZA] (IRQs 5 7 9 10 11 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LPMU] (IRQs 5 7 9 10 11 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LSMB] (IRQs 5 7 9 10 11 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LUB2] (IRQs 5 7 9 10 11 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LIDE] (IRQs 5 7 9 10 11 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LSID] (IRQs 5 7 9 10 *11 14 15)
ACPI: PCI Interrupt Link [LFID] (IRQs *5 7 9 10 11 14 15)
ACPI: PCI Interrupt Link [LSA2] (IRQs 5 7 9 *10 11 14 15)
ACPI: PCI Interrupt Link [APC1] (IRQs 16) *0
ACPI: PCI Interrupt Link [APC2] (IRQs 17) *0
ACPI: PCI Interrupt Link [APC3] (IRQs 18) *0, disabled.
ACPI: PCI Interrupt Link [APC4] (IRQs 19) *0, disabled.
ACPI: PCI Interrupt Link [APC5] (IRQs 16) *0, disabled.
ACPI: PCI Interrupt Link [APC6] (IRQs 16) *0
ACPI: PCI Interrupt Link [APC7] (IRQs 16) *0, disabled.
ACPI: PCI Interrupt Link [APC8] (IRQs 16) *0, disabled.
ACPI: PCI Interrupt Link [APCF] (IRQs 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [APCH] (IRQs 20 21 22 23) *0
ACPI: PCI Interrupt Link [APMU] (IRQs 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [AAZA] (IRQs 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [APCS] (IRQs 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [APCL] (IRQs 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [APCM] (IRQs 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [APCZ] (IRQs 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [APSI] (IRQs 20 21 22 23) *0
ACPI: PCI Interrupt Link [APSJ] (IRQs 20 21 22 23) *0
ACPI: PCI Interrupt Link [ASA2] (IRQs 20 21 22 23) *0
Linux Plug and Play Support v0.97 (c) Adam Belay
pnp: PnP ACPI init
ACPI: bus type pnp registered
pnp: PnP ACPI: found 12 devices
ACPI: ACPI bus type pnp unregistered
usbcore: registered new interface driver usbfs
usbcore: registered new interface driver hub
usbcore: registered new device driver usb
PCI: Using ACPI for IRQ routing
PCI: If a device doesn't work, try "pci=routeirq".  If it helps, post a report
NetLabel: Initializing
NetLabel:  domain hash size = 128
NetLabel:  protocols = UNLABELED CIPSOv4
NetLabel:  unlabeled traffic allowed by default
hpet0: at MMIO 0xfefff000, IRQs 2, 8, 31
hpet0: 3 32-bit timers, 25000000 Hz
ACPI: RTC can wake from S4
pnp: 00:01: ioport range 0x1000-0x107f has been reserved
pnp: 00:01: ioport range 0x1080-0x10ff has been reserved
pnp: 00:01: ioport range 0x1400-0x147f has been reserved
Time: hpet clocksource has been installed.
pnp: 00:01: ioport range 0x1480-0x14ff has been reserved
pnp: 00:01: ioport range 0x1800-0x187f has been reserved
pnp: 00:01: ioport range 0x1880-0x18ff has been reserved
pnp: 00:0a: iomem range 0xf0000000-0xf3ffffff could not be reserved
pnp: 00:0b: iomem range 0xd1800-0xd3fff has been reserved
pnp: 00:0b: iomem range 0xf0000-0xf7fff could not be reserved
pnp: 00:0b: iomem range 0xf8000-0xfbfff could not be reserved
pnp: 00:0b: iomem range 0xfc000-0xfffff could not be reserved
PCI: Bridge: 0000:00:06.0
  IO window: a000-afff
  MEM window: fde00000-fdefffff
  PREFETCH window: 80000000-800fffff
PCI: Bridge: 0000:00:0a.0
  IO window: disabled.
  MEM window: disabled.
  PREFETCH window: disabled.
PCI: Bridge: 0000:00:0b.0
  IO window: disabled.
  MEM window: disabled.
  PREFETCH window: disabled.
PCI: Bridge: 0000:00:0c.0
  IO window: disabled.
  MEM window: disabled.
  PREFETCH window: disabled.
PCI: Bridge: 0000:00:0d.0
  IO window: disabled.
  MEM window: disabled.
  PREFETCH window: disabled.
PCI: Bridge: 0000:00:0e.0
  IO window: disabled.
  MEM window: disabled.
  PREFETCH window: disabled.
PCI: Bridge: 0000:00:0f.0
  IO window: disabled.
  MEM window: fa000000-fcffffff
  PREFETCH window: e0000000-efffffff
PCI: Setting latency timer of device 0000:00:06.0 to 64
PCI: Setting latency timer of device 0000:00:0a.0 to 64
PCI: Setting latency timer of device 0000:00:0b.0 to 64
PCI: Setting latency timer of device 0000:00:0c.0 to 64
PCI: Setting latency timer of device 0000:00:0d.0 to 64
PCI: Setting latency timer of device 0000:00:0e.0 to 64
PCI: Setting latency timer of device 0000:00:0f.0 to 64
NET: Registered protocol family 2
IP route cache hash table entries: 65536 (order: 7, 524288 bytes)
TCP established hash table entries: 262144 (order: 10, 6291456 bytes)
TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
TCP: Hash tables configured (established 262144 bind 65536)
TCP reno registered
checking if image is initramfs... it is
Freeing initrd memory: 3938k freed
audit: initializing netlink socket (disabled)
audit(1186467404.666:1): initialized
Total HugeTLB memory allocated, 0
VFS: Disk quotas dquot_6.5.1
Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
SELinux:  Registering netfilter hooks
ksign: Installing public key data
Loading keyring
- Added public key 8321A2A758C22C88
- User ID: Red Hat, Inc. (Kernel Module GPG key)
io scheduler noop registered
io scheduler anticipatory registered
io scheduler deadline registered
io scheduler cfq registered (default)
Boot video device is 0000:07:00.0
PCI: Setting latency timer of device 0000:00:0a.0 to 64
assign_interrupt_mode Found MSI capability
Allocate Port Service[0000:00:0a.0:pcie00]
Allocate Port Service[0000:00:0a.0:pcie03]
PCI: Setting latency timer of device 0000:00:0b.0 to 64
assign_interrupt_mode Found MSI capability
Allocate Port Service[0000:00:0b.0:pcie00]
Allocate Port Service[0000:00:0b.0:pcie03]
PCI: Setting latency timer of device 0000:00:0c.0 to 64
assign_interrupt_mode Found MSI capability
Allocate Port Service[0000:00:0c.0:pcie00]
Allocate Port Service[0000:00:0c.0:pcie03]
PCI: Setting latency timer of device 0000:00:0d.0 to 64
assign_interrupt_mode Found MSI capability
Allocate Port Service[0000:00:0d.0:pcie00]
Allocate Port Service[0000:00:0d.0:pcie03]
PCI: Setting latency timer of device 0000:00:0e.0 to 64
assign_interrupt_mode Found MSI capability
Allocate Port Service[0000:00:0e.0:pcie00]
Allocate Port Service[0000:00:0e.0:pcie03]
PCI: Setting latency timer of device 0000:00:0f.0 to 64
assign_interrupt_mode Found MSI capability
Allocate Port Service[0000:00:0f.0:pcie00]
Allocate Port Service[0000:00:0f.0:pcie03]
pci_hotplug: PCI Hot Plug PCI Core version: 0.5
ACPI: Fan [FAN] (on)
ACPI: Thermal Zone [THRM] (40 C)
hpet_resources: 0xfefff000 is busy
Generic RTC Driver v1.07
Non-volatile memory driver v1.2
Linux agpgart interface v0.102 (c) Dave Jones
Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled
RAMDISK driver initialized: 16 RAM disks of 16384K size 4096 blocksize
input: Macintosh mouse button emulation as /class/input/input0
PNP: PS/2 Controller [PNP0303:PS2K,PNP0f13:PS2M] at 0x60,0x64 irq 1,12
serio: i8042 KBD port at 0x60,0x64 irq 1
serio: i8042 AUX port at 0x60,0x64 irq 12
mice: PS/2 mouse device common for all mice
input: AT Translated Set 2 keyboard as /class/input/input1
usbcore: registered new interface driver hiddev
usbcore: registered new interface driver usbhid
drivers/hid/usbhid/hid-core.c: v2.6:USB HID core driver
TCP cubic registered
Initializing XFRM netlink socket
NET: Registered protocol family 1
NET: Registered protocol family 17
powernow-k8: Found 2 AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ processors (version 2.00.00)
powernow-k8: MP systems not supported by PSB BIOS structure
powernow-k8: MP systems not supported by PSB BIOS structure
drivers/rtc/hctosys.c: unable to open rtc device (rtc0)
Freeing unused kernel memory: 312k freed
Write protecting the kernel read-only data: 1060k
USB Universal Host Controller Interface driver v3.0
ohci_hcd: 2006 August 04 USB 1.1 'Open' Host Controller (OHCI) Driver
ACPI: PCI Interrupt Link [APCF] enabled at IRQ 23
ACPI: PCI Interrupt 0000:00:02.0[A] -> Link [APCF] -> GSI 23 (level, low) -> IRQ 23
PCI: Setting latency timer of device 0000:00:02.0 to 64
ohci_hcd 0000:00:02.0: OHCI Host Controller
ohci_hcd 0000:00:02.0: new USB bus registered, assigned bus number 1
ohci_hcd 0000:00:02.0: irq 23, io mem 0xfe02f000
input: PS2++ Logitech Wheel Mouse as /class/input/input2
usb usb1: configuration #1 chosen from 1 choice
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 10 ports detected
ACPI: PCI Interrupt Link [APCL] enabled at IRQ 22
ACPI: PCI Interrupt 0000:00:02.1[B] -> Link [APCL] -> GSI 22 (level, low) -> IRQ 22
PCI: Setting latency timer of device 0000:00:02.1 to 64
ehci_hcd 0000:00:02.1: EHCI Host Controller
ehci_hcd 0000:00:02.1: new USB bus registered, assigned bus number 2
ehci_hcd 0000:00:02.1: debug port 1
PCI: cache line size of 64 is not supported by device 0000:00:02.1
ehci_hcd 0000:00:02.1: irq 22, io mem 0xfe02e000
ehci_hcd 0000:00:02.1: USB 2.0 started, EHCI 1.00, driver 10 Dec 2004
usb usb2: configuration #1 chosen from 1 choice
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 10 ports detected
raid5: automatically using best checksumming function: generic_sse
   generic_sse:  6160.000 MB/sec
raid5: using function: generic_sse (6160.000 MB/sec)
raid6: int64x1   1738 MB/s
raid6: int64x2   2378 MB/s
raid6: int64x4   1812 MB/s
raid6: int64x8   1800 MB/s
raid6: sse2x1    2773 MB/s
raid6: sse2x2    3714 MB/s
raid6: sse2x4    3898 MB/s
raid6: using algorithm sse2x4 (3898 MB/s)
md: raid6 personality registered for level 6
md: raid5 personality registered for level 5
md: raid4 personality registered for level 4
SCSI subsystem initialized
libata version 2.21 loaded.
sata_nv 0000:00:05.0: version 3.4
ACPI: PCI Interrupt Link [APSI] enabled at IRQ 21
ACPI: PCI Interrupt 0000:00:05.0[A] -> Link [APSI] -> GSI 21 (level, low) -> IRQ 21
PCI: Setting latency timer of device 0000:00:05.0 to 64
scsi0 : sata_nv
scsi1 : sata_nv
ata1: SATA max UDMA/133 cmd 0x00000000000109f0 ctl 0x0000000000010bf2 bmdma 0x000000000001dc00 irq 21
ata2: SATA max UDMA/133 cmd 0x0000000000010970 ctl 0x0000000000010b72 bmdma 0x000000000001dc08 irq 21
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: ATA-7: Maxtor 6V320F0, VA111900, max UDMA/133
ata1.00: 625142448 sectors, multi 1: LBA48 NCQ (depth 0/32)
ata1.00: configured for UDMA/133
ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata2.00: ATA-7: Maxtor 6V320F0, VA111900, max UDMA/133
ata2.00: 625142448 sectors, multi 1: LBA48 NCQ (depth 0/32)
ata2.00: configured for UDMA/133
scsi 0:0:0:0: Direct-Access     ATA      Maxtor 6V320F0   VA11 PQ: 0 ANSI: 5
sd 0:0:0:0: [sda] 625142448 512-byte hardware sectors (320073 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sd 0:0:0:0: [sda] 625142448 512-byte hardware sectors (320073 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
 sda: sda1 sda2
sd 0:0:0:0: [sda] Attached SCSI disk
scsi 1:0:0:0: Direct-Access     ATA      Maxtor 6V320F0   VA11 PQ: 0 ANSI: 5
sd 1:0:0:0: [sdb] 625142448 512-byte hardware sectors (320073 MB)
sd 1:0:0:0: [sdb] Write Protect is off
sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sd 1:0:0:0: [sdb] 625142448 512-byte hardware sectors (320073 MB)
sd 1:0:0:0: [sdb] Write Protect is off
sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
 sdb: sdb1 sdb2
sd 1:0:0:0: [sdb] Attached SCSI disk
ACPI: PCI Interrupt Link [APSJ] enabled at IRQ 20
ACPI: PCI Interrupt 0000:00:05.1[B] -> Link [APSJ] -> GSI 20 (level, low) -> IRQ 20
PCI: Setting latency timer of device 0000:00:05.1 to 64
scsi2 : sata_nv
scsi3 : sata_nv
ata3: SATA max UDMA/133 cmd 0x00000000000109e0 ctl 0x0000000000010be2 bmdma 0x000000000001c800 irq 20
ata4: SATA max UDMA/133 cmd 0x0000000000010960 ctl 0x0000000000010b62 bmdma 0x000000000001c808 irq 20
ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata3.00: ATA-7: Maxtor 6V320F0, VA111900, max UDMA/133
ata3.00: 625142448 sectors, multi 1: LBA48 NCQ (depth 0/32)
ata3.00: configured for UDMA/133
ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata4.00: ATA-7: Maxtor 6V320F0, VA111900, max UDMA/133
ata4.00: 625142448 sectors, multi 1: LBA48 NCQ (depth 0/32)
ata4.00: configured for UDMA/133
scsi 2:0:0:0: Direct-Access     ATA      Maxtor 6V320F0   VA11 PQ: 0 ANSI: 5
sd 2:0:0:0: [sdc] 625142448 512-byte hardware sectors (320073 MB)
sd 2:0:0:0: [sdc] Write Protect is off
sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sd 2:0:0:0: [sdc] 625142448 512-byte hardware sectors (320073 MB)
sd 2:0:0:0: [sdc] Write Protect is off
sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
 sdc: sdc1 sdc2
sd 2:0:0:0: [sdc] Attached SCSI disk
scsi 3:0:0:0: Direct-Access     ATA      Maxtor 6V320F0   VA11 PQ: 0 ANSI: 5
sd 3:0:0:0: [sdd] 625142448 512-byte hardware sectors (320073 MB)
sd 3:0:0:0: [sdd] Write Protect is off
sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00
sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sd 3:0:0:0: [sdd] 625142448 512-byte hardware sectors (320073 MB)
sd 3:0:0:0: [sdd] Write Protect is off
sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00
sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
 sdd: sdd1 sdd2
sd 3:0:0:0: [sdd] Attached SCSI disk
ACPI: PCI Interrupt Link [ASA2] enabled at IRQ 23
ACPI: PCI Interrupt 0000:00:05.2[C] -> Link [ASA2] -> GSI 23 (level, low) -> IRQ 23
PCI: Setting latency timer of device 0000:00:05.2 to 64
scsi4 : sata_nv
scsi5 : sata_nv
ata5: SATA max UDMA/133 cmd 0x000000000001c400 ctl 0x000000000001c002 bmdma 0x000000000001b400 irq 23
ata6: SATA max UDMA/133 cmd 0x000000000001bc00 ctl 0x000000000001b802 bmdma 0x000000000001b408 irq 23
ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata5.00: ATAPI: PLEXTOR DVDR   PX-760A, 1.03, max UDMA/66
ata5.00: configured for UDMA/66
ata6: SATA link down (SStatus 0 SControl 300)
scsi 4:0:0:0: CD-ROM            PLEXTOR  DVDR   PX-760A   1.03 PQ: 0 ANSI: 5
pata_amd 0000:00:04.0: version 0.3.8
PCI: Setting latency timer of device 0000:00:04.0 to 64
scsi6 : pata_amd
scsi7 : pata_amd
ata7: PATA max UDMA/133 cmd 0x00000000000101f0 ctl 0x00000000000103f6 bmdma 0x000000000001f000 irq 14
ata8: PATA max UDMA/133 cmd 0x0000000000010170 ctl 0x0000000000010376 bmdma 0x000000000001f008 irq 15
ata7: port disabled. ignoring.
ata8: port disabled. ignoring.
device-mapper: ioctl: 4.11.0-ioctl (2006-10-12) initialised: dm-devel@redhat.com
md: md1 stopped.
md: bind<sdb2>
md: bind<sdc2>
md: bind<sdd2>
md: bind<sda2>
raid5: device sda2 operational as raid disk 0
raid5: device sdd2 operational as raid disk 3
raid5: device sdc2 operational as raid disk 2
raid5: device sdb2 operational as raid disk 1
raid5: allocated 4262kB for md1
raid5: raid level 5 set md1 active with 4 out of 4 devices, algorithm 2
RAID5 conf printout:
 --- rd:4 wd:4
 disk 0, o:1, dev:sda2
 disk 1, o:1, dev:sdb2
 disk 2, o:1, dev:sdc2
 disk 3, o:1, dev:sdd2
kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
audit(1186467414.296:2): enforcing=1 old_enforcing=0 auid=4294967295
security:  3 users, 6 roles, 1829 types, 81 bools, 1 sens, 1024 cats
security:  61 classes, 69524 rules
SELinux:  Completing initialization.
SELinux:  Setting up existing superblocks.
SELinux: initialized (dev dm-0, type ext3), uses xattr
SELinux: initialized (dev usbfs, type usbfs), uses genfs_contexts
SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
SELinux: initialized (dev debugfs, type debugfs), uses genfs_contexts
SELinux: initialized (dev selinuxfs, type selinuxfs), uses genfs_contexts
SELinux: initialized (dev mqueue, type mqueue), uses transition SIDs
SELinux: initialized (dev hugetlbfs, type hugetlbfs), uses genfs_contexts
SELinux: initialized (dev devpts, type devpts), uses transition SIDs
SELinux: initialized (dev inotifyfs, type inotifyfs), uses genfs_contexts
SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
SELinux: initialized (dev futexfs, type futexfs), uses genfs_contexts
SELinux: initialized (dev anon_inodefs, type anon_inodefs), uses genfs_contexts
SELinux: initialized (dev pipefs, type pipefs), uses task SIDs
SELinux: initialized (dev sockfs, type sockfs), uses task SIDs
SELinux: initialized (dev cpuset, type cpuset), uses genfs_contexts
SELinux: initialized (dev proc, type proc), uses genfs_contexts
SELinux: initialized (dev bdev, type bdev), uses genfs_contexts
SELinux: initialized (dev rootfs, type rootfs), uses genfs_contexts
SELinux: initialized (dev sysfs, type sysfs), uses genfs_contexts
audit(1186467414.533:3): policy loaded auid=4294967295
sd 0:0:0:0: Attached scsi generic sg0 type 0
sd 1:0:0:0: Attached scsi generic sg1 type 0
sd 2:0:0:0: Attached scsi generic sg2 type 0
sd 3:0:0:0: Attached scsi generic sg3 type 0
scsi 4:0:0:0: Attached scsi generic sg4 type 5
sr0: scsi3-mmc drive: 40x/40x writer cd/rw xa/form2 cdda tray
Uniform CD-ROM driver Revision: 3.20
sr 4:0:0:0: Attached scsi CD-ROM sr0
i2c-adapter i2c-0: nForce2 SMBus adapter at 0x1c00
i2c-adapter i2c-1: nForce2 SMBus adapter at 0x1c40
forcedeth.c: Reverse Engineered nForce ethernet driver. Version 0.60.
ACPI: PCI Interrupt Link [APCH] enabled at IRQ 22
ACPI: PCI Interrupt 0000:00:08.0[A] -> Link [APCH] -> GSI 22 (level, low) -> IRQ 22
PCI: Setting latency timer of device 0000:00:08.0 to 64
forcedeth: using HIGHDMA
rtc_cmos 00:05: rtc core: registered rtc_cmos as rtc0
rtc0: alarms up to one year, y3k
eth0: forcedeth.c: subsystem: 01043:8239 bound to 0000:00:08.0
ACPI: PCI Interrupt Link [APC1] enabled at IRQ 16
ACPI: PCI Interrupt 0000:01:06.0[A] -> Link [APC1] -> GSI 16 (level, low) -> IRQ 16
3c59x: Donald Becker and others.
0000:01:06.0: 3Com PCI 3c905C Tornado at ffffc20000330000.
ACPI: PCI Interrupt Link [APC2] enabled at IRQ 17
ACPI: PCI Interrupt 0000:01:07.0[A] -> Link [APC2] -> GSI 17 (level, low) -> IRQ 17
0000:01:07.0: 3Com PCI 3c905C Tornado at ffffc2000037a000.
shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
loop: module loaded
floppy0: no floppy controllers found
lp: driver loaded but no devices found
No dock devices found.
input: Power Button (FF) as /class/input/input3
ACPI: Power Button (FF) [PWRF]
input: Power Button (CM) as /class/input/input4
ACPI: Power Button (CM) [PWRB]
md: md0 stopped.
md: bind<sdb1>
md: bind<sdc1>
md: bind<sdd1>
md: bind<sda1>
md: raid1 personality registered for level 1
raid1: raid set md0 active with 4 out of 4 mirrors
EXT3 FS on dm-0, internal journal
SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
kjournald starting.  Commit interval 5 seconds
EXT3 FS on dm-2, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
SELinux: initialized (dev dm-2, type ext3), uses xattr
kjournald starting.  Commit interval 5 seconds
EXT3 FS on md0, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
SELinux: initialized (dev md0, type ext3), uses xattr
Adding 4194296k swap on /dev/all/swap.  Priority:-1 extents:1 across:4194296k
SELinux: initialized (dev binfmt_misc, type binfmt_misc), uses genfs_contexts
ip_tables: (C) 2000-2006 Netfilter Core Team
Netfilter messages via NETLINK v0.30.
nf_conntrack version 0.5.0 (8192 buckets, 65536 max)
NET: Registered protocol family 10
lo: Disabled Privacy Extensions
Mobile IPv6
eth1:  setting full-duplex.
eth2:  setting full-duplex.
eth0: no IPv6 routers present
audit(1186467437.499:4): avc:  denied  { search } for  pid=2244 comm="sm-notify" scontext=system_u:system_r:rpcd_t:s0 tcontext=system_u:object_r:sysctl_fs_t:s0 tclass=dir
audit(1186467437.533:5): avc:  denied  { search } for  pid=2243 comm="rpc.statd" scontext=system_u:system_r:rpcd_t:s0 tcontext=system_u:object_r:sysctl_fs_t:s0 tclass=dir
SELinux: initialized (dev rpc_pipefs, type rpc_pipefs), uses genfs_contexts
SELinux: initialized (dev autofs, type autofs), uses genfs_contexts
SELinux: initialized (dev autofs, type autofs), uses genfs_contexts
SELinux: initialized (dev autofs, type autofs), uses genfs_contexts
eth1: no IPv6 routers present
eth2: no IPv6 routers present


* .config

i dont have it, it was the standard fedora one.

i'm not sure that the problem is related to 3com, because i replaced those cards by old card i had in spare :

01:06.0 Ethernet controller: VIA Technologies, Inc. VT6102 [Rhine-II] (rev 42)
01:07.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8029(AS)

and i had the exact same problem.

Those 3com cards were working 24/24 before i went to fedora 7 (and kernel 2.6.21 then).

jb



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-07  7:46                                   ` Marcin Ślusarz
@ 2007-08-07  8:23                                     ` Jarek Poplawski
       [not found]                                       ` <4bacf17f0708070237w19d184b3p7f74b53612edb9a6@mail.gmail.com>
  0 siblings, 1 reply; 68+ messages in thread
From: Jarek Poplawski @ 2007-08-07  8:23 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

On Tue, Aug 07, 2007 at 09:46:36AM +0200, Marcin Ślusarz wrote:
> 2007/8/6, Ingo Molnar <mingo@elte.hu>:
> > (..)
> > please try Jarek's second patch too - there was a missing unmask.
> >
> >         Ingo
> >
> > -------------->
> > Subject: genirq: fix simple and fasteoi irq handlers
> > From: Jarek Poplawski <jarkao2@o2.pl>
...
> Network card still locks up (tested on 2.6.22.1). I had to upload more
> data than usual (~350 MB vs ~1-100 MB) to trigger that bug but it
> might be a coincidence...

Thanks! It's a good news after all - it would be really strange why
this place doesn't hit more people (it seems there is some safety
elsewhere for this).

BTW: I hope, this previous Thomas' patch with Ingo's warning to resend.c
(with a warning), had no problems with a similar load?

So, once more, I would suspect hw retrigger code. Ingo, IMHO, this
patch for testing HARDIRQS_SW_RESEND could be reworked, so that
desc->chip->retrigger() is done only for eadges and the tasklet only
for levels. BTW, I think this current warning in the "temporary" is
is too early - we don't know if after this the ->retrigger() will
take place.

Regards,
Jarek P.

PS: Marcin, if you need a break in this testing let us know!
I think the main idea of this bug is known enough.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-07  8:10 Jean-Baptiste Vignaud
@ 2007-08-07  9:05 ` Jarek Poplawski
  0 siblings, 0 replies; 68+ messages in thread
From: Jarek Poplawski @ 2007-08-07  9:05 UTC (permalink / raw)
  To: Jean-Baptiste Vignaud
  Cc: cebbert, mingo, marcin.slusarz, tglx, torvalds, linux-kernel,
	shemminger, linux-net, netdev, akpm, alan

On Tue, Aug 07, 2007 at 10:10:34AM +0200, Jean-Baptiste Vignaud wrote:
> 
> > BTW: Jean-Babtiste, could you send or point to you current configs?

Oops! I'm very sorry for misspelling!

> > I mean at least proc/interrupts, but with dmesg and .config it would
> > be even better. (I assume this last report was about the revert patch
> > mentioned by Chuck, not the one below your message?)
> 
> Sure.
> 
> Last reports are with the 2.6.22.1-41.fc7 kernel, which has in changelog :
> 
> * Sat Jul 28 2007 Chuck Ebbert <cebbert@redhat.com>
> - revert upstream "genirq: do not mask interrupts by default"
> 
> 
> * interrupts (i use irqbalance, but problem was the same without)

I wonder if you tried without SMP too?

> 
> [root@loki ~]# cat /proc/interrupts
>            CPU0       CPU1
...
>  16:      72625         96   IO-APIC-fasteoi   eth1
>  17:       4667        128   IO-APIC-fasteoi   eth2
>  20:       4156      39870   IO-APIC-fasteoi   sata_nv
>  21:      34794       9177   IO-APIC-fasteoi   sata_nv
>  22:          0          0   IO-APIC-fasteoi   ehci_hcd:usb2
>  23:       6005       1565   IO-APIC-fasteoi   ohci_hcd:usb1, sata_nv
> 2297:          3     492180   PCI-MSI-edge      eth0
> NMI:          0          0
> LOC:    4915345    4915282
> ERR:          0

So, here it's not about irq sharing...

> 
> problems are with eth1 and eth2 here. never had any problems with the onboard (eth0).
...
> 
> * .config
> 
> i dont have it, it was the standard fedora one.
> 
> i'm not sure that the problem is related to 3com, because i replaced those cards by old card i had in spare :
> 
> 01:06.0 Ethernet controller: VIA Technologies, Inc. VT6102 [Rhine-II] (rev 42)
> 01:07.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8029(AS)
> 
> and i had the exact same problem.
> 
> Those 3com cards were working 24/24 before i went to fedora 7 (and kernel 2.6.21 then).

It seems from 2.6.21 the problems are mainly about 'older' network
chips on x86_64. This reverted patch should mean only for those
using disable_irq, but I see forcedeth could use this too so it's
not clear yet, and btw. there where other changes around irqs and
pci, so everybody could have something a bit different with similar
time outs logs...

BTW, Jean-Baptiste and Chuck - it seems, unless you have too much
time, there is no use for testing my "genirq: fix simple and fasteoi
irq handlers" patch.

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
@ 2007-08-07  9:21 Jean-Baptiste Vignaud
  2007-08-07  9:44 ` Jarek Poplawski
  0 siblings, 1 reply; 68+ messages in thread
From: Jean-Baptiste Vignaud @ 2007-08-07  9:21 UTC (permalink / raw)
  To: jarkao2
  Cc: cebbert, mingo, marcin.slusarz, tglx, torvalds, linux-kernel,
	shemminger, linux-net, netdev, akpm, alan


> > * interrupts (i use irqbalance, but problem was the same without)
> 
> I wonder if you tried without SMP too?

No i did not. Do you think that this can be a problem ?
To test with no SMP, do i need to recompile kernel or is there a kernel parameter ?

....

> BTW, Jean-Baptiste and Chuck - it seems, unless you have too much
> time, there is no use for testing my "genirq: fix simple and fasteoi
> irq handlers" patch.

Well i just  tested 2.6.23-rc1 with your patch and copied (using smbclient) big files :

Aug  7 11:11:53 loki kernel: NETDEV WATCHDOG: eth2: transmit timed out
Aug  7 11:11:53 loki kernel: eth2: transmit timed out, tx_status 00 status e601.
Aug  7 11:11:53 loki kernel:   diagnostics: net 0ccc media 8880 dma 0000003a fifo 0000
Aug  7 11:11:53 loki kernel: eth2: Interrupt posted but not delivered -- IRQ blocked by another device?
Aug  7 11:11:53 loki kernel:   Flags; bus-master 1, dirty 93481(9) current 93481(9)
Aug  7 11:11:53 loki kernel:   Transmit list 00000000 vs. ffff81007be977a0.
Aug  7 11:11:53 loki kernel:   0: @ffff81007be97200  length 8000005f status 0001005f
Aug  7 11:11:53 loki kernel:   1: @ffff81007be972a0  length 8000005f status 0001005f
Aug  7 11:11:53 loki kernel:   2: @ffff81007be97340  length 8000005f status 0001005f
Aug  7 11:11:53 loki kernel:   3: @ffff81007be973e0  length 8000005f status 0001005f
Aug  7 11:11:53 loki kernel:   4: @ffff81007be97480  length 8000003c status 0001003c
Aug  7 11:11:53 loki kernel:   5: @ffff81007be97520  length 8000003c status 0001003c
Aug  7 11:11:53 loki kernel:   6: @ffff81007be975c0  length 8000003c status 0001003c
Aug  7 11:11:53 loki kernel:   7: @ffff81007be97660  length 8000003c status 8001003c
Aug  7 11:11:53 loki kernel:   8: @ffff81007be97700  length 8000003c status 8001003c
Aug  7 11:11:53 loki kernel:   9: @ffff81007be977a0  length 8000002a status 0001002a
Aug  7 11:11:53 loki kernel:   10: @ffff81007be97840  length 8000003a status 0001003a
Aug  7 11:11:53 loki kernel:   11: @ffff81007be978e0  length 8000005f status 0001005f
Aug  7 11:11:53 loki kernel:   12: @ffff81007be97980  length 800000be status 0c0100be
Aug  7 11:11:53 loki kernel:   13: @ffff81007be97a20  length 800000be status 0c0100be
Aug  7 11:11:53 loki kernel:   14: @ffff81007be97ac0  length 8000005f status 0001005f
Aug  7 11:11:53 loki kernel:   15: @ffff81007be97b60  length 8000005f status 0001005f

Thanks;

Jb


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-07  9:21 Jean-Baptiste Vignaud
@ 2007-08-07  9:44 ` Jarek Poplawski
  0 siblings, 0 replies; 68+ messages in thread
From: Jarek Poplawski @ 2007-08-07  9:44 UTC (permalink / raw)
  To: Jean-Baptiste Vignaud
  Cc: cebbert, mingo, marcin.slusarz, tglx, torvalds, linux-kernel,
	shemminger, linux-net, netdev, akpm, alan

On Tue, Aug 07, 2007 at 11:21:07AM +0200, Jean-Baptiste Vignaud wrote:
> 
> > > * interrupts (i use irqbalance, but problem was the same without)
> >
> > I wonder if you tried without SMP too?
> 
> No i did not. Do you think that this can be a problem ?
> To test with no SMP, do i need to recompile kernel or is there a kernel parameter ?

It's always better to exclude any complications if it's possible.
Yes, there is the kernel parameter for this: nosmp. So, if you
have some time to spare I think 2.6.23-rc2 with this nosmp
could be an interesting option.

> ....
> 
> > BTW, Jean-Baptiste and Chuck - it seems, unless you have too much
> > time, there is no use for testing my "genirq: fix simple and fasteoi
> > irq handlers" patch.
> 
> Well i just  tested 2.6.23-rc1 with your patch and copied (using smbclient) big files :
> 
> Aug  7 11:11:53 loki kernel: NETDEV WATCHDOG: eth2: transmit timed out
> Aug  7 11:11:53 loki kernel: eth2: transmit timed out, tx_status 00 status e601.
> Aug  7 11:11:53 loki kernel:   diagnostics: net 0ccc media 8880 dma 0000003a fifo 0000
> Aug  7 11:11:53 loki kernel: eth2: Interrupt posted but not delivered -- IRQ blocked by another device?
> Aug  7 11:11:53 loki kernel:   Flags; bus-master 1, dirty 93481(9) current 93481(9)
> Aug  7 11:11:53 loki kernel:   Transmit list 00000000 vs. ffff81007be977a0.
> Aug  7 11:11:53 loki kernel:   0: @ffff81007be97200  length 8000005f status 0001005f
...

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
       [not found]                                       ` <4bacf17f0708070237w19d184b3p7f74b53612edb9a6@mail.gmail.com>
@ 2007-08-07  9:52                                         ` Jarek Poplawski
  2007-08-07 12:13                                           ` Jarek Poplawski
  0 siblings, 1 reply; 68+ messages in thread
From: Jarek Poplawski @ 2007-08-07  9:52 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

On Tue, Aug 07, 2007 at 11:37:01AM +0200, Marcin Ślusarz wrote:
> 2007/8/7, Jarek Poplawski <jarkao2@o2.pl>:
> > On Tue, Aug 07, 2007 at 09:46:36AM +0200, Marcin Ślusarz wrote:
> > > Network card still locks up (tested on 2.6.22.1). I had to upload more
> > > data than usual (~350 MB vs ~1-100 MB) to trigger that bug but it
> > > might be a coincidence...
> >
> > Thanks! It's a good news after all - it would be really strange why
> > this place doesn't hit more people (it seems there is some safety
> > elsewhere for this).
> >
> > BTW: I hope, this previous Thomas' patch with Ingo's warning to resend.c
> > (with a warning), had no problems with a similar load?
> I always tested on 500-600 MB "dataset"
> 
> > PS: Marcin, if you need a break in this testing let us know!
> No, i don't need a break. I'll have more time in next weeks.

Great! So, I'll try to send a patch with _SW_RESEND in a few hours,
if Ingo doesn't prepare something for you.

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-06 17:43                                   ` Chuck Ebbert
  2007-08-06 19:08                                     ` Ingo Molnar
@ 2007-08-07 10:09                                     ` Jarek Poplawski
  1 sibling, 0 replies; 68+ messages in thread
From: Jarek Poplawski @ 2007-08-07 10:09 UTC (permalink / raw)
  To: Chuck Ebbert
  Cc: Ingo Molnar, Marcin Ślusarz, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

On Mon, Aug 06, 2007 at 01:43:48PM -0400, Chuck Ebbert wrote:
> On 08/06/2007 03:03 AM, Ingo Molnar wrote:
> > 
> > But, since level types don't need this retriggers too much I think
> > this "don't mask interrupts by default" idea should be rethinked:
> > is there enough gain to risk such hard to diagnose errors?
> >   
> > 
> 
> I reverted those masking changes in Fedora and the baffling problem
> with 3Com 3C905 network adapters went away.
> 
> Before, they would print:
> 
> eth0: transmit timed out, tx_status 00 status e601.
>   diagnostics: net 0ccc media 8880 dma 0000003a fifo 0000
> eth0: Interrupt posted but not delivered -- IRQ blocked by another device?
>   Flags; bus-master 1, dirty 295757(13) current 295757(13)
>   Transmit list 00000000 vs. f7150a20.
>   0: @f7150200  length 80000070 status 0c010070
>   1: @f71502a0  length 80000070 status 0c010070
>   2: @f7150340  length 8000005c status 0c01005c
> 
> Now they just work, apparently...
> 
> So why not just revert the change?
> 

Ingo has written about such possibility. But, it would be good
to know which precisely place is to blame, as well. Since this
diagnosing takes time, I think Chuck is right, and maybe at least
this temporary patch for resend.c without this warning, should
be recomended for stables (2.6.21 and 2.6.22)?

Jarek P.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-07  9:52                                         ` Jarek Poplawski
@ 2007-08-07 12:13                                           ` Jarek Poplawski
  2007-08-07 12:55                                             ` Jarek Poplawski
  2007-08-08 11:09                                             ` Marcin Ślusarz
  0 siblings, 2 replies; 68+ messages in thread
From: Jarek Poplawski @ 2007-08-07 12:13 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

On Tue, Aug 07, 2007 at 11:52:46AM +0200, Jarek Poplawski wrote:
> On Tue, Aug 07, 2007 at 11:37:01AM +0200, Marcin Ślusarz wrote:
> > 2007/8/7, Jarek Poplawski <jarkao2@o2.pl>:
> > > On Tue, Aug 07, 2007 at 09:46:36AM +0200, Marcin Ślusarz wrote:
> > > > Network card still locks up (tested on 2.6.22.1). I had to upload more
> > > > data than usual (~350 MB vs ~1-100 MB) to trigger that bug but it
> > > > might be a coincidence...
> > >
> > > Thanks! It's a good news after all - it would be really strange why
> > > this place doesn't hit more people (it seems there is some safety
> > > elsewhere for this).
> > >
> > > BTW: I hope, this previous Thomas' patch with Ingo's warning to resend.c
> > > (with a warning), had no problems with a similar load?
> > I always tested on 500-600 MB "dataset"
> > 
> > > PS: Marcin, if you need a break in this testing let us know!
> > No, i don't need a break. I'll have more time in next weeks.
> 
> Great! So, I'll try to send a patch with _SW_RESEND in a few hours,
> if Ingo doesn't prepare something for you.

So, the let's try this idea yet: modified Ingo's "x86: activate
HARDIRQS_SW_RESEND" patch.
(Don't forget about make oldconfig before make.)
For testing only.

Cheers,
Jarek P.

PS: alas there was not even time for "compile checking"...

---

diff -Nurp 2.6.22.1-/arch/i386/Kconfig 2.6.22.1/arch/i386/Kconfig
--- 2.6.22.1-/arch/i386/Kconfig	2007-07-09 01:32:17.000000000 +0200
+++ 2.6.22.1/arch/i386/Kconfig	2007-08-07 13:13:03.000000000 +0200
@@ -1252,6 +1252,10 @@ config GENERIC_PENDING_IRQ
 	depends on GENERIC_HARDIRQS && SMP
 	default y
 
+config HARDIRQS_SW_RESEND
+	bool
+	default y
+
 config X86_SMP
 	bool
 	depends on SMP && !X86_VOYAGER
diff -Nurp 2.6.22.1-/arch/x86_64/Kconfig 2.6.22.1/arch/x86_64/Kconfig
--- 2.6.22.1-/arch/x86_64/Kconfig	2007-07-09 01:32:17.000000000 +0200
+++ 2.6.22.1/arch/x86_64/Kconfig	2007-08-07 13:13:03.000000000 +0200
@@ -690,6 +690,10 @@ config GENERIC_PENDING_IRQ
 	depends on GENERIC_HARDIRQS && SMP
 	default y
 
+config HARDIRQS_SW_RESEND
+	bool
+	default y
+
 menu "Power management options"
 
 source kernel/power/Kconfig
diff -Nurp 2.6.22.1-/kernel/irq/manage.c 2.6.22.1/kernel/irq/manage.c
--- 2.6.22.1-/kernel/irq/manage.c	2007-07-09 01:32:17.000000000 +0200
+++ 2.6.22.1/kernel/irq/manage.c	2007-08-07 13:13:03.000000000 +0200
@@ -169,6 +169,14 @@ void enable_irq(unsigned int irq)
 		desc->depth--;
 	}
 	spin_unlock_irqrestore(&desc->lock, flags);
+#ifdef CONFIG_HARDIRQS_SW_RESEND
+	/*
+	 * Do a bh disable/enable pair to trigger any pending
+	 * irq resend logic:
+	 */
+	local_bh_disable();
+	local_bh_enable();
+#endif
 }
 EXPORT_SYMBOL(enable_irq);
 
diff -Nurp 2.6.22.1-/kernel/irq/resend.c 2.6.22.1/kernel/irq/resend.c
--- 2.6.22.1-/kernel/irq/resend.c	2007-07-09 01:32:17.000000000 +0200
+++ 2.6.22.1/kernel/irq/resend.c	2007-08-07 13:57:54.000000000 +0200
@@ -62,16 +62,24 @@ void check_irq_resend(struct irq_desc *d
 	 */
 	desc->chip->enable(irq);
 
+	/*
+	 * Temporary hack to figure out more about the problem, which
+	 * is causing the ancient network cards to die.
+	 */
+
 	if ((status & (IRQ_PENDING | IRQ_REPLAY)) == IRQ_PENDING) {
 		desc->status = (status & ~IRQ_PENDING) | IRQ_REPLAY;
 
-		if (!desc->chip || !desc->chip->retrigger ||
-					!desc->chip->retrigger(irq)) {
+		if (desc->handle_irq == handle_edge_irq) {
+			if (desc->chip->retrigger)
+				desc->chip->retrigger(irq);
+			return;
+		}
 #ifdef CONFIG_HARDIRQS_SW_RESEND
-			/* Set it pending and activate the softirq: */
-			set_bit(irq, irqs_resend);
-			tasklet_schedule(&resend_tasklet);
+		WARN_ON_ONCE(1);
+		/* Set it pending and activate the softirq: */
+		set_bit(irq, irqs_resend);
+		tasklet_schedule(&resend_tasklet);
 #endif
-		}
 	}
 }

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-07 12:13                                           ` Jarek Poplawski
@ 2007-08-07 12:55                                             ` Jarek Poplawski
  2007-08-08 11:11                                               ` Marcin Ślusarz
  2007-08-08 11:09                                             ` Marcin Ślusarz
  1 sibling, 1 reply; 68+ messages in thread
From: Jarek Poplawski @ 2007-08-07 12:55 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

On Tue, Aug 07, 2007 at 02:13:39PM +0200, Jarek Poplawski wrote:
> On Tue, Aug 07, 2007 at 11:52:46AM +0200, Jarek Poplawski wrote:
> > On Tue, Aug 07, 2007 at 11:37:01AM +0200, Marcin Ślusarz wrote:
...
> > > No, i don't need a break. I'll have more time in next weeks.
> > 
> > Great! So, I'll try to send a patch with _SW_RESEND in a few hours,
> > if Ingo doesn't prepare something for you.
> 
> So, the let's try this idea yet: modified Ingo's "x86: activate
> HARDIRQS_SW_RESEND" patch.
> (Don't forget about make oldconfig before make.)
> For testing only.
> 
> Cheers,
> Jarek P.
> 
> PS: alas there was not even time for "compile checking"...

And here is one more patch to test the same idea (chip->retrigger()).
Let's try i386 way! (I hope I will not be arrested for this...)
(Should be tested without any previous patches.)

Jarek P.

PS: as above

---

diff -Nurp 2.6.22.1-/arch/x86_64/kernel/io_apic.c 2.6.22.1/arch/x86_64/kernel/io_apic.c
--- 2.6.22.1-/arch/x86_64/kernel/io_apic.c	2007-07-09 01:32:17.000000000 +0200
+++ 2.6.22.1/arch/x86_64/kernel/io_apic.c	2007-08-07 14:37:45.000000000 +0200
@@ -1311,15 +1311,8 @@ static unsigned int startup_ioapic_irq(u
 static int ioapic_retrigger_irq(unsigned int irq)
 {
 	struct irq_cfg *cfg = &irq_cfg[irq];
-	cpumask_t mask;
-	unsigned long flags;
-
-	spin_lock_irqsave(&vector_lock, flags);
-	cpus_clear(mask);
-	cpu_set(first_cpu(cfg->domain), mask);
 
-	send_IPI_mask(mask, cfg->vector);
-	spin_unlock_irqrestore(&vector_lock, flags);
+	send_IPI_self(cfg->vector);
 
 	return 1;
 }

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
@ 2007-08-07 17:16 Jean-Baptiste Vignaud
  2007-08-08  7:21 ` Jarek Poplawski
  0 siblings, 1 reply; 68+ messages in thread
From: Jean-Baptiste Vignaud @ 2007-08-07 17:16 UTC (permalink / raw)
  To: jarkao2
  Cc: cebbert, mingo, marcin.slusarz, tglx, torvalds, linux-kernel,
	shemminger, linux-net, netdev, akpm, alan

> On Tue, Aug 07, 2007 at 11:21:07AM +0200, Jean-Baptiste Vignaud wrote:
> > 
> > > > * interrupts (i use irqbalance, but problem was the same without)
> > >
> > > I wonder if you tried without SMP too?
> > 
> > No i did not. Do you think that this can be a problem ?
> > To test with no SMP, do i need to recompile kernel or is there a kernel parameter ?
> 
> It's always better to exclude any complications if it's possible.
> Yes, there is the kernel parameter for this: nosmp. So, if you
> have some time to spare I think 2.6.23-rc2 with this nosmp
> could be an interesting option.

So this afternoon i compiled 2.6.23-rc2 with same options as 2.6.23-rc1 and edited grub.conf to add nosmp but after reboot the box did not responded. Back home, i saw that the kernel failed because it was unable to find the partitions (mdadm failed, then LVM). After a few tests, removing nosmp let the kernel boot correctly. It seems that even the fedora provided kernels have the same behavior (well at least 2.6.22.1-41.fc7).

Jb


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-07 17:16 Jean-Baptiste Vignaud
@ 2007-08-08  7:21 ` Jarek Poplawski
  2007-08-08  7:36   ` Jarek Poplawski
  0 siblings, 1 reply; 68+ messages in thread
From: Jarek Poplawski @ 2007-08-08  7:21 UTC (permalink / raw)
  To: Jean-Baptiste Vignaud
  Cc: cebbert, mingo, marcin.slusarz, tglx, torvalds, linux-kernel,
	shemminger, linux-net, netdev, akpm, alan

On Tue, Aug 07, 2007 at 07:16:33PM +0200, Jean-Baptiste Vignaud wrote:
...
> So this afternoon i compiled 2.6.23-rc2 with same options as 2.6.23-rc1
> and edited grub.conf to add nosmp but after reboot the box did not
> responded. Back home, i saw that the kernel failed because it was unable
> to find the partitions (mdadm failed, then LVM). After a few tests,
> removing nosmp let the kernel boot correctly. It seems that even the
> fedora provided kernels have the same behavior
> (well at least 2.6.22.1-41.fc7).

Sorry: it seems there is some implementation error or some modules
don't check CONFIG_SMP enough...

Of course testing this with smp should be precious too.
Only, after finding some problems, you should consider smp is quite
a new and complicated technology, at least regarding such old designs
as 3c905.

BTW: I didn't notice this yesterday, but your forcedeth uses new type
of irq handling (MSI), so it should explain why it's not affected.

Jean-Baptiste: I'm not sure how much of this testing you can afford?
If you can spare some time for this and your box isn't for
'production' it could be very precious to diagnose such reproducible
bug.

Then, I'd have a few suggestions (you could choose any of them) like:
- trying these last test patches prepared for Marcin, too (but only
with kernels 2.6.21 - 2.6.23-rc1),
- trying to find the last kernel version, which works for you:
Marcin has done this with successfully using the most professional
way: git bisect (which btw. I did learn yet), but, IMHO, it could be
very usable to try a "poor man's" bisect too older kernels like this:
2.6.18, so to try again this version of previos Fedora, but
preferably in "vanilla" version (there could be some problems if
something in your configs or hardware has changed); then if OK:
2.6.20; if OK 2.6.21-rc1 or -rc2 (there are usually heavy changes
in the beginning of a cycle); ithen try to jump forward or backward
around the middle of the range eg. -rc4. You should use each time the
same, current config and remember to 'make oldconfig' before make.

In my opinion it would be very precious even after some long time,
so there is no need to hurry and do this now. The most important:
if nothing has changed with your hardware in the meantime, you
should find 'the culprit' for sure.

But, if there are any problems about such testing, don't bother!
It could be really a lot of hard and maybe boring work.

If you would like to read something more about testing (then of
course my suggestions could occur invalid - I'm a very bad tester
myself...) you can try this:
http://www.stardust.webpages.pl/files/handbook/

If you would need some additional advice you can mail me privately
too (but my response could take a few days). Of course, if your find
something interresting I'd be glad to know about this, but let's be
honest - I'm not any authority about these drivers, so cc-ing a
maintainer should always be more usable.

Thanks,
Jarek P.

PS: it would be nice if you could fix your mail program on line
breaking (or try to do this manually).

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-08  7:21 ` Jarek Poplawski
@ 2007-08-08  7:36   ` Jarek Poplawski
  0 siblings, 0 replies; 68+ messages in thread
From: Jarek Poplawski @ 2007-08-08  7:36 UTC (permalink / raw)
  To: Jean-Baptiste Vignaud
  Cc: cebbert, mingo, marcin.slusarz, tglx, torvalds, linux-kernel,
	shemminger, linux-net, netdev, akpm, alan

On Wed, Aug 08, 2007 at 09:21:14AM +0200, Jarek Poplawski wrote:
> On Tue, Aug 07, 2007 at 07:16:33PM +0200, Jean-Baptiste Vignaud wrote:
...
> Marcin has done this with successfully using the most professional
> way: git bisect (which btw. I did learn yet), but, IMHO, it could be
...
Let me say this slow and distinctly: I didn't learn yet! (Shame on me!)
Sorry for these misspelings here and there...

Jarek P.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
@ 2007-08-08  8:59 Jean-Baptiste Vignaud
  2007-08-08  9:30 ` Jarek Poplawski
  2007-08-08 12:16 ` Jarek Poplawski
  0 siblings, 2 replies; 68+ messages in thread
From: Jean-Baptiste Vignaud @ 2007-08-08  8:59 UTC (permalink / raw)
  To: jarkao2
  Cc: cebbert, mingo, marcin.slusarz, tglx, torvalds, linux-kernel,
	shemminger, linux-net, netdev, akpm, alan

> Jean-Baptiste: I'm not sure how much of this testing you can afford?
> If you can spare some time for this and your box isn't for
> 'production' it could be very precious to diagnose such reproducible
> bug.

Well i can continue testing patches for sure.

> Then, I'd have a few suggestions (you could choose any of them) like:
> - trying these last test patches prepared for Marcin, too (but only
> with kernels 2.6.21 - 2.6.23-rc1),

I'v patched 2.6.23-rc2 with those patches yesterday evening, and
launched samba copy. 
Is rc2 ok ?

This morning the network is still up :
RX bytes:279853499958 (260.6 GiB)  TX bytes:7416695531 (6.9 GiB)

Still testing.

> If you would like to read something more about testing (then of
> course my suggestions could occur invalid - I'm a very bad tester
> myself...) you can try this:
> http://www.stardust.webpages.pl/files/handbook/

I'll have a look at the document. 

Jb



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-08  8:59 2.6.20->2.6.21 - networking dies after random time Jean-Baptiste Vignaud
@ 2007-08-08  9:30 ` Jarek Poplawski
  2007-08-08 12:16 ` Jarek Poplawski
  1 sibling, 0 replies; 68+ messages in thread
From: Jarek Poplawski @ 2007-08-08  9:30 UTC (permalink / raw)
  To: Jean-Baptiste Vignaud
  Cc: cebbert, mingo, marcin.slusarz, tglx, torvalds, linux-kernel,
	shemminger, linux-net, netdev, akpm, alan

On Wed, Aug 08, 2007 at 10:59:22AM +0200, Jean-Baptiste Vignaud wrote:
> > Jean-Baptiste: I'm not sure how much of this testing you can afford?
> > If you can spare some time for this and your box isn't for
> > 'production' it could be very precious to diagnose such reproducible
> > bug.
> 
> Well i can continue testing patches for sure.

Great!

> 
> > Then, I'd have a few suggestions (you could choose any of them) like:
> > - trying these last test patches prepared for Marcin, too (but only
> > with kernels 2.6.21 - 2.6.23-rc1),
> 
> I'v patched 2.6.23-rc2 with those patches yesterday evening, and
> launched samba copy.
> Is rc2 ok ?

Yes! Mostly... 2.6.23-rc2 has a "temporary" patch applied, which should
work by itself (at last it works for Marcin). So, it's very good news
it works for you too. But, as a matter of fact the other patches
(I hope you mean these yesterday's two) probably are not used very
much (the last one could do some work but with other irqs).

So, it would be interesting to try them with e.g. 2.6.23-rc1. But not
together (I'd remind that after applying such a patch, make oldconfig,
make and so on plus testing, you can revert it with the same command
you used to patch plus -R option (e.g.: patch -p1 -R < ../patch1.diff),
to save some time on restoring a 'vanilla' kernel version.
The aim of these newer patches is to find why exactly this patch in
-rc2 works...

Cheers,
Jarek P.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-07 12:13                                           ` Jarek Poplawski
  2007-08-07 12:55                                             ` Jarek Poplawski
@ 2007-08-08 11:09                                             ` Marcin Ślusarz
  2007-08-08 11:42                                               ` Jarek Poplawski
  1 sibling, 1 reply; 68+ messages in thread
From: Marcin Ślusarz @ 2007-08-08 11:09 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

2007/8/7, Jarek Poplawski <jarkao2@o2.pl>:
> So, the let's try this idea yet: modified Ingo's "x86: activate
> HARDIRQS_SW_RESEND" patch.
> (Don't forget about make oldconfig before make.)
> For testing only.
>
> Cheers,
> Jarek P.
>
> PS: alas there was not even time for "compile checking"...
>
> ---
>
> diff -Nurp 2.6.22.1-/arch/i386/Kconfig 2.6.22.1/arch/i386/Kconfig
> --- 2.6.22.1-/arch/i386/Kconfig 2007-07-09 01:32:17.000000000 +0200
> +++ 2.6.22.1/arch/i386/Kconfig  2007-08-07 13:13:03.000000000 +0200
> @@ -1252,6 +1252,10 @@ config GENERIC_PENDING_IRQ
>         depends on GENERIC_HARDIRQS && SMP
>         default y
>
> +config HARDIRQS_SW_RESEND
> +       bool
> +       default y
> +
>  config X86_SMP
>         bool
>         depends on SMP && !X86_VOYAGER
> diff -Nurp 2.6.22.1-/arch/x86_64/Kconfig 2.6.22.1/arch/x86_64/Kconfig
> --- 2.6.22.1-/arch/x86_64/Kconfig       2007-07-09 01:32:17.000000000 +0200
> +++ 2.6.22.1/arch/x86_64/Kconfig        2007-08-07 13:13:03.000000000 +0200
> @@ -690,6 +690,10 @@ config GENERIC_PENDING_IRQ
>         depends on GENERIC_HARDIRQS && SMP
>         default y
>
> +config HARDIRQS_SW_RESEND
> +       bool
> +       default y
> +
>  menu "Power management options"
>
>  source kernel/power/Kconfig
> diff -Nurp 2.6.22.1-/kernel/irq/manage.c 2.6.22.1/kernel/irq/manage.c
> --- 2.6.22.1-/kernel/irq/manage.c       2007-07-09 01:32:17.000000000 +0200
> +++ 2.6.22.1/kernel/irq/manage.c        2007-08-07 13:13:03.000000000 +0200
> @@ -169,6 +169,14 @@ void enable_irq(unsigned int irq)
>                 desc->depth--;
>         }
>         spin_unlock_irqrestore(&desc->lock, flags);
> +#ifdef CONFIG_HARDIRQS_SW_RESEND
> +       /*
> +        * Do a bh disable/enable pair to trigger any pending
> +        * irq resend logic:
> +        */
> +       local_bh_disable();
> +       local_bh_enable();
> +#endif
>  }
>  EXPORT_SYMBOL(enable_irq);
>
> diff -Nurp 2.6.22.1-/kernel/irq/resend.c 2.6.22.1/kernel/irq/resend.c
> --- 2.6.22.1-/kernel/irq/resend.c       2007-07-09 01:32:17.000000000 +0200
> +++ 2.6.22.1/kernel/irq/resend.c        2007-08-07 13:57:54.000000000 +0200
> @@ -62,16 +62,24 @@ void check_irq_resend(struct irq_desc *d
>          */
>         desc->chip->enable(irq);
>
> +       /*
> +        * Temporary hack to figure out more about the problem, which
> +        * is causing the ancient network cards to die.
> +        */
> +
>         if ((status & (IRQ_PENDING | IRQ_REPLAY)) == IRQ_PENDING) {
>                 desc->status = (status & ~IRQ_PENDING) | IRQ_REPLAY;
>
> -               if (!desc->chip || !desc->chip->retrigger ||
> -                                       !desc->chip->retrigger(irq)) {
> +               if (desc->handle_irq == handle_edge_irq) {
> +                       if (desc->chip->retrigger)
> +                               desc->chip->retrigger(irq);
> +                       return;
> +               }
>  #ifdef CONFIG_HARDIRQS_SW_RESEND
> -                       /* Set it pending and activate the softirq: */
> -                       set_bit(irq, irqs_resend);
> -                       tasklet_schedule(&resend_tasklet);
> +               WARN_ON_ONCE(1);
> +               /* Set it pending and activate the softirq: */
> +               set_bit(irq, irqs_resend);
> +               tasklet_schedule(&resend_tasklet);
>  #endif
> -               }
>         }
>  }
>
Works fine with:
WARNING: at kernel/irq/resend.c:79 check_irq_resend()

Call Trace:
 [<ffffffff8025e660>] check_irq_resend+0xc0/0xd0
 [<ffffffff8025e1cd>] enable_irq+0xed/0xf0
 [<ffffffff8807f21d>] :8390:ei_start_xmit+0x14d/0x30c
 [<ffffffff8024d055>] lock_release_non_nested+0xe5/0x190
 [<ffffffff80539b78>] __qdisc_run+0x98/0x1f0
 [<ffffffff80539b8e>] __qdisc_run+0xae/0x1f0
 [<ffffffff8052b65e>] dev_hard_start_xmit+0x26e/0x2d0
 [<ffffffff80539ba0>] __qdisc_run+0xc0/0x1f0
 [<ffffffff8052dc2f>] dev_queue_xmit+0x24f/0x310
 [<ffffffff805337a7>] neigh_resolve_output+0xe7/0x290
 [<ffffffff8054f5c0>] dst_output+0x0/0x10
 [<ffffffff80552aff>] ip_output+0x19f/0x340
 [<ffffffff80551f77>] ip_queue_xmit+0x217/0x430
 [<ffffffff80563b2a>] tcp_transmit_skb+0x40a/0x7c0
 [<ffffffff805657bb>] __tcp_push_pending_frames+0x11b/0x940
 [<ffffffff8055972a>] tcp_sendmsg+0x87a/0xc80
 [<ffffffff80577735>] inet_sendmsg+0x45/0x80
 [<ffffffff8051e2d4>] sock_aio_write+0x104/0x120
 [<ffffffff80285fc1>] do_sync_write+0xf1/0x130
 [<ffffffff80243290>] autoremove_wake_function+0x0/0x40
 [<ffffffff802868e9>] vfs_write+0x159/0x170
 [<ffffffff80286ef0>] sys_write+0x50/0x90
 [<ffffffff802097fe>] system_call+0x7e/0x83

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-07 12:55                                             ` Jarek Poplawski
@ 2007-08-08 11:11                                               ` Marcin Ślusarz
  0 siblings, 0 replies; 68+ messages in thread
From: Marcin Ślusarz @ 2007-08-08 11:11 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

2007/8/7, Jarek Poplawski <jarkao2@o2.pl>:
> And here is one more patch to test the same idea (chip->retrigger()).
> Let's try i386 way! (I hope I will not be arrested for this...)
> (Should be tested without any previous patches.)
>
> Jarek P.
>
> PS: as above
>
> ---
>
> diff -Nurp 2.6.22.1-/arch/x86_64/kernel/io_apic.c 2.6.22.1/arch/x86_64/kernel/io_apic.c
> --- 2.6.22.1-/arch/x86_64/kernel/io_apic.c      2007-07-09 01:32:17.000000000 +0200
> +++ 2.6.22.1/arch/x86_64/kernel/io_apic.c       2007-08-07 14:37:45.000000000 +0200
> @@ -1311,15 +1311,8 @@ static unsigned int startup_ioapic_irq(u
>  static int ioapic_retrigger_irq(unsigned int irq)
>  {
>         struct irq_cfg *cfg = &irq_cfg[irq];
> -       cpumask_t mask;
> -       unsigned long flags;
> -
> -       spin_lock_irqsave(&vector_lock, flags);
> -       cpus_clear(mask);
> -       cpu_set(first_cpu(cfg->domain), mask);
>
> -       send_IPI_mask(mask, cfg->vector);
> -       spin_unlock_irqrestore(&vector_lock, flags);
> +       send_IPI_self(cfg->vector);
>
>         return 1;
>  }
>
Network card timed out with this patch.

Marcin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-08 11:09                                             ` Marcin Ślusarz
@ 2007-08-08 11:42                                               ` Jarek Poplawski
  2007-08-08 11:53                                                 ` Jarek Poplawski
  0 siblings, 1 reply; 68+ messages in thread
From: Jarek Poplawski @ 2007-08-08 11:42 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

Read below please:

On Wed, Aug 08, 2007 at 01:09:36PM +0200, Marcin Ślusarz wrote:
> 2007/8/7, Jarek Poplawski <jarkao2@o2.pl>:
> > So, the let's try this idea yet: modified Ingo's "x86: activate
> > HARDIRQS_SW_RESEND" patch.
> > (Don't forget about make oldconfig before make.)
> > For testing only.
> >
> > Cheers,
> > Jarek P.
> >
> > PS: alas there was not even time for "compile checking"...
> >
> > ---
> >
> > diff -Nurp 2.6.22.1-/arch/i386/Kconfig 2.6.22.1/arch/i386/Kconfig
> > --- 2.6.22.1-/arch/i386/Kconfig 2007-07-09 01:32:17.000000000 +0200
> > +++ 2.6.22.1/arch/i386/Kconfig  2007-08-07 13:13:03.000000000 +0200
> > @@ -1252,6 +1252,10 @@ config GENERIC_PENDING_IRQ
> >         depends on GENERIC_HARDIRQS && SMP
> >         default y
> >
> > +config HARDIRQS_SW_RESEND
> > +       bool
> > +       default y
> > +
> >  config X86_SMP
> >         bool
> >         depends on SMP && !X86_VOYAGER
> > diff -Nurp 2.6.22.1-/arch/x86_64/Kconfig 2.6.22.1/arch/x86_64/Kconfig
> > --- 2.6.22.1-/arch/x86_64/Kconfig       2007-07-09 01:32:17.000000000 +0200
> > +++ 2.6.22.1/arch/x86_64/Kconfig        2007-08-07 13:13:03.000000000 +0200
> > @@ -690,6 +690,10 @@ config GENERIC_PENDING_IRQ
> >         depends on GENERIC_HARDIRQS && SMP
> >         default y
> >
> > +config HARDIRQS_SW_RESEND
> > +       bool
> > +       default y
> > +
> >  menu "Power management options"
> >
> >  source kernel/power/Kconfig
> > diff -Nurp 2.6.22.1-/kernel/irq/manage.c 2.6.22.1/kernel/irq/manage.c
> > --- 2.6.22.1-/kernel/irq/manage.c       2007-07-09 01:32:17.000000000 +0200
> > +++ 2.6.22.1/kernel/irq/manage.c        2007-08-07 13:13:03.000000000 +0200
> > @@ -169,6 +169,14 @@ void enable_irq(unsigned int irq)
> >                 desc->depth--;
> >         }
> >         spin_unlock_irqrestore(&desc->lock, flags);
> > +#ifdef CONFIG_HARDIRQS_SW_RESEND
> > +       /*
> > +        * Do a bh disable/enable pair to trigger any pending
> > +        * irq resend logic:
> > +        */
> > +       local_bh_disable();
> > +       local_bh_enable();
> > +#endif
> >  }
> >  EXPORT_SYMBOL(enable_irq);
> >
> > diff -Nurp 2.6.22.1-/kernel/irq/resend.c 2.6.22.1/kernel/irq/resend.c
> > --- 2.6.22.1-/kernel/irq/resend.c       2007-07-09 01:32:17.000000000 +0200
> > +++ 2.6.22.1/kernel/irq/resend.c        2007-08-07 13:57:54.000000000 +0200
> > @@ -62,16 +62,24 @@ void check_irq_resend(struct irq_desc *d
> >          */
> >         desc->chip->enable(irq);
> >
> > +       /*
> > +        * Temporary hack to figure out more about the problem, which
> > +        * is causing the ancient network cards to die.
> > +        */
> > +
> >         if ((status & (IRQ_PENDING | IRQ_REPLAY)) == IRQ_PENDING) {
> >                 desc->status = (status & ~IRQ_PENDING) | IRQ_REPLAY;
> >
> > -               if (!desc->chip || !desc->chip->retrigger ||
> > -                                       !desc->chip->retrigger(irq)) {
> > +               if (desc->handle_irq == handle_edge_irq) {
> > +                       if (desc->chip->retrigger)
> > +                               desc->chip->retrigger(irq);
> > +                       return;
> > +               }
> >  #ifdef CONFIG_HARDIRQS_SW_RESEND
> > -                       /* Set it pending and activate the softirq: */
> > -                       set_bit(irq, irqs_resend);
> > -                       tasklet_schedule(&resend_tasklet);
> > +               WARN_ON_ONCE(1);
> > +               /* Set it pending and activate the softirq: */
> > +               set_bit(irq, irqs_resend);
> > +               tasklet_schedule(&resend_tasklet);
> >  #endif
> > -               }
> >         }
> >  }
> >
> Works fine with:

Very nice! It would be about time this kernel should start behave...

> WARNING: at kernel/irq/resend.c:79 check_irq_resend()
> 
> Call Trace:
>  [<ffffffff8025e660>] check_irq_resend+0xc0/0xd0
>  [<ffffffff8025e1cd>] enable_irq+0xed/0xf0
>  [<ffffffff8807f21d>] :8390:ei_start_xmit+0x14d/0x30c
>  [<ffffffff8024d055>] lock_release_non_nested+0xe5/0x190
>  [<ffffffff80539b78>] __qdisc_run+0x98/0x1f0
>  [<ffffffff80539b8e>] __qdisc_run+0xae/0x1f0
>  [<ffffffff8052b65e>] dev_hard_start_xmit+0x26e/0x2d0
>  [<ffffffff80539ba0>] __qdisc_run+0xc0/0x1f0
>  [<ffffffff8052dc2f>] dev_queue_xmit+0x24f/0x310
>  [<ffffffff805337a7>] neigh_resolve_output+0xe7/0x290
>  [<ffffffff8054f5c0>] dst_output+0x0/0x10
>  [<ffffffff80552aff>] ip_output+0x19f/0x340
>  [<ffffffff80551f77>] ip_queue_xmit+0x217/0x430
>  [<ffffffff80563b2a>] tcp_transmit_skb+0x40a/0x7c0
>  [<ffffffff805657bb>] __tcp_push_pending_frames+0x11b/0x940
>  [<ffffffff8055972a>] tcp_sendmsg+0x87a/0xc80
>  [<ffffffff80577735>] inet_sendmsg+0x45/0x80
>  [<ffffffff8051e2d4>] sock_aio_write+0x104/0x120
>  [<ffffffff80285fc1>] do_sync_write+0xf1/0x130
>  [<ffffffff80243290>] autoremove_wake_function+0x0/0x40
>  [<ffffffff802868e9>] vfs_write+0x159/0x170
>  [<ffffffff80286ef0>] sys_write+0x50/0x90
>  [<ffffffff802097fe>] system_call+0x7e/0x83
> 

So, it looks like x86_64 io_apic's IPI code was unused too long...
I hope it's a piece of cake for Ingo now...

Thanks very much Marcin!

If it's possible for you and Jean-Baptiste, try this today patch
with -rc2, and maybe once more this one patch (-rc1 or older).

Regards,
Jarek P. 


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-08 11:42                                               ` Jarek Poplawski
@ 2007-08-08 11:53                                                 ` Jarek Poplawski
  0 siblings, 0 replies; 68+ messages in thread
From: Jarek Poplawski @ 2007-08-08 11:53 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

On Wed, Aug 08, 2007 at 01:42:43PM +0200, Jarek Poplawski wrote:
...
> So, it looks like x86_64 io_apic's IPI code was unused too long...

To be fair it's x86_64 lapic's IPI code.

Jarek P.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-08  8:59 2.6.20->2.6.21 - networking dies after random time Jean-Baptiste Vignaud
  2007-08-08  9:30 ` Jarek Poplawski
@ 2007-08-08 12:16 ` Jarek Poplawski
  1 sibling, 0 replies; 68+ messages in thread
From: Jarek Poplawski @ 2007-08-08 12:16 UTC (permalink / raw)
  To: Jean-Baptiste Vignaud
  Cc: cebbert, mingo, marcin.slusarz, tglx, torvalds, linux-kernel,
	shemminger, linux-net, netdev, akpm, alan

On Wed, Aug 08, 2007 at 10:59:22AM +0200, Jean-Baptiste Vignaud wrote:
...
> > If you would like to read something more about testing (then of
> > course my suggestions could occur invalid - I'm a very bad tester
> > myself...) you can try this:
> > http://www.stardust.webpages.pl/files/handbook/
> 
> I'll have a look at the document.

BTW: this document describes some methods for a kind of 'professional'
testing (so you could save time if you do it very often). But, you
shouldn't think all this knowledge or tools are necessary. So, you can
skip many such things and still do very valuable testing with simpler
methods. And there are a lot of simple & good advices as well.

BTW #2: this all testing of older versions, which I've described, has
of course any reason only if after present patches you'll still think
the older kernel had worked better for you.

Jarek P. 

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2007-08-08 12:16 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-08-08  8:59 2.6.20->2.6.21 - networking dies after random time Jean-Baptiste Vignaud
2007-08-08  9:30 ` Jarek Poplawski
2007-08-08 12:16 ` Jarek Poplawski
  -- strict thread matches above, loose matches on Subject: below --
2007-08-07 17:16 Jean-Baptiste Vignaud
2007-08-08  7:21 ` Jarek Poplawski
2007-08-08  7:36   ` Jarek Poplawski
2007-08-07  9:21 Jean-Baptiste Vignaud
2007-08-07  9:44 ` Jarek Poplawski
2007-08-07  8:10 Jean-Baptiste Vignaud
2007-08-07  9:05 ` Jarek Poplawski
2007-08-06 20:42 Jean-Baptiste Vignaud
2007-08-06 21:19 ` Chuck Ebbert
2007-08-07  7:26   ` Jarek Poplawski
2007-08-06 21:30 ` Al Boldi
2007-08-06 19:36 Jean-Baptiste Vignaud
2007-06-29  8:50 Jean-Baptiste Vignaud
2007-06-29 15:07 ` Jarek Poplawski
2007-07-23  5:44   ` Marcin Ślusarz
2007-07-23  8:53     ` Jarek Poplawski
2007-07-24  7:18     ` Jarek Poplawski
2007-07-24  8:05     ` Ingo Molnar
2007-07-24  9:42       ` Ingo Molnar
2007-07-24 19:30         ` Linus Torvalds
2007-07-24 20:04           ` Ingo Molnar
2007-07-25  0:19             ` Thomas Gleixner
2007-07-25  7:23               ` Jarek Poplawski
2007-07-25 13:57               ` Jarek Poplawski
2007-07-25 14:46                 ` Alan Cox
2007-07-30  8:46                   ` Ingo Molnar
2007-07-30 13:05                     ` Alan Cox
2007-07-26  7:16               ` Marcin Ślusarz
2007-07-26  8:13                 ` Jarek Poplawski
2007-07-26  8:10                   ` Thomas Gleixner
2007-07-26  8:31                     ` Ingo Molnar
2007-07-26  8:55                       ` Jarek Poplawski
2007-07-26  9:12                         ` Ingo Molnar
2007-07-30  7:29                           ` Marcin Ślusarz
2007-07-30  8:49                             ` Ingo Molnar
2007-08-01  7:24                               ` Marcin Ślusarz
2007-08-01  7:27                                 ` Ingo Molnar
2007-08-06  6:58                                   ` Marcin Ślusarz
2007-07-31 13:20                             ` Jarek Poplawski
2007-08-06  7:00                               ` Marcin Ślusarz
2007-08-06  7:03                                 ` Ingo Molnar
2007-08-06 17:43                                   ` Chuck Ebbert
2007-08-06 19:08                                     ` Ingo Molnar
2007-08-07 10:09                                     ` Jarek Poplawski
2007-08-07  7:46                                   ` Marcin Ślusarz
2007-08-07  8:23                                     ` Jarek Poplawski
     [not found]                                       ` <4bacf17f0708070237w19d184b3p7f74b53612edb9a6@mail.gmail.com>
2007-08-07  9:52                                         ` Jarek Poplawski
2007-08-07 12:13                                           ` Jarek Poplawski
2007-08-07 12:55                                             ` Jarek Poplawski
2007-08-08 11:11                                               ` Marcin Ślusarz
2007-08-08 11:09                                             ` Marcin Ślusarz
2007-08-08 11:42                                               ` Jarek Poplawski
2007-08-08 11:53                                                 ` Jarek Poplawski
2007-07-26  9:11                     ` Jarek Poplawski
2007-07-26  8:19                   ` Jarek Poplawski
2007-07-26  8:16                 ` Ingo Molnar
2007-06-26 14:24 Jean-Baptiste Vignaud
2007-06-27 10:17 ` Jarek Poplawski
     [not found] <4bacf17f0706161435g1bb7c08bpd427901f64d57fa@mail.gmail.com>
2007-06-18 11:08 ` Jarek Poplawski
2007-06-18 15:10   ` Stephen Hemminger
2007-06-19  5:27     ` Jarek Poplawski
2007-06-19  5:50     ` Jarek Poplawski
2007-06-22  8:56       ` Marcin Ślusarz
2007-06-22 13:32         ` Jarek Poplawski
     [not found]           ` <4bacf17f0706252310w155fc4d7v1bf12319a650559a@mail.gmail.com>
2007-06-26  8:08             ` Jarek Poplawski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).