public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: eth*: transmit timed out since .27 (was: linux-2.4.27 released)
       [not found] <566B962EB122634D86E6EE29E83DD808182C3236@hdsmsx403.hd.intel.com>
@ 2004-08-16 17:52 ` Len Brown
  2004-08-16 18:44   ` eth*: transmit timed out since .27 Oliver Feiler
  0 siblings, 1 reply; 9+ messages in thread
From: Len Brown @ 2004-08-16 17:52 UTC (permalink / raw)
  To: Oliver Feiler; +Cc: Marcelo Tosatti, Marcelo Tosatti, linux-kernel

Oliver,
I'm glad that turning off "pci=noacpi" fixed your system.
I don't know why the legacy irqrouter didn't work, but
as ACPI works, I'm not going to worry about it;-)

I expect the "acpi=off" experiment would behave the same as
"pci=noacpi", but it looks like in your experiment you
mis-spelled that parameter as apci=off, so instead it was the
same as the default ACPI-enabled case.

Re: lots of interrupts on the same IRQ.
There are boot params to balance out the IRQs in PIC mode,
but what you want to do on this system is enable the IOAPIC
in your kernel config.  The existence of the MADT in your
ACPI tables suggests you may have one.  An IOAPIC will bring
additional interrupt pins to bear, usually allowing
the PCI interrupts to use IRQs > 16 where they may
not have to share so much.

cheers,
-Len



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: eth*: transmit timed out since .27
  2004-08-16 17:52 ` eth*: transmit timed out since .27 (was: linux-2.4.27 released) Len Brown
@ 2004-08-16 18:44   ` Oliver Feiler
  2004-08-16 19:08     ` Oliver Feiler
  2004-08-16 19:38     ` Len Brown
  0 siblings, 2 replies; 9+ messages in thread
From: Oliver Feiler @ 2004-08-16 18:44 UTC (permalink / raw)
  To: Len Brown; +Cc: Marcelo Tosatti, Marcelo Tosatti, linux-kernel

Hello Len,

Len Brown wrote:

> Oliver,
> I'm glad that turning off "pci=noacpi" fixed your system.
> I don't know why the legacy irqrouter didn't work, but
> as ACPI works, I'm not going to worry about it;-)

Well, it did work with 2.4.26, but I agree that it's better to get the 
new stuff to work correctly. ;) I just noticed that /proc/interrupts and 
/proc/pci, lspci still disagree on the IRQ of the IDE device.

            CPU0
   0:     112337    IO-APIC-edge  timer
   1:          2    IO-APIC-edge  keyboard
   8:          1    IO-APIC-edge  rtc
   9:          0   IO-APIC-level  acpi
  14:       9296    IO-APIC-edge  ide0
  15:       9078    IO-APIC-edge  ide1
  17:         24   IO-APIC-level  eth1
  18:     125085   IO-APIC-level  eth0
  21:          0   IO-APIC-level  usb-uhci, usb-uhci, usb-uhci
  22:          0   IO-APIC-level  via82cxxx
  23:       2976   IO-APIC-level  eth2
NMI:          0
LOC:     112313
ERR:          0
MIS:         42


vs.

00:11.1 IDE interface: VIA Technologies, Inc. 
VT82C586A/B/VT82C686/A/B/VT823x/A/C/VT8235 PIPC Bus Master IDE (rev 06) 
(prog-if 8a [Master SecP PriP])
         Subsystem: Unknown device 1849:0571
         Flags: bus master, medium devsel, latency 32, IRQ 255
         I/O ports at fc00 [size=16]
         Capabilities: <available only to root>

This probably has to do with this boot message:
PCI: No IRQ known for interrupt pin A of device 00:11.1

I have found absolutely nothing that explains if this is an error or 
just some sort of debug message one can ignore.

> 
> I expect the "acpi=off" experiment would behave the same as
> "pci=noacpi", but it looks like in your experiment you
> mis-spelled that parameter as apci=off, so instead it was the
> same as the default ACPI-enabled case.

Oh, thanks for noticing. Stupid me.

> 
> Re: lots of interrupts on the same IRQ.
> There are boot params to balance out the IRQs in PIC mode,
> but what you want to do on this system is enable the IOAPIC
> in your kernel config.  The existence of the MADT in your
> ACPI tables suggests you may have one.  An IOAPIC will bring
> additional interrupt pins to bear, usually allowing
> the PCI interrupts to use IRQs > 16 where they may
> not have to share so much.

Ok, I've turned on the IOAPIC and it seems to work perfectly fine. 
Except for that IRQ 255 thing I've noticed no oddities. Thanks for the 
hint. :)

cu
	Oliver


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: eth*: transmit timed out since .27
  2004-08-16 18:44   ` eth*: transmit timed out since .27 Oliver Feiler
@ 2004-08-16 19:08     ` Oliver Feiler
  2004-08-16 19:50       ` Len Brown
  2004-08-16 19:38     ` Len Brown
  1 sibling, 1 reply; 9+ messages in thread
From: Oliver Feiler @ 2004-08-16 19:08 UTC (permalink / raw)
  To: Len Brown; +Cc: Marcelo Tosatti, Marcelo Tosatti, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1083 bytes --]

Oliver Feiler wrote:
> 
> 
> Ok, I've turned on the IOAPIC and it seems to work perfectly fine. 
> Except for that IRQ 255 thing I've noticed no oddities. Thanks for the 
> hint. :)

No, not quite. After about 30 minutes of uptime and a moderate load of 
eth0 (100-200KB/s constant data flow) it happened again. :(

Aug 16 21:03:13 spot kernel: eth0: Tx timed out, lost interrupt? 
TSR=0x3, ISR=0x97, t=36.
Aug 16 21:03:15 spot kernel: eth0: Tx timed out, lost interrupt? 
TSR=0x3, ISR=0x3, t=141.
Aug 16 21:03:23 spot kernel: eth0: Tx timed out, lost interrupt? 
TSR=0x3, ISR=0x3, t=545.
[repeating endlessly]

I've booted a kernel without APIC and IOAPIC compiled and it works again.

I'm attaching a dmesg from a boot with IOAPIC enabled. I don't really 
know where to look for the problem here. The interrupt counter for the 
IRQ eth0 is using (a Realtek 8029 chipset) is growing significantly 
after a while. And after a while is seems to get stuck (Tx timed out). 
"ifconfig eth0 down" and "up" again did nothing. Sometimes it seems to 
fix such network problems.

cu
	Oliver


[-- Attachment #2: dmesg-2.4.27-ioapic.gz --]
[-- Type: application/x-gzip, Size: 4878 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: eth*: transmit timed out since .27
  2004-08-16 18:44   ` eth*: transmit timed out since .27 Oliver Feiler
  2004-08-16 19:08     ` Oliver Feiler
@ 2004-08-16 19:38     ` Len Brown
  2004-08-16 20:11       ` Maciej W. Rozycki
  1 sibling, 1 reply; 9+ messages in thread
From: Len Brown @ 2004-08-16 19:38 UTC (permalink / raw)
  To: Oliver Feiler; +Cc: Marcelo Tosatti, Marcelo Tosatti, linux-kernel

On Mon, 2004-08-16 at 14:44, Oliver Feiler wrote:

>   14:       9296    IO-APIC-edge  ide0
>   15:       9078    IO-APIC-edge  ide1
>   17:         24   IO-APIC-level  eth1
>   18:     125085   IO-APIC-level  eth0
>   21:          0   IO-APIC-level  usb-uhci, usb-uhci, usb-uhci
>   22:          0   IO-APIC-level  via82cxxx
>   23:       2976   IO-APIC-level  eth2
> NMI:          0
> LOC:     112313
> ERR:          0
> MIS:         42

This is unusual.
MIS is a hardware workaround and should normally be 0.

> 
> 
> vs.
> 
> 00:11.1 IDE interface: VIA Technologies, Inc. 
> VT82C586A/B/VT82C686/A/B/VT823x/A/C/VT8235 PIPC Bus Master IDE (rev
> 06) 
> (prog-if 8a [Master SecP PriP])
>          Subsystem: Unknown device 1849:0571
>          Flags: bus master, medium devsel, latency 32, IRQ 255
>          I/O ports at fc00 [size=16]
>          Capabilities: <available only to root>
> 
> This probably has to do with this boot message:
> PCI: No IRQ known for interrupt pin A of device 00:11.1

> I have found absolutely nothing that explains if this is an error or 
> just some sort of debug message one can ignore.

Yes, ignore it.

This is where that message about 255 came from.
When ACPI failed to find a PCI-routing-table entry
for this device, it looked in PCI config space
and found the 255 you see above.  The only recent
change is that it dosn't try to use an obviously
bogus value.  But in either case, with this device
it is moot as the hardware and the driver are hard-coded.

-Len



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: eth*: transmit timed out since .27
  2004-08-16 19:08     ` Oliver Feiler
@ 2004-08-16 19:50       ` Len Brown
  2004-08-16 23:04         ` Oliver Feiler
  0 siblings, 1 reply; 9+ messages in thread
From: Len Brown @ 2004-08-16 19:50 UTC (permalink / raw)
  To: Oliver Feiler; +Cc: Marcelo Tosatti, Marcelo Tosatti, linux-kernel

On Mon, 2004-08-16 at 15:08, Oliver Feiler wrote:
> Oliver Feiler wrote:
> > 
> > 
> > Ok, I've turned on the IOAPIC and it seems to work perfectly fine. 
> > Except for that IRQ 255 thing I've noticed no oddities. Thanks for
> the 
> > hint. :)
> 
> No, not quite. After about 30 minutes of uptime and a moderate load of
> eth0 (100-200KB/s constant data flow) it happened again. :(
> 
> Aug 16 21:03:13 spot kernel: eth0: Tx timed out, lost interrupt? 
> TSR=0x3, ISR=0x97, t=36.
> Aug 16 21:03:15 spot kernel: eth0: Tx timed out, lost interrupt? 
> TSR=0x3, ISR=0x3, t=141.
> Aug 16 21:03:23 spot kernel: eth0: Tx timed out, lost interrupt? 
> TSR=0x3, ISR=0x3, t=545.
> [repeating endlessly]
> 
> I've booted a kernel without APIC and IOAPIC compiled and it works
> again.
> 
> I'm attaching a dmesg from a boot with IOAPIC enabled. I don't really 
> know where to look for the problem here. The interrupt counter for the
> IRQ eth0 is using (a Realtek 8029 chipset) is growing significantly 
> after a while. And after a while is seems to get stuck (Tx timed out).
> "ifconfig eth0 down" and "up" again did nothing. Sometimes it seems to
> fix such network problems.

You've got 3 ethernet controllers.

eth0: RealTek RTL-8029 found at 0xe800, IRQ 18, 00:00:E8:5C:2D:AA.
eth1: SiS 900 PCI Fast Ethernet at 0xec00, IRQ 17, 00:c0:ca:16:4c:b6.
eth2: VIA VT6102 Rhine-II at 0xd400, 00:0b:6a:2b:48:84, IRQ 23.

And eth0 is failing.
See if you can give its network cable and its IRQ to on of the other
devices and see if the error follows the load and the wires,
or stays with the device.

The quirks for this hardware look totally broken in IOAPIC mode:
PCI: Via IRQ fixup for 00:10.2, from 10 to 5
PCI: Via IRQ fixup for 00:10.1, from 10 to 5
PCI: Via IRQ fixup for 00:10.0, from 11 to 5
I have no idea if they're a nop or not, but you might exeriment with
disabling them.  Sure isn't obvious that something called
quirk_via_irqpic() should be running in IOAPIC mode.
I'd try disabling quirk_via_acpi() too.

cheers,
-Len

ps. to exchange IRQs, you'll need to physically exchange the slots
of the cards, easy enough unless eth0 is soldered onto the
motherboard;-)



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: eth*: transmit timed out since .27
  2004-08-16 19:38     ` Len Brown
@ 2004-08-16 20:11       ` Maciej W. Rozycki
  0 siblings, 0 replies; 9+ messages in thread
From: Maciej W. Rozycki @ 2004-08-16 20:11 UTC (permalink / raw)
  To: Len Brown; +Cc: Oliver Feiler, Marcelo Tosatti, Marcelo Tosatti, linux-kernel

On Mon, 16 Aug 2004, Len Brown wrote:

> > MIS:         42
> 
> This is unusual.
> MIS is a hardware workaround and should normally be 0.

 Unfortunately these events seem to be triggerable for all systems using
serial APIC interrupt delivery.  All that is needed is a sufficiently high
load on interrupts, even a transient one.  Admittedly the definition of
"sufficient" here is very high, something like at least ten thousands of
interrupts per second.  E.g. I've been able to observe a few of them on my
system when a UDP NFS client was untarring an archive over a 100Mbps
network -- both the archive and the destination were located in an NFS
mounted filesystem and the size of the untarred data was around 300MB.  
The APIC hardware is rock-solid there -- after many years of operation I
have yet to see a single APIC error.

 One "reliable" way of triggering these events is configuring the PIT
timer interrupt input as level-triggered in the I/O APIC. ;-)  This is
actually how I did run-time testing of this code.

  Maciej

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: eth*: transmit timed out since .27
  2004-08-16 19:50       ` Len Brown
@ 2004-08-16 23:04         ` Oliver Feiler
  2004-08-16 23:42           ` Maciej W. Rozycki
  2004-08-17  0:29           ` Alan Cox
  0 siblings, 2 replies; 9+ messages in thread
From: Oliver Feiler @ 2004-08-16 23:04 UTC (permalink / raw)
  To: Len Brown; +Cc: Marcelo Tosatti, Marcelo Tosatti, linux-kernel

Hi Len,

Len Brown wrote:
> 
> 
> You've got 3 ethernet controllers.
> 
> eth0: RealTek RTL-8029 found at 0xe800, IRQ 18, 00:00:E8:5C:2D:AA.
> eth1: SiS 900 PCI Fast Ethernet at 0xec00, IRQ 17, 00:c0:ca:16:4c:b6.
> eth2: VIA VT6102 Rhine-II at 0xd400, 00:0b:6a:2b:48:84, IRQ 23.

Correct.

> 
> And eth0 is failing.
> See if you can give its network cable and its IRQ to on of the other
> devices and see if the error follows the load and the wires,
> or stays with the device.

Doing that is a bit problematic. eth0 is a 10mbit NIC, eth1 and eth2 
must be 100mbit unfortunately. I can move around (two of) the NICs in 
the PCI slots however. The box is headless and a bit uncomfortable to 
work with, so I'd like to try software solutions first.

> 
> The quirks for this hardware look totally broken in IOAPIC mode:
> PCI: Via IRQ fixup for 00:10.2, from 10 to 5
> PCI: Via IRQ fixup for 00:10.1, from 10 to 5
> PCI: Via IRQ fixup for 00:10.0, from 11 to 5
> I have no idea if they're a nop or not, but you might exeriment with
> disabling them.  Sure isn't obvious that something called
> quirk_via_irqpic() should be running in IOAPIC mode.
> I'd try disabling quirk_via_acpi() too.

Ok, I've removed the quirks from quirks.c, compiled and rebooted. I hope 
I have done it right, I commented out these lines in quirks.c:

//      { PCI_FIXUP_HEADER,     PCI_VENDOR_ID_VIA, 
PCI_DEVICE_ID_VIA_82C586_3,     quirk_via_acpi },
//      { PCI_FIXUP_HEADER,     PCI_VENDOR_ID_VIA, 
PCI_DEVICE_ID_VIA_82C686_4,     quirk_via_acpi },
//      { PCI_FIXUP_FINAL,      PCI_VENDOR_ID_VIA, 
PCI_DEVICE_ID_VIA_82C586_2,     quirk_via_irqpic },
//      { PCI_FIXUP_FINAL,      PCI_VENDOR_ID_VIA, 
PCI_DEVICE_ID_VIA_82C686_5,     quirk_via_irqpic },
//      { PCI_FIXUP_FINAL,      PCI_VENDOR_ID_VIA, 
PCI_DEVICE_ID_VIA_82C686_6,     quirk_via_irqpic },

The "Via IRQ fixup for dev:..." are gone from the boot messages. After 
transferring about 250 MB over eth0 the "Tx timed out" error reoccured.

/proc/interrupts looked like this:

            CPU0
   0:     191473    IO-APIC-edge  timer
   1:       1244    IO-APIC-edge  keyboard
   8:          1    IO-APIC-edge  rtc
   9:          0   IO-APIC-level  acpi
  14:      33547    IO-APIC-edge  ide0
  15:      23121    IO-APIC-edge  ide1
  17:       5699   IO-APIC-level  eth1
  18:     234589   IO-APIC-level  eth0
  21:          0   IO-APIC-level  usb-uhci, usb-uhci, usb-uhci
  22:          0   IO-APIC-level  via82cxxx
  23:     240873   IO-APIC-level  eth2
NMI:          0
LOC:     191481
ERR:          0
MIS:          8

What exactly is MIS? Something like "interrupt occured, but I have no 
idea what device caused it"? I don't know much about it, but it's always 
 >0 when the problem happens.

> 
> cheers,
> -Len
> 
> ps. to exchange IRQs, you'll need to physically exchange the slots
> of the cards, easy enough unless eth0 is soldered onto the
> motherboard;-)

Fortunately only eth2 (the VIA Rhine-II) is soldered onto the board. :)

I'll try reordering the NICs in the PCI slots. The system is used most 
of the time though, so I can't take it apart and test things all the 
time. I wonder if it makes sense to experiment with the IOAPIC further. 
Maybe the hardware is just plain broken? Or might there be a slight 
chance to get this to work the way it's intended to?

Btw, I don't know if I've ever mentioned it, it's an Asrock K7VM4 board. 
lspci output is here if it might be of interest:

kiza@spot:~> lspci
00:00.0 Host bridge: VIA Technologies, Inc. VT8378 [KM400] Chipset Host 
Bridge
00:01.0 PCI bridge: VIA Technologies, Inc. VT8237 PCI Bridge
00:09.0 Ethernet controller: Silicon Integrated Systems [SiS] SiS900 PCI 
Fast Ethernet (rev 02)
00:0a.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8029(AS)
00:10.0 USB Controller: VIA Technologies, Inc. VT6202 [USB 2.0 
controller] (rev 80)
00:10.1 USB Controller: VIA Technologies, Inc. VT6202 [USB 2.0 
controller] (rev 80)
00:10.2 USB Controller: VIA Technologies, Inc. VT6202 [USB 2.0 
controller] (rev 80)
00:10.3 USB Controller: VIA Technologies, Inc. USB 2.0 (rev 82)
00:11.0 ISA bridge: VIA Technologies, Inc. VT8235 ISA Bridge
00:11.1 IDE interface: VIA Technologies, Inc. 
VT82C586A/B/VT82C686/A/B/VT823x/A/C/VT8235 PIPC Bus Master IDE (rev 06)
00:11.5 Multimedia audio controller: VIA Technologies, Inc. 
VT8233/A/8235/8237 AC97 Audio Controller (rev 50)
00:12.0 Ethernet controller: VIA Technologies, Inc. VT6102 [Rhine-II] 
(rev 74)
01:00.0 VGA compatible controller: VIA Technologies, Inc. VT8378 [S3 
UniChrome] Integrated Video (rev 01)

Thanks for your help with this. :)

Oliver


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: eth*: transmit timed out since .27
  2004-08-16 23:04         ` Oliver Feiler
@ 2004-08-16 23:42           ` Maciej W. Rozycki
  2004-08-17  0:29           ` Alan Cox
  1 sibling, 0 replies; 9+ messages in thread
From: Maciej W. Rozycki @ 2004-08-16 23:42 UTC (permalink / raw)
  To: Oliver Feiler; +Cc: Len Brown, Marcelo Tosatti, Marcelo Tosatti, linux-kernel

On Tue, 17 Aug 2004, Oliver Feiler wrote:

> MIS:          8
> 
> What exactly is MIS? Something like "interrupt occured, but I have no 
> idea what device caused it"? I don't know much about it, but it's always 
>  >0 when the problem happens.

 It's a trigger mode MISmatch.  It only happens for level-triggered
interrupts and the problem is they get recorded as edge-triggered ones in
the receiving local APIC.  The two interrupt trigger modes require the
hardware to perform different actions when the software interrupt handler
concludes and such a mismatch would lead to a lock-up of the affected
line.  Specifically, the local APIC involved sends an End Of Interrupt
(EOI) message to the originating I/O APIC for level-triggered interrupts
and for edge-triggered interrupts nothing is sent.  Fortunately just
before sending the final ACK to the hardware at the conclusion of the
handler we can detect that the trigger mode recorded by the local APIC
disagrees with the setup of the corresponding I/O APIC line and if that
happens we execute an (expensive) unlock action at the I/O APIC so that it
resets its logic for the input as if it received an EOI message from a
local APIC for a level-triggered interrupt.

  Maciej

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: eth*: transmit timed out since .27
  2004-08-16 23:04         ` Oliver Feiler
  2004-08-16 23:42           ` Maciej W. Rozycki
@ 2004-08-17  0:29           ` Alan Cox
  1 sibling, 0 replies; 9+ messages in thread
From: Alan Cox @ 2004-08-17  0:29 UTC (permalink / raw)
  To: Oliver Feiler
  Cc: Len Brown, Marcelo Tosatti, Marcelo Tosatti,
	Linux Kernel Mailing List

Looking over the docs the whole ACPI and IOAPIC mode for these boards
seems very different and quite "magic" compared to the PCI mode which is
merely "odd" in a few places. APIC routing bits are stuffed into strange
chipset specific places which implies the quirks probably shouldn't be
applied in acpi mode.


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2004-08-17  1:32 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <566B962EB122634D86E6EE29E83DD808182C3236@hdsmsx403.hd.intel.com>
2004-08-16 17:52 ` eth*: transmit timed out since .27 (was: linux-2.4.27 released) Len Brown
2004-08-16 18:44   ` eth*: transmit timed out since .27 Oliver Feiler
2004-08-16 19:08     ` Oliver Feiler
2004-08-16 19:50       ` Len Brown
2004-08-16 23:04         ` Oliver Feiler
2004-08-16 23:42           ` Maciej W. Rozycki
2004-08-17  0:29           ` Alan Cox
2004-08-16 19:38     ` Len Brown
2004-08-16 20:11       ` Maciej W. Rozycki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox