* The buggy APIC of the Abit BP6
@ 2002-06-12 22:33 Robbert Kouprie
2002-06-13 9:05 ` Helge Hafting
2002-06-14 15:07 ` Raphael Manfredi
0 siblings, 2 replies; 21+ messages in thread
From: Robbert Kouprie @ 2002-06-12 22:33 UTC (permalink / raw)
To: linux-kernel
Hi all,
First of all, I know the Abit BP6 is infamous about its APIC, but I
would like to make sure there's absolutely no solution for this except
disabling the APIC.
I am experiencing problems for a long time now, which are always related
to the NIC in the box (probably due the being a device that generates a
lot of interrupts). The NIC has changed a couple of times (from 3com 10
Mbit to Intel eepro100 to 3Com PCI 3c905B Cyclone 100baseTx now), and
it's NOT placed in the infamous (I believe 3rd) PCI slot of the board
(mentioned in the manual). Also, /proc/interrups shows NO sharing with
another device. The running kernel is 2.4.19-pre8-ac5 SMP, though many
kernels have preceded it, with the same results.
The problems appear once in a while (in order of days/weeks). They are
always interluded with an "unexpected IRQ trap at vector 7d", and then
followed within a minute by chaos in the network driver. I found the
message of the 3com driver to be the most clear one, see the snippet
below. When I boot with "noapic", the problems go away.
Is there a solution that does not require disabling the APIC as a whole
or is this just too flaky hardware?
Thanks in advance,
- Robbert Kouprie
PS. Please CC me in answers, as I'm not on the list.
Jun 12 23:47:56 radium kernel: unexpected IRQ trap at vector 7d
Jun 12 23:47:56 radium kernel: unexpected IRQ trap at vector 7d
Jun 12 23:48:29 radium kernel: NETDEV WATCHDOG: eth0: transmit timed out
Jun 12 23:48:29 radium kernel: eth0: transmit timed out, tx_status 00
status e681.
Jun 12 23:48:29 radium kernel: diagnostics: net 0cf2 media 8880 dma
0000003a fifo 8800
Jun 12 23:48:29 radium kernel: eth0: Interrupt posted but not delivered
-- IRQ blocked by another device?
Jun 12 23:48:29 radium kernel: Flags; bus-master 1, dirty 16264012(12)
current 16264012(12)
Jun 12 23:48:29 radium kernel: Transmit list 00000000 vs. c133e500.
Jun 12 23:48:29 radium kernel: 0: @c133e200 length 800005cc status
000105cc
Jun 12 23:48:29 radium kernel: 1: @c133e240 length 800005cc status
000105cc
Jun 12 23:48:29 radium kernel: 2: @c133e280 length 800005cc status
000105cc
Jun 12 23:48:29 radium kernel: 3: @c133e2c0 length 800005cc status
000105cc
Jun 12 23:48:29 radium kernel: 4: @c133e300 length 8000001e status
0001001e
Jun 12 23:48:29 radium kernel: 5: @c133e340 length 800005cc status
000105cc
Jun 12 23:48:29 radium kernel: 6: @c133e380 length 80000045 status
00010045
Jun 12 23:48:29 radium kernel: 7: @c133e3c0 length 80000045 status
00010045
Jun 12 23:48:29 radium kernel: 8: @c133e400 length 80000045 status
00010045
Jun 12 23:48:29 radium kernel: 9: @c133e440 length 80000045 status
00010045
Jun 12 23:48:29 radium kernel: 10: @c133e480 length 80000045 status
80010045
Jun 12 23:48:29 radium kernel: 11: @c133e4c0 length 800005cc status
800105cc
Jun 12 23:48:29 radium kernel: 12: @c133e500 length 80000042 status
00010042
Jun 12 23:48:29 radium kernel: 13: @c133e540 length 80000042 status
00010042
Jun 12 23:48:29 radium kernel: 14: @c133e580 length 80000042 status
00010042
Jun 12 23:48:29 radium kernel: 15: @c133e5c0 length 80000042 status
00010042
Jun 12 23:48:29 radium kernel: eth0: Resetting the Tx ring pointer.
Jun 12 23:48:29 radium kernel: NETDEV WATCHDOG: eth0: transmit timed out
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: The buggy APIC of the Abit BP6
2002-06-12 22:33 Robbert Kouprie
@ 2002-06-13 9:05 ` Helge Hafting
2002-06-13 13:30 ` Robbert Kouprie
2002-06-14 15:07 ` Raphael Manfredi
1 sibling, 1 reply; 21+ messages in thread
From: Helge Hafting @ 2002-06-13 9:05 UTC (permalink / raw)
To: Robbert Kouprie, linux-kernel; +Cc: helge.hafting
Robbert Kouprie wrote:
>
> Hi all,
>
> First of all, I know the Abit BP6 is infamous about its APIC, but I
> would like to make sure there's absolutely no solution for this except
> disabling the APIC.
>
> I am experiencing problems for a long time now, which are always related
> to the NIC in the box (probably due the being a device that generates a
> lot of interrupts). The NIC has changed a couple of times (from 3com 10
> Mbit to Intel eepro100 to 3Com PCI 3c905B Cyclone 100baseTx now), and
> it's NOT placed in the infamous (I believe 3rd) PCI slot of the board
> (mentioned in the manual). Also, /proc/interrups shows NO sharing with
> another device. The running kernel is 2.4.19-pre8-ac5 SMP, though many
> kernels have preceded it, with the same results.
>
> The problems appear once in a while (in order of days/weeks). They are
> always interluded with an "unexpected IRQ trap at vector 7d", and then
> followed within a minute by chaos in the network driver. I found the
> message of the 3com driver to be the most clear one, see the snippet
> below. When I boot with "noapic", the problems go away.
>
> Is there a solution that does not require disabling the APIC as a whole
> or is this just too flaky hardware?
>
> Thanks in advance,
> - Robbert Kouprie
>
> PS. Please CC me in answers, as I'm not on the list.
>
> Jun 12 23:47:56 radium kernel: unexpected IRQ trap at vector 7d
> Jun 12 23:47:56 radium kernel: unexpected IRQ trap at vector 7d
It _can_ be solved - rebooting cures it, so assuming the problem
is autodetectable it _can_ be solved by doing whatever it is
a reboot (or driver reload) does to the APIC.
My guess is that the APIC setup for that IRQ have to be reprogrammed.
you could do that as a quirk for the BP6.
The first question is if there is a reliable way to detect this
condition. "No interrupts from a device" could simply mean that
it isn't used much at the time. You get a unexpected IRQ trap - do
the problem always manifest itself this way?
The second question is if all the PCI card drivers out there
survive a lost interrupt handled outside the driver.
If not, you have to close+reopen the device, and that involves
userspace.
A network card will need reinitialization, a disk controller
remounting...
Helge Hafting
^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: The buggy APIC of the Abit BP6
2002-06-13 9:05 ` Helge Hafting
@ 2002-06-13 13:30 ` Robbert Kouprie
2002-06-14 10:54 ` Helge Hafting
0 siblings, 1 reply; 21+ messages in thread
From: Robbert Kouprie @ 2002-06-13 13:30 UTC (permalink / raw)
To: 'Helge Hafting'; +Cc: linux-kernel
Helge Hafting wrote:
> > Jun 12 23:47:56 radium kernel: unexpected IRQ trap at vector 7d
> > Jun 12 23:47:56 radium kernel: unexpected IRQ trap at vector 7d
>
> It _can_ be solved - rebooting cures it, so assuming the problem
> is autodetectable it _can_ be solved by doing whatever it is
> a reboot (or driver reload) does to the APIC.
True.
> My guess is that the APIC setup for that IRQ have to be reprogrammed.
> you could do that as a quirk for the BP6.
> The first question is if there is a reliable way to detect this
> condition. "No interrupts from a device" could simply mean that
> it isn't used much at the time. You get a unexpected IRQ trap - do
> the problem always manifest itself this way?
Yes, I always get the "unexpected IRQ trap at vector 7d" message. This
is the same message even with different NICs (though they were placed in
the same PCI slot). About 30-120 seconds after this message (depending
on some driver timeout value I guess) the NETDEV watchdog kicks in with
a "eth0: transmit timed out".
> The second question is if all the PCI card drivers out there
> survive a lost interrupt handled outside the driver.
> If not, you have to close+reopen the device, and that involves
> userspace.
> A network card will need reinitialization, a disk controller
> remounting...
That could indeed be a problem. But this will become clear pretty soon
once this APIC reprogramming workaround is actually implemented in the
kernel. Then I will be able to test that. Any ideas how this workaround
in the kernel would look like?
Thanks for the help,
- Robbert Kouprie
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: The buggy APIC of the Abit BP6
2002-06-13 13:30 ` Robbert Kouprie
@ 2002-06-14 10:54 ` Helge Hafting
2002-06-18 15:30 ` Robbert Kouprie
0 siblings, 1 reply; 21+ messages in thread
From: Helge Hafting @ 2002-06-14 10:54 UTC (permalink / raw)
To: Robbert Kouprie, linux-kernel
Robbert Kouprie wrote:
> That could indeed be a problem. But this will become clear pretty soon
> once this APIC reprogramming workaround is actually implemented in the
> kernel. Then I will be able to test that. Any ideas how this workaround
> in the kernel would look like?
Not much. Take a look at what happens in the kernel
when a pci device driver allocate an irq, and what happens
when it releases it.
What you have to do, is probably to release the (broken) irq
without disturbing the driver's internal data. Then
claim it again immediately on behalf of the driver. You
have now treated the APIC the same way as a close/open do.
No interrupt from that device should happen in the middle
of this - but you should be fine as the irq supposedly is dead.
And this is something you'll have to do wherever the error
is detected, i.e. near the code that prints that message.
Helge Hafting
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: The buggy APIC of the Abit BP6
2002-06-12 22:33 Robbert Kouprie
2002-06-13 9:05 ` Helge Hafting
@ 2002-06-14 15:07 ` Raphael Manfredi
1 sibling, 0 replies; 21+ messages in thread
From: Raphael Manfredi @ 2002-06-14 15:07 UTC (permalink / raw)
To: linux-kernel
Quoting Robbert Kouprie <robbert@radium.jvb.tudelft.nl> from ml.linux.kernel:
:I am experiencing problems for a long time now, which are always related
:to the NIC in the box (probably due the being a device that generates a
:lot of interrupts). The NIC has changed a couple of times (from 3com 10
:Mbit to Intel eepro100 to 3Com PCI 3c905B Cyclone 100baseTx now), and
:it's NOT placed in the infamous (I believe 3rd) PCI slot of the board
:(mentioned in the manual). Also, /proc/interrups shows NO sharing with
:another device. The running kernel is 2.4.19-pre8-ac5 SMP, though many
:kernels have preceded it, with the same results.
Here's my own solution for it, in an old article. I've been running
with this patch since then, and transmit timeouts have never been a problem.
I run 2.4.18-pre7 nowadays, and the patch below applied without problem.
Raphael
------------------------------------------------------------------------
From: Raphael Manfredi <Raphael_Manfredi@pobox.com>
To: linux-kernel@vger.kernel.org
Subject: [PATCH] Recover from lockups after "eth0: transmit timed out"
Date: Thu, 08 Nov 2001 19:42:05 +0100
Message-ID: <30043.1005244925@nice.ram.loc>
This is my second take at the fix. I've tested it on my ABIT BP6 with
linux 2.4.12-ac3, but it should apply fine on more recent versions.
I've verified that I indeed recovered from a timeout situation where
I had to reboot before.
The fix assumes that the "NETDEV WATCHDOG" will only run when there is
an APIC, so it's OK to call apic routines. If this assumption is wrong,
then could someone more knowledgeable than me protect the call correctly
so we don't address missing hardware?
This fix is driver-independent, contrary to my first fix. It's also
shorter, as it re-uses existing macros in io_apic.c instead of expanding
them.
Please apply, and if rejected, let me know why.
Raphael
--- linux-2.4.12-ac3/arch/i386/kernel/io_apic.c.orig Mon Oct 29 19:34:42 2001
+++ linux-2.4.12-ac3/arch/i386/kernel/io_apic.c Sun Nov 4 15:53:05 2001
@@ -1616,3 +1616,35 @@
check_timer();
print_IO_APIC();
}
+
+/*
+ * The purpose of this routine is to recover from hopeless situations,
+ * where the IO-APIC level interrupt no longer happens, despite the use
+ * of end_level_ioapic_irq().
+ *
+ * This happens mainly whith Ethernet cards under heavy network traffic,
+ * on boxes with streams of APIC errors. The visible symptom is a message:
+ *
+ * NETDEV WATCHDOG: eth0: transmit timed out
+ *
+ * At this point, a driver-specific TX timout routine is called. Upon
+ * return, the watchdog calls:
+ *
+ * kick_IO_APIC_irq(dev->irq)
+ *
+ * to re-enable the interrupt source, or the machine may be stuck without
+ * network, until rebooted.
+ *
+ * Idea was suggested by Manfred Spraul, implemented by Raphael Manfredi.
+ */
+void kick_IO_APIC_irq(int irq)
+{
+ printk(KERN_CRIT "Kicking IO-APIC IRQ %d:\n", irq);
+
+ spin_lock(&ioapic_lock);
+ __mask_and_edge_IO_APIC_irq(irq);
+ udelay(10);
+ __unmask_and_level_IO_APIC_irq(irq);
+ spin_unlock(&ioapic_lock);
+}
+
--- linux-2.4.12-ac3/net/sched/sch_generic.c.orig Sun Nov 4 15:47:10 2001
+++ linux-2.4.12-ac3/net/sched/sch_generic.c Sun Nov 4 15:51:14 2001
@@ -153,6 +153,7 @@
(jiffies - dev->trans_start) > dev->watchdog_timeo) {
printk(KERN_INFO "NETDEV WATCHDOG: %s: transmit timed out\n", dev->name);
dev->tx_timeout(dev);
+ kick_IO_APIC_irq(dev->irq); /* Added by RAM */
}
if (!mod_timer(&dev->watchdog_timer, jiffies + dev->watchdog_timeo))
dev_hold(dev);
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: The buggy APIC of the Abit BP6
@ 2002-06-14 16:49 Robbert Kouprie
2002-06-14 18:41 ` Raphael Manfredi
0 siblings, 1 reply; 21+ messages in thread
From: Robbert Kouprie @ 2002-06-14 16:49 UTC (permalink / raw)
To: 'Raphael Manfredi'; +Cc: linux-kernel
Raphael Manfredi wrote:
> Here's my own solution for it, in an old article. I've been running
> with this patch since then, and transmit timeouts have never been a
problem.
>
> I run 2.4.18-pre7 nowadays, and the patch below applied without
problem.
Thanks very much! This looks very promising. I just patched
2.4.19pre10-ac2 with it and booted it up on my BP6. I will report back
any failure or success of APIC kicking ;)
BTW, did you get any explanation why this wasn't applied in -ac or main
kernel?
- Robbert Kouprie
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: The buggy APIC of the Abit BP6
2002-06-14 16:49 The buggy APIC of the Abit BP6 Robbert Kouprie
@ 2002-06-14 18:41 ` Raphael Manfredi
2002-06-18 7:33 ` Helge Hafting
0 siblings, 1 reply; 21+ messages in thread
From: Raphael Manfredi @ 2002-06-14 18:41 UTC (permalink / raw)
To: linux-kernel
Quoting Robbert Kouprie <robbert@radium.jvb.tudelft.nl> from ml.linux.kernel:
:BTW, did you get any explanation why this wasn't applied in -ac or main
:kernel?
None.
But I know that this patch is dirty because it attacks a hardware-dependent
layer from a rather generic one. This may be why it's rejected. And it
may also be completely APIC-BP6 specific.
I also know is that it works for me. ;-)
Raphael
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: The buggy APIC of the Abit BP6
2002-06-14 18:41 ` Raphael Manfredi
@ 2002-06-18 7:33 ` Helge Hafting
0 siblings, 0 replies; 21+ messages in thread
From: Helge Hafting @ 2002-06-18 7:33 UTC (permalink / raw)
To: Raphael Manfredi, linux-kernel
Raphael Manfredi wrote:
>
> Quoting Robbert Kouprie <robbert@radium.jvb.tudelft.nl> from ml.linux.kernel:
> :BTW, did you get any explanation why this wasn't applied in -ac or main
> :kernel?
>
> None.
>
> But I know that this patch is dirty because it attacks a hardware-dependent
> layer from a rather generic one. This may be why it's rejected. And it
> may also be completely APIC-BP6 specific.
>
> I also know is that it works for me. ;-)
I'll try it. Have you considered resubmitting the patch,
hidden behind a CONFIG_BROKEN_APIC? That'll keep the code
clean for those with better hardware.
Helge Hafting
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: The buggy APIC of the Abit BP6
@ 2002-06-18 9:53 Robbert Kouprie
0 siblings, 0 replies; 21+ messages in thread
From: Robbert Kouprie @ 2002-06-18 9:53 UTC (permalink / raw)
To: 'Helge Hafting'; +Cc: 'Raphael Manfredi', linux-kernel
Helge Hafting wrote:
> I'll try it. Have you considered resubmitting the patch,
> hidden behind a CONFIG_BROKEN_APIC? That'll keep the code
> clean for those with better hardware.
We might as well move the kick_IO_APIC_irq call to the
arch/i386/kernel/irq.c:ack_none function then, surrounded by proper
#ifdefs. The ack_none is the function that does the printk("unexpected
IRQ trap at vector %02x\n", irq), which I see everytime the bug
triggers.
And looking at the comment of the end_level_ioapic_irq function in
io_apic.c, is there a possibility to replace the kick_IO_APIC_irq call
entirely with a end_level_ioapic_irq call? I see lots of similarities in
these two functions.
I didn't test this yet, as I'm still running on Raphael's patch, waiting
for the bug to trigger. (Anyone got a reliable way of triggering it?)
Regards,
- Robbert Kouprie
^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: The buggy APIC of the Abit BP6
2002-06-18 15:30 ` Robbert Kouprie
@ 2002-06-18 15:17 ` Zwane Mwaikambo
2002-06-19 12:47 ` Maciej W. Rozycki
1 sibling, 0 replies; 21+ messages in thread
From: Zwane Mwaikambo @ 2002-06-18 15:17 UTC (permalink / raw)
To: Robbert Kouprie
Cc: 'Raphael Manfredi', 'Helge Hafting', linux-kernel
On Tue, 18 Jun 2002, Robbert Kouprie wrote:
> Raphael Manfredi wrote:
>
> I just triggered the bug using a couple of simultaneous "ping -f -s 10
> host" commands, and the patched kernel indeed recovers from the bug with
> a "kernel: Kicking IO-APIC IRQ 17:" message :)
> Now if only we could call the recovery mechanism from the point where
> the "unexpected IRQ trap at vector" message gets printed (in
> arch/i386/kernel/irq.c:ack_none), then we would have a lot more generic
> code for all kinds of devices. If we then surround it by an #ifdef
> CONFIG_BROKEN_APIC like Helge suggested, there's more chance this patch
> gets accepted.
>
> Problem now is, in the ack_none function we only know about the
> (illegal) vector we are getting, and not about the interrupt we need to
> reset. Could there be some kind of link between these, so that
> kick_IO_APIC_irq can be called from there?
Interesting, i haven't come across this problem before but it sounds like
the vector isn't getting delivered when the interrupt gets asserted and
only gets triggered later followed by an EOI... or something. Then again
its probably been beaten around a couple of times by now so i probably am
not adding anything new.
arch/i386/kernel/io_apic.c:irq_vector seems to be what you're looking for.
Good luck,
Zwane Mwaikambo
--
http://function.linuxpower.ca
^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: The buggy APIC of the Abit BP6
2002-06-14 10:54 ` Helge Hafting
@ 2002-06-18 15:30 ` Robbert Kouprie
2002-06-18 15:17 ` Zwane Mwaikambo
2002-06-19 12:47 ` Maciej W. Rozycki
0 siblings, 2 replies; 21+ messages in thread
From: Robbert Kouprie @ 2002-06-18 15:30 UTC (permalink / raw)
To: 'Raphael Manfredi'; +Cc: 'Helge Hafting', linux-kernel
Raphael Manfredi wrote:
> I also know is that it works for me. ;-)
I just triggered the bug using a couple of simultaneous "ping -f -s 10
host" commands, and the patched kernel indeed recovers from the bug with
a "kernel: Kicking IO-APIC IRQ 17:" message :)
Now if only we could call the recovery mechanism from the point where
the "unexpected IRQ trap at vector" message gets printed (in
arch/i386/kernel/irq.c:ack_none), then we would have a lot more generic
code for all kinds of devices. If we then surround it by an #ifdef
CONFIG_BROKEN_APIC like Helge suggested, there's more chance this patch
gets accepted.
Problem now is, in the ack_none function we only know about the
(illegal) vector we are getting, and not about the interrupt we need to
reset. Could there be some kind of link between these, so that
kick_IO_APIC_irq can be called from there?
- Robbert
^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: The buggy APIC of the Abit BP6
2002-06-18 15:30 ` Robbert Kouprie
2002-06-18 15:17 ` Zwane Mwaikambo
@ 2002-06-19 12:47 ` Maciej W. Rozycki
2002-06-19 13:23 ` Robbert Kouprie
2002-06-20 12:21 ` Helge Hafting
1 sibling, 2 replies; 21+ messages in thread
From: Maciej W. Rozycki @ 2002-06-19 12:47 UTC (permalink / raw)
To: Robbert Kouprie
Cc: 'Raphael Manfredi', 'Helge Hafting', linux-kernel
On Tue, 18 Jun 2002, Robbert Kouprie wrote:
> Problem now is, in the ack_none function we only know about the
> (illegal) vector we are getting, and not about the interrupt we need to
> reset. Could there be some kind of link between these, so that
> kick_IO_APIC_irq can be called from there?
You get an invalid vector delivered due to massive transmission errors at
the inter-APIC bus. The errors are a serious hardware problem that cannot
and should not be fixed in software.
I'm told getting a better PSU may help, though.
--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: macro@ds2.pg.gda.pl, PGP key available +
^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: The buggy APIC of the Abit BP6
2002-06-19 12:47 ` Maciej W. Rozycki
@ 2002-06-19 13:23 ` Robbert Kouprie
2002-06-19 14:03 ` Keith Owens
2002-06-19 14:22 ` Maciej W. Rozycki
2002-06-20 12:21 ` Helge Hafting
1 sibling, 2 replies; 21+ messages in thread
From: Robbert Kouprie @ 2002-06-19 13:23 UTC (permalink / raw)
To: 'Maciej W. Rozycki'; +Cc: linux-kernel
> The errors are a serious hardware
> problem that cannot
> and should not be fixed in software.
I know the hardware sucks bad, but what's wrong with trying to work
around the problem providing noone else is bugged by the workaround?
> I'm told getting a better PSU may help, though.
The box already has an (overkill) 300W PSU, but yet I'm still seeing
problems.
- Robbert Kouprie
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: The buggy APIC of the Abit BP6
2002-06-19 13:23 ` Robbert Kouprie
@ 2002-06-19 14:03 ` Keith Owens
2002-06-19 14:35 ` Maciej W. Rozycki
2002-06-20 1:50 ` Robbert Kouprie
2002-06-19 14:22 ` Maciej W. Rozycki
1 sibling, 2 replies; 21+ messages in thread
From: Keith Owens @ 2002-06-19 14:03 UTC (permalink / raw)
To: Robbert Kouprie; +Cc: linux-kernel
On Wed, 19 Jun 2002 15:23:13 +0200,
"Robbert Kouprie" <robbert@radium.jvb.tudelft.nl> wrote:
>I know the hardware sucks bad, but what's wrong with trying to work
>around the problem providing noone else is bugged by the workaround?
You do not have the data required to (a) detect the problem and (b)
recover even if you could detect the problem. The APIC bus has a
single bit checksum, the APIC hardware detects single bit errors and
does a retransmission. It _cannot_ detect double bit errors, the bad
data is accepted and processed with undefined side effects.
What you see in the logs for a BP6 are error messages for single bit
errors that were recovered by the hardware. You will never see
messages for double bit errors, just unexplained oops and/or machine
hangs.
Yes, I have a BP6 :(.
^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: The buggy APIC of the Abit BP6
2002-06-19 13:23 ` Robbert Kouprie
2002-06-19 14:03 ` Keith Owens
@ 2002-06-19 14:22 ` Maciej W. Rozycki
1 sibling, 0 replies; 21+ messages in thread
From: Maciej W. Rozycki @ 2002-06-19 14:22 UTC (permalink / raw)
To: Robbert Kouprie; +Cc: linux-kernel
On Wed, 19 Jun 2002, Robbert Kouprie wrote:
> I know the hardware sucks bad, but what's wrong with trying to work
> around the problem providing noone else is bugged by the workaround?
The reliability of the hardware is next to null. You are not able to
recover from that. You may succeed to recover from a subset of malformed
APIC messages, especially as losing an interrupt is often negligible, but
sooner or later you'll get hit by a corrupted IPI message, such as a TLB
flush and the system will either crash or produce wrong results. Note
that APIC hardware already performs consistency checks on messages
exchanged, which are capable to filter out damaged ones. If the hardware
fails to do that a message has to be seriously harmed.
Similarly you wouldn't like to work around occasional corruptions on your
host data bus, would you?
> The box already has an (overkill) 300W PSU, but yet I'm still seeing
> problems.
Too bad.
--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: macro@ds2.pg.gda.pl, PGP key available +
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: The buggy APIC of the Abit BP6
2002-06-19 14:03 ` Keith Owens
@ 2002-06-19 14:35 ` Maciej W. Rozycki
2002-06-20 1:50 ` Robbert Kouprie
1 sibling, 0 replies; 21+ messages in thread
From: Maciej W. Rozycki @ 2002-06-19 14:35 UTC (permalink / raw)
To: Keith Owens; +Cc: Robbert Kouprie, linux-kernel
On Thu, 20 Jun 2002, Keith Owens wrote:
> You do not have the data required to (a) detect the problem and (b)
> recover even if you could detect the problem. The APIC bus has a
> single bit checksum, the APIC hardware detects single bit errors and
> does a retransmission. It _cannot_ detect double bit errors, the bad
> data is accepted and processed with undefined side effects.
Thanks to the way the checksum is calculated (a two-bit cumulative sum),
about 75% of double-bit errors are detected as well.
--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: macro@ds2.pg.gda.pl, PGP key available +
^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: The buggy APIC of the Abit BP6
2002-06-19 14:03 ` Keith Owens
2002-06-19 14:35 ` Maciej W. Rozycki
@ 2002-06-20 1:50 ` Robbert Kouprie
1 sibling, 0 replies; 21+ messages in thread
From: Robbert Kouprie @ 2002-06-20 1:50 UTC (permalink / raw)
To: 'Keith Owens', 'Maciej W. Rozycki'; +Cc: linux-kernel
Keith Owens wrote:
> You do not have the data required to (a) detect the problem and (b)
> recover even if you could detect the problem.
Maciej W. Rozycki wrote:
> The reliability of the hardware is next to null. You are not able to
> recover from that.
Okay, you guys convinced me that some hardware can suck *really* bad. I
think I'm just going to stop my effort on this, stay with Raphael
Manfredi's hack to avoid most of the hangs on my BP6 for now, and get a
new board ASAP.
Thanks for all the help,
- Robbert Kouprie
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: The buggy APIC of the Abit BP6
2002-06-19 12:47 ` Maciej W. Rozycki
2002-06-19 13:23 ` Robbert Kouprie
@ 2002-06-20 12:21 ` Helge Hafting
2002-06-20 13:10 ` Maciej W. Rozycki
2002-06-20 22:29 ` Kevin Krieser
1 sibling, 2 replies; 21+ messages in thread
From: Helge Hafting @ 2002-06-20 12:21 UTC (permalink / raw)
To: Maciej W. Rozycki, linux-kernel
"Maciej W. Rozycki" wrote:
>
> On Tue, 18 Jun 2002, Robbert Kouprie wrote:
>
> > Problem now is, in the ack_none function we only know about the
> > (illegal) vector we are getting, and not about the interrupt we need to
> > reset. Could there be some kind of link between these, so that
> > kick_IO_APIC_irq can be called from there?
>
> You get an invalid vector delivered due to massive transmission errors at
> the inter-APIC bus. The errors are a serious hardware problem that cannot
> and should not be fixed in software.
Yes, the hardware is at fault. I don't have money for
other hardware though, so working around it seems a good idea.
We could simplify the IDE driver a lot by dropping support for
all the broken controllers too. Or tell
people to not use DMA on them.
Of course such an option should default to OFF, and
perhaps live under "dangerous." It can keep the
BP6 going much longer, which is good enough
for a home machine.
Failing due to a stuck NIC after one week seems worse
than crashing due to a scrambled IPI after some months.
There are more interrupts than IPI's.
This sort of fix don't really make things worse, the
theoretical scrambled IPI will happen without it too.
The safe solution is NOAPIC, this fix simply makes it work
for a longer time using the bad apic.
>
> I'm told getting a better PSU may help, though.
Unfortunately not. I got a nice PSU when I ordered the BP6,
thinking that power was the only issue. (It was the only
cheap dual solution at the time.)
Helge Hafting
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: The buggy APIC of the Abit BP6
2002-06-20 12:21 ` Helge Hafting
@ 2002-06-20 13:10 ` Maciej W. Rozycki
2002-06-21 13:02 ` Helge Hafting
2002-06-20 22:29 ` Kevin Krieser
1 sibling, 1 reply; 21+ messages in thread
From: Maciej W. Rozycki @ 2002-06-20 13:10 UTC (permalink / raw)
To: Helge Hafting; +Cc: linux-kernel
On Thu, 20 Jun 2002, Helge Hafting wrote:
> Yes, the hardware is at fault. I don't have money for
> other hardware though, so working around it seems a good idea.
What's the problem with using a privately patched kernel then? I do that
all the time for various stuff.
> We could simplify the IDE driver a lot by dropping support for
> all the broken controllers too. Or tell
> people to not use DMA on them.
It depends on how intrusive and reliable the workarounds are. If merely
slowing down or using PIO is sufficient, then they may be OK to include.
> The safe solution is NOAPIC, this fix simply makes it work
> for a longer time using the bad apic.
Well, consider it *the* workaround, then.
--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: macro@ds2.pg.gda.pl, PGP key available +
^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: The buggy APIC of the Abit BP6
2002-06-20 12:21 ` Helge Hafting
2002-06-20 13:10 ` Maciej W. Rozycki
@ 2002-06-20 22:29 ` Kevin Krieser
1 sibling, 0 replies; 21+ messages in thread
From: Kevin Krieser @ 2002-06-20 22:29 UTC (permalink / raw)
To: Helge Hafting, Maciej W. Rozycki, linux-kernel
Obviously, in my case, if I had known about the problems it would have
later, I wouldn't have bought it. But at the time, 2 433 Celerons were
cheaper than a Pentium III 600 system.
At least, with the additional fan I've added, and the "noapic" option, it is
pretty reliable. Up for weeks at a time before I reboot. Of course, when I
had my IBM hard drives on the HT366 board, it was more likely to crash from
a DMA error than the apic problems.
In fact, the last problems I had were SCSI related, fixed by adding a second
SCSI card for some external devices. Not motherboard related.
-----Original Message-----
From: linux-kernel-owner@vger.kernel.org
[mailto:linux-kernel-owner@vger.kernel.org]On Behalf Of Helge Hafting
Sent: Thursday, June 20, 2002 7:21 AM
To: Maciej W. Rozycki; linux-kernel@vger.kernel.org
Subject: Re: The buggy APIC of the Abit BP6
> I'm told getting a better PSU may help, though.
Unfortunately not. I got a nice PSU when I ordered the BP6,
thinking that power was the only issue. (It was the only
cheap dual solution at the time.)
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: The buggy APIC of the Abit BP6
2002-06-20 13:10 ` Maciej W. Rozycki
@ 2002-06-21 13:02 ` Helge Hafting
0 siblings, 0 replies; 21+ messages in thread
From: Helge Hafting @ 2002-06-21 13:02 UTC (permalink / raw)
To: Maciej W. Rozycki, linux-kernel
"Maciej W. Rozycki" wrote:
>
> On Thu, 20 Jun 2002, Helge Hafting wrote:
>
> > Yes, the hardware is at fault. I don't have money for
> > other hardware though, so working around it seems a good idea.
>
> What's the problem with using a privately patched kernel then? I do that
> all the time for various stuff.
Nothing wrong with that. I may be wrong, but I have the impression
that the bp6 was quite popular for a while. It was the only
cheap smp board for a while (excluding those who re-soldered
their cpu's to enable SMP in other boards)
and it took some time before people realized that a good
PSU and extra cooling wasn't enough.
>
> > We could simplify the IDE driver a lot by dropping support for
> > all the broken controllers too. Or tell
> > people to not use DMA on them.
>
> It depends on how intrusive and reliable the workarounds are. If merely
> slowing down or using PIO is sufficient, then they may be OK to include.
>
My impression is that the IDE driver contains workarounds for many
broken chipsets. Using DMA is usually
default off but can be turned on for those that know it
won't hurt in their case.
I think a similiar approach is ok with the BP6 fix - put it in
because there are a bunch of these boards, default it OFF because
many more don't need it.
Helge Hafting
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2002-06-21 13:02 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-06-14 16:49 The buggy APIC of the Abit BP6 Robbert Kouprie
2002-06-14 18:41 ` Raphael Manfredi
2002-06-18 7:33 ` Helge Hafting
-- strict thread matches above, loose matches on Subject: below --
2002-06-18 9:53 Robbert Kouprie
2002-06-12 22:33 Robbert Kouprie
2002-06-13 9:05 ` Helge Hafting
2002-06-13 13:30 ` Robbert Kouprie
2002-06-14 10:54 ` Helge Hafting
2002-06-18 15:30 ` Robbert Kouprie
2002-06-18 15:17 ` Zwane Mwaikambo
2002-06-19 12:47 ` Maciej W. Rozycki
2002-06-19 13:23 ` Robbert Kouprie
2002-06-19 14:03 ` Keith Owens
2002-06-19 14:35 ` Maciej W. Rozycki
2002-06-20 1:50 ` Robbert Kouprie
2002-06-19 14:22 ` Maciej W. Rozycki
2002-06-20 12:21 ` Helge Hafting
2002-06-20 13:10 ` Maciej W. Rozycki
2002-06-21 13:02 ` Helge Hafting
2002-06-20 22:29 ` Kevin Krieser
2002-06-14 15:07 ` Raphael Manfredi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox