* Network hang with 2.4.1-pre9 and 3c59x
@ 2001-01-23 17:40 John Roll
2001-01-24 0:40 ` Andrew Morton
0 siblings, 1 reply; 4+ messages in thread
From: John Roll @ 2001-01-23 17:40 UTC (permalink / raw)
To: linux-kernel
Hi,
I read about some problems with my ethernet card (3c59x) but it was rumored
that they were fixed in 2.4.1-pre8. I have 6 IDE drives raided together and
was stress testing the disk IO. Suddenly there was no network!
[root@image log]# uname -a
Linux image.harvard.edu 2.4.1-pre9 #1 SMP Mon Jan 22 12:59:32 EST 2001 i686 unknown
>From the log:
NETDEV WATCHDOG: eth0: transmit timed out
eth0: transmit timed out, tx_status 00 status e681.
diagnostics: net 0cd8 media 8880 dma 0000003a.
eth0: Interrupt posted but not delivered -- IRQ blocked by another device?
Flags; bus-master 1, full 0; dirty 18114(2) current 18114(2).
Transmit list 00000000 vs. c14ab220.
0: @c14ab200 length 8000002a status 8001002a
1: @c14ab210 length 8000002a status 8001002a
2: @c14ab220 length 800000fe status 000100fe
3: @c14ab230 length 800005ea status 000105ea
4: @c14ab240 length 800000fe status 000100fe
5: @c14ab250 length 800000fe status 000100fe
6: @c14ab260 length 800005ea status 000105ea
7: @c14ab270 length 800000fe status 000100fe
8: @c14ab280 length 800000fe status 000100fe
9: @c14ab290 length 800000fe status 000100fe
10: @c14ab2a0 length 800000fe status 000100fe
11: @c14ab2b0 length 8000002a status 0001002a
12: @c14ab2c0 length 8000002a status 0001002a
13: @c14ab2d0 length 8000002a status 0001002a
14: @c14ab2e0 length 8000002a status 0001002a
15: @c14ab2f0 length 8000002a status 0001002a
... several more message blocks like this until I reboot ....
Here is the boot message showing my ethernet card:
3c59x.c:LK1.1.12 06 Jan 2000 Donald Becker and others. http://www.scyld.com/network/vortex.html $Revision: 1.102.2.46 $
See Documentation/networking/vortex.txt
eth0: 3Com PCI 3c905B Cyclone 100baseTx at 0xe400, 00:01:02:c4:ae:cb, IRQ 16
product code 'CG' rev 00.12 date 08-29-00
8K byte-wide RAM 5:3 Rx:Tx split, autoselect/Autonegotiate interface.
MII transceiver found at address 24, status 786d.
Enabling bus-master transmits and whole-frame receives.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Network hang with 2.4.1-pre9 and 3c59x
2001-01-23 17:40 Network hang with 2.4.1-pre9 and 3c59x John Roll
@ 2001-01-24 0:40 ` Andrew Morton
2001-01-24 12:35 ` Maciej W. Rozycki
0 siblings, 1 reply; 4+ messages in thread
From: Andrew Morton @ 2001-01-24 0:40 UTC (permalink / raw)
To: John Roll; +Cc: linux-kernel
John Roll wrote:
>
> Hi,
>
> I read about some problems with my ethernet card (3c59x) but it was rumored
> that they were fixed in 2.4.1-pre8. I have 6 IDE drives raided together and
> was stress testing the disk IO. Suddenly there was no network!
>
>
> ...
> Linux image.harvard.edu 2.4.1-pre9 #1 SMP Mon Jan 22 12:59:32 EST 2001 i686 unknown
>
> ...
> eth0: Interrupt posted but not delivered -- IRQ blocked by another device?
This is due to a lost APIC interrupt acknowledgement. A workaround
is to boot with the `noapic' LILO option.
This long-standing and very nasty problem was discussed extensively
a week or two ago. Suspicions were cast at the disable_irq() function
but I'm not sure anything 100% conclusive was arrived at.
I guess I'll have to find a way to make disable_irq() go away,
see if that helps.
-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Network hang with 2.4.1-pre9 and 3c59x
2001-01-24 0:40 ` Andrew Morton
@ 2001-01-24 12:35 ` Maciej W. Rozycki
2001-01-25 0:24 ` Andrew Morton
0 siblings, 1 reply; 4+ messages in thread
From: Maciej W. Rozycki @ 2001-01-24 12:35 UTC (permalink / raw)
To: Andrew Morton; +Cc: John Roll, linux-kernel
On Wed, 24 Jan 2001, Andrew Morton wrote:
> This is due to a lost APIC interrupt acknowledgement. A workaround
> is to boot with the `noapic' LILO option.
>
> This long-standing and very nasty problem was discussed extensively
> a week or two ago. Suspicions were cast at the disable_irq() function
> but I'm not sure anything 100% conclusive was arrived at.
Not sure if that is 100% conclusive but I decided to develop an APIC
lockup recovery procedure. Fortunately chips provide us enough
information we may deal with the problem with moderate pain.
> I guess I'll have to find a way to make disable_irq() go away,
> see if that helps.
Please don't. This would be hiding problems under a carpet.
--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: macro@ds2.pg.gda.pl, PGP key available +
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Network hang with 2.4.1-pre9 and 3c59x
2001-01-24 12:35 ` Maciej W. Rozycki
@ 2001-01-25 0:24 ` Andrew Morton
0 siblings, 0 replies; 4+ messages in thread
From: Andrew Morton @ 2001-01-25 0:24 UTC (permalink / raw)
To: Maciej W. Rozycki; +Cc: linux-kernel
[-- Attachment #1: Type: text/plain, Size: 1987 bytes --]
"Maciej W. Rozycki" wrote:
>
> On Wed, 24 Jan 2001, Andrew Morton wrote:
>
> > This is due to a lost APIC interrupt acknowledgement. A workaround
> > is to boot with the `noapic' LILO option.
> >
> > This long-standing and very nasty problem was discussed extensively
> > a week or two ago. Suspicions were cast at the disable_irq() function
> > but I'm not sure anything 100% conclusive was arrived at.
>
> Not sure if that is 100% conclusive but I decided to develop an APIC
> lockup recovery procedure. Fortunately chips provide us enough
> information we may deal with the problem with moderate pain.
Cool.
> > I guess I'll have to find a way to make disable_irq() go away,
> > see if that helps.
>
> Please don't. This would be hiding problems under a carpet.
Whether it's fixed properly, or kludged in the APIC code or kludged
in the drivers, it needs to be fixed. I've spent nine months
methodically picking away at the 3com driver so it's now very
reliable, and this interrupt problem is the major failure mode.
In fact, the only failure mode, apart from the usual dodgy
ethernet switch negotiation blah.
So I've started to poke at this problem as well. I'd be glad to stop :)
Attached are two patches:
irq-whacker.patch:
This is a patch against the 3com driver which simply calls
disable_irq()/enable_irq() at 100kHz. Enable it with the
`whacker=1' module parm. With this thread running, the
APIC dies within about one second as soon as you start
sending 100baseT traffic through the interface. So it's
nice and reproducible. This testing setup should translate
easily into any PCI netdriver.
manfred.patch:
Manfred's edge+level trigger hack. This fixes the problem!
It slows down disable_irq()/enable_irq() a bit, but that
doesn't seem an issue. A proper fix would be nice, but
this puppy works.
Manfred's ALT+SYSRQ+Q trick also fixes the problem.
Enabling processor focus simply makes interrupts
stop altogether. Haven't looked into this yet.
-
[-- Attachment #2: manfred.patch --]
[-- Type: text/plain, Size: 5200 bytes --]
>From - Thu Jan 25 02:34:14 2001
Received: from pop.zip.com.au
by localhost with POP3 (fetchmail-5.1.0)
for morton@localhost (single-drop); Sat, 13 Jan 2001 09:10:27 +1100 (EST)
Received: by leeloo.zip.com.au (mbox akpm)
(with Cubic Circle's cucipop (v1.31 1998/05/13) Sat Jan 13 09:03:25 2001)
X-From_: linux-kernel-owner@vger.kernel.org Sat Jan 13 07:58:29 2001
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org [199.183.24.194])
by leeloo.zip.com.au (8.9.1/8.9.1) with ESMTP id HAA17314;
Sat, 13 Jan 2001 07:58:22 +1100
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
id <S132402AbRALU46>; Fri, 12 Jan 2001 15:56:58 -0500
Received: (majordomo@vger.kernel.org) by vger.kernel.org
id <S132461AbRALU4r>; Fri, 12 Jan 2001 15:56:47 -0500
Received: from colorfullife.com ([216.156.138.34]:56581 "EHLO colorfullife.com")
by vger.kernel.org with ESMTP id <S132402AbRALU4g>;
Fri, 12 Jan 2001 15:56:36 -0500
Received: from clsrvli.local (localhost [127.0.0.1])
by colorfullife.com (8.9.3/8.9.3) with ESMTP id QAA12504;
Fri, 12 Jan 2001 16:02:22 -0500
Received: from colorfullife.com (clsrvli.local [172.23.10.10])
by clsrvli.local (8.11.0/8.11.0) with ESMTP id f0CKuMr05457;
Fri, 12 Jan 2001 21:56:22 +0100
Message-ID: <3A5F6F07.88564D5B@colorfullife.com>
Date: Fri, 12 Jan 2001 21:54:31 +0100
From: Manfred Spraul <manfred@colorfullife.com>
X-Mailer: Mozilla 4.75 [en] (X11; U; Linux 2.2.16-22 i586)
X-Accept-Language: en
MIME-Version: 1.0
To: mingo@elte.hu
CC: Frank de Lange <frank@unternet.org>,
Linus Torvalds <torvalds@transmeta.com>, dwmw2@infradead.org,
linux-kernel@vger.kernel.org, Alan Cox <alan@lxorguk.ukuu.org.uk>
Subject: Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardwarerelated?
In-Reply-To: <Pine.LNX.4.30.0101122136180.2772-100000@e2>
Content-Type: multipart/mixed;
boundary="------------55919F484DB2C7B38B8C5162"
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
This is a multi-part message in MIME format.
--------------55919F484DB2C7B38B8C5162
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Ingo Molnar wrote:
>
>
> okay - i just wanted to hear a definitive word from you that this fixes
> your problem, because this is what we'll have to do as a final solution.
> (barring any other solution.)
>
Ingo, is that possible?
The current fix is "disable_irq_nosync() and enable_irq() cause
deadlocks with level triggered ioapic irqs, do not use them" - I'm sure
ne2k-pci isn't the only driver that uses these function.
I have found one combination that doesn't hang with the unpatched
8390.c, but network throughput is down to 1/2. I hope that's due to the
debugging changes.
I'll restart now from a fresh 2.4.0 tree:
Changes:
1) enable focus cpu.
2) apply the attached patch.
I'm not sure if it's a real fix or if it just hides the problem: my
sysrq patch has shown that clearing and setting the "level trigger" bit
in the io apic reanimates the IO APIC.
--
Manfred
--------------55919F484DB2C7B38B8C5162
Content-Type: text/plain; charset=us-ascii;
name="patch-io"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename="patch-io"
--- build-2.4/arch/i386/kernel/io_apic.c.orig Fri Jan 12 20:17:36 2001
+++ build-2.4/arch/i386/kernel/io_apic.c Fri Jan 12 21:26:31 2001
@@ -134,6 +134,30 @@
spin_unlock_irqrestore(&ioapic_lock, flags);
}
+DO_ACTION( __trigger_level, 0, |= 0x00008000, io_apic_sync(entry->apic))/* mask = 1 */
+DO_ACTION( __trigger_edge, 0, &= 0xffff7fff, ) /* mask = 0 */
+
+
+static void unmask_level_IO_APIC_irq (unsigned int irq)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&ioapic_lock, flags);
+ __trigger_level_IO_APIC_irq(irq);
+ __unmask_IO_APIC_irq(irq);
+ spin_unlock_irqrestore(&ioapic_lock, flags);
+}
+
+static void mask_level_IO_APIC_irq (unsigned int irq)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&ioapic_lock, flags);
+ __mask_IO_APIC_irq(irq);
+ __trigger_edge_IO_APIC_irq(irq);
+ spin_unlock_irqrestore(&ioapic_lock, flags);
+}
+
static void unmask_IO_APIC_irq (unsigned int irq)
{
unsigned long flags;
@@ -143,6 +167,7 @@
spin_unlock_irqrestore(&ioapic_lock, flags);
}
+
void clear_IO_APIC_pin(unsigned int apic, unsigned int pin)
{
struct IO_APIC_route_entry entry;
@@ -1181,14 +1206,14 @@
*/
static unsigned int startup_level_ioapic_irq (unsigned int irq)
{
- unmask_IO_APIC_irq(irq);
+ unmask_level_IO_APIC_irq(irq);
return 0; /* don't check for pending */
}
-#define shutdown_level_ioapic_irq mask_IO_APIC_irq
-#define enable_level_ioapic_irq unmask_IO_APIC_irq
-#define disable_level_ioapic_irq mask_IO_APIC_irq
+#define shutdown_level_ioapic_irq mask_level_IO_APIC_irq
+#define enable_level_ioapic_irq unmask_level_IO_APIC_irq
+#define disable_level_ioapic_irq mask_level_IO_APIC_irq
static void end_level_ioapic_irq (unsigned int i)
{
--------------55919F484DB2C7B38B8C5162--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
[-- Attachment #3: irq-whacker.patch --]
[-- Type: text/plain, Size: 2000 bytes --]
Index: drivers/net/3c59x.c
===================================================================
RCS file: /opt/cvs/lk/drivers/net/3c59x.c,v
retrieving revision 1.21
diff -u -u -r1.21 3c59x.c
--- drivers/net/3c59x.c 2001/01/23 08:25:53 1.21
+++ drivers/net/3c59x.c 2001/01/25 00:23:11
@@ -220,6 +220,8 @@
static char version[] __devinitdata =
"3c59x.c:LK1.1.13 14 Jan 2001 Donald Becker and others. http://www.scyld.com/network/vortex.html\n";
+static int whacker = 0;
+
MODULE_AUTHOR("Donald Becker <becker@scyld.com>");
MODULE_DESCRIPTION("3Com 3c59x/3c90x/3c575 series Vortex/Boomerang/Cyclone driver");
MODULE_PARM(debug, "i");
@@ -232,6 +234,7 @@
MODULE_PARM(compaq_irq, "i");
MODULE_PARM(compaq_device_id, "i");
MODULE_PARM(watchdog, "i");
+MODULE_PARM(whacker, "i");
/* Operational parameter that usually are not changed. */
@@ -775,6 +778,41 @@
static int vortex_cards_found;
+static volatile int run_thread, thread_running;;
+
+static int kthread(void *arg)
+{
+ struct net_device *dev = arg;
+
+ printk("kthread running\n");
+ thread_running = 1;
+ while (run_thread) {
+ disable_irq(dev->irq);
+ udelay(5);
+ schedule();
+ enable_irq(dev->irq);
+ udelay(5);
+ }
+ printk("kthread stops\n");
+ thread_running = 0;
+ return 0;
+}
+
+static void start_irq_whacker(struct net_device *dev)
+{
+ run_thread = 1;
+ thread_running = 0;
+ if (whacker)
+ kernel_thread(kthread, dev, 0);
+}
+
+static void stop_irq_whacker(struct net_device *dev)
+{
+ run_thread = 0;
+ while (thread_running)
+ ;
+}
+
static void vortex_suspend (struct pci_dev *pdev)
{
struct net_device *dev = pdev->driver_data;
@@ -1421,6 +1459,7 @@
if (vp->cb_fn_base) /* The PCMCIA people are idiots. */
writel(0x8000, vp->cb_fn_base + 4);
netif_start_queue (dev);
+ start_irq_whacker(dev);
}
static int
@@ -2298,6 +2337,8 @@
{
struct vortex_private *vp = (struct vortex_private *)dev->priv;
long ioaddr = dev->base_addr;
+
+ stop_irq_whacker(dev);
netif_stop_queue (dev);
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2001-01-25 0:18 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-01-23 17:40 Network hang with 2.4.1-pre9 and 3c59x John Roll
2001-01-24 0:40 ` Andrew Morton
2001-01-24 12:35 ` Maciej W. Rozycki
2001-01-25 0:24 ` Andrew Morton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox