Re: IRQ problems on IBM 850

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* Re: IRQ problems on IBM 850
       [not found] <Pine.HPX.4.10.9910252356120.21757-100000@gra-ux1.iram.es>
@ 1999-11-04  5:33 ` Hollis R Blanchard
  1999-11-04 13:27   ` Gabriel Paubert
  0 siblings, 1 reply; 7+ messages in thread
From: Hollis R Blanchard @ 1999-11-04  5:33 UTC (permalink / raw)
  To: Gabriel Paubert
  Cc: Cort Dougan, David Monro, linuxppc-workstation, linuxppc-dev


On Tue, 26 Oct 1999, Gabriel Paubert wrote:
> 
> On Mon, 25 Oct 1999, Cort Dougan wrote:
> 
> > I remember this problem some time ago.  The failure mode seems to be we get
> > 2 interrupts on the cascaded 8259 (irq > 7), we handle one and ack/eoi that
> > whole thing so we loose one of the interrupts.  On the 830's (I'm not sure
> > about the 850) the IDE and pcnet are on the cascaded 8259 controller so I
> > was able to easily reproduce this problem when trying to fix it by
> > generating a lot of network and disk traffic (such as copying something
> > to/from the network).
> > 
> > I had a loop in the 8259 code for some time to go through until it didn't
> > find any more interrupts pending.
> 
> In i8259.c, yes but a loop reading the IRR is not enough since it does not
> reset the edge detect logic. There is anyway one contradiction between
> what is implemented in the kernel and what my 8259 doc claims in the order
> of the EOI to the slave and master. Since on my machine I don't use the
> slave, I can't test it. 
> 
> cat /proc/interrupts
>            CPU0       
>   1:          8   i8259         keyboard
>   2:          0   i8259         82c59 secondary cascade
>   4:        680   i8259         serial
>  16:          0   OpenPIC       82c59 cascade
>  18:    1727293   OpenPIC       DC21140 (eth0)
>  19:     152520   OpenPIC       ncr53c8xx
>  21:   55648729   OpenPIC       vme (Universe)
> BAD:          0

An update:

I found at least a typo in arch/ppc/kernel/i8259.c. At about line 56:
                outb(0x20,0xA0);        /* Non-specific EOI */
                outb(0x20,0x20);        /* Non-specific EOI to cascade */ 

What I'm reading says that 0xA0 is the cascade, and 0x20 is the master. So the
order was the opposite of what the comments said. I would think you would want
to ack the master first and then the slave, but I changed this and it made no
difference. :(

My /proc/interrupts reads:
1:      576     i8259   keyboard
2:        0     i8259   82c59 secondary cascade
5:        1     i8259   Crystal audio controller
13:  287499     i8259   ide0
15:   16720     i8259   PCnet/PCI II 79C970A
BAD:      1

I don't know enough to interpret it. I do have a SCSI card attached right now
but no drive (and not compiled into the kernel) - is that the "BAD"?

I also don't know what to make of the lines that look like
        outb(0xFF, 0x21); /* Mask all */
and
        outb(cached_A1, 0x21);
They're to mask and unmask the interrupts? What is accomplished with
cached_A1? We keep the same interrupt from being re-entered? (I don't know the
order the functions in this file are called, nor how Linux interrupt handlers
work in general.)

Also, Cort, in your original comment... I don't see how acking one interrupt
causes another to be lost - it's only the ISR register that's supposed to be
cleared, not the IRR. If there are still things in the IRR, another interrupt
is supposed to be generated and the ISR set appropriately by the controller,
no?

-Hollis


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: IRQ problems on IBM 850
  1999-11-04  5:33 ` IRQ problems on IBM 850 Hollis R Blanchard
@ 1999-11-04 13:27   ` Gabriel Paubert
  1999-11-04 20:32     ` Hollis R Blanchard
  0 siblings, 1 reply; 7+ messages in thread
From: Gabriel Paubert @ 1999-11-04 13:27 UTC (permalink / raw)
  To: Hollis R Blanchard
  Cc: Cort Dougan, David Monro, linuxppc-workstation, linuxppc-dev


On Thu, 4 Nov 1999, Hollis R Blanchard wrote:

> An update:
> 
> I found at least a typo in arch/ppc/kernel/i8259.c. At about line 56:
>                 outb(0x20,0xA0);        /* Non-specific EOI */
>                 outb(0x20,0x20);        /* Non-specific EOI to cascade */ 
> 
> What I'm reading says that 0xA0 is the cascade, and 0x20 is the master. So the
> order was the opposite of what the comments said. I would think you would want
> to ack the master first and then the slave, but I changed this and it made no
> difference. :(

Not completely surprising. What if you apply the following patch ? 

It might not solve your problem, but anyway playing with cached_A1 and
cached_21 while an interrupt may be coming is not healthy, to say the
least. 

--- irq.c.orig	Thu Nov  4 12:51:20 1999
+++ irq.c	Thu Nov  4 14:10:41 1999
@@ -50,7 +50,6 @@
 #include <asm/io.h>
 #include <asm/pgtable.h>
 #include <asm/irq.h>
-#include <asm/bitops.h>
 #include <asm/gg2.h>
 #include <asm/cache.h>
 #include <asm/prom.h>
@@ -61,6 +60,7 @@
 
 #include "local_irq.h"
 
+spinlock_t irq_controller_lock;
 extern volatile unsigned long ipi_count;
 void enable_irq(unsigned int irq_nr);
 void disable_irq(unsigned int irq_nr);
@@ -193,18 +193,27 @@
 /* XXX should implement irq disable depth like on intel */
 void disable_irq_nosync(unsigned int irq_nr)
 {
+	unsigned long flags;
+	spin_lock_irqsave(&irq_controller_lock, flags);
 	mask_irq(irq_nr);
+	spin_unlock_irqrestore(&irq_controller_lock, flags);
 }
 
 void disable_irq(unsigned int irq_nr)
 {
+	unsigned long flags;
+	spin_lock_irqsave(&irq_controller_lock, flags);
 	mask_irq(irq_nr);
+	spin_unlock_irqrestore(&irq_controller_lock, flags);
 	synchronize_irq();
 }
 
 void enable_irq(unsigned int irq_nr)
 {
+	unsigned long flags;
+	spin_lock_irqsave(&irq_controller_lock, flags);
 	unmask_irq(irq_nr);
+	spin_unlock_irqrestore(&irq_controller_lock, flags);
 }
 
 int get_irq_list(char *buf)
@@ -242,6 +251,23 @@
 	len += sprintf(buf+len, "IPI: %10lu\n", ipi_count);
 #endif		
 	len += sprintf(buf+len, "BAD: %10u\n", ppc_spurious_interrupts);
+#if 1
+	do {
+		int imr, isr, irr;
+		spin_lock_irq(&irq_controller_lock);
+		imr = (inb(0xA1)<<8) | inb(0x21);
+		outb(0xb,0x20);
+		outb(0xb,0xa0);
+		isr = (inb(0xA0)<<8) | inb(0x20);
+		outb(0xa,0x20);
+		outb(0xa,0xa0);
+		irr = (inb(0xA0)<<8) | inb(0x20);
+		spin_unlock_irq(&irq_controller_loc);
+		len += sprintf(buf+len, "8259 IMR/ISR/IRR = %04x/%04x/%04x\n",
+			       imr, isr, irr);
+		
+	} while(0);
+#endif	
 	return len;
 }
 
@@ -256,6 +282,10 @@
 	int cpu = smp_processor_id();
 	
 	mask_and_ack_irq(irq);
+	/* See comments in do_IRQ, we don't want another processor to touch
+	 * the masks between acquiring the vector and mask_and_ack.
+	 */
+	spin_unlock(&irq_controller_lock);
 	status = 0;
 	action = irq_desc[irq].action;
 	kstat.irqs[cpu][irq]++;
@@ -267,10 +297,16 @@
 			action->handler(irq, action->dev_id, regs);
 			action = action->next;
 		} while ( action );
-		__cli();
+		spin_lock_irq(&irq_controller_lock);
+		/* Still wrong on SMP, the interrupt might have been
+		 * disabled by another processor while it was active.
+		 * Flags like on i386 are necessary...
+		 */
 		unmask_irq(irq);
+		spin_unlock(&irq_controller_lock);
 	} else {
 		ppc_spurious_interrupts++;
+		/* Is it necessary ? */
 		disable_irq( irq );
 	}
 }
@@ -280,6 +316,11 @@
 	int cpu = smp_processor_id();
 
         hardirq_enter(cpu);
+	/* Ugly, the irq controller has to be locked here and unlocked
+	 * in dispatch_handler it seems. I can't see any better solution
+	 * barring a complete rewrite of interrupt handling :-(
+	 */
+	spin_lock(&irq_controller_lock);
         ppc_md.do_IRQ(regs, cpu, isfake);
         hardirq_exit(cpu);
 }

> My /proc/interrupts reads:
> 1:      576     i8259   keyboard
> 2:        0     i8259   82c59 secondary cascade
> 5:        1     i8259   Crystal audio controller
> 13:  287499     i8259   ide0
> 15:   16720     i8259   PCnet/PCI II 79C970A
> BAD:      1

Now mine reads: 
           CPU0       
  1:          2   i8259         keyboard
  2:          0   i8259         82c59 secondary cascade
  4:          5   i8259         serial
 16:          0   OpenPIC       82c59 cascade
 18:       2754   OpenPIC       DC21140 (eth0)
 19:       1930   OpenPIC       ncr53c8xx
BAD:          1
8259 IMR/ISR/IRR = ffe9/0000/0040

BTW: if you apply only the part which modifies the /proc output you might
be able to know the state of the interrupt controller when the system
locks up. This would be interesting and probably better in a first step,
since the rest of the patch migt clash with your source.  In my case the
floppy interrupt is requested, but as long as it's masked...


> 
> I don't know enough to interpret it. I do have a SCSI card attached right now
> but no drive (and not compiled into the kernel) - is that the "BAD"?
> 
> I also don't know what to make of the lines that look like
>         outb(0xFF, 0x21); /* Mask all */
> and
>         outb(cached_A1, 0x21);
> They're to mask and unmask the interrupts? What is accomplished with
> cached_A1? We keep the same interrupt from being re-entered? (I don't know the
> order the functions in this file are called, nor how Linux interrupt handlers
> work in general.)

I'm in the process of rewriting a lot of i8259.c, but it might take some
time until it stabilizes...

> 
> Also, Cort, in your original comment... I don't see how acking one interrupt
> causes another to be lost - it's only the ISR register that's supposed to be
> cleared, not the IRR. If there are still things in the IRR, another interrupt
> is supposed to be generated and the ISR set appropriately by the controller,
> no?

The 8259 is a fragile beast, plus the fact that the current
enable/disable_irq code is completely unprotected. My patch is not
sufficient but might be a step in the right direction (hopefully enough
on UP). 

	Gabriel.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: IRQ problems on IBM 850
  1999-11-04 13:27   ` Gabriel Paubert
@ 1999-11-04 20:32     ` Hollis R Blanchard
  1999-11-05  8:42       ` Gabriel Paubert
  0 siblings, 1 reply; 7+ messages in thread
From: Hollis R Blanchard @ 1999-11-04 20:32 UTC (permalink / raw)
  To: Gabriel Paubert
  Cc: Cort Dougan, David Monro, linuxppc-workstation, linuxppc-dev


On Thu, 4 Nov 1999, Gabriel Paubert wrote:
> 
> It might not solve your problem, but anyway playing with cached_A1 and
> cached_21 while an interrupt may be coming is not healthy, to say the
> least. 
> 
> --- irq.c.orig	Thu Nov  4 12:51:20 1999
> +++ irq.c	Thu Nov  4 14:10:41 1999

[snip patch]

> > My /proc/interrupts reads:
> > 1:      576     i8259   keyboard
> > 2:        0     i8259   82c59 secondary cascade
> > 5:        1     i8259   Crystal audio controller
> > 13:  287499     i8259   ide0
> > 15:   16720     i8259   PCnet/PCI II 79C970A
> > BAD:      1
> 
> Now mine reads: 
>            CPU0       
>   1:          2   i8259         keyboard
>   2:          0   i8259         82c59 secondary cascade
>   4:          5   i8259         serial
>  16:          0   OpenPIC       82c59 cascade
>  18:       2754   OpenPIC       DC21140 (eth0)
>  19:       1930   OpenPIC       ncr53c8xx
> BAD:          1
> 8259 IMR/ISR/IRR = ffe9/0000/0040
> 
> BTW: if you apply only the part which modifies the /proc output you might
> be able to know the state of the interrupt controller when the system
> locks up. This would be interesting and probably better in a first step,
> since the rest of the patch migt clash with your source.  In my case the
> floppy interrupt is requested, but as long as it's masked...

[...]

> The 8259 is a fragile beast, plus the fact that the current
> enable/disable_irq code is completely unprotected. My patch is not
> sufficient but might be a step in the right direction (hopefully enough
> on UP). 

The spinlock patch doesn't appear to have any affect on the situation. Still
lots of "hda - lost interrupt"'s.

A normal (no disk activity) /proc/interrupts:
1:     1321     i8259   keyboard
2:        0     i8259   82c59 secondary cascade
5:        1     i8259   Crystal audio controller
13:   82733     i8259   ide0
15:   30749     i8259   PCnet/PCI II 79C970A
BAD:      1
8259 IMR/ISR/IRR = 5f99/0000/0001

When "lost interrupts" are occuring (the keyboard does still function Gabriel,
my mistake), the last line consistantly looks like:

8259 IMR/ISR/IRR = 5f99/0000/a001

The 'a' indicates that interrupts have come in on irq's 13 & 15, which would
be ide0 and the ethernet controller. So it seems Cort's memory is correct. (In
your output, is irq 6 '0x0040' the floppy drive you mention?)

-Hollis


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: IRQ problems on IBM 850
  1999-11-04 20:32     ` Hollis R Blanchard
@ 1999-11-05  8:42       ` Gabriel Paubert
  1999-12-08 17:33         ` David Monro
  0 siblings, 1 reply; 7+ messages in thread
From: Gabriel Paubert @ 1999-11-05  8:42 UTC (permalink / raw)
  To: Hollis R Blanchard
  Cc: Cort Dougan, David Monro, linuxppc-workstation, linuxppc-dev

On Thu, 4 Nov 1999, Hollis R Blanchard wrote:

> The spinlock patch doesn't appear to have any affect on the situation. Still
> lots of "hda - lost interrupt"'s.

Now I might start to understand why.

> 
> A normal (no disk activity) /proc/interrupts:
> 1:     1321     i8259   keyboard
> 2:        0     i8259   82c59 secondary cascade
> 5:        1     i8259   Crystal audio controller
> 13:   82733     i8259   ide0
> 15:   30749     i8259   PCnet/PCI II 79C970A
> BAD:      1
> 8259 IMR/ISR/IRR = 5f99/0000/0001
> 
> When "lost interrupts" are occuring (the keyboard does still function Gabriel,
> my mistake), the last line consistantly looks like:
> 
> 8259 IMR/ISR/IRR = 5f99/0000/a001
> 
> The 'a' indicates that interrupts have come in on irq's 13 & 15, which would
> be ide0 and the ethernet controller. So it seems Cort's memory is correct. (In
> your output, is irq 6 '0x0040' the floppy drive you mention?)

Using interrupt 13 is strange, to say the least. It was reserved for FPU
errors on x86 processors. In your case obviously the interrupts are
expected (since you have a timeout) but have stayed masked for some
reason. We have to find where this happens. Some code paths might have an
unbalanced enable/disable_irq but I suspect that it will be hard to find.

Perhaps we should add some debugging code (again bloating
/proc/interrupts :-)) giving the address of the caller of the last
{en,dis}able_irq for every interrupt (just add 2 fields to each interrupt
struct ({dis,en}abled_by) set to builtin_return_address in
{dis,en}able_irq. Then print those addresses in /proc/interrupts. Can you
write such a patch or should I do it myself ?

And yes, irq 6 is the floppy drive, although I've no floppy connected. But
the controller is here. 

	Gabriel.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: IRQ problems on IBM 850
  1999-11-05  8:42       ` Gabriel Paubert
@ 1999-12-08 17:33         ` David Monro
  1999-12-09  1:33           ` Hollis R Blanchard
  1999-12-09  9:14           ` Gabriel Paubert
  0 siblings, 2 replies; 7+ messages in thread
From: David Monro @ 1999-12-08 17:33 UTC (permalink / raw)
  To: linuxppc-workstation, linuxppc-dev

Sorry to reply so late, but I went on holiday.

Gabriel Paubert wrote:
> 
> On Thu, 4 Nov 1999, Hollis R Blanchard wrote:
> 
[..]
> 
> Using interrupt 13 is strange, to say the least. It was reserved for FPU
> errors on x86 processors. In your case obviously the interrupts are
> expected (since you have a timeout) but have stayed masked for some
> reason. We have to find where this happens. Some code paths might have an
> unbalanced enable/disable_irq but I suspect that it will be hard to find.
> 

Umm. Possible data point which may help here - I cannot cause my machine
to do anything silly unless I hit two interrupt sources at the same
time. I can compile kernels on the IDE disk (irq 13) till the cows come
home if I don't have the ethernet enabled and don't play with the mouse
too much. If I enable the ethernet (irq 15 I think), or play with the
mouse a lot (irq 12), sooner or later I die. My (very uneducated) guess
is that it has something to do with getting two interrupts in a very
short space of time. I guess I should try thrashing one of the lower 8
interrupts (serial mouse I guess would do it) and see if that can cause
problems, or whether it is restricted to the second controller.

[..]

Cheers,

        David

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: IRQ problems on IBM 850
  1999-12-08 17:33         ` David Monro
@ 1999-12-09  1:33           ` Hollis R Blanchard
  1999-12-09  9:14           ` Gabriel Paubert
  1 sibling, 0 replies; 7+ messages in thread
From: Hollis R Blanchard @ 1999-12-09  1:33 UTC (permalink / raw)
  To: David Monro; +Cc: linuxppc-workstation, linuxppc-dev


On Wed, 8 Dec 1999, David Monro wrote:
> 
> Gabriel Paubert wrote:
> > 
> > Using interrupt 13 is strange, to say the least. It was reserved for FPU
> > errors on x86 processors. In your case obviously the interrupts are
> > expected (since you have a timeout) but have stayed masked for some
> > reason. We have to find where this happens. Some code paths might have an
> > unbalanced enable/disable_irq but I suspect that it will be hard to find.
> 
> Umm. Possible data point which may help here - I cannot cause my machine
> to do anything silly unless I hit two interrupt sources at the same
> time. I can compile kernels on the IDE disk (irq 13) till the cows come
> home if I don't have the ethernet enabled and don't play with the mouse
> too much. If I enable the ethernet (irq 15 I think), or play with the
> mouse a lot (irq 12), sooner or later I die. My (very uneducated) guess
> is that it has something to do with getting two interrupts in a very
> short space of time. I guess I should try thrashing one of the lower 8
> interrupts (serial mouse I guess would do it) and see if that can cause
> problems, or whether it is restricted to the second controller.

The problem definately occurs when two interrupts come in simultaneously on
the cascaded controller. It shouldn't happen with one cascaded and one
non-cascaded interrupt, though I guess it couldn't hurt to verify...

I don't think anyone's gotten anywhere towards a fix though. I certainly
haven't had time to look into it, and neither has anyone else I've been in
contact with...

-Hollis


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: IRQ problems on IBM 850
  1999-12-08 17:33         ` David Monro
  1999-12-09  1:33           ` Hollis R Blanchard
@ 1999-12-09  9:14           ` Gabriel Paubert
  1 sibling, 0 replies; 7+ messages in thread
From: Gabriel Paubert @ 1999-12-09  9:14 UTC (permalink / raw)
  To: David Monro; +Cc: linuxppc-workstation, linuxppc-dev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2304 bytes --]



On Wed, 8 Dec 1999, David Monro wrote:

> 
> Sorry to reply so late, but I went on holiday.
> 
> Gabriel Paubert wrote:
> > 
> > On Thu, 4 Nov 1999, Hollis R Blanchard wrote:
> > 
> [..]
> > 
> > Using interrupt 13 is strange, to say the least. It was reserved for FPU
> > errors on x86 processors. In your case obviously the interrupts are
> > expected (since you have a timeout) but have stayed masked for some
> > reason. We have to find where this happens. Some code paths might have an
> > unbalanced enable/disable_irq but I suspect that it will be hard to find.
> > 
> 
> Umm. Possible data point which may help here - I cannot cause my machine
> to do anything silly unless I hit two interrupt sources at the same
> time. I can compile kernels on the IDE disk (irq 13) till the cows come
> home if I don't have the ethernet enabled and don't play with the mouse
> too much. If I enable the ethernet (irq 15 I think), or play with the
> mouse a lot (irq 12), sooner or later I die. My (very uneducated) guess
> is that it has something to do with getting two interrupts in a very
> short space of time. I guess I should try thrashing one of the lower 8
> interrupts (serial mouse I guess would do it) and see if that can cause
> problems, or whether it is restricted to the second controller.

Ok, you are definitely on something. The problem is that on my machines I
only ever use one of the high (8-15) interrupts: the mouse. All the other
are routed through the OpenPIC...

Now we have to find where and how these interrutps become masked, this is
likely a stupid blunder.

Could you replace your arch/ppc/kernel/{irq,i8259}.c by the ones included
in the attached tarball ?

It adds a line in /proc/interrupts to check the state of the 8259
registers: 
[root@vcorr1 linux]# cat /proc/interrupts 
           CPU0       
  1:       4201   i8259         keyboard
  2:          0   i8259         82c59 secondary cascade
  4:        198   i8259         serial
 12:         48   i8259         PS/2 Mouse
 16:          0   OpenPIC       82c59 cascade
 18:    4705191   OpenPIC       DC21140 (eth0)
 19:     160631   OpenPIC       ncr53c8xx
BAD:          1
8259 IMR/ISR/IRR = efe9/0000/0040
                   ^^^^
that one is interesting to see if some interrupts become masked
forever. 

	Gabriel.

[-- Attachment #2: Tarball to get more /proc info about i8259 state. --]
[-- Type: APPLICATION/octet-stream, Size: 6569 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~1999-12-09  9:14 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <Pine.HPX.4.10.9910252356120.21757-100000@gra-ux1.iram.es>
1999-11-04  5:33 ` IRQ problems on IBM 850 Hollis R Blanchard
1999-11-04 13:27   ` Gabriel Paubert
1999-11-04 20:32     ` Hollis R Blanchard
1999-11-05  8:42       ` Gabriel Paubert
1999-12-08 17:33         ` David Monro
1999-12-09  1:33           ` Hollis R Blanchard
1999-12-09  9:14           ` Gabriel Paubert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).