Posssible bug in kernel/irq/handle.c

All of lore.kernel.org
 help / color / mirror / Atom feed

* Posssible bug in kernel/irq/handle.c
@ 2005-11-09  0:38 Benjamin Herrenschmidt
  2005-11-09  1:04 ` Linus Torvalds
  0 siblings, 1 reply; 5+ messages in thread
From: Benjamin Herrenschmidt @ 2005-11-09  0:38 UTC (permalink / raw)
  To: Linus Torvalds, Ingo Molnar; +Cc: Paul Mackerras, Linux Kernel list

Hi Ingo, Linus !

We were going through __do_IRQ (trying to figure out if we can use it
instead of keeping our own in ppc64) while we found that bit that seems
bogus to me:

	/*
	 * If the IRQ is disabled for whatever reason, we cannot
	 * use the action we have.
	 */
	action = NULL;
	if (likely(!(status & (IRQ_DISABLED | IRQ_INPROGRESS)))) {
		action = desc->action;
		status &= ~IRQ_PENDING; /* we commit to handling */
		status |= IRQ_INPROGRESS; /* we are handling it */
	}
	desc->status = status;

	/*
	 * If there is no IRQ handler or it was disabled, exit early.
	 * Since we set PENDING, if another processor is handling
	 * a different instance of this same irq, the other processor
	 * will take care of it.
	 */
	if (unlikely(!action))
		goto out;

Now, look at what's going on if there is no action, that is desc->action
is NULL. In that case, the code will go out, leaving the IRQ marked
IN_PROGRESS, call the end() handler and go out without ever calling
note_interrupt().

That means that

 1) The interrupt will be stuck IN_PROGRESS. I don't see how IN_PROGRESS
can ever be cleared afterward

 2) We won't go through the code in note_interrupt() that protects us
against a stuck interrupt, so if the interrupt is indeed stuck, we'll
just lockup the processor taking the same IRQ for ever (and not being
able to handle it, even if an action magically gets registered, due to
1)

I think we need to differentiate the case DISABLED | IN_PROGRESS from
the case no action.

Something like (just typing in the mailer) :

	if (unlikely(status & (IRQ_DISABLED | IRQ_INPROGRESS))) {
		desc->status = status;
		goto out;
	}
	action = desc->action;
	status &= ~IRQ_PENDING; /* we commit to handling */
	status |= IRQ_INPROGRESS; /* we are handling it */
	desc->status = status;

Then, test for action != NULL in the for (;;) loop before calling
handle_IRQ_event() and set action_ret to IRQ_NONE by default. (I think
we should get to the for loop and not avoid it, in case the PENDING bit
happens to be set by another CPU in the meantime).

Did I miss something ? If you think that's ok, I'll produce a patch.

In addition, I'd like an arch hook in note_interrupt() in order to pass
unhandled interrupts to the firmware but that's a different matter (still
let me know if you have any objection)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Posssible bug in kernel/irq/handle.c
  2005-11-09  0:38 Posssible bug in kernel/irq/handle.c Benjamin Herrenschmidt
@ 2005-11-09  1:04 ` Linus Torvalds
  2005-11-09  1:50   ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 5+ messages in thread
From: Linus Torvalds @ 2005-11-09  1:04 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Ingo Molnar, Paul Mackerras, Linux Kernel list

On Wed, 9 Nov 2005, Benjamin Herrenschmidt wrote:
> 
> Now, look at what's going on if there is no action, that is desc->action
> is NULL. In that case, the code will go out, leaving the IRQ marked
> IN_PROGRESS, call the end() handler and go out without ever calling
> note_interrupt().

Not a bug afaik.

> That means that
> 
>  1) The interrupt will be stuck IN_PROGRESS. I don't see how IN_PROGRESS
> can ever be cleared afterward

If desc->action is NULL, the flags are supposed to be cleared when we get 
an action. See kernel/irq/manage.c: setup_irq(), and in particular the 
case where we had no handler before (ie the "!shared" case).

>  2) We won't go through the code in note_interrupt() that protects us
> against a stuck interrupt, so if the interrupt is indeed stuck, we'll
> just lockup the processor taking the same IRQ for ever (and not being
> able to handle it, even if an action magically gets registered, due to
> 1)

If the irq is stuck, you have serious problems with your interrupt 
controller. By definition, the irq should be disabled since there are no 
handlers for that interrupt.

So I think the code is correct. It has certainly worked for years on x86 
(and it got serious debugging, since we had some rather nasty and subtle 
issues with edge-triggered APIC interrupts that just get lost if they are 
disabled at the controller).

			Linus

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Posssible bug in kernel/irq/handle.c
  2005-11-09  1:04 ` Linus Torvalds
@ 2005-11-09  1:50   ` Benjamin Herrenschmidt
  2005-11-09  2:07     ` Linus Torvalds
  0 siblings, 1 reply; 5+ messages in thread
From: Benjamin Herrenschmidt @ 2005-11-09  1:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ingo Molnar, Paul Mackerras, Linux Kernel list

On Tue, 2005-11-08 at 17:04 -0800, Linus Torvalds wrote:

> >  1) The interrupt will be stuck IN_PROGRESS. I don't see how IN_PROGRESS
> > can ever be cleared afterward
> 
> If desc->action is NULL, the flags are supposed to be cleared when we get 
> an action. See kernel/irq/manage.c: setup_irq(), and in particular the 
> case where we had no handler before (ie the "!shared" case).

Yes, but in the meantime, the CPU will be stuck taking the same irq for
ever without ever going through note_interrupt().

> >  2) We won't go through the code in note_interrupt() that protects us
> > against a stuck interrupt, so if the interrupt is indeed stuck, we'll
> > just lockup the processor taking the same IRQ for ever (and not being
> > able to handle it, even if an action magically gets registered, due to
> > 1)
> 
> If the irq is stuck, you have serious problems with your interrupt 
> controller. By definition, the irq should be disabled since there are no 
> handlers for that interrupt.

Hrm... yah, it should... still, I find that a bit fragile.

> So I think the code is correct. It has certainly worked for years on x86 
> (and it got serious debugging, since we had some rather nasty and subtle 
> issues with edge-triggered APIC interrupts that just get lost if they are 
> disabled at the controller).

Well, we have a "funny" case with some pSeries... the firmware may
enable interrupts behind our back, and expects us to call a firmware
"try to handle that interrupt" kind of call when we get one we don't
handle. That is, either all the handlers returned IRQ_NONE or there is
no action. I'm not sure how to do that with the current code without
having our own __do_IRQ() which I'd rather avoid...

Ben.



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Posssible bug in kernel/irq/handle.c
  2005-11-09  1:50   ` Benjamin Herrenschmidt
@ 2005-11-09  2:07     ` Linus Torvalds
  2005-11-09  2:16       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 5+ messages in thread
From: Linus Torvalds @ 2005-11-09  2:07 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Ingo Molnar, Paul Mackerras, Linux Kernel list

On Wed, 9 Nov 2005, Benjamin Herrenschmidt wrote:
> 
> Hrm... yah, it should... still, I find that a bit fragile.
> 
> > So I think the code is correct. It has certainly worked for years on x86 
> > (and it got serious debugging, since we had some rather nasty and subtle 
> > issues with edge-triggered APIC interrupts that just get lost if they are 
> > disabled at the controller).
> 
> Well, we have a "funny" case with some pSeries... the firmware may
> enable interrupts behind our back, and expects us to call a firmware
> "try to handle that interrupt" kind of call when we get one we don't
> handle. That is, either all the handlers returned IRQ_NONE or there is
> no action. I'm not sure how to do that with the current code without
> having our own __do_IRQ() which I'd rather avoid...

Well, I don't think it would be _wrong_ to change the IRQ_INPROGRESS 
handling to work so that it's usabel for PPC.

Some of it may actually be purely historical: I have this dim memory that 
we may have used IRQ_INPROGRESS for the irq autodetection originally (it 
now uses the IRQ_WAITING flag for that).

So I was really only arguing for it not being actively buggy the way it is 
now - which is not to say that we can't change it to be friendlier to 
whatever your needs are.

Be vewy vewy caweful when changing that code, though. If you end up with a 
patch, please try to give it some nice stress-testing (both on ppc and 
x86), and then post it for comments, ok? Maybe the arch mailing list and 
Ingo (who else has touched that logic?)

		Linus

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Posssible bug in kernel/irq/handle.c
  2005-11-09  2:07     ` Linus Torvalds
@ 2005-11-09  2:16       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 5+ messages in thread
From: Benjamin Herrenschmidt @ 2005-11-09  2:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ingo Molnar, Paul Mackerras, Linux Kernel list

> Be vewy vewy caweful when changing that code, though. If you end up with a 
> patch, please try to give it some nice stress-testing (both on ppc and 
> x86), and then post it for comments, ok? Maybe the arch mailing list and 
> Ingo (who else has touched that logic?)

Ok, I'll try to avoid touching that code. In a perfect world, we should
have proper handlers for those firmware interrupts anyway, it's just
that the "spec" says we should call the firmware for any interrupt we
don't handle...

I suppose it should be enough for us to test for desc->action before
calling __do_IRQ() and eventually do the firmware trick then, since I
doubt that if it matters at all, it will happen on shared interrupts...

Ben.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2005-11-09  2:18 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-11-09  0:38 Posssible bug in kernel/irq/handle.c Benjamin Herrenschmidt
2005-11-09  1:04 ` Linus Torvalds
2005-11-09  1:50   ` Benjamin Herrenschmidt
2005-11-09  2:07     ` Linus Torvalds
2005-11-09  2:16       ` Benjamin Herrenschmidt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.