linux-can.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs
@ 2013-12-23 19:25 Austin Schuh
  2014-01-06 13:32 ` Oliver Hartkopp
  0 siblings, 1 reply; 9+ messages in thread
From: Austin Schuh @ 2013-12-23 19:25 UTC (permalink / raw)
  To: Thomas Gleixner, Wolfgang Grandegger, Pavel Pisa,
	Marc Kleine-Budde, Oliver Hartkopp, linux-can

Hi Thomas,

Did anything happen with your patch to note_interrupt, originally
posted on May 8th of 2013?  (https://lkml.org/lkml/2013/3/7/222)

I am seeing an issue on a machine right now running a
config-preempt-rt kernel and a SJA1000 CAN card from PEAK.  It works
for ~1 day, and then proceeds to die with a "Disabling IRQ #18"
message.  I posted on the Linux CAN mailing list, and Oliver Hartkopp
was able to reproduce the issue only on a realtime kernel.  A function
trace ending when the IRQ was disabled shows that note_interrupt is
being called regularly from the IRQ handler threads, and one of the
threads is doing work (and therefore calling note_interrupt with
IRQ_HANDLED).

Oliver Hartkopp and I ran tests over the weekend on numerous machines
and verified that the patch that you proposed fixes the problem.  We
think that the race condition that Till reported is causing the
problem here.

In reply to the comment about using the upper bit of
threads_handled_last for holding the SPURIOUS_DEFERRED flag, while
that may still be an over-optimization, the code should still work.
All comparisons are done with the bit set, which just makes it a 31
bit counter.  It will take 8 more days for the counter to overflow on
my machine, so I won't know for certain until then.

My only concern is that there may still be a small race condition with
this new code.  If the interrupt handler thread is running at a
realtime priority, but lower than another task, it may not get run
until a large number of IRQs get triggered, and then process them
quickly.  With your new handler code, this would be counted as one
single handled interrupt.  With the current constants, this is only a
problem if more than 1000 calls to the handler happen between IRQs.  I
starved my card's irq threads by running 4 tasks at a higher realtime
priority than the handler threads, and saw the number of unhandled
IRQs jump from 1/100000 to 3/100000, so that problem may not show up
in practice.

Austin Schuh

Tested-by: Austin Schuh <austin@peloton-tech.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs
  2013-12-23 19:25 [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs Austin Schuh
@ 2014-01-06 13:32 ` Oliver Hartkopp
  2014-04-07 18:38   ` Austin Schuh
  0 siblings, 1 reply; 9+ messages in thread
From: Oliver Hartkopp @ 2014-01-06 13:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Austin Schuh, Wolfgang Grandegger, Pavel Pisa, Marc Kleine-Budde,
	linux-can

Hi Thomas,

I just wanted to add my

Tested-by: Oliver Hartkopp <socketcan@hartkopp.net>

In my setup with Core i7 and 20 CAN busses SJA1000 PCIe the problem
disappeared with the discussed patch with the -rt kernel.

The system was running at full CAN bus load over the weekend more than 72
hours of operation without problems:
 
           CPU0       CPU1       CPU2       CPU3       
  0:         40          0          0          0   IO-APIC-edge      timer
  1:          1          0          0          0   IO-APIC-edge      i8042
  8:          0          0          1          0   IO-APIC-edge      rtc0
  9:         42         45         45         42   IO-APIC-fasteoi   acpi
 16:          9          8          8          8   IO-APIC-fasteoi   ahci, ehci_hcd:usb1, can4, can5, can6, can7
 17:  441468642  443275488  443609061  441436145   IO-APIC-fasteoi   can8, can10, can11, can9
 18:  441975412  438811422  437317802  441209092   IO-APIC-fasteoi   can12, can13, can14, can15
 19:  427310388  428661677  429813687  428095739   IO-APIC-fasteoi   can0, can1, can2, can3, can16, can17, can18, can19
(..)

Before the having the patch, it lasted 1 minutes to 1.5 hours (usually ~3
minutes) until the irq was killed due to the spurious detection using Linux
3.10.11-rt (Debian linux-image-3.10-0.bpo.3-rt-686-pae).

I also tested the patch on different latest 3.13-rc5+ (non-rt) kernels for two
weeks now without problems.

If you want me to test an improved version (as Austin suggested below) please
send a patch.

Best regards,
Oliver

On 23.12.2013 20:25, Austin Schuh wrote:
> Hi Thomas,
> 
> Did anything happen with your patch to note_interrupt, originally
> posted on May 8th of 2013?  (https://lkml.org/lkml/2013/3/7/222)
> 
> I am seeing an issue on a machine right now running a
> config-preempt-rt kernel and a SJA1000 CAN card from PEAK.  It works
> for ~1 day, and then proceeds to die with a "Disabling IRQ #18"
> message.  I posted on the Linux CAN mailing list, and Oliver Hartkopp
> was able to reproduce the issue only on a realtime kernel.  A function
> trace ending when the IRQ was disabled shows that note_interrupt is
> being called regularly from the IRQ handler threads, and one of the
> threads is doing work (and therefore calling note_interrupt with
> IRQ_HANDLED).
> 
> Oliver Hartkopp and I ran tests over the weekend on numerous machines
> and verified that the patch that you proposed fixes the problem.  We
> think that the race condition that Till reported is causing the
> problem here.
> 
> In reply to the comment about using the upper bit of
> threads_handled_last for holding the SPURIOUS_DEFERRED flag, while
> that may still be an over-optimization, the code should still work.
> All comparisons are done with the bit set, which just makes it a 31
> bit counter.  It will take 8 more days for the counter to overflow on
> my machine, so I won't know for certain until then.
> 
> My only concern is that there may still be a small race condition with
> this new code.  If the interrupt handler thread is running at a
> realtime priority, but lower than another task, it may not get run
> until a large number of IRQs get triggered, and then process them
> quickly.  With your new handler code, this would be counted as one
> single handled interrupt.  With the current constants, this is only a
> problem if more than 1000 calls to the handler happen between IRQs.  I
> starved my card's irq threads by running 4 tasks at a higher realtime
> priority than the handler threads, and saw the number of unhandled
> IRQs jump from 1/100000 to 3/100000, so that problem may not show up
> in practice.
> 
> Austin Schuh
> 
> Tested-by: Austin Schuh <austin@peloton-tech.com>
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs
  2014-01-06 13:32 ` Oliver Hartkopp
@ 2014-04-07 18:38   ` Austin Schuh
  2014-04-07 18:41     ` Thomas Gleixner
  0 siblings, 1 reply; 9+ messages in thread
From: Austin Schuh @ 2014-04-07 18:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Oliver Hartkopp, Wolfgang Grandegger, Pavel Pisa,
	Marc Kleine-Budde, linux-can, linux-kernel

Hi Thomas,

Did anything come of this patch?  Both Oliver and I have found that it
fixes real problems.  I have multiple machines which have been running
with the patch since December with no ill effects.

Thanks,
  Austin

On Mon, Jan 6, 2014 at 5:32 AM, Oliver Hartkopp <socketcan@hartkopp.net> wrote:
> Hi Thomas,
>
> I just wanted to add my
>
> Tested-by: Oliver Hartkopp <socketcan@hartkopp.net>
>
> In my setup with Core i7 and 20 CAN busses SJA1000 PCIe the problem
> disappeared with the discussed patch with the -rt kernel.
>
> The system was running at full CAN bus load over the weekend more than 72
> hours of operation without problems:
>
>            CPU0       CPU1       CPU2       CPU3
>   0:         40          0          0          0   IO-APIC-edge      timer
>   1:          1          0          0          0   IO-APIC-edge      i8042
>   8:          0          0          1          0   IO-APIC-edge      rtc0
>   9:         42         45         45         42   IO-APIC-fasteoi   acpi
>  16:          9          8          8          8   IO-APIC-fasteoi   ahci, ehci_hcd:usb1, can4, can5, can6, can7
>  17:  441468642  443275488  443609061  441436145   IO-APIC-fasteoi   can8, can10, can11, can9
>  18:  441975412  438811422  437317802  441209092   IO-APIC-fasteoi   can12, can13, can14, can15
>  19:  427310388  428661677  429813687  428095739   IO-APIC-fasteoi   can0, can1, can2, can3, can16, can17, can18, can19
> (..)
>
> Before the having the patch, it lasted 1 minutes to 1.5 hours (usually ~3
> minutes) until the irq was killed due to the spurious detection using Linux
> 3.10.11-rt (Debian linux-image-3.10-0.bpo.3-rt-686-pae).
>
> I also tested the patch on different latest 3.13-rc5+ (non-rt) kernels for two
> weeks now without problems.
>
> If you want me to test an improved version (as Austin suggested below) please
> send a patch.
>
> Best regards,
> Oliver
>
> On 23.12.2013 20:25, Austin Schuh wrote:
>> Hi Thomas,
>>
>> Did anything happen with your patch to note_interrupt, originally
>> posted on May 8th of 2013?  (https://lkml.org/lkml/2013/3/7/222)
>>
>> I am seeing an issue on a machine right now running a
>> config-preempt-rt kernel and a SJA1000 CAN card from PEAK.  It works
>> for ~1 day, and then proceeds to die with a "Disabling IRQ #18"
>> message.  I posted on the Linux CAN mailing list, and Oliver Hartkopp
>> was able to reproduce the issue only on a realtime kernel.  A function
>> trace ending when the IRQ was disabled shows that note_interrupt is
>> being called regularly from the IRQ handler threads, and one of the
>> threads is doing work (and therefore calling note_interrupt with
>> IRQ_HANDLED).
>>
>> Oliver Hartkopp and I ran tests over the weekend on numerous machines
>> and verified that the patch that you proposed fixes the problem.  We
>> think that the race condition that Till reported is causing the
>> problem here.
>>
>> In reply to the comment about using the upper bit of
>> threads_handled_last for holding the SPURIOUS_DEFERRED flag, while
>> that may still be an over-optimization, the code should still work.
>> All comparisons are done with the bit set, which just makes it a 31
>> bit counter.  It will take 8 more days for the counter to overflow on
>> my machine, so I won't know for certain until then.
>>
>> My only concern is that there may still be a small race condition with
>> this new code.  If the interrupt handler thread is running at a
>> realtime priority, but lower than another task, it may not get run
>> until a large number of IRQs get triggered, and then process them
>> quickly.  With your new handler code, this would be counted as one
>> single handled interrupt.  With the current constants, this is only a
>> problem if more than 1000 calls to the handler happen between IRQs.  I
>> starved my card's irq threads by running 4 tasks at a higher realtime
>> priority than the handler threads, and saw the number of unhandled
>> IRQs jump from 1/100000 to 3/100000, so that problem may not show up
>> in practice.
>>
>> Austin Schuh
>>
>> Tested-by: Austin Schuh <austin@peloton-tech.com>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-can" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs
  2014-04-07 18:38   ` Austin Schuh
@ 2014-04-07 18:41     ` Thomas Gleixner
  2014-04-07 20:05       ` Austin Schuh
  0 siblings, 1 reply; 9+ messages in thread
From: Thomas Gleixner @ 2014-04-07 18:41 UTC (permalink / raw)
  To: Austin Schuh
  Cc: Oliver Hartkopp, Wolfgang Grandegger, Pavel Pisa,
	Marc Kleine-Budde, linux-can, linux-kernel

On Mon, 7 Apr 2014, Austin Schuh wrote:

> Hi Thomas,
> 
> Did anything come of this patch?  Both Oliver and I have found that it
> fixes real problems.  I have multiple machines which have been running
> with the patch since December with no ill effects.

No, sorry. It fell through the cracks. Care to resend ?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs
  2014-04-07 18:41     ` Thomas Gleixner
@ 2014-04-07 20:05       ` Austin Schuh
  2014-04-07 20:07         ` Thomas Gleixner
  0 siblings, 1 reply; 9+ messages in thread
From: Austin Schuh @ 2014-04-07 20:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Oliver Hartkopp, Wolfgang Grandegger, Pavel Pisa,
	Marc Kleine-Budde, linux-can, linux-kernel

On Mon, Apr 7, 2014 at 11:41 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Mon, 7 Apr 2014, Austin Schuh wrote:
>
>> Hi Thomas,
>>
>> Did anything come of this patch?  Both Oliver and I have found that it
>> fixes real problems.  I have multiple machines which have been running
>> with the patch since December with no ill effects.
>
> No, sorry. It fell through the cracks. Care to resend ?
>
> Thanks,
>
>         tglx
> --
> To unsubscribe from this list: send the line "unsubscribe linux-can" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

You originally sent the patch out.  I could send your patch out back
to you, but that feels a bit weird ;)

Austin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs
  2014-04-07 20:05       ` Austin Schuh
@ 2014-04-07 20:07         ` Thomas Gleixner
  2014-04-07 20:08           ` Austin Schuh
  0 siblings, 1 reply; 9+ messages in thread
From: Thomas Gleixner @ 2014-04-07 20:07 UTC (permalink / raw)
  To: Austin Schuh
  Cc: Oliver Hartkopp, Wolfgang Grandegger, Pavel Pisa,
	Marc Kleine-Budde, linux-can, linux-kernel

On Mon, 7 Apr 2014, Austin Schuh wrote:

> On Mon, Apr 7, 2014 at 11:41 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Mon, 7 Apr 2014, Austin Schuh wrote:
> >
> >> Hi Thomas,
> >>
> >> Did anything come of this patch?  Both Oliver and I have found that it
> >> fixes real problems.  I have multiple machines which have been running
> >> with the patch since December with no ill effects.
> >
> > No, sorry. It fell through the cracks. Care to resend ?
> >
> > Thanks,
> >
> >         tglx
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-can" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> You originally sent the patch out.  I could send your patch out back
> to you, but that feels a bit weird ;)

Wheee. Let me dig in my archives ....

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs
  2014-04-07 20:07         ` Thomas Gleixner
@ 2014-04-07 20:08           ` Austin Schuh
  2014-04-28 20:20             ` Austin Schuh
  0 siblings, 1 reply; 9+ messages in thread
From: Austin Schuh @ 2014-04-07 20:08 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Oliver Hartkopp, Wolfgang Grandegger, Pavel Pisa,
	Marc Kleine-Budde, linux-can, linux-kernel

On Mon, Apr 7, 2014 at 1:07 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Mon, 7 Apr 2014, Austin Schuh wrote:
>> You originally sent the patch out.  I could send your patch out back
>> to you, but that feels a bit weird ;)
>
> Wheee. Let me dig in my archives ....

https://lkml.org/lkml/2013/3/7/222 in case that helps.

Austin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs
  2014-04-07 20:08           ` Austin Schuh
@ 2014-04-28 20:20             ` Austin Schuh
  2014-04-28 20:44               ` Thomas Gleixner
  0 siblings, 1 reply; 9+ messages in thread
From: Austin Schuh @ 2014-04-28 20:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Oliver Hartkopp, Wolfgang Grandegger, Pavel Pisa,
	Marc Kleine-Budde, linux-can, linux-kernel

On Mon, Apr 7, 2014 at 1:08 PM, Austin Schuh <austin@peloton-tech.com> wrote:
> On Mon, Apr 7, 2014 at 1:07 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> On Mon, 7 Apr 2014, Austin Schuh wrote:
>>> You originally sent the patch out.  I could send your patch out back
>>> to you, but that feels a bit weird ;)
>>
>> Wheee. Let me dig in my archives ....
>
> https://lkml.org/lkml/2013/3/7/222 in case that helps.

Did you find the patch?  I didn't see anything go by (but I'm not on
the main mailing list and didn't find anything with a quick Google
search.)  It would be nice to not need to run a custom kernel to keep
my machine running.  I have what is probably a year split between 2
machines of runtime with the patch applied, and I haven't seen any
problems with it.

Austin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs
  2014-04-28 20:20             ` Austin Schuh
@ 2014-04-28 20:44               ` Thomas Gleixner
  0 siblings, 0 replies; 9+ messages in thread
From: Thomas Gleixner @ 2014-04-28 20:44 UTC (permalink / raw)
  To: Austin Schuh
  Cc: Oliver Hartkopp, Wolfgang Grandegger, Pavel Pisa,
	Marc Kleine-Budde, linux-can, linux-kernel

On Mon, 28 Apr 2014, Austin Schuh wrote:
> On Mon, Apr 7, 2014 at 1:08 PM, Austin Schuh <austin@peloton-tech.com> wrote:
> > On Mon, Apr 7, 2014 at 1:07 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >> On Mon, 7 Apr 2014, Austin Schuh wrote:
> >>> You originally sent the patch out.  I could send your patch out back
> >>> to you, but that feels a bit weird ;)
> >>
> >> Wheee. Let me dig in my archives ....
> >
> > https://lkml.org/lkml/2013/3/7/222 in case that helps.
> 
> Did you find the patch?  I didn't see anything go by (but I'm not on
> the main mailing list and didn't find anything with a quick Google
> search.)  It would be nice to not need to run a custom kernel to keep
> my machine running.  I have what is probably a year split between 2
> machines of runtime with the patch applied, and I haven't seen any
> problems with it.

It's on my list, but Easter and traveling did not really help :)

There are a few issues with the patch I need to think through, but
I'll get to it in the next few days.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-04-28 20:43 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-23 19:25 [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs Austin Schuh
2014-01-06 13:32 ` Oliver Hartkopp
2014-04-07 18:38   ` Austin Schuh
2014-04-07 18:41     ` Thomas Gleixner
2014-04-07 20:05       ` Austin Schuh
2014-04-07 20:07         ` Thomas Gleixner
2014-04-07 20:08           ` Austin Schuh
2014-04-28 20:20             ` Austin Schuh
2014-04-28 20:44               ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).