* Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs
@ 2013-12-23 19:25 Austin Schuh
2014-01-06 13:32 ` Oliver Hartkopp
0 siblings, 1 reply; 9+ messages in thread
From: Austin Schuh @ 2013-12-23 19:25 UTC (permalink / raw)
To: Thomas Gleixner, Wolfgang Grandegger, Pavel Pisa,
Marc Kleine-Budde, Oliver Hartkopp, linux-can
Hi Thomas,
Did anything happen with your patch to note_interrupt, originally
posted on May 8th of 2013? (https://lkml.org/lkml/2013/3/7/222)
I am seeing an issue on a machine right now running a
config-preempt-rt kernel and a SJA1000 CAN card from PEAK. It works
for ~1 day, and then proceeds to die with a "Disabling IRQ #18"
message. I posted on the Linux CAN mailing list, and Oliver Hartkopp
was able to reproduce the issue only on a realtime kernel. A function
trace ending when the IRQ was disabled shows that note_interrupt is
being called regularly from the IRQ handler threads, and one of the
threads is doing work (and therefore calling note_interrupt with
IRQ_HANDLED).
Oliver Hartkopp and I ran tests over the weekend on numerous machines
and verified that the patch that you proposed fixes the problem. We
think that the race condition that Till reported is causing the
problem here.
In reply to the comment about using the upper bit of
threads_handled_last for holding the SPURIOUS_DEFERRED flag, while
that may still be an over-optimization, the code should still work.
All comparisons are done with the bit set, which just makes it a 31
bit counter. It will take 8 more days for the counter to overflow on
my machine, so I won't know for certain until then.
My only concern is that there may still be a small race condition with
this new code. If the interrupt handler thread is running at a
realtime priority, but lower than another task, it may not get run
until a large number of IRQs get triggered, and then process them
quickly. With your new handler code, this would be counted as one
single handled interrupt. With the current constants, this is only a
problem if more than 1000 calls to the handler happen between IRQs. I
starved my card's irq threads by running 4 tasks at a higher realtime
priority than the handler threads, and saw the number of unhandled
IRQs jump from 1/100000 to 3/100000, so that problem may not show up
in practice.
Austin Schuh
Tested-by: Austin Schuh <austin@peloton-tech.com>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs
2013-12-23 19:25 [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs Austin Schuh
@ 2014-01-06 13:32 ` Oliver Hartkopp
2014-04-07 18:38 ` Austin Schuh
0 siblings, 1 reply; 9+ messages in thread
From: Oliver Hartkopp @ 2014-01-06 13:32 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Austin Schuh, Wolfgang Grandegger, Pavel Pisa, Marc Kleine-Budde,
linux-can
Hi Thomas,
I just wanted to add my
Tested-by: Oliver Hartkopp <socketcan@hartkopp.net>
In my setup with Core i7 and 20 CAN busses SJA1000 PCIe the problem
disappeared with the discussed patch with the -rt kernel.
The system was running at full CAN bus load over the weekend more than 72
hours of operation without problems:
CPU0 CPU1 CPU2 CPU3
0: 40 0 0 0 IO-APIC-edge timer
1: 1 0 0 0 IO-APIC-edge i8042
8: 0 0 1 0 IO-APIC-edge rtc0
9: 42 45 45 42 IO-APIC-fasteoi acpi
16: 9 8 8 8 IO-APIC-fasteoi ahci, ehci_hcd:usb1, can4, can5, can6, can7
17: 441468642 443275488 443609061 441436145 IO-APIC-fasteoi can8, can10, can11, can9
18: 441975412 438811422 437317802 441209092 IO-APIC-fasteoi can12, can13, can14, can15
19: 427310388 428661677 429813687 428095739 IO-APIC-fasteoi can0, can1, can2, can3, can16, can17, can18, can19
(..)
Before the having the patch, it lasted 1 minutes to 1.5 hours (usually ~3
minutes) until the irq was killed due to the spurious detection using Linux
3.10.11-rt (Debian linux-image-3.10-0.bpo.3-rt-686-pae).
I also tested the patch on different latest 3.13-rc5+ (non-rt) kernels for two
weeks now without problems.
If you want me to test an improved version (as Austin suggested below) please
send a patch.
Best regards,
Oliver
On 23.12.2013 20:25, Austin Schuh wrote:
> Hi Thomas,
>
> Did anything happen with your patch to note_interrupt, originally
> posted on May 8th of 2013? (https://lkml.org/lkml/2013/3/7/222)
>
> I am seeing an issue on a machine right now running a
> config-preempt-rt kernel and a SJA1000 CAN card from PEAK. It works
> for ~1 day, and then proceeds to die with a "Disabling IRQ #18"
> message. I posted on the Linux CAN mailing list, and Oliver Hartkopp
> was able to reproduce the issue only on a realtime kernel. A function
> trace ending when the IRQ was disabled shows that note_interrupt is
> being called regularly from the IRQ handler threads, and one of the
> threads is doing work (and therefore calling note_interrupt with
> IRQ_HANDLED).
>
> Oliver Hartkopp and I ran tests over the weekend on numerous machines
> and verified that the patch that you proposed fixes the problem. We
> think that the race condition that Till reported is causing the
> problem here.
>
> In reply to the comment about using the upper bit of
> threads_handled_last for holding the SPURIOUS_DEFERRED flag, while
> that may still be an over-optimization, the code should still work.
> All comparisons are done with the bit set, which just makes it a 31
> bit counter. It will take 8 more days for the counter to overflow on
> my machine, so I won't know for certain until then.
>
> My only concern is that there may still be a small race condition with
> this new code. If the interrupt handler thread is running at a
> realtime priority, but lower than another task, it may not get run
> until a large number of IRQs get triggered, and then process them
> quickly. With your new handler code, this would be counted as one
> single handled interrupt. With the current constants, this is only a
> problem if more than 1000 calls to the handler happen between IRQs. I
> starved my card's irq threads by running 4 tasks at a higher realtime
> priority than the handler threads, and saw the number of unhandled
> IRQs jump from 1/100000 to 3/100000, so that problem may not show up
> in practice.
>
> Austin Schuh
>
> Tested-by: Austin Schuh <austin@peloton-tech.com>
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs
2014-01-06 13:32 ` Oliver Hartkopp
@ 2014-04-07 18:38 ` Austin Schuh
2014-04-07 18:41 ` Thomas Gleixner
0 siblings, 1 reply; 9+ messages in thread
From: Austin Schuh @ 2014-04-07 18:38 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Oliver Hartkopp, Wolfgang Grandegger, Pavel Pisa,
Marc Kleine-Budde, linux-can, linux-kernel
Hi Thomas,
Did anything come of this patch? Both Oliver and I have found that it
fixes real problems. I have multiple machines which have been running
with the patch since December with no ill effects.
Thanks,
Austin
On Mon, Jan 6, 2014 at 5:32 AM, Oliver Hartkopp <socketcan@hartkopp.net> wrote:
> Hi Thomas,
>
> I just wanted to add my
>
> Tested-by: Oliver Hartkopp <socketcan@hartkopp.net>
>
> In my setup with Core i7 and 20 CAN busses SJA1000 PCIe the problem
> disappeared with the discussed patch with the -rt kernel.
>
> The system was running at full CAN bus load over the weekend more than 72
> hours of operation without problems:
>
> CPU0 CPU1 CPU2 CPU3
> 0: 40 0 0 0 IO-APIC-edge timer
> 1: 1 0 0 0 IO-APIC-edge i8042
> 8: 0 0 1 0 IO-APIC-edge rtc0
> 9: 42 45 45 42 IO-APIC-fasteoi acpi
> 16: 9 8 8 8 IO-APIC-fasteoi ahci, ehci_hcd:usb1, can4, can5, can6, can7
> 17: 441468642 443275488 443609061 441436145 IO-APIC-fasteoi can8, can10, can11, can9
> 18: 441975412 438811422 437317802 441209092 IO-APIC-fasteoi can12, can13, can14, can15
> 19: 427310388 428661677 429813687 428095739 IO-APIC-fasteoi can0, can1, can2, can3, can16, can17, can18, can19
> (..)
>
> Before the having the patch, it lasted 1 minutes to 1.5 hours (usually ~3
> minutes) until the irq was killed due to the spurious detection using Linux
> 3.10.11-rt (Debian linux-image-3.10-0.bpo.3-rt-686-pae).
>
> I also tested the patch on different latest 3.13-rc5+ (non-rt) kernels for two
> weeks now without problems.
>
> If you want me to test an improved version (as Austin suggested below) please
> send a patch.
>
> Best regards,
> Oliver
>
> On 23.12.2013 20:25, Austin Schuh wrote:
>> Hi Thomas,
>>
>> Did anything happen with your patch to note_interrupt, originally
>> posted on May 8th of 2013? (https://lkml.org/lkml/2013/3/7/222)
>>
>> I am seeing an issue on a machine right now running a
>> config-preempt-rt kernel and a SJA1000 CAN card from PEAK. It works
>> for ~1 day, and then proceeds to die with a "Disabling IRQ #18"
>> message. I posted on the Linux CAN mailing list, and Oliver Hartkopp
>> was able to reproduce the issue only on a realtime kernel. A function
>> trace ending when the IRQ was disabled shows that note_interrupt is
>> being called regularly from the IRQ handler threads, and one of the
>> threads is doing work (and therefore calling note_interrupt with
>> IRQ_HANDLED).
>>
>> Oliver Hartkopp and I ran tests over the weekend on numerous machines
>> and verified that the patch that you proposed fixes the problem. We
>> think that the race condition that Till reported is causing the
>> problem here.
>>
>> In reply to the comment about using the upper bit of
>> threads_handled_last for holding the SPURIOUS_DEFERRED flag, while
>> that may still be an over-optimization, the code should still work.
>> All comparisons are done with the bit set, which just makes it a 31
>> bit counter. It will take 8 more days for the counter to overflow on
>> my machine, so I won't know for certain until then.
>>
>> My only concern is that there may still be a small race condition with
>> this new code. If the interrupt handler thread is running at a
>> realtime priority, but lower than another task, it may not get run
>> until a large number of IRQs get triggered, and then process them
>> quickly. With your new handler code, this would be counted as one
>> single handled interrupt. With the current constants, this is only a
>> problem if more than 1000 calls to the handler happen between IRQs. I
>> starved my card's irq threads by running 4 tasks at a higher realtime
>> priority than the handler threads, and saw the number of unhandled
>> IRQs jump from 1/100000 to 3/100000, so that problem may not show up
>> in practice.
>>
>> Austin Schuh
>>
>> Tested-by: Austin Schuh <austin@peloton-tech.com>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-can" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs
2014-04-07 18:38 ` Austin Schuh
@ 2014-04-07 18:41 ` Thomas Gleixner
2014-04-07 20:05 ` Austin Schuh
0 siblings, 1 reply; 9+ messages in thread
From: Thomas Gleixner @ 2014-04-07 18:41 UTC (permalink / raw)
To: Austin Schuh
Cc: Oliver Hartkopp, Wolfgang Grandegger, Pavel Pisa,
Marc Kleine-Budde, linux-can, linux-kernel
On Mon, 7 Apr 2014, Austin Schuh wrote:
> Hi Thomas,
>
> Did anything come of this patch? Both Oliver and I have found that it
> fixes real problems. I have multiple machines which have been running
> with the patch since December with no ill effects.
No, sorry. It fell through the cracks. Care to resend ?
Thanks,
tglx
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs
2014-04-07 18:41 ` Thomas Gleixner
@ 2014-04-07 20:05 ` Austin Schuh
2014-04-07 20:07 ` Thomas Gleixner
0 siblings, 1 reply; 9+ messages in thread
From: Austin Schuh @ 2014-04-07 20:05 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Oliver Hartkopp, Wolfgang Grandegger, Pavel Pisa,
Marc Kleine-Budde, linux-can, linux-kernel
On Mon, Apr 7, 2014 at 11:41 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Mon, 7 Apr 2014, Austin Schuh wrote:
>
>> Hi Thomas,
>>
>> Did anything come of this patch? Both Oliver and I have found that it
>> fixes real problems. I have multiple machines which have been running
>> with the patch since December with no ill effects.
>
> No, sorry. It fell through the cracks. Care to resend ?
>
> Thanks,
>
> tglx
> --
> To unsubscribe from this list: send the line "unsubscribe linux-can" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
You originally sent the patch out. I could send your patch out back
to you, but that feels a bit weird ;)
Austin
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs
2014-04-07 20:05 ` Austin Schuh
@ 2014-04-07 20:07 ` Thomas Gleixner
2014-04-07 20:08 ` Austin Schuh
0 siblings, 1 reply; 9+ messages in thread
From: Thomas Gleixner @ 2014-04-07 20:07 UTC (permalink / raw)
To: Austin Schuh
Cc: Oliver Hartkopp, Wolfgang Grandegger, Pavel Pisa,
Marc Kleine-Budde, linux-can, linux-kernel
On Mon, 7 Apr 2014, Austin Schuh wrote:
> On Mon, Apr 7, 2014 at 11:41 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Mon, 7 Apr 2014, Austin Schuh wrote:
> >
> >> Hi Thomas,
> >>
> >> Did anything come of this patch? Both Oliver and I have found that it
> >> fixes real problems. I have multiple machines which have been running
> >> with the patch since December with no ill effects.
> >
> > No, sorry. It fell through the cracks. Care to resend ?
> >
> > Thanks,
> >
> > tglx
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-can" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> You originally sent the patch out. I could send your patch out back
> to you, but that feels a bit weird ;)
Wheee. Let me dig in my archives ....
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs
2014-04-07 20:07 ` Thomas Gleixner
@ 2014-04-07 20:08 ` Austin Schuh
2014-04-28 20:20 ` Austin Schuh
0 siblings, 1 reply; 9+ messages in thread
From: Austin Schuh @ 2014-04-07 20:08 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Oliver Hartkopp, Wolfgang Grandegger, Pavel Pisa,
Marc Kleine-Budde, linux-can, linux-kernel
On Mon, Apr 7, 2014 at 1:07 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Mon, 7 Apr 2014, Austin Schuh wrote:
>> You originally sent the patch out. I could send your patch out back
>> to you, but that feels a bit weird ;)
>
> Wheee. Let me dig in my archives ....
https://lkml.org/lkml/2013/3/7/222 in case that helps.
Austin
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs
2014-04-07 20:08 ` Austin Schuh
@ 2014-04-28 20:20 ` Austin Schuh
2014-04-28 20:44 ` Thomas Gleixner
0 siblings, 1 reply; 9+ messages in thread
From: Austin Schuh @ 2014-04-28 20:20 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Oliver Hartkopp, Wolfgang Grandegger, Pavel Pisa,
Marc Kleine-Budde, linux-can, linux-kernel
On Mon, Apr 7, 2014 at 1:08 PM, Austin Schuh <austin@peloton-tech.com> wrote:
> On Mon, Apr 7, 2014 at 1:07 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> On Mon, 7 Apr 2014, Austin Schuh wrote:
>>> You originally sent the patch out. I could send your patch out back
>>> to you, but that feels a bit weird ;)
>>
>> Wheee. Let me dig in my archives ....
>
> https://lkml.org/lkml/2013/3/7/222 in case that helps.
Did you find the patch? I didn't see anything go by (but I'm not on
the main mailing list and didn't find anything with a quick Google
search.) It would be nice to not need to run a custom kernel to keep
my machine running. I have what is probably a year split between 2
machines of runtime with the patch applied, and I haven't seen any
problems with it.
Austin
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs
2014-04-28 20:20 ` Austin Schuh
@ 2014-04-28 20:44 ` Thomas Gleixner
0 siblings, 0 replies; 9+ messages in thread
From: Thomas Gleixner @ 2014-04-28 20:44 UTC (permalink / raw)
To: Austin Schuh
Cc: Oliver Hartkopp, Wolfgang Grandegger, Pavel Pisa,
Marc Kleine-Budde, linux-can, linux-kernel
On Mon, 28 Apr 2014, Austin Schuh wrote:
> On Mon, Apr 7, 2014 at 1:08 PM, Austin Schuh <austin@peloton-tech.com> wrote:
> > On Mon, Apr 7, 2014 at 1:07 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >> On Mon, 7 Apr 2014, Austin Schuh wrote:
> >>> You originally sent the patch out. I could send your patch out back
> >>> to you, but that feels a bit weird ;)
> >>
> >> Wheee. Let me dig in my archives ....
> >
> > https://lkml.org/lkml/2013/3/7/222 in case that helps.
>
> Did you find the patch? I didn't see anything go by (but I'm not on
> the main mailing list and didn't find anything with a quick Google
> search.) It would be nice to not need to run a custom kernel to keep
> my machine running. I have what is probably a year split between 2
> machines of runtime with the patch applied, and I haven't seen any
> problems with it.
It's on my list, but Easter and traveling did not really help :)
There are a few issues with the patch I need to think through, but
I'll get to it in the next few days.
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2014-04-28 20:43 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-23 19:25 [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs Austin Schuh
2014-01-06 13:32 ` Oliver Hartkopp
2014-04-07 18:38 ` Austin Schuh
2014-04-07 18:41 ` Thomas Gleixner
2014-04-07 20:05 ` Austin Schuh
2014-04-07 20:07 ` Thomas Gleixner
2014-04-07 20:08 ` Austin Schuh
2014-04-28 20:20 ` Austin Schuh
2014-04-28 20:44 ` Thomas Gleixner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).