* Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs @ 2013-12-23 19:25 Austin Schuh 2014-01-06 13:32 ` Oliver Hartkopp 0 siblings, 1 reply; 9+ messages in thread From: Austin Schuh @ 2013-12-23 19:25 UTC (permalink / raw) To: Thomas Gleixner, Wolfgang Grandegger, Pavel Pisa, Marc Kleine-Budde, Oliver Hartkopp, linux-can Hi Thomas, Did anything happen with your patch to note_interrupt, originally posted on May 8th of 2013? (https://lkml.org/lkml/2013/3/7/222) I am seeing an issue on a machine right now running a config-preempt-rt kernel and a SJA1000 CAN card from PEAK. It works for ~1 day, and then proceeds to die with a "Disabling IRQ #18" message. I posted on the Linux CAN mailing list, and Oliver Hartkopp was able to reproduce the issue only on a realtime kernel. A function trace ending when the IRQ was disabled shows that note_interrupt is being called regularly from the IRQ handler threads, and one of the threads is doing work (and therefore calling note_interrupt with IRQ_HANDLED). Oliver Hartkopp and I ran tests over the weekend on numerous machines and verified that the patch that you proposed fixes the problem. We think that the race condition that Till reported is causing the problem here. In reply to the comment about using the upper bit of threads_handled_last for holding the SPURIOUS_DEFERRED flag, while that may still be an over-optimization, the code should still work. All comparisons are done with the bit set, which just makes it a 31 bit counter. It will take 8 more days for the counter to overflow on my machine, so I won't know for certain until then. My only concern is that there may still be a small race condition with this new code. If the interrupt handler thread is running at a realtime priority, but lower than another task, it may not get run until a large number of IRQs get triggered, and then process them quickly. With your new handler code, this would be counted as one single handled interrupt. With the current constants, this is only a problem if more than 1000 calls to the handler happen between IRQs. I starved my card's irq threads by running 4 tasks at a higher realtime priority than the handler threads, and saw the number of unhandled IRQs jump from 1/100000 to 3/100000, so that problem may not show up in practice. Austin Schuh Tested-by: Austin Schuh <austin@peloton-tech.com> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs 2013-12-23 19:25 [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs Austin Schuh @ 2014-01-06 13:32 ` Oliver Hartkopp 2014-04-07 18:38 ` Austin Schuh 0 siblings, 1 reply; 9+ messages in thread From: Oliver Hartkopp @ 2014-01-06 13:32 UTC (permalink / raw) To: Thomas Gleixner Cc: Austin Schuh, Wolfgang Grandegger, Pavel Pisa, Marc Kleine-Budde, linux-can Hi Thomas, I just wanted to add my Tested-by: Oliver Hartkopp <socketcan@hartkopp.net> In my setup with Core i7 and 20 CAN busses SJA1000 PCIe the problem disappeared with the discussed patch with the -rt kernel. The system was running at full CAN bus load over the weekend more than 72 hours of operation without problems: CPU0 CPU1 CPU2 CPU3 0: 40 0 0 0 IO-APIC-edge timer 1: 1 0 0 0 IO-APIC-edge i8042 8: 0 0 1 0 IO-APIC-edge rtc0 9: 42 45 45 42 IO-APIC-fasteoi acpi 16: 9 8 8 8 IO-APIC-fasteoi ahci, ehci_hcd:usb1, can4, can5, can6, can7 17: 441468642 443275488 443609061 441436145 IO-APIC-fasteoi can8, can10, can11, can9 18: 441975412 438811422 437317802 441209092 IO-APIC-fasteoi can12, can13, can14, can15 19: 427310388 428661677 429813687 428095739 IO-APIC-fasteoi can0, can1, can2, can3, can16, can17, can18, can19 (..) Before the having the patch, it lasted 1 minutes to 1.5 hours (usually ~3 minutes) until the irq was killed due to the spurious detection using Linux 3.10.11-rt (Debian linux-image-3.10-0.bpo.3-rt-686-pae). I also tested the patch on different latest 3.13-rc5+ (non-rt) kernels for two weeks now without problems. If you want me to test an improved version (as Austin suggested below) please send a patch. Best regards, Oliver On 23.12.2013 20:25, Austin Schuh wrote: > Hi Thomas, > > Did anything happen with your patch to note_interrupt, originally > posted on May 8th of 2013? (https://lkml.org/lkml/2013/3/7/222) > > I am seeing an issue on a machine right now running a > config-preempt-rt kernel and a SJA1000 CAN card from PEAK. It works > for ~1 day, and then proceeds to die with a "Disabling IRQ #18" > message. I posted on the Linux CAN mailing list, and Oliver Hartkopp > was able to reproduce the issue only on a realtime kernel. A function > trace ending when the IRQ was disabled shows that note_interrupt is > being called regularly from the IRQ handler threads, and one of the > threads is doing work (and therefore calling note_interrupt with > IRQ_HANDLED). > > Oliver Hartkopp and I ran tests over the weekend on numerous machines > and verified that the patch that you proposed fixes the problem. We > think that the race condition that Till reported is causing the > problem here. > > In reply to the comment about using the upper bit of > threads_handled_last for holding the SPURIOUS_DEFERRED flag, while > that may still be an over-optimization, the code should still work. > All comparisons are done with the bit set, which just makes it a 31 > bit counter. It will take 8 more days for the counter to overflow on > my machine, so I won't know for certain until then. > > My only concern is that there may still be a small race condition with > this new code. If the interrupt handler thread is running at a > realtime priority, but lower than another task, it may not get run > until a large number of IRQs get triggered, and then process them > quickly. With your new handler code, this would be counted as one > single handled interrupt. With the current constants, this is only a > problem if more than 1000 calls to the handler happen between IRQs. I > starved my card's irq threads by running 4 tasks at a higher realtime > priority than the handler threads, and saw the number of unhandled > IRQs jump from 1/100000 to 3/100000, so that problem may not show up > in practice. > > Austin Schuh > > Tested-by: Austin Schuh <austin@peloton-tech.com> > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs 2014-01-06 13:32 ` Oliver Hartkopp @ 2014-04-07 18:38 ` Austin Schuh 2014-04-07 18:41 ` Thomas Gleixner 0 siblings, 1 reply; 9+ messages in thread From: Austin Schuh @ 2014-04-07 18:38 UTC (permalink / raw) To: Thomas Gleixner Cc: Oliver Hartkopp, Wolfgang Grandegger, Pavel Pisa, Marc Kleine-Budde, linux-can, linux-kernel Hi Thomas, Did anything come of this patch? Both Oliver and I have found that it fixes real problems. I have multiple machines which have been running with the patch since December with no ill effects. Thanks, Austin On Mon, Jan 6, 2014 at 5:32 AM, Oliver Hartkopp <socketcan@hartkopp.net> wrote: > Hi Thomas, > > I just wanted to add my > > Tested-by: Oliver Hartkopp <socketcan@hartkopp.net> > > In my setup with Core i7 and 20 CAN busses SJA1000 PCIe the problem > disappeared with the discussed patch with the -rt kernel. > > The system was running at full CAN bus load over the weekend more than 72 > hours of operation without problems: > > CPU0 CPU1 CPU2 CPU3 > 0: 40 0 0 0 IO-APIC-edge timer > 1: 1 0 0 0 IO-APIC-edge i8042 > 8: 0 0 1 0 IO-APIC-edge rtc0 > 9: 42 45 45 42 IO-APIC-fasteoi acpi > 16: 9 8 8 8 IO-APIC-fasteoi ahci, ehci_hcd:usb1, can4, can5, can6, can7 > 17: 441468642 443275488 443609061 441436145 IO-APIC-fasteoi can8, can10, can11, can9 > 18: 441975412 438811422 437317802 441209092 IO-APIC-fasteoi can12, can13, can14, can15 > 19: 427310388 428661677 429813687 428095739 IO-APIC-fasteoi can0, can1, can2, can3, can16, can17, can18, can19 > (..) > > Before the having the patch, it lasted 1 minutes to 1.5 hours (usually ~3 > minutes) until the irq was killed due to the spurious detection using Linux > 3.10.11-rt (Debian linux-image-3.10-0.bpo.3-rt-686-pae). > > I also tested the patch on different latest 3.13-rc5+ (non-rt) kernels for two > weeks now without problems. > > If you want me to test an improved version (as Austin suggested below) please > send a patch. > > Best regards, > Oliver > > On 23.12.2013 20:25, Austin Schuh wrote: >> Hi Thomas, >> >> Did anything happen with your patch to note_interrupt, originally >> posted on May 8th of 2013? (https://lkml.org/lkml/2013/3/7/222) >> >> I am seeing an issue on a machine right now running a >> config-preempt-rt kernel and a SJA1000 CAN card from PEAK. It works >> for ~1 day, and then proceeds to die with a "Disabling IRQ #18" >> message. I posted on the Linux CAN mailing list, and Oliver Hartkopp >> was able to reproduce the issue only on a realtime kernel. A function >> trace ending when the IRQ was disabled shows that note_interrupt is >> being called regularly from the IRQ handler threads, and one of the >> threads is doing work (and therefore calling note_interrupt with >> IRQ_HANDLED). >> >> Oliver Hartkopp and I ran tests over the weekend on numerous machines >> and verified that the patch that you proposed fixes the problem. We >> think that the race condition that Till reported is causing the >> problem here. >> >> In reply to the comment about using the upper bit of >> threads_handled_last for holding the SPURIOUS_DEFERRED flag, while >> that may still be an over-optimization, the code should still work. >> All comparisons are done with the bit set, which just makes it a 31 >> bit counter. It will take 8 more days for the counter to overflow on >> my machine, so I won't know for certain until then. >> >> My only concern is that there may still be a small race condition with >> this new code. If the interrupt handler thread is running at a >> realtime priority, but lower than another task, it may not get run >> until a large number of IRQs get triggered, and then process them >> quickly. With your new handler code, this would be counted as one >> single handled interrupt. With the current constants, this is only a >> problem if more than 1000 calls to the handler happen between IRQs. I >> starved my card's irq threads by running 4 tasks at a higher realtime >> priority than the handler threads, and saw the number of unhandled >> IRQs jump from 1/100000 to 3/100000, so that problem may not show up >> in practice. >> >> Austin Schuh >> >> Tested-by: Austin Schuh <austin@peloton-tech.com> >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-can" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs 2014-04-07 18:38 ` Austin Schuh @ 2014-04-07 18:41 ` Thomas Gleixner 2014-04-07 20:05 ` Austin Schuh 0 siblings, 1 reply; 9+ messages in thread From: Thomas Gleixner @ 2014-04-07 18:41 UTC (permalink / raw) To: Austin Schuh Cc: Oliver Hartkopp, Wolfgang Grandegger, Pavel Pisa, Marc Kleine-Budde, linux-can, linux-kernel On Mon, 7 Apr 2014, Austin Schuh wrote: > Hi Thomas, > > Did anything come of this patch? Both Oliver and I have found that it > fixes real problems. I have multiple machines which have been running > with the patch since December with no ill effects. No, sorry. It fell through the cracks. Care to resend ? Thanks, tglx ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs 2014-04-07 18:41 ` Thomas Gleixner @ 2014-04-07 20:05 ` Austin Schuh 2014-04-07 20:07 ` Thomas Gleixner 0 siblings, 1 reply; 9+ messages in thread From: Austin Schuh @ 2014-04-07 20:05 UTC (permalink / raw) To: Thomas Gleixner Cc: Oliver Hartkopp, Wolfgang Grandegger, Pavel Pisa, Marc Kleine-Budde, linux-can, linux-kernel On Mon, Apr 7, 2014 at 11:41 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > On Mon, 7 Apr 2014, Austin Schuh wrote: > >> Hi Thomas, >> >> Did anything come of this patch? Both Oliver and I have found that it >> fixes real problems. I have multiple machines which have been running >> with the patch since December with no ill effects. > > No, sorry. It fell through the cracks. Care to resend ? > > Thanks, > > tglx > -- > To unsubscribe from this list: send the line "unsubscribe linux-can" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html You originally sent the patch out. I could send your patch out back to you, but that feels a bit weird ;) Austin ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs 2014-04-07 20:05 ` Austin Schuh @ 2014-04-07 20:07 ` Thomas Gleixner 2014-04-07 20:08 ` Austin Schuh 0 siblings, 1 reply; 9+ messages in thread From: Thomas Gleixner @ 2014-04-07 20:07 UTC (permalink / raw) To: Austin Schuh Cc: Oliver Hartkopp, Wolfgang Grandegger, Pavel Pisa, Marc Kleine-Budde, linux-can, linux-kernel On Mon, 7 Apr 2014, Austin Schuh wrote: > On Mon, Apr 7, 2014 at 11:41 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > > On Mon, 7 Apr 2014, Austin Schuh wrote: > > > >> Hi Thomas, > >> > >> Did anything come of this patch? Both Oliver and I have found that it > >> fixes real problems. I have multiple machines which have been running > >> with the patch since December with no ill effects. > > > > No, sorry. It fell through the cracks. Care to resend ? > > > > Thanks, > > > > tglx > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-can" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > You originally sent the patch out. I could send your patch out back > to you, but that feels a bit weird ;) Wheee. Let me dig in my archives .... ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs 2014-04-07 20:07 ` Thomas Gleixner @ 2014-04-07 20:08 ` Austin Schuh 2014-04-28 20:20 ` Austin Schuh 0 siblings, 1 reply; 9+ messages in thread From: Austin Schuh @ 2014-04-07 20:08 UTC (permalink / raw) To: Thomas Gleixner Cc: Oliver Hartkopp, Wolfgang Grandegger, Pavel Pisa, Marc Kleine-Budde, linux-can, linux-kernel On Mon, Apr 7, 2014 at 1:07 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > On Mon, 7 Apr 2014, Austin Schuh wrote: >> You originally sent the patch out. I could send your patch out back >> to you, but that feels a bit weird ;) > > Wheee. Let me dig in my archives .... https://lkml.org/lkml/2013/3/7/222 in case that helps. Austin ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs 2014-04-07 20:08 ` Austin Schuh @ 2014-04-28 20:20 ` Austin Schuh 2014-04-28 20:44 ` Thomas Gleixner 0 siblings, 1 reply; 9+ messages in thread From: Austin Schuh @ 2014-04-28 20:20 UTC (permalink / raw) To: Thomas Gleixner Cc: Oliver Hartkopp, Wolfgang Grandegger, Pavel Pisa, Marc Kleine-Budde, linux-can, linux-kernel On Mon, Apr 7, 2014 at 1:08 PM, Austin Schuh <austin@peloton-tech.com> wrote: > On Mon, Apr 7, 2014 at 1:07 PM, Thomas Gleixner <tglx@linutronix.de> wrote: >> On Mon, 7 Apr 2014, Austin Schuh wrote: >>> You originally sent the patch out. I could send your patch out back >>> to you, but that feels a bit weird ;) >> >> Wheee. Let me dig in my archives .... > > https://lkml.org/lkml/2013/3/7/222 in case that helps. Did you find the patch? I didn't see anything go by (but I'm not on the main mailing list and didn't find anything with a quick Google search.) It would be nice to not need to run a custom kernel to keep my machine running. I have what is probably a year split between 2 machines of runtime with the patch applied, and I haven't seen any problems with it. Austin ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs 2014-04-28 20:20 ` Austin Schuh @ 2014-04-28 20:44 ` Thomas Gleixner 0 siblings, 0 replies; 9+ messages in thread From: Thomas Gleixner @ 2014-04-28 20:44 UTC (permalink / raw) To: Austin Schuh Cc: Oliver Hartkopp, Wolfgang Grandegger, Pavel Pisa, Marc Kleine-Budde, linux-can, linux-kernel On Mon, 28 Apr 2014, Austin Schuh wrote: > On Mon, Apr 7, 2014 at 1:08 PM, Austin Schuh <austin@peloton-tech.com> wrote: > > On Mon, Apr 7, 2014 at 1:07 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > >> On Mon, 7 Apr 2014, Austin Schuh wrote: > >>> You originally sent the patch out. I could send your patch out back > >>> to you, but that feels a bit weird ;) > >> > >> Wheee. Let me dig in my archives .... > > > > https://lkml.org/lkml/2013/3/7/222 in case that helps. > > Did you find the patch? I didn't see anything go by (but I'm not on > the main mailing list and didn't find anything with a quick Google > search.) It would be nice to not need to run a custom kernel to keep > my machine running. I have what is probably a year split between 2 > machines of runtime with the patch applied, and I haven't seen any > problems with it. It's on my list, but Easter and traveling did not really help :) There are a few issues with the patch I need to think through, but I'll get to it in the next few days. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2014-04-28 20:43 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-12-23 19:25 [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs Austin Schuh 2014-01-06 13:32 ` Oliver Hartkopp 2014-04-07 18:38 ` Austin Schuh 2014-04-07 18:41 ` Thomas Gleixner 2014-04-07 20:05 ` Austin Schuh 2014-04-07 20:07 ` Thomas Gleixner 2014-04-07 20:08 ` Austin Schuh 2014-04-28 20:20 ` Austin Schuh 2014-04-28 20:44 ` Thomas Gleixner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).