From mboxrd@z Thu Jan 1 00:00:00 1970 From: Oliver Hartkopp Subject: Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs Date: Mon, 06 Jan 2014 14:32:15 +0100 Message-ID: <52CAB05F.4010303@hartkopp.net> References: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Return-path: Received: from mo4-p00-ob.smtp.rzone.de ([81.169.146.218]:56145 "EHLO mo4-p00-ob.smtp.rzone.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750990AbaAFNcZ (ORCPT ); Mon, 6 Jan 2014 08:32:25 -0500 In-Reply-To: Sender: linux-can-owner@vger.kernel.org List-ID: To: Thomas Gleixner Cc: Austin Schuh , Wolfgang Grandegger , Pavel Pisa , Marc Kleine-Budde , linux-can@vger.kernel.org Hi Thomas, I just wanted to add my Tested-by: Oliver Hartkopp In my setup with Core i7 and 20 CAN busses SJA1000 PCIe the problem disappeared with the discussed patch with the -rt kernel. The system was running at full CAN bus load over the weekend more than 72 hours of operation without problems: CPU0 CPU1 CPU2 CPU3 0: 40 0 0 0 IO-APIC-edge timer 1: 1 0 0 0 IO-APIC-edge i8042 8: 0 0 1 0 IO-APIC-edge rtc0 9: 42 45 45 42 IO-APIC-fasteoi acpi 16: 9 8 8 8 IO-APIC-fasteoi ahci, ehci_hcd:usb1, can4, can5, can6, can7 17: 441468642 443275488 443609061 441436145 IO-APIC-fasteoi can8, can10, can11, can9 18: 441975412 438811422 437317802 441209092 IO-APIC-fasteoi can12, can13, can14, can15 19: 427310388 428661677 429813687 428095739 IO-APIC-fasteoi can0, can1, can2, can3, can16, can17, can18, can19 (..) Before the having the patch, it lasted 1 minutes to 1.5 hours (usually ~3 minutes) until the irq was killed due to the spurious detection using Linux 3.10.11-rt (Debian linux-image-3.10-0.bpo.3-rt-686-pae). I also tested the patch on different latest 3.13-rc5+ (non-rt) kernels for two weeks now without problems. If you want me to test an improved version (as Austin suggested below) please send a patch. Best regards, Oliver On 23.12.2013 20:25, Austin Schuh wrote: > Hi Thomas, > > Did anything happen with your patch to note_interrupt, originally > posted on May 8th of 2013? (https://lkml.org/lkml/2013/3/7/222) > > I am seeing an issue on a machine right now running a > config-preempt-rt kernel and a SJA1000 CAN card from PEAK. It works > for ~1 day, and then proceeds to die with a "Disabling IRQ #18" > message. I posted on the Linux CAN mailing list, and Oliver Hartkopp > was able to reproduce the issue only on a realtime kernel. A function > trace ending when the IRQ was disabled shows that note_interrupt is > being called regularly from the IRQ handler threads, and one of the > threads is doing work (and therefore calling note_interrupt with > IRQ_HANDLED). > > Oliver Hartkopp and I ran tests over the weekend on numerous machines > and verified that the patch that you proposed fixes the problem. We > think that the race condition that Till reported is causing the > problem here. > > In reply to the comment about using the upper bit of > threads_handled_last for holding the SPURIOUS_DEFERRED flag, while > that may still be an over-optimization, the code should still work. > All comparisons are done with the bit set, which just makes it a 31 > bit counter. It will take 8 more days for the counter to overflow on > my machine, so I won't know for certain until then. > > My only concern is that there may still be a small race condition with > this new code. If the interrupt handler thread is running at a > realtime priority, but lower than another task, it may not get run > until a large number of IRQs get triggered, and then process them > quickly. With your new handler code, this would be counted as one > single handled interrupt. With the current constants, this is only a > problem if more than 1000 calls to the handler happen between IRQs. I > starved my card's irq threads by running 4 tasks at a higher realtime > priority than the handler threads, and saw the number of unhandled > IRQs jump from 1/100000 to 3/100000, so that problem may not show up > in practice. > > Austin Schuh > > Tested-by: Austin Schuh >