From mboxrd@z Thu Jan 1 00:00:00 1970 From: Oliver Hartkopp Subject: Re: sja1000 interrupt problem Date: Wed, 13 Nov 2013 07:58:59 +0100 Message-ID: <52832333.9080908@hartkopp.net> References: <3a4a0c6ac898fbe27a8fe95cb147634c@grandegger.com> <99984642-b542-4078-a5ba-3dfb66188ce5@email.android.com> <5254608B.4080208@grandegger.com> <84ba410d04a85a783d1c1994f98d1f31@grandegger.com> <527E44EB.9020605@hartkopp.net> <52829CF1.4010902@hartkopp.net> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Return-path: Received: from mo-p00-ob.rzone.de ([81.169.146.161]:60496 "EHLO mo-p00-ob.rzone.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751655Ab3KMG7B (ORCPT ); Wed, 13 Nov 2013 01:59:01 -0500 In-Reply-To: Sender: linux-can-owner@vger.kernel.org List-ID: To: Austin Schuh Cc: Wolfgang Grandegger , linux-can@vger.kernel.org Hi Austin, sorry for checking my mails in sequential order :-) I would have been able to shorten the last mail. Thanks for your interesting investigation. I wonder why this problem did not show up before then. Having shared interrupts should be a usual thing. This kind of race condition should not be there at all. Do you have a second peak_pci hardware? I could be an idea to try to split the IRQs in a way that you have two IRQs for two cards - and then connect can0 to can2. You would have a pretty fast following RX/TX interrupt but without interrupt sharing ... Best regards, Oliver On 13.11.2013 04:41, Austin Schuh wrote: > On Tue, Nov 12, 2013 at 3:22 PM, Austin Schuh wrote: >> On Tue, Nov 12, 2013 at 1:26 PM, Oliver Hartkopp wrote: >>> On 12.11.2013 03:59, Austin Schuh wrote: >>> >>>>> From the trace it is pretty hard to know which CAN interface is in charge. >>>>> (2) Can you please add the output of dev->ifindex in the pr_info() calls? >>>> >>>> Gladly. See the updated logs. >>>> >>>> [ 556.019246] peak_pci 0000:05:00.0 can1: Got an sja1000 interrupt. >>>> [ 556.019268] Unhandled IRQ 18... stop tracing... >>>> [ 556.019280] peak_pci 0000:05:00.0 can0: Got an sja1000 interrupt. >>>> [ 556.019289] peak_pci 0000:05:00.0 can1: Received packet. >>>> [ 556.019299] peak_pci 0000:05:00.0 can1: sja1000_rx >>>> [ 556.019307] peak_pci 0000:05:00.0 can0: TX complete. >>>> [ 556.019318] peak_pci 0000:05:00.0 can0: Returning IRQ_HANDLED >>>> [ 556.019362] peak_pci 0000:05:00.0 can1: Returning IRQ_HANDLED >>>> >>> >>> This looks pretty broken regarding the IRQ handling. >>> Maybe the IRQ thread handling has a real problem in the -rt kernel ?!? >> >> Sounds pretty plausible right now. > > Ok, I spent a good chunk of today reading the IRQ handling code in the > kernel, and I think I get what is happening and have a plausible > explanation for why the interrupt is getting disabled. Not sure how > to test it. > > Here is what it looks like is happening. The hardware triggers an > interrupt. The handler is called, and then the registered action for > each of the devices is to notify their threads that an IRQ occurred, > and to have them handle it. Each of the handling threads then calls > the sja1000_interrupt function, or the equivalent ata_generic > interrupt function. 2 of the 3 interrupt functions then return > IRQ_NONE, and one of them returns IRQ_HANDLED. note_interrupt is then > called in each of the threads (instead of being called once in the > non-rt case), resulting in 2 unhanded calls, and 1 handled call. So > far, so good. The kernel operates as expected, since less than 99.9 % > of the interrupts are handled. (There is a note_interrupt call in the > handler, but since the threaded handlers are notified, this doesn't > get counted. > > Since the IRQ handlers are now all in threads, if the thread that > actually receives data doesn't process the interrupts either because > something goes wrong, or because it doesn't get scheduled, there will > be a bunch of unhanded interrupts noted, and no handled interrupts > noted. This will cause the IRQ to be disabled. > > I guess the next interesting thing to do is to trigger when it > disables the IRQ and take a look at what is happening. I have a test > running on one machine with tracing enabled which will disable tracing > when the IRQ is disabled. That should provide some interesting > results. I think I also know how to bypass it for now by setting > "noirqdebug", but I'd like to fix it for real as well. > > Austin >