From: Wolfgang Grandegger <wg@grandegger.com>
To: Austin Schuh <austin@peloton-tech.com>
Cc: Oliver Hartkopp <socketcan@hartkopp.net>, linux-can@vger.kernel.org
Subject: Re: sja1000 interrupt problem
Date: Tue, 12 Nov 2013 23:15:16 +0100 [thread overview]
Message-ID: <5282A874.5050809@grandegger.com> (raw)
In-Reply-To: <CANGgnMbb+VResUC6h+cK6Hfe5PLJx9R9ao6bMdJM2e5BPaDamw@mail.gmail.com>
Hi Austin,
On 11/12/2013 03:59 AM, Austin Schuh wrote:
> Hi Wolfgang,
>
> On Sat, Nov 9, 2013 at 11:42 AM, Wolfgang Grandegger <wg@grandegger.com> wrote:
>> Hi Austin,
>>
>> On 11/08/2013 12:43 AM, Austin Schuh wrote:
>>>>>
>>>>> The dump is attached.
>>>>
>>>> I do not see any sja1000_rx() calls. Either they never happen or more
>>>> likely the trace is not long enough. Could you try with a larger buffer
>>>> using "echo 20000 > buffer_size_kb"? I also do not see some pr_info()
>>>> related functions at the end of the trace. Are you sure is has stopped
>>>> (cat tracing_on or message in dmesg)?
>>>>
>>>> Also please do an "echo 0 > trace" to clear the trace content.
>>>>
>>>> Wolfgang.
>>>
>>> Hi Wolfgang,
>>>
>>> I'm pretty certain that the trace is long enough. I tried again with
>>> echo -e -n 100000 > buffer_size_kb and I still don't see any calls to
>>> sja1000_rx.
>>>
>>> I added some pr_info prints at the front of sja1000_rx and
>>> sja1000_interrupt. For each packet sent and then received, I see the
>>> following. The following lines are from me sending 4 packets.
>>
>> sja1000_interrupt() is normally called for each SJA1000 device (shared
>> interrupt).
>>
>>> Nov 7 15:35:52 vpc5 kernel: [ 75.136107] Got an sja1000 interrupt.
>>> Nov 7 15:35:52 vpc5 kernel: [ 75.136123] Unhandled IRQ 18... stop tracing...
>>
>> I'm confused. Why is the IRQ "unhandled" without calling
>> sja1000_interrupt() twice. Ah, this is due to threaded interrupt
>> handling, where the spurious interrupt check is called for each
>> handler/device. Therefore the trigger is simply bad.
>>
>>> Nov 7 15:35:52 vpc5 kernel: [ 75.136130] Got an sja1000 interrupt.
>>> Nov 7 15:35:52 vpc5 kernel: [ 75.136139] Received packet.
>>> Nov 7 15:35:52 vpc5 kernel: [ 75.136146] sja1000_rx
>>> Nov 7 15:35:52 vpc5 kernel: [ 75.136155] TX complete.
>>> Nov 7 15:35:52 vpc5 kernel: [ 75.136174] Returning IRQ_HANDLED
>>> Nov 7 15:35:52 vpc5 kernel: [ 75.136207] Returning IRQ_HANDLED
>> ...
>>
>> This is the working case. How does it look like when the device stops
>> receiving messages. You should also label the device.
>
> I haven't been able to run the code long enough with all the debug
> turned on because my syslog grows way too big. This only happens on
> the PC connected to the highly loaded CAN bus. I have another PC
> connected to a lightly loaded bus, and I don't see nearly as many
> problems with that PC. I just figured out how to not fill my syslog
> up, so I'll get back to you when I have something interesting to show.
>
>>> I didn't rerun any traces, since my analysis of the syslog is that it
>>> won't give you the information you are looking for without moving the
>>> tracing_off call somewhere else.
>>
>> Well, we know that messages are received properly for some time. We need
>> to trigger the malfunctioning. I understood that it happens just once a
>> day that the interrupts get stuck, right?
>>
>> Wolfgang.
>
> I now have a number of datapoints. These are the timestamps from when
> it disabled the IRQ. After each of these, I rebooted the machine to
> be safe, though it works to instead take the interface down and bring
> it back up again. I'm seeing anywhere from 13 hours - 2 hours.
>
> [47430.043798] Disabling IRQ #18
> [21439.136781] Disabling IRQ #18
> [5641.564900] Disabling IRQ #18
> [45129.357272] Disabling IRQ #18
Be aware that it prints these messages not before having detected more
than 99000 unhandled IRQs in sequence. See:
http://lxr.linux.no/#linux+v3.12/kernel/irq/spurious.c#L308
Either the SJA1000 is stuck or the interrupt does not go away. Maybe we
have a race with interrupt acknowledgement.
> I moved the trace disable to the end of sja1000_interrupt when I get
> an interrupt from can1, since that is the last event in the trace.
> Attached is the new trace. It should have more interesting
> information in it.
This does disable the trace after the first interrupt served for CAN1.
Or do I miss something? What do you want to trigger?
>>From the syslog:
> Nov 11 17:56:14 vpc5 kernel: [ 79.069060] Unhandled IRQ 18...
> irqs_unhandled: 1, irq_count: 27
> Nov 11 17:56:14 vpc5 kernel: [ 79.069097] peak_pci 0000:05:00.0
> can0: Got an sja1000 interrupt.
> Nov 11 17:56:14 vpc5 kernel: [ 79.069113] peak_pci 0000:05:00.0
> can1: Got an sja1000 interrupt.
> Nov 11 17:56:14 vpc5 kernel: [ 79.069132] peak_pci 0000:05:00.0
> can1: Received packet.
> Nov 11 17:56:14 vpc5 kernel: [ 79.069157] peak_pci 0000:05:00.0
> can0: TX complete.
> Nov 11 17:56:14 vpc5 kernel: [ 79.069161] peak_pci 0000:05:00.0
> can1: sja1000_rx
> Nov 11 17:56:14 vpc5 kernel: [ 79.069199] peak_pci 0000:05:00.0
> can0: Returning IRQ_HANDLED
> Nov 11 17:56:14 vpc5 kernel: [ 79.069242] peak_pci 0000:05:00.0
> can1: Returning IRQ_HANDLED
> Nov 11 17:56:14 vpc5 kernel: [ 79.069269] peak_pci 0000:05:00.0
> can1: Found can1, disabling tracing.
If you use FTRACE, please remove these prinkt's.
> Instead of sending packets between the CAN interfaces, I grabbed
> another CAN device and am using it to send packets on the bus. This
> is the configuration that I have been seeing the real failures in. I
> now have 2 machines with the CAN device attached to see if I can catch
> the kprintfs when it dies. Hopefully one of them will have died by
> tomorrow morning and I'll be able to find something interesting.
It's good to make the test case as simple as possible. If I understand
correctly, just CAN1 does then receive messages.
Concerning the trigger. Maybe it's already sufficient to stop the trace
after 10 (or even less) unhandled IRQ 18 by adding:
if (unlikely(desc->irqs_unhandled == 10)) {
tracing_off();
printk("%s: tracing stopped\n", __func__);
}
in note_interrupt here:
http://lxr.linux.no/#linux+v3.12/kernel/irq/spurious.c#L306.
Other things to check:
- do the following checks in sja1000_interrupt ever return IRQ_NONE?
/* check for absent controller due to hw unplug */
if (status == 0xFF && sja1000_is_absent(priv))
return IRQ_NONE;
- What is the value of the PITA_ICR register when the device is stuck
(IRQ 18 disabled)? See:
http://lxr.linux.no/#linux/drivers/net/can/sja1000/peak_pci.c#L539
Hope it helps to find the problem.
Wolfgang.
prev parent reply other threads:[~2013-11-12 22:15 UTC|newest]
Thread overview: 66+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-10-08 0:47 sja1000 interrupt problem Austin Schuh
2013-10-08 6:32 ` Wolfgang Grandegger
2013-10-08 6:58 ` Oliver Hartkopp
2013-10-08 18:48 ` Austin Schuh
2013-10-08 19:44 ` Wolfgang Grandegger
2013-10-08 20:47 ` Austin Schuh
2013-10-09 6:21 ` Wolfgang Grandegger
2013-10-09 6:31 ` Wolfgang Grandegger
2013-10-09 6:47 ` Wolfgang Grandegger
[not found] ` <CANGgnMZpPGctUWGcg7Lp-QFPc7d6A5GeL9KQYnpeYMR8WukgdA@mail.gmail.com>
2013-11-07 8:15 ` Wolfgang Grandegger
2013-11-07 23:43 ` Austin Schuh
2013-11-09 14:21 ` Oliver Hartkopp
2013-11-12 2:59 ` Austin Schuh
2013-11-12 21:26 ` Oliver Hartkopp
2013-11-12 23:22 ` Austin Schuh
2013-11-13 3:41 ` Austin Schuh
2013-11-13 6:58 ` Oliver Hartkopp
2013-11-13 9:48 ` Kurt Van Dijck
2013-11-13 6:44 ` Oliver Hartkopp
2013-11-13 8:11 ` Wolfgang Grandegger
2013-11-13 9:08 ` Pavel Pisa
2013-11-13 9:52 ` Wolfgang Grandegger
2013-11-13 18:41 ` Oliver Hartkopp
2013-11-13 19:29 ` Wolfgang Grandegger
2013-11-13 22:00 ` Oliver Hartkopp
2013-11-13 11:02 ` Kurt Van Dijck
2013-11-16 21:42 ` Oliver Hartkopp
2013-11-17 8:18 ` Wolfgang Grandegger
2013-11-17 14:27 ` Oliver Hartkopp
2013-11-17 17:23 ` Wolfgang Grandegger
2013-11-17 20:46 ` Wolfgang Grandegger
2013-11-18 17:08 ` Austin Schuh
2013-12-09 21:54 ` Austin Schuh
2013-12-09 21:54 ` Austin Schuh
2013-12-10 7:49 ` Wolfgang Grandegger
2013-12-10 8:05 ` Austin Schuh
2013-12-10 9:32 ` Wolfgang Grandegger
2013-12-10 13:47 ` Oliver Hartkopp
2013-12-10 14:23 ` Oliver Hartkopp
2013-12-10 14:41 ` Wolfgang Grandegger
2013-12-10 16:05 ` Oliver Hartkopp
2013-12-10 21:12 ` Wolfgang Grandegger
2013-12-11 16:59 ` Oliver Hartkopp
2013-12-11 19:27 ` Wolfgang Grandegger
2013-12-12 6:13 ` Oliver Hartkopp
2013-12-12 17:38 ` Oliver Hartkopp
2013-12-12 22:56 ` Wolfgang Grandegger
2013-12-13 0:07 ` Austin Schuh
2013-12-13 16:16 ` Oliver Hartkopp
2013-12-13 9:38 ` Oliver Hartkopp
2013-12-13 10:04 ` Wolfgang Grandegger
2013-12-13 10:09 ` Wolfgang Grandegger
2013-12-13 16:25 ` Oliver Hartkopp
2013-12-13 17:33 ` Wolfgang Grandegger
2013-12-13 10:07 ` Marc Kleine-Budde
2013-12-13 16:22 ` Oliver Hartkopp
2013-12-13 17:14 ` Oliver Hartkopp
2013-12-13 21:14 ` Oliver Hartkopp
2013-12-14 9:51 ` Oliver Hartkopp
2013-12-20 23:13 ` Austin Schuh
2013-12-21 8:29 ` Wolfgang Grandegger
2013-12-21 13:12 ` Oliver Hartkopp
2013-12-21 12:55 ` Oliver Hartkopp
2013-12-23 15:58 ` Oliver Hartkopp
2013-11-09 19:42 ` Wolfgang Grandegger
[not found] ` <CANGgnMbb+VResUC6h+cK6Hfe5PLJx9R9ao6bMdJM2e5BPaDamw@mail.gmail.com>
2013-11-12 22:15 ` Wolfgang Grandegger [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5282A874.5050809@grandegger.com \
--to=wg@grandegger.com \
--cc=austin@peloton-tech.com \
--cc=linux-can@vger.kernel.org \
--cc=socketcan@hartkopp.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).