All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tom Evans <tom_usenet@optusnet.com.au>
To: Holger Schurig <holgerschurig@gmail.com>,
	Marc Kleine-Budde <mkl@pengutronix.de>
Cc: linux-can@vger.kernel.org
Subject: Re: CAN question: how to trace frame errors?
Date: Fri, 12 Jun 2015 22:33:17 +1000	[thread overview]
Message-ID: <557AD18D.8010807@optusnet.com.au> (raw)
In-Reply-To: <CAOpc7mH-xD+eesJNjhpFwUSeMwGCdsRK8Q9AZSjmE2uw=cBgsw@mail.gmail.com>

On 12/06/2015 8:01 PM, Holger Schurig wrote:
>> These are probably RX-FIFO overrun errors:
>
> You're right.
>
> 13:03:16 kernel: ##HS flexcan_irq FIFO overrun
>
> That really puzzles me. The errors occured originally when someuse
> used a Kvazer to generate 80% load at 500 kB/s. In my tests, I used 1
> MB/s and cangen (no Kvazer here). The i.MX6 was otherwise idle:

Idle has nothing to do with it. The problem is the latency.

Let's do the maths.

The worst case CAN message arrival rate is with standard ID CAN messages 
with zero bytes of data. These are about 50 bits long. So at 500kHz they 
arrive at 10kHz or 100us interval. FlexCAN has a six message FIFO. So it 
can receive six without overflowing, but the seventh will blow it. So it 
has to (start to) be unloaded within 700us.

Since the CPU is running at 800MHz, that's only 560,000 instructions to 
respond to an interrupt. And it isn't managing it. That's what "Not Real 
Time" means - half a million instructions isn't enough time.

Except it isn't "respond to an interrupt" as the FlexCAN driver doesn't 
receive the 8-byte messages during the interrupt. It schedules a NAPI 
service routine to read the data, and they can easily be delayed that 
long waiting for a slot.

Do you have a Frame Buffer? Is it write-through or cache-flushed? I've 
read that a flush of a frame-buffer-sized chunk of memory can take the 
L2 cache a very long time to complete. Think MILLIseconds. That locks up 
ALL CORES unless the other ones are lucky enough to stay inside both 
their L1 caches.

We couldn't handle CAN losing data (and run at 1 MBit) so I rewrote the 
3.4 FlexCAN driver to unload the FIFO into a ring buffer during 
interrupts and to have the NAPI routine unload that. No problems since then.

 > kernel is 3.18.14

Then you SHOULD be better off than we were, running 3.4. In that version 
the FlexCAN controller uses NAPI (and always has) while the Ethernet 
controller didn't, but would happily try and unload 100 Ethernet packets 
all the way to the network layer in the interrupt routine, blocking the 
FlexCAN interrupts and NAPI run.

So check your Ethernet driver and see if it uses NAPI and if there's any 
work-limiting in the interrupt routine.

Flood ping (ping -f -l 20) one of them and see if that makes the 
overruns worse during your CAN test.

Run cangen with an 8-byte CAN packet and see if the lower arrival rate 
fixes it. Change the message length and see if you can use that to 
"measure the latency".

Then see what other interrupts you're getting. Try:

     while true; do cat /proc/interrupts; sleep 1; done

See which interrupt counts (apart from FlexCAN) are increasing quickly.

Then you'll want to FTRACE (CONFIG_PERF_EVENTS) the kernel. The Kernel 
Tracing is very good and very powerful. Learning how to run this is a 
skill very worth having. You want to find where it is spending its time 
between the interrupt and the NAPI run.

In our case we found our kernel was spending the majority of its time in 
mutex, spinlock and slub debugging. It got six times faster when I got 
rid of those.

Tom


  reply	other threads:[~2015-06-12 12:33 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-12  9:29 CAN question: how to trace frame errors? Holger Schurig
2015-06-12  9:34 ` Marc Kleine-Budde
2015-06-12 10:01   ` Holger Schurig
2015-06-12 12:33     ` Tom Evans [this message]
2015-06-12 14:24       ` Holger Schurig
2015-06-13 15:30         ` Tom Evans
2015-06-13 20:28           ` Holger Schurig
2015-06-14  2:42             ` Tom Evans
2015-06-22 12:17               ` Holger Schurig
2015-06-22 13:15                 ` Marc Kleine-Budde
2015-06-24 14:29                   ` Holger Schurig
2015-06-25  8:37                     ` Tom Evans

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=557AD18D.8010807@optusnet.com.au \
    --to=tom_usenet@optusnet.com.au \
    --cc=holgerschurig@gmail.com \
    --cc=linux-can@vger.kernel.org \
    --cc=mkl@pengutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.