linux-can.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Vlastimil Setka <setka@vstk.cz>
To: Tom Evans <tom_usenet@optusnet.com.au>,
	Marc Kleine-Budde <mkl@pengutronix.de>,
	Robert Schwebel <r.schwebel@pengutronix.de>
Cc: rfi@lists.rocketboards.org, linux-can <linux-can@vger.kernel.org>
Subject: Re: [Rfi] Cyclone V CAN errors when application pinned to CPU1
Date: Sun, 7 Feb 2016 00:56:15 +0100	[thread overview]
Message-ID: <56B6881F.3010606@vstk.cz> (raw)
In-Reply-To: <56B6750D.4040602@optusnet.com.au>

6.2.2016 23:34 Tom Evans:
> On 7/02/2016 4:59 AM, Vlastimil Setka wrote:
> >>> We have a linux application which sends data
> >>> periodically (1 to 20 ms period) out over the
> >>> can0 socketcan interface. Sometimes the first
> >>> data byte in the CAN frame is zero on the wire,
> >>> but non-zero in the data sent!
>>> The TX functions is usually pretty straight forward. Copy all data bytes into the hardware, write ID and DLC, then hit the send bit (or whatever triggers the hardware to send the frame). Maybe there's some barrier missing in this sequence? 
> I'd suggest you "objdump -S" the CAN driver object file and check to see the optimizer hasn't re-ordered the above sequence too much.

I'm not so familiar with reading assembly, and the driver is a bit complicated by splitting this into many functions.

Relevant source code start probably in the c_can_start_xmit function: https://github.com/altera-opensource/linux-socfpga/blob/rel_socfpga-4.3_16.02.01_pr/drivers/net/can/c_can/c_can.c#L434

I uploaded objdump -S of my c_can.o here: https://gist.github.com/vstk/9c4307bb9ae0a6ae0208

> > It can be reproducibly triggered by a high network load on
> > ethernet generated by iperf for example.
>
> Which generates a lot of interrupts. Which are probably interrupting the above transmit sequence and delaying its completion. During which time something else can get in. The most likely disturbing interrupt would be a CAN Receive or Transmit interrupt. Is the transmitter "one message at a time" in that hardware, is there a FIFO or are there multiple transmit message buffers?

As I know, the CAN controller has pool of buffers - called message objects in CAN terminology - and these are somehow allocated to rx/tx messages.

> Do you have any other CAN traffic on the network that might be generating CAN Receive interrupts?

No, the only traffic on the network is generated by the test program.

> I'd suggest you add a "reentry counter" to the driver and test it on entry to various routines (transmit, receive, interrupt), Increment it on entry to the transmit routine and decrement on exit. "printk" a warning when you see "reentry" and correlate with the data corruption. Reduce where you increment and decrement to just around the transmit code that loads the hardware and see if you can zero in on the part of the code that can't handle the reentry.
>
> It is also possible your "periodic transmit task" is being delayed sufficiently that it sends two or more messages back-to-back. Transmit flow control might not be working properly. I'd suggest putting a sequence counter in the CAN message to see if any are getting dropped or duplicated. You could also try a partial microsecond transmit timestamp in there to detect two messages being sent close together or back-to-back.

I'm also thinking about interrupts related problem. It also explains why the problem is not present on RT-patched kernel with test program executed with high priority -- higher than kernel interrupt threads. Interrupts are handled using NAPI, so there should not be a re-entry. Maybe the NAPI is important part of the problem, because the CAN NAPI handler could be significantly delayed based on ethernet traffic which driver also use NAPI infrastructure.

The problem was initially seen in application where CAN message timing was monitored by receiving system, and the jitter was under 150us of 1000us period, so it seems to be not triggered by delaying tx task or back-to-back xmit. There were no missing or duplicated messages, only malformed data in a one message between many surrounding correct messages.

>> As a next step, I plan to check data inside the driver just before it writes into the hardware to verify if the error is not in network stack above the driver. Any other idea? 
> Can you read the data back from the hardware and verify it got written properly? Do this before initiating the transmit and after as well.
>
> Since you seem to be always sending the same data (so writing the same data into the registers) I'd suggest sending different data in alternate messages to see if there's any "stale data" being sent as well.

My initial suspicion was that some stale data from previous communication is sent, that's why my test program send always the same data - and the error can be easily detected by candump+grep.

Thanks for the idea. I will do another tests with variable data containing counters and I will monitor data on wires using logic analyser to also see precise timing.

Vlastimil Setka


  reply	other threads:[~2016-02-06 23:56 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <562155B7.7020504@vsis.cz>
2015-10-20  7:18 ` [Rfi] Cyclone V CAN errors when application pinned to CPU1 Robert Schwebel
2015-10-20  7:37   ` Marc Kleine-Budde
2016-02-06 17:59     ` Vlastimil Setka
2016-02-06 22:34       ` Tom Evans
2016-02-06 23:56         ` Vlastimil Setka [this message]
2016-02-07  0:54           ` Tom Evans
2016-02-07 22:19           ` Tom Evans

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56B6881F.3010606@vstk.cz \
    --to=setka@vstk.cz \
    --cc=linux-can@vger.kernel.org \
    --cc=mkl@pengutronix.de \
    --cc=r.schwebel@pengutronix.de \
    --cc=rfi@lists.rocketboards.org \
    --cc=tom_usenet@optusnet.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).