From mboxrd@z Thu Jan  1 00:00:00 1970
From: Tom Evans <tom_usenet@optusnet.com.au>
Subject: Re: [Rfi] Cyclone V CAN errors when application pinned to CPU1
Date: Sun, 7 Feb 2016 09:34:53 +1100
Message-ID: <56B6750D.4040602@optusnet.com.au>
References: <562155B7.7020504@vsis.cz> <20151020071807.GH20879@pengutronix.de>
 <5625EF45.2000807@pengutronix.de> <56B63491.9020500@vstk.cz>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Return-path: <linux-can-owner@vger.kernel.org>
Received: from mail104.syd.optusnet.com.au ([211.29.132.246]:57180 "EHLO
	mail104.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK)
	by vger.kernel.org with ESMTP id S1751986AbcBFWfF (ORCPT
	<rfc822;linux-can@vger.kernel.org>); Sat, 6 Feb 2016 17:35:05 -0500
In-Reply-To: <56B63491.9020500@vstk.cz>
Sender: linux-can-owner@vger.kernel.org
List-ID: <linux-can.vger.kernel.org>
To: Vlastimil Setka <setka@vstk.cz>, Marc Kleine-Budde <mkl@pengutronix.de>, Robert Schwebel <r.schwebel@pengutronix.de>
Cc: rfi@lists.rocketboards.org, linux-can <linux-can@vger.kernel.org>

On 7/02/2016 4:59 AM, Vlastimil Setka wrote:
 >>> We have a linux application which sends data
 >>> periodically (1 to 20 ms period) out over the
 >>> can0 socketcan interface. Sometimes the first
 >>> data byte in the CAN frame is zero on the wire,
 >>> but non-zero in the data sent!

>> The TX functions is usually pretty straight forward.
 >> Copy all data bytes into the hardware, write ID and DLC,
 >> then hit the send bit (or whatever triggers the hardware
 >> to send the frame). Maybe there's some barrier
>> missing in this sequence?

I'd suggest you "objdump -S" the CAN driver object file and check to see 
the optimizer hasn't re-ordered the above sequence too much.

 > It can be reproducibly triggered by a high network load on
 > ethernet generated by iperf for example.

Which generates a lot of interrupts. Which are probably interrupting the 
above transmit sequence and delaying its completion. During which time 
something else can get in. The most likely disturbing interrupt would be 
a CAN Receive or Transmit interrupt. Is the transmitter "one message at 
a time" in that hardware, is there a FIFO or are there multiple transmit 
message buffers?

Do you have any other CAN traffic on the network that might be 
generating CAN Receive interrupts?

I'd suggest you add a "reentry counter" to the driver and test it on 
entry to various routines (transmit, receive, interrupt), Increment it 
on entry to the transmit routine and decrement on exit. "printk" a 
warning when you see "reentry" and correlate with the data corruption. 
Reduce where you increment and decrement to just around the transmit 
code that loads the hardware and see if you can zero in on the part of 
the code that can't handle the reentry.

It is also possible your "periodic transmit task" is being delayed 
sufficiently that it sends two or more messages back-to-back. Transmit 
flow control might not be working properly. I'd suggest putting a 
sequence counter in the CAN message to see if any are getting dropped or 
duplicated. You could also try a partial microsecond transmit timestamp 
in there to detect two messages being sent close together or back-to-back.

> As a next step, I plan to check data inside the driver
 > just before it writes into the hardware to verify if
 > the error is not in network stack above the driver.
 > Any other idea?

Can you read the data back from the hardware and verify it got written 
properly? Do this before initiating the transmit and after as well.

Since you seem to be always sending the same data (so writing the same 
data into the registers) I'd suggest sending different data in alternate 
messages to see if there's any "stale data" being sent as well.

Tom