Bad reading from mcp2515 with j1939

All of lore.kernel.org
 help / color / mirror / Atom feed

* Bad reading from mcp2515 with j1939
@ 2016-04-15  9:54 Julien Pilet
  2016-04-15 10:19 ` Kurt Van Dijck
  2016-04-15 13:07 ` Ramesh Shanmugasundaram
  0 siblings, 2 replies; 9+ messages in thread
From: Julien Pilet @ 2016-04-15  9:54 UTC (permalink / raw)
  To: linux-can@vger.kernel.org

Hi,

I am using a MCP2515 for a nautical application. It turns out that in my lab, everything goes fine. On two boats, though, I get strange readings from can0.

I connected the MCP2515 to a sensor reporting a boat speed of 0 m/s.
  can0  09F50323   [8]  79 00 00 FF FF 00 FF FF

The first data byte is the sequence ID, it is incremented in each packet.
The two following bytes encode the boat speed.

candump gives me the following:

  can0  09F50323   [8]  7A 00 00 FF FF 00 FF FF
  can0  09F50323   [8]  7B 00 00 FF FF 00 FF FF
  can0  09F50323   [8]  7C 00 00 FF FF 00 FF FF
  can0  09F50323   [8]  7D 00 00 FF FF 00 FF FF
  can0  09F50323   [8]  7E 00 00 FF FF 00 FF FF
  can0  09F50323   [8]  7F 00 00 FF FF 00 FF FF
  can0  09F50323   [8]  80 80 00 FF FF 00 FF FF
  can0  09F50323   [8]  81 00 00 FF FF 00 FF FF
  can0  09F50323   [8]  82 80 00 FF FF 00 FF FF
  can0  09F50323   [8]  83 00 00 FF FF 00 FF FF
  can0  09F50323   [8]  84 80 00 FF FF 00 FF FF
  can0  09F50323   [8]  85 00 00 FF FF 00 FF FF
  can0  09F50323   [8]  86 80 00 FF FF 00 FF FF

Suddenly, when SID reaches 0x80, the speed is not 0 anymore: it is 0x0080 (encoded little endian 80 00). This is clearly wrong.
I observed similar problems with sensors of different manufacturers, so I do not think the sensor is faulty. 
The pattern repeats: for seq ID between 00 and 0x79, everything works fine. For seq ID 0x80 - 0xFA (the last seq ID sent) every packet with an even sequence ID has a spurious bit set in the 2nd byte.

I failed to reproduce the error in the lab, so it is hard for me to plug a oscilloscope and observe what is going on. I observe no error when transmitting the same data on my test network (which is much shorter and simpler than a real one).

Did I miss something in the configuration ? How is it possible that bad packets pass through?

Here’s how my configuration looks like:

# ifconfig can0
can0      Link encap:UNSPEC  HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00  
          UP RUNNING NOARP  MTU:16  Metric:1
          RX packets:22823 errors:3 dropped:0 overruns:0 frame:3
          TX packets:7 errors:1 dropped:1 overruns:0 carrier:1
          collisions:0 txqueuelen:10 
          RX bytes:182584 (178.3 KiB)  TX bytes:31 (31.0 B)

# ./bin/ip -details link show can0
5: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UNKNOWN mode DEFAULT qlen 10
    link/can 
    can <TRIPLE-SAMPLING> state ERROR-PASSIVE restart-ms 0 
    bitrate 250000 sample-point 0.875 
    tq 250 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1
    mcp251x: tseg1 3..16 tseg2 2..8 sjw 1..4 brp 1..64 brp-inc 1
    clock 8000000
    j1939 on

I’m using a 3.10.17 kernel modified by Intel for their Edison module. I patched it with j1939-v3.10.

Any idea or suggestion would be appreciated!

Thanks,
Julien.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Bad reading from mcp2515 with j1939
  2016-04-15  9:54 Bad reading from mcp2515 with j1939 Julien Pilet
@ 2016-04-15 10:19 ` Kurt Van Dijck
  2016-04-15 13:02   ` Julien Pilet
  2016-04-15 13:07 ` Ramesh Shanmugasundaram
  1 sibling, 1 reply; 9+ messages in thread
From: Kurt Van Dijck @ 2016-04-15 10:19 UTC (permalink / raw)
  To: Julien Pilet; +Cc: linux-can@vger.kernel.org

> Hi,
> 
> I am using a MCP2515 for a nautical application. It turns out that in my lab, everything goes fine. On two boats, though, I get strange readings from can0.
> 
> I connected the MCP2515 to a sensor reporting a boat speed of 0 m/s.
>   can0  09F50323   [8]  79 00 00 FF FF 00 FF FF
> 
> The first data byte is the sequence ID, it is incremented in each packet.
> The two following bytes encode the boat speed.

If you say so :-)
This kind of application logic is kept outside the kernel.

> 
> candump gives me the following:
> 
>   can0  09F50323   [8]  7A 00 00 FF FF 00 FF FF
>   can0  09F50323   [8]  7B 00 00 FF FF 00 FF FF
>   can0  09F50323   [8]  7C 00 00 FF FF 00 FF FF
>   can0  09F50323   [8]  7D 00 00 FF FF 00 FF FF
>   can0  09F50323   [8]  7E 00 00 FF FF 00 FF FF
>   can0  09F50323   [8]  7F 00 00 FF FF 00 FF FF
>   can0  09F50323   [8]  80 80 00 FF FF 00 FF FF
>   can0  09F50323   [8]  81 00 00 FF FF 00 FF FF
>   can0  09F50323   [8]  82 80 00 FF FF 00 FF FF
>   can0  09F50323   [8]  83 00 00 FF FF 00 FF FF
>   can0  09F50323   [8]  84 80 00 FF FF 00 FF FF
>   can0  09F50323   [8]  85 00 00 FF FF 00 FF FF
>   can0  09F50323   [8]  86 80 00 FF FF 00 FF FF
> 
> Suddenly, when SID reaches 0x80, the speed is not 0 anymore: it is 0x0080 (encoded little endian 80 00). This is clearly wrong.
> I observed similar problems with sensors of different manufacturers, so I do not think the sensor is faulty. 
> The pattern repeats: for seq ID between 00 and 0x79, everything works fine. For seq ID 0x80 - 0xFA (the last seq ID sent) every packet with an even sequence ID has a spurious bit set in the 2nd byte.

I'm sure that your host does not introduce a problem.
The 80 00 appears on the wire. Your linux kernel can't change that :-)

Can you validate your claim that this is erroneous?
Can you see these frames exactly the same on an other host?

> I failed to reproduce the error in the lab, so it is hard for me to plug a oscilloscope and observe what is going on. I observe no error when transmitting the same data on my test network (which is much shorter and simpler than a real one).

It is often like that. Problems like to hide themselves.

Kurt

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Bad reading from mcp2515 with j1939
  2016-04-15 10:19 ` Kurt Van Dijck
@ 2016-04-15 13:02   ` Julien Pilet
  2016-04-15 14:52     ` Wolfgang Grandegger
  0 siblings, 1 reply; 9+ messages in thread
From: Julien Pilet @ 2016-04-15 13:02 UTC (permalink / raw)
  To: Kurt Van Dijck; +Cc: linux-can@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 1849 bytes --]

Hi,

Thanks for your response.

I managed to reproduce the error and validate that the reading is erroneous.

I have one device sending can messages. I have two device reading, one being my mcp2515, the other is a usb nmea2000 adapter from actisense.

Beside the changing sequence id (first byte) I expect to read always the same data. So I ran:

./candump can0  | grep 09FD0200 | grep -v "19 00 60 7A FA FF FF"

It selects only the messages I’m interested in (09FD0200) and remove correct frames. Most frames are correct and are not displayed. A few frames show wrong values, though:
  can0  09FD0200   [8]  AA 99 00 60 7A FA FF FF
  can0  09FD0200   [8]  C2 99 00 60 7A FA FF FF
  can0  09FD0200   [8]  AA 99 00 60 7A FA FF FF
  can0  09FD0200   [8]  AE 99 00 60 7A FA FF FF
  can0  09FD0200   [8]  A6 99 00 60 7A FA FF FF
  can0  09FD0200   [8]  86 99 00 60 7A FA FF FF
  can0  09FD0200   [8]  E6 99 00 60 7A FA FF FF
  can0  09FD0200   [8]  8E 99 00 60 7A FA FF FF
  can0  09FD0200   [8]  8E 99 00 60 7A FA FF FF
  can0  09FD0200   [8]  A2 99 00 60 7A FA FF FF
  can0  09FD0200   [8]  CA 99 00 60 7A FA FF FF

It is again the MSB bit of byte number 2 which is wrong. It should show 19 00, not 99 00. On the other reading device, I see correctly 19 00. So the problem is in the reading.

I tried switching triple-sampling on and off, but it does not change anything.

I observed the same issue on different physical devices, so it is not an isolated issue.

Julien

> On 15 avr. 2016, at 12:19, Kurt Van Dijck <dev.kurt@vandijck-laurijssen.be> wrote:
> 
> I'm sure that your host does not introduce a problem.
> The 80 00 appears on the wire. Your linux kernel can't change that :-)
> 
> Can you validate your claim that this is erroneous?
> Can you see these frames exactly the same on an other host?


[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Bad reading from mcp2515 with j1939
  2016-04-15 13:02   ` Julien Pilet
@ 2016-04-15 14:52     ` Wolfgang Grandegger
  2016-04-15 16:25       ` Julien Pilet
  0 siblings, 1 reply; 9+ messages in thread
From: Wolfgang Grandegger @ 2016-04-15 14:52 UTC (permalink / raw)
  To: Julien Pilet, Kurt Van Dijck; +Cc: linux-can@vger.kernel.org

Hello,

Am 15.04.2016 um 15:02 schrieb Julien Pilet:
> Hi,
>
> Thanks for your response.
>
> I managed to reproduce the error and validate that the reading is erroneous.
>
> I have one device sending can messages. I have two device reading, one being my mcp2515, the other is a usb nmea2000 adapter from actisense.
>
> Beside the changing sequence id (first byte) I expect to read always the same data. So I ran:
>
> ./candump can0  | grep 09FD0200 | grep -v "19 00 60 7A FA FF FF"
>
> It selects only the messages I’m interested in (09FD0200) and remove correct frames. Most frames are correct and are not displayed. A few frames show wrong values, though:
>    can0  09FD0200   [8]  AA 99 00 60 7A FA FF FF
>    can0  09FD0200   [8]  C2 99 00 60 7A FA FF FF
>    can0  09FD0200   [8]  AA 99 00 60 7A FA FF FF
>    can0  09FD0200   [8]  AE 99 00 60 7A FA FF FF
>    can0  09FD0200   [8]  A6 99 00 60 7A FA FF FF
>    can0  09FD0200   [8]  86 99 00 60 7A FA FF FF
>    can0  09FD0200   [8]  E6 99 00 60 7A FA FF FF
>    can0  09FD0200   [8]  8E 99 00 60 7A FA FF FF
>    can0  09FD0200   [8]  8E 99 00 60 7A FA FF FF
>    can0  09FD0200   [8]  A2 99 00 60 7A FA FF FF
>    can0  09FD0200   [8]  CA 99 00 60 7A FA FF FF
>
> It is again the MSB bit of byte number 2 which is wrong. It should show 19 00, not 99 00. On the other reading device, I see correctly 19 00. So the problem is in the reading.
>
> I tried switching triple-sampling on and off, but it does not change anything.
>
> I observed the same issue on different physical devices, so it is not an isolated issue.

Maybe the hardware does not work properly. Try lowering the SPI bus 
frequency or other hw parameters.

Wolfgang.

>
> Julien
>
>> On 15 avr. 2016, at 12:19, Kurt Van Dijck <dev.kurt@vandijck-laurijssen.be> wrote:
>>
>> I'm sure that your host does not introduce a problem.
>> The 80 00 appears on the wire. Your linux kernel can't change that :-)
>>
>> Can you validate your claim that this is erroneous?
>> Can you see these frames exactly the same on an other host?
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Bad reading from mcp2515 with j1939
  2016-04-15 14:52     ` Wolfgang Grandegger
@ 2016-04-15 16:25       ` Julien Pilet
  2016-04-15 16:34         ` Wolfgang Grandegger
  2016-04-18  4:29         ` Tom Evans
  0 siblings, 2 replies; 9+ messages in thread
From: Julien Pilet @ 2016-04-15 16:25 UTC (permalink / raw)
  To: Wolfgang Grandegger; +Cc: Kurt Van Dijck, linux-can@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 416 bytes --]

Hi,

Thanks a lot for your help, I found the error. There was a resistor that was not supposed to be there (120ohms between CANH and CANL). Removing it solved the problem.

Thanks again. Have a great week-end!

Julien.

> On 15 avr. 2016, at 16:52, Wolfgang Grandegger <wg@grandegger.com> wrote:
> 
> Maybe the hardware does not work properly. Try lowering the SPI bus frequency or other hw parameters.


[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Bad reading from mcp2515 with j1939
  2016-04-15 16:25       ` Julien Pilet
@ 2016-04-15 16:34         ` Wolfgang Grandegger
  2016-04-18  5:48           ` Wolfgang Grandegger
  2016-04-18  4:29         ` Tom Evans
  1 sibling, 1 reply; 9+ messages in thread
From: Wolfgang Grandegger @ 2016-04-15 16:34 UTC (permalink / raw)
  To: Julien Pilet; +Cc: Kurt Van Dijck, linux-can@vger.kernel.org

Hello,

Am 15.04.2016 um 18:25 schrieb Julien Pilet:
> Hi,
>
> Thanks a lot for your help, I found the error. There was a resistor that was not supposed to be there (120ohms between CANH and CANL). Removing it solved the problem.

Well, the mcp2515 should read wrong data even with improper bus termination.

Anyway, have a nice weekend as well.

Wolfgang.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Bad reading from mcp2515 with j1939
  2016-04-15 16:34         ` Wolfgang Grandegger
@ 2016-04-18  5:48           ` Wolfgang Grandegger
  0 siblings, 0 replies; 9+ messages in thread
From: Wolfgang Grandegger @ 2016-04-18  5:48 UTC (permalink / raw)
  To: Julien Pilet; +Cc: Kurt Van Dijck, linux-can@vger.kernel.org

Am 15.04.2016 um 18:34 schrieb Wolfgang Grandegger:
> Hello,
>
> Am 15.04.2016 um 18:25 schrieb Julien Pilet:
>> Hi,
>>
>> Thanks a lot for your help, I found the error. There was a resistor
>> that was not supposed to be there (120ohms between CANH and CANL).
>> Removing it solved the problem.
>
> Well, the mcp2515 should read wrong data even with improper bus
> termination.

s/should/should *not*/. of course!

> Anyway, have a nice weekend as well.
>
> Wolfgang.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Bad reading from mcp2515 with j1939
  2016-04-15 16:25       ` Julien Pilet
  2016-04-15 16:34         ` Wolfgang Grandegger
@ 2016-04-18  4:29         ` Tom Evans
  1 sibling, 0 replies; 9+ messages in thread
From: Tom Evans @ 2016-04-18  4:29 UTC (permalink / raw)
  To: Julien Pilet, Wolfgang Grandegger
  Cc: Kurt Van Dijck, linux-can@vger.kernel.org

On 16/04/16 02:25, Julien Pilet wrote:
> Hi,
>
> Thanks a lot for your help, I found the error.
 > There was a resistor that was not supposed to be
 > there (120ohms between CANH and CANL).
 > Removing it solved the problem.

Do you mean there was an EXTRA terminating resistor that you removed? The CAN 
bus is specified to have two 120 ohm resistors, normally situated as close to 
the ends of the cable as reasonably possible. Check the diagrams and read the 
"Physical Layer" part of the following for details:

https://en.wikipedia.org/wiki/CAN_bus#Layers

A CAN bus should be able to tolerate three 120 ohm terminators. The On Semi 
AMIS-42770 is specified to work with a minimum termination resistance of 42.5 
ohms. If the bus wiring is high resistance, there are bad connections, or if 
someone has added series resistors or inductors between the transceivers and 
the bus then it might not even work with two terminators. You should check for 
this.

I was very surprised that it was able to receive bad data. There's a 15-bit 
checksum on the data. The only way to have bad data is to have a two-bit or a 
three-bit error that generates the same checksum, or a data bit error and a 
checksum bit error that cancel each other out.

That is extremely unlikely as normally a CAN bus has multiple devices, and 
they're all checking all packets on the wire, all the time. If ANY of them 
detect an error, then they report it back to the wire and that invalidates the 
message, forcing a retransmit.

Do you have only the one transmitter and the one receiver on that bus? If 
there's only the one receiver then it is more likely that a corrupted or noisy 
signal could sneak through bad data. The sensitivity to "every even byte over 
0x80" implies a specific data pattern, stuffing bit and checksum sensitivity.

Do the baud rates match exactly or are they around 1% out? That can cause 
problems related to runs of zeros and stuff bits.

You should monitor TEC and REC if they're available through the interface and 
drivers. It is a boat. What is the common grounding like? You might be getting 
ground shifts between the devices that could be making things worse.

Tom

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Bad reading from mcp2515 with j1939
  2016-04-15  9:54 Bad reading from mcp2515 with j1939 Julien Pilet
  2016-04-15 10:19 ` Kurt Van Dijck
@ 2016-04-15 13:07 ` Ramesh Shanmugasundaram
  1 sibling, 0 replies; 9+ messages in thread
From: Ramesh Shanmugasundaram @ 2016-04-15 13:07 UTC (permalink / raw)
  To: Julien Pilet, linux-can@vger.kernel.org

> Subject: Bad reading from mcp2515 with j1939
> 
> Hi,
> 
> I am using a MCP2515 for a nautical application. It turns out that in my
> lab, everything goes fine. On two boats, though, I get strange readings
> from can0.
> 
> I connected the MCP2515 to a sensor reporting a boat speed of 0 m/s.
>   can0  09F50323   [8]  79 00 00 FF FF 00 FF FF
> 
> The first data byte is the sequence ID, it is incremented in each packet.
> The two following bytes encode the boat speed.
> 
> candump gives me the following:
> 
>   can0  09F50323   [8]  7A 00 00 FF FF 00 FF FF
>   can0  09F50323   [8]  7B 00 00 FF FF 00 FF FF
>   can0  09F50323   [8]  7C 00 00 FF FF 00 FF FF
>   can0  09F50323   [8]  7D 00 00 FF FF 00 FF FF
>   can0  09F50323   [8]  7E 00 00 FF FF 00 FF FF
>   can0  09F50323   [8]  7F 00 00 FF FF 00 FF FF
>   can0  09F50323   [8]  80 80 00 FF FF 00 FF FF
>   can0  09F50323   [8]  81 00 00 FF FF 00 FF FF
>   can0  09F50323   [8]  82 80 00 FF FF 00 FF FF
>   can0  09F50323   [8]  83 00 00 FF FF 00 FF FF
>   can0  09F50323   [8]  84 80 00 FF FF 00 FF FF
>   can0  09F50323   [8]  85 00 00 FF FF 00 FF FF
>   can0  09F50323   [8]  86 80 00 FF FF 00 FF FF
> 
> Suddenly, when SID reaches 0x80, the speed is not 0 anymore: it is 0x0080
> (encoded little endian 80 00). This is clearly wrong.
> I observed similar problems with sensors of different manufacturers, so I
> do not think the sensor is faulty.

On the transmitting node, which app code do this logic (forming the CAN frame)? Are you sure this code is not bug free? "80 00" is what seen in wire. If driver is buggy, you would have seen the same in test n/w for same input data.

> The pattern repeats: for seq ID between 00 and 0x79, everything works
> fine. For seq ID 0x80 - 0xFA (the last seq ID sent) every packet with an
> even sequence ID has a spurious bit set in the 2nd byte.
> 
> I failed to reproduce the error in the lab, so it is hard for me to plug a
> oscilloscope and observe what is going on. I observe no error when
> transmitting the same data on my test network (which is much shorter and
> simpler than a real one).
> 
> Did I miss something in the configuration ? How is it possible that bad
> packets pass through?
> 
> Here’s how my configuration looks like:
> 
> # ifconfig can0
> can0      Link encap:UNSPEC  HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-
> 00-00-00-00
>           UP RUNNING NOARP  MTU:16  Metric:1
>           RX packets:22823 errors:3 dropped:0 overruns:0 frame:3
>           TX packets:7 errors:1 dropped:1 overruns:0 carrier:1

Are these errors/drops seen in your test n/w too?

>           collisions:0 txqueuelen:10
>           RX bytes:182584 (178.3 KiB)  TX bytes:31 (31.0 B)
> 
> # ./bin/ip -details link show can0
> 5: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UNKNOWN
> mode DEFAULT qlen 10
>     link/can
>     can <TRIPLE-SAMPLING> state ERROR-PASSIVE restart-ms 0

ERROR-PASSIVE - with only 7 transmitted packets? This means there are lot of receive errors (you Rx'ed 22823 frames). The physical link is not good enough (that's why you cannot reproduce it in lab). Did you check the transmitting node's state and stats?

Coming back to "80 00" case, did you check the transmitting application error handling logic? That fact it happens after 128 frames gives some clue it may be related to Tx BERR counter crossing 127 limit and changing state?

In your lab n/w, in the middle of transmission, try to pull the cable between nodes for a second or two and plug it back. Check the transmitting node application behaviour when it enters ERROR-PASSIVE state.

Happy debugging,
-Ramesh

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-04-18  5:49 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-04-15  9:54 Bad reading from mcp2515 with j1939 Julien Pilet
2016-04-15 10:19 ` Kurt Van Dijck
2016-04-15 13:02   ` Julien Pilet
2016-04-15 14:52     ` Wolfgang Grandegger
2016-04-15 16:25       ` Julien Pilet
2016-04-15 16:34         ` Wolfgang Grandegger
2016-04-18  5:48           ` Wolfgang Grandegger
2016-04-18  4:29         ` Tom Evans
2016-04-15 13:07 ` Ramesh Shanmugasundaram

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.