RFC: improve and consolidate state change and bus-off handling

linux-can.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RFC: improve and consolidate state change and bus-off handling
@ 2011-12-02  9:04 Wolfgang Grandegger
  2011-12-02  9:41 ` Marc Kleine-Budde
  0 siblings, 1 reply; 7+ messages in thread
From: Wolfgang Grandegger @ 2011-12-02  9:04 UTC (permalink / raw)
  To: Linux-can Mailing List, SocketCAN Core Mailing List

Hello,

as you might know, our handling of CAN state changes and bus-off is
not consistent, weak or even incorrect. Therefore I'm making an effort
to improve, consolidate and *unify* it. Most things are straight-
forward, but others need more attention and especially for the bus-off
recovery I would appreciate some CAN expert advice (more below). I have
already some patches implementing:

- Add missing do_get_berr_counter() callbacks (for ti_hecc, etc.).

- Add error counters to the data fields 6..7 of *any* CAN error message
  automatically in alloc_can_err_skb():

       if (priv->do_get_berr_counter) {
               struct can_berr_counter bec;

               priv->do_get_berr_counter(dev, &bec);
               (*cf)->data[6] = bec.txerr;
               (*cf)->data[7] = bec.rxerr;
       }

- Allow state changes going down including "back to error active":

  Therefore I added:

    $ cat include/linux/can/error.h
    ...
    #define CAN_ERR_STATE_CHANGE 0x00000200U /* CAN error state change / data[1] */
    ...
    #define CAN_ERR_CRTL_ACTIVE      0x40 /* recovered to error active state */

  For any state change the CAN_ERR_STATE_CHANGE will be set in the
  can_id. If the state gets worse, CAN_ERR_CRTL is set as usual
  also for backward compatibility. The state change management will
  be done by a common "can_change_state()" function doing all the bit
  settings and counter increments. For the SJA1000 "candump -e" will
  then report for recovery from the error passive state (no cable):

    can0  20000204  [8] 00 08 00 00 00 00 60 00   ERRORFRAME
	controller-problem{tx-error-warning}
	state-change{tx-error-warning}
	error-counter-tx-rx{{96}{0}}
    can0  20000204  [8] 00 30 00 00 00 00 80 00   ERRORFRAME
	controller-problem{tx-error-passive}
	state-change{tx-error-passive}
	error-counter-tx-rx{{128}{0}}
    can0  124  [3] 12 34 56
    ...
    can0  124  [3] 12 34 56
    can0  20000200  [8] 00 08 00 00 00 00 7F 00   ERRORFRAME
	state-change{tx-error-warning}
	error-counter-tx-rx{{127}{0}}
    can0  124  [3] 12 34 56
    ...
    can0  124  [3] 12 34 56
    can0  20000200  [8] 00 40 00 00 00 00 5F 00   ERRORFRAME
	state-change{back-to-error-active}
	error-counter-tx-rx{{95}{0}}

   Updating all drivers correctly is a challenge, especially because I
   do not have all hardware. Help and comments are appreciated.

- Bus-off recovery:

  Currently, I think, we do not handle bus-off recovery correctly for
  most controllers. We brute-force stop and restart the controller.
  The controller will do the recovery cycle anyway and we may send
  messages to early. Instead the software should handle the bus-off
  recovery cycle as shown below:

  * bus-off happens
    - call netif_stop_queue() and maybe disable interrupts

  * automatic or manual restart is done
    - trigger bus-off recovery sequence by resetting the init bit
      (on SJA1000) and maybe re-enable the interrupts
    - await the controller going back to error-active state
      (signaled via interrupt).
    - call netif_wake_queue()

  Here is a "candump -e" output for the SJA1000 (with delta times)

    (009.832477)  can0  20000204  [8] 00 30 00 00 00 00 88 00   ERRORFRAME
	controller-problem{tx-error-passive}
	state-change{tx-error-passive}
	error-counter-tx-rx{{136}{0}}
    (000.000804)  can0  20000240  [8] 00 00 00 00 00 00 7F 00   ERRORFRAME
	bus-off
	state-change{}
	error-counter-tx-rx{{127}{0}}
    (000.099795)  can0  20000100  [8] 00 00 00 00 00 00 7F 00   ERRORFRAME
	restarted-after-bus-off
	error-counter-tx-rx{{127}{0}}
    (000.003061)  can0  20000200  [8] 00 40 00 00 00 00 00 00   ERRORFRAME
	state-change{back-to-error-active}

   Before doing all the necessary code changes, which are not always
   trivial I ask: Would that be the correct bus-off handling???

Thanks for feedback.

Wolfgang.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RFC: improve and consolidate state change and bus-off handling
  2011-12-02  9:04 RFC: improve and consolidate state change and bus-off handling Wolfgang Grandegger
@ 2011-12-02  9:41 ` Marc Kleine-Budde
  2011-12-02 10:04   ` Wolfgang Grandegger
  2011-12-03  9:28   ` Sebastian Haas
  0 siblings, 2 replies; 7+ messages in thread
From: Marc Kleine-Budde @ 2011-12-02  9:41 UTC (permalink / raw)
  To: Wolfgang Grandegger; +Cc: Linux-can Mailing List, SocketCAN Core Mailing List

[-- Attachment #1: Type: text/plain, Size: 5161 bytes --]

On 12/02/2011 10:04 AM, Wolfgang Grandegger wrote:
> Hello,
> 
> as you might know, our handling of CAN state changes and bus-off is
> not consistent, weak or even incorrect. Therefore I'm making an effort
> to improve, consolidate and *unify* it. Most things are straight-
> forward, but others need more attention and especially for the bus-off
> recovery I would appreciate some CAN expert advice (more below). I have
> already some patches implementing:
> 
> - Add missing do_get_berr_counter() callbacks (for ti_hecc, etc.).

+1

> - Add error counters to the data fields 6..7 of *any* CAN error message
>   automatically in alloc_can_err_skb():
> 
>        if (priv->do_get_berr_counter) {
>                struct can_berr_counter bec;
> 
>                priv->do_get_berr_counter(dev, &bec);
>                (*cf)->data[6] = bec.txerr;
>                (*cf)->data[7] = bec.rxerr;
>        }

What about some not directly connected devices like the mcp251x. At
least the mcp2515 driver (which is not mainline, though) needs a spi
transfer for that.

Do we need a flag in the driver to indicate not to read the berr_counter?

> - Allow state changes going down including "back to error active":
> 
>   Therefore I added:
> 
>     $ cat include/linux/can/error.h
>     ...
>     #define CAN_ERR_STATE_CHANGE 0x00000200U /* CAN error state change / data[1] */
>     ...
>     #define CAN_ERR_CRTL_ACTIVE      0x40 /* recovered to error active state */
>  
>   For any state change the CAN_ERR_STATE_CHANGE will be set in the
>   can_id. If the state gets worse, CAN_ERR_CRTL is set as usual
>   also for backward compatibility. The state change management will
>   be done by a common "can_change_state()" function doing all the bit

yeah! common function! +1

>   settings and counter increments. For the SJA1000 "candump -e" will
>   then report for recovery from the error passive state (no cable):
> 
>     can0  20000204  [8] 00 08 00 00 00 00 60 00   ERRORFRAME
> 	controller-problem{tx-error-warning}
> 	state-change{tx-error-warning}
> 	error-counter-tx-rx{{96}{0}}
>     can0  20000204  [8] 00 30 00 00 00 00 80 00   ERRORFRAME
> 	controller-problem{tx-error-passive}
> 	state-change{tx-error-passive}
> 	error-counter-tx-rx{{128}{0}}
>     can0  124  [3] 12 34 56
>     ...
>     can0  124  [3] 12 34 56
>     can0  20000200  [8] 00 08 00 00 00 00 7F 00   ERRORFRAME
> 	state-change{tx-error-warning}
> 	error-counter-tx-rx{{127}{0}}
>     can0  124  [3] 12 34 56
>     ...
>     can0  124  [3] 12 34 56
>     can0  20000200  [8] 00 40 00 00 00 00 5F 00   ERRORFRAME
> 	state-change{back-to-error-active}
> 	error-counter-tx-rx{{95}{0}}
>  
>    Updating all drivers correctly is a challenge, especially because I
>    do not have all hardware. Help and comments are appreciated.

I can test the at91_can, flexcan and if we're lucky we've a mcp251x in
the office.

> - Bus-off recovery:
> 
>   Currently, I think, we do not handle bus-off recovery correctly for
>   most controllers. We brute-force stop and restart the controller.
>   The controller will do the recovery cycle anyway and we may send
>   messages to early. Instead the software should handle the bus-off
>   recovery cycle as shown below:
> 
>   * bus-off happens
>     - call netif_stop_queue() and maybe disable interrupts
> 
>   * automatic or manual restart is done
>     - trigger bus-off recovery sequence by resetting the init bit
>       (on SJA1000) and maybe re-enable the interrupts
>     - await the controller going back to error-active state
>       (signaled via interrupt).

I'm not sure if all controllers signal correctly that they are back in
error active. My observation is that bus off handling is a bit like
climbing the mount Everest, the air is quite thin and things can lock up
quite fast.

>     - call netif_wake_queue()
> 
>   Here is a "candump -e" output for the SJA1000 (with delta times)
> 
>     (009.832477)  can0  20000204  [8] 00 30 00 00 00 00 88 00   ERRORFRAME
> 	controller-problem{tx-error-passive}
> 	state-change{tx-error-passive}
> 	error-counter-tx-rx{{136}{0}}
>     (000.000804)  can0  20000240  [8] 00 00 00 00 00 00 7F 00   ERRORFRAME
> 	bus-off
> 	state-change{}
> 	error-counter-tx-rx{{127}{0}}
>     (000.099795)  can0  20000100  [8] 00 00 00 00 00 00 7F 00   ERRORFRAME
> 	restarted-after-bus-off
> 	error-counter-tx-rx{{127}{0}}
>     (000.003061)  can0  20000200  [8] 00 40 00 00 00 00 00 00   ERRORFRAME
> 	state-change{back-to-error-active}
> 
>    Before doing all the necessary code changes, which are not always
>    trivial I ask: Would that be the correct bus-off handling???

However if hardware permits the described steps sound reasonable (from
my non CAN expert point of view).

> Thanks for feedback.

Marc

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RFC: improve and consolidate state change and bus-off handling
  2011-12-02  9:41 ` Marc Kleine-Budde
@ 2011-12-02 10:04   ` Wolfgang Grandegger
  2011-12-03  9:28   ` Sebastian Haas
  1 sibling, 0 replies; 7+ messages in thread
From: Wolfgang Grandegger @ 2011-12-02 10:04 UTC (permalink / raw)
  To: Marc Kleine-Budde; +Cc: SocketCAN Core Mailing List, Linux-can Mailing List

On 12/02/2011 10:41 AM, Marc Kleine-Budde wrote:
> On 12/02/2011 10:04 AM, Wolfgang Grandegger wrote:
>> Hello,
>>
>> as you might know, our handling of CAN state changes and bus-off is
>> not consistent, weak or even incorrect. Therefore I'm making an effort
>> to improve, consolidate and *unify* it. Most things are straight-
>> forward, but others need more attention and especially for the bus-off
>> recovery I would appreciate some CAN expert advice (more below). I have
>> already some patches implementing:
>>
>> - Add missing do_get_berr_counter() callbacks (for ti_hecc, etc.).
> 
> +1
> 
>> - Add error counters to the data fields 6..7 of *any* CAN error message
>>   automatically in alloc_can_err_skb():
>>
>>        if (priv->do_get_berr_counter) {
>>                struct can_berr_counter bec;
>>
>>                priv->do_get_berr_counter(dev, &bec);
>>                (*cf)->data[6] = bec.txerr;
>>                (*cf)->data[7] = bec.rxerr;
>>        }
> 
> What about some not directly connected devices like the mcp251x. At
> least the mcp2515 driver (which is not mainline, though) needs a spi
> transfer for that.
> 
> Do we need a flag in the driver to indicate not to read the berr_counter?

We could introduce a return code for the do_get_berr_counter callback
and handle it.

>> - Allow state changes going down including "back to error active":
>>
>>   Therefore I added:
>>
>>     $ cat include/linux/can/error.h
>>     ...
>>     #define CAN_ERR_STATE_CHANGE 0x00000200U /* CAN error state change / data[1] */
>>     ...
>>     #define CAN_ERR_CRTL_ACTIVE      0x40 /* recovered to error active state */
>>  
>>   For any state change the CAN_ERR_STATE_CHANGE will be set in the
>>   can_id. If the state gets worse, CAN_ERR_CRTL is set as usual
>>   also for backward compatibility. The state change management will
>>   be done by a common "can_change_state()" function doing all the bit
> 
> yeah! common function! +1
> 
>>   settings and counter increments. For the SJA1000 "candump -e" will
>>   then report for recovery from the error passive state (no cable):
>>
>>     can0  20000204  [8] 00 08 00 00 00 00 60 00   ERRORFRAME
>> 	controller-problem{tx-error-warning}
>> 	state-change{tx-error-warning}
>> 	error-counter-tx-rx{{96}{0}}
>>     can0  20000204  [8] 00 30 00 00 00 00 80 00   ERRORFRAME
>> 	controller-problem{tx-error-passive}
>> 	state-change{tx-error-passive}
>> 	error-counter-tx-rx{{128}{0}}
>>     can0  124  [3] 12 34 56
>>     ...
>>     can0  124  [3] 12 34 56
>>     can0  20000200  [8] 00 08 00 00 00 00 7F 00   ERRORFRAME
>> 	state-change{tx-error-warning}
>> 	error-counter-tx-rx{{127}{0}}
>>     can0  124  [3] 12 34 56
>>     ...
>>     can0  124  [3] 12 34 56
>>     can0  20000200  [8] 00 40 00 00 00 00 5F 00   ERRORFRAME
>> 	state-change{back-to-error-active}
>> 	error-counter-tx-rx{{95}{0}}
>>  
>>    Updating all drivers correctly is a challenge, especially because I
>>    do not have all hardware. Help and comments are appreciated.
> 
> I can test the at91_can, flexcan and if we're lucky we've a mcp251x in
> the office.

Good, I can test here MSCAN/MPC5200 and maybe MPC51xx , CC770/I82527 on
my old MPC8xx board and SJA1000. And maybe also Flexcan on a i.MX28.

>> - Bus-off recovery:
>>
>>   Currently, I think, we do not handle bus-off recovery correctly for
>>   most controllers. We brute-force stop and restart the controller.
>>   The controller will do the recovery cycle anyway and we may send
>>   messages to early. Instead the software should handle the bus-off
>>   recovery cycle as shown below:
>>
>>   * bus-off happens
>>     - call netif_stop_queue() and maybe disable interrupts
>>
>>   * automatic or manual restart is done
>>     - trigger bus-off recovery sequence by resetting the init bit
>>       (on SJA1000) and maybe re-enable the interrupts
>>     - await the controller going back to error-active state
>>       (signaled via interrupt).
> 
> I'm not sure if all controllers signal correctly that they are back in
> error active. My observation is that bus off handling is a bit like
> climbing the mount Everest, the air is quite thin and things can lock up
> quite fast.

I know, unfortunately. But SJA1000, CC770/i82527 and MSCAN behave like
expected and for the controller doing the recovery automatically in
hardware, like the at91?, we need to await back to error active anyway.
Anyway, we simple do not touch controllers with magic bus-off handling.

> 
>>     - call netif_wake_queue()
>>
>>   Here is a "candump -e" output for the SJA1000 (with delta times)
>>
>>     (009.832477)  can0  20000204  [8] 00 30 00 00 00 00 88 00   ERRORFRAME
>> 	controller-problem{tx-error-passive}
>> 	state-change{tx-error-passive}
>> 	error-counter-tx-rx{{136}{0}}
>>     (000.000804)  can0  20000240  [8] 00 00 00 00 00 00 7F 00   ERRORFRAME
>> 	bus-off
>> 	state-change{}
>> 	error-counter-tx-rx{{127}{0}}
>>     (000.099795)  can0  20000100  [8] 00 00 00 00 00 00 7F 00   ERRORFRAME
>> 	restarted-after-bus-off
>> 	error-counter-tx-rx{{127}{0}}
>>     (000.003061)  can0  20000200  [8] 00 40 00 00 00 00 00 00   ERRORFRAME
>> 	state-change{back-to-error-active}
>>
>>    Before doing all the necessary code changes, which are not always
>>    trivial I ask: Would that be the correct bus-off handling???
> 
> However if hardware permits the described steps sound reasonable (from
> my non CAN expert point of view).

On the SJA1000 and i82527 I realized that the hardware does the bus-off
recovery cycle anyway, even if you stop and restart the controller.

Patches coming soon...

Wolfgang.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RFC: improve and consolidate state change and bus-off handling
  2011-12-02  9:41 ` Marc Kleine-Budde
  2011-12-02 10:04   ` Wolfgang Grandegger
@ 2011-12-03  9:28   ` Sebastian Haas
       [not found]     ` <4ED9EBAB.4060701-xpvPi5bcW5W9w4jpWW8B1qHonnlKzd3f@public.gmane.org>
  1 sibling, 1 reply; 7+ messages in thread
From: Sebastian Haas @ 2011-12-03  9:28 UTC (permalink / raw)
  To: Marc Kleine-Budde
  Cc: Wolfgang Grandegger, Linux-can Mailing List,
	SocketCAN Core Mailing List

Hello,

Am 02.12.2011 10:41, schrieb Marc Kleine-Budde:
> On 12/02/2011 10:04 AM, Wolfgang Grandegger wrote:
>> - Bus-off recovery:
>>
>>    Currently, I think, we do not handle bus-off recovery correctly for
>>    most controllers. We brute-force stop and restart the controller.
>>    The controller will do the recovery cycle anyway and we may send
>>    messages to early. Instead the software should handle the bus-off
>>    recovery cycle as shown below:
>>
>>    * bus-off happens
>>      - call netif_stop_queue() and maybe disable interrupts
>>
>>    * automatic or manual restart is done
Short note from me about that. We should be carefully with automatic 
recovery at driver-level, since this is basically not covered by the 
standard (only the application is allowed to decide whether or not the 
controller shall go back to error active). A bus off is an unusual state 
in a CAN network, because if this happens the network is physically in a 
bad shape (cable length, stub lines, electrical defects) or has a 
logical error (identifier conflicts, different bit-rates). So in my 
opinion the driver shall not allow anyway of automatic recovery and 
shall be up to the application to decide what to do.

>>      - trigger bus-off recovery sequence by resetting the init bit
>>        (on SJA1000) and maybe re-enable the interrupts
>>      - await the controller going back to error-active state
>>        (signaled via interrupt).
>
> I'm not sure if all controllers signal correctly that they are back in
> error active. My observation is that bus off handling is a bit like
> climbing the mount Everest, the air is quite thin and things can lock up
> quite fast.
In a worst case you need to poll for the error state if the controller 
doesn't support correct signalling. But I expect most controllers to 
implement such a feature since this is quite essential for most 
application to know the controller got bus-on/error active.

>>      - call netif_wake_queue()
>>
>>    Here is a "candump -e" output for the SJA1000 (with delta times)
>>
>>      (009.832477)  can0  20000204  [8] 00 30 00 00 00 00 88 00   ERRORFRAME
>> 	controller-problem{tx-error-passive}
>> 	state-change{tx-error-passive}
>> 	error-counter-tx-rx{{136}{0}}
>>      (000.000804)  can0  20000240  [8] 00 00 00 00 00 00 7F 00   ERRORFRAME
>> 	bus-off
>> 	state-change{}
>> 	error-counter-tx-rx{{127}{0}}
>>      (000.099795)  can0  20000100  [8] 00 00 00 00 00 00 7F 00   ERRORFRAME
>> 	restarted-after-bus-off
>> 	error-counter-tx-rx{{127}{0}}
>>      (000.003061)  can0  20000200  [8] 00 40 00 00 00 00 00 00   ERRORFRA
>> 	state-change{back-to-error-active}
>>
>>     Before doing all the necessary code changes, which are not always
>>     trivial I ask: Would that be the correct bus-off handling???
>
> However if hardware permits the described steps sound reasonable (from
> my non CAN expert point of view).
Despite the fact that you misuse a CAN message structure to carry CAN 
error frames, the overall procedure looks quite good.


Cheers,
  Sebastian

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RFC: improve and consolidate state change and bus-off handling
       [not found]     ` <4ED9EBAB.4060701-xpvPi5bcW5W9w4jpWW8B1qHonnlKzd3f@public.gmane.org>
@ 2011-12-03 10:18       ` Oliver Hartkopp
  2011-12-03 11:53         ` Sebastian Haas
  0 siblings, 1 reply; 7+ messages in thread
From: Oliver Hartkopp @ 2011-12-03 10:18 UTC (permalink / raw)
  To: Sebastian Haas
  Cc: SocketCAN Core Mailing List, Marc Kleine-Budde,
	Wolfgang Grandegger, Linux-can Mailing List

Hello Sebastian,

nice to read that you're back on air ;-)

On 03.12.2011 10:28, Sebastian Haas wrote:

> Hello,
> 
> Am 02.12.2011 10:41, schrieb Marc Kleine-Budde:
>> On 12/02/2011 10:04 AM, Wolfgang Grandegger wrote:
>>> - Bus-off recovery:
>>>
>>>    Currently, I think, we do not handle bus-off recovery correctly for
>>>    most controllers. We brute-force stop and restart the controller.
>>>    The controller will do the recovery cycle anyway and we may send
>>>    messages to early. Instead the software should handle the bus-off
>>>    recovery cycle as shown below:
>>>
>>>    * bus-off happens
>>>      - call netif_stop_queue() and maybe disable interrupts
>>>
>>>    * automatic or manual restart is done
> Short note from me about that. We should be carefully with automatic recovery
> at driver-level, since this is basically not covered by the standard (only the
> application is allowed to decide whether or not the controller shall go back
> to error active). A bus off is an unusual state in a CAN network, because if
> this happens the network is physically in a bad shape (cable length, stub
> lines, electrical defects) or has a logical error (identifier conflicts,
> different bit-rates). So in my opinion the driver shall not allow anyway of
> automatic recovery and shall be up to the application to decide what to do.

You are right. For that reason the automatic restart is an option. You're
still able to define one handler on the multi-user system to cope with
bus-off. OTOH you can enable the automatic recovery when you know that you are
working in a proved CAN setup (regarding cable & bittiming issues) to handle
other temporary issues (EMC/hotplug). E.g. in some vehicle setups an automatic
recovery from bus-off after 200ms is specified. If this is done by a 'special'
handler or just by the driver can not be seen from the network view.

> 
>>>      - trigger bus-off recovery sequence by resetting the init bit
>>>        (on SJA1000) and maybe re-enable the interrupts
>>>      - await the controller going back to error-active state
>>>        (signaled via interrupt).
>>
>> I'm not sure if all controllers signal correctly that they are back in
>> error active. My observation is that bus off handling is a bit like
>> climbing the mount Everest, the air is quite thin and things can lock up
>> quite fast.
> In a worst case you need to poll for the error state if the controller doesn't
> support correct signalling. But I expect most controllers to implement such a
> feature since this is quite essential for most application to know the
> controller got bus-on/error active.

Yes. Indeed some CAN controllers recover themselves (IIRC mscan) and others
don't (SJA1000).  If a controller recovers silently (and no CAN frames come
from the outside to detect it) the state would need to be polled to be able to
send frames from the local host ...

> 
>>>      - call netif_wake_queue()
>>>
>>>    Here is a "candump -e" output for the SJA1000 (with delta times)
>>>
>>>      (009.832477)  can0  20000204  [8] 00 30 00 00 00 00 88 00   ERRORFRAME
>>>     controller-problem{tx-error-passive}
>>>     state-change{tx-error-passive}
>>>     error-counter-tx-rx{{136}{0}}
>>>      (000.000804)  can0  20000240  [8] 00 00 00 00 00 00 7F 00   ERRORFRAME
>>>     bus-off
>>>     state-change{}
>>>     error-counter-tx-rx{{127}{0}}
>>>      (000.099795)  can0  20000100  [8] 00 00 00 00 00 00 7F 00   ERRORFRAME
>>>     restarted-after-bus-off
>>>     error-counter-tx-rx{{127}{0}}
>>>      (000.003061)  can0  20000200  [8] 00 40 00 00 00 00 00 00   ERRORFRA
>>>     state-change{back-to-error-active}
>>>
>>>     Before doing all the necessary code changes, which are not always
>>>     trivial I ask: Would that be the correct bus-off handling???
>>
>> However if hardware permits the described steps sound reasonable (from
>> my non CAN expert point of view).

To me too. Nice rework idea.

> Despite the fact that you misuse a CAN message structure to carry CAN error
> frames, the overall procedure looks quite good.

From a todays view the CAN controller states could also be transferred into
the userspace by using recvmsg() that currently supports to push the timestamp
and the dropcounter within one syscall to the user.

The idea to get the error message through the data path (when it's enabled
with a special filter) is, that the CAN controller states fit into the logfile
and have their own timestamps - even when there's no valid CAN frame
receiption in that time. E.g. if you look into some Vector ASC logfiles the
problems detected by the CAN controller are also put inside the log at the
appropriate place. The reception of CAN data and CAN controller states in-line
with timestamps has real debugging advantages. Although it's easy to implement
in the interrupt handler whether you have a correct RX-interrupt or an error
interrupt -> you create a skb and push it into the system :-)

Regards,
Oliver

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RFC: improve and consolidate state change and bus-off handling
  2011-12-03 10:18       ` Oliver Hartkopp
@ 2011-12-03 11:53         ` Sebastian Haas
  2011-12-03 16:53           ` Oliver Hartkopp
  0 siblings, 1 reply; 7+ messages in thread
From: Sebastian Haas @ 2011-12-03 11:53 UTC (permalink / raw)
  To: Oliver Hartkopp
  Cc: Marc Kleine-Budde, Wolfgang Grandegger, Linux-can Mailing List,
	SocketCAN Core Mailing List

Hello,

Am 03.12.2011 11:18, schrieb Oliver Hartkopp:
> Hello Sebastian,
>
> nice to read that you're back on air ;-)
I've started to implement a kind of a restbus simulation for Linux and I 
stumbled again over SocketCAN ;-) So I'm back on air.

> On 03.12.2011 10:28, Sebastian Haas wrote:
>> Am 02.12.2011 10:41, schrieb Marc Kleine-Budde:
>>> On 12/02/2011 10:04 AM, Wolfgang Grandegger wrote:
>>>> - Bus-off recovery:
>>>>
>>>>     Currently, I think, we do not handle bus-off recovery correctly for
>>>>     most controllers. We brute-force stop and restart the controller.
>>>>     The controller will do the recovery cycle anyway and we may send
>>>>     messages to early. Instead the software should handle the bus-off
>>>>     recovery cycle as shown below:
>>>>
>>>>     * bus-off happens
>>>>       - call netif_stop_queue() and maybe disable interrupts
>>>>
>>>>     * automatic or manual restart is done
>> Short note from me about that. We should be carefully with automatic recovery
>> at driver-level, since this is basically not covered by the standard (only the
>> application is allowed to decide whether or not the controller shall go back
>> to error active). A bus off is an unusual state in a CAN network, because if
>> this happens the network is physically in a bad shape (cable length, stub
>> lines, electrical defects) or has a logical error (identifier conflicts,
>> different bit-rates). So in my opinion the driver shall not allow anyway of
>> automatic recovery and shall be up to the application to decide what to do.
>
>
> You are right. For that reason the automatic restart is an option. You're
> still able to define one handler on the multi-user system to cope with
> bus-off. OTOH you can enable the automatic recovery when you know that you are
> working in a proved CAN setup (regarding cable&  bittiming issues) to handle
> other temporary issues (EMC/hotplug). E.g. in some vehicle setups an automatic
> recovery from bus-off after 200ms is specified. If this is done by a 'special'
> handler or just by the driver can not be seen from the network view.
I'm not very familiar with the current API, but how do I as an 
application restart recovery manually. I suppose there is ioctl().

>>
>>>>       - trigger bus-off recovery sequence by resetting the init bit
>>>>         (on SJA1000) and maybe re-enable the interrupts
>>>>       - await the controller going back to error-active state
>>>>         (signaled via interrupt).
>>>
>>> I'm not sure if all controllers signal correctly that they are back in
>>> error active. My observation is that bus off handling is a bit like
>>> climbing the mount Everest, the air is quite thin and things can lock up
>>> quite fast.
>> In a worst case you need to poll for the error state if the controller doesn't
>> support correct signalling. But I expect most controllers to implement such a
>> feature since this is quite essential for most application to know the
>> controller got bus-on/error active.
>
>
> Yes. Indeed some CAN controllers recover themselves (IIRC mscan) and others
> don't (SJA1000).  If a controller recovers silently (and no CAN frames come
> from the outside to detect it) the state would need to be polled to be able to
> send frames from the local host ...
I guess it is possible to disable a automatic recovery for e.g. mscan? 
But I got your point, there are for sure a lot of controllers in the 
field which may behave differently to the old ones like SJA1000 and 
CC770. IMHO we do not need to support non-standard conform controllers 
(auto recovery without user interaction violates the standard).

>>>>       - call netif_wake_queue()
>>>>
>>>>     Here is a "candump -e" output for the SJA1000 (with delta times)
>>>>
>>>>       (009.832477)  can0  20000204  [8] 00 30 00 00 00 00 88 00   ERRORFRAME
>>>>      controller-problem{tx-error-passive}
>>>>      state-change{tx-error-passive}
>>>>      error-counter-tx-rx{{136}{0}}
>>>>       (000.000804)  can0  20000240  [8] 00 00 00 00 00 00 7F 00   ERRORFRAME
>>>>      bus-off
>>>>      state-change{}
>>>>      error-counter-tx-rx{{127}{0}}
>>>>       (000.099795)  can0  20000100  [8] 00 00 00 00 00 00 7F 00   ERRORFRAME
>>>>      restarted-after-bus-off
>>>>      error-counter-tx-rx{{127}{0}}
>>>>       (000.003061)  can0  20000200  [8] 00 40 00 00 00 00 00 00   ERRORFRA
>>>>      state-change{back-to-error-active}
>>>>
>>>>      Before doing all the necessary code changes, which are not always
>>>>      trivial I ask: Would that be the correct bus-off handling???
>>>
>>> However if hardware permits the described steps sound reasonable (from
>>> my non CAN expert point of view).
>
>
> To me too. Nice rework idea.
Yepp, I like it too. Unfortunately I don't have access the CAN hardware 
yet, so I can't support the testing.

>> Despite the fact that you misuse a CAN message structure to carry CAN error
>> frames, the overall procedure looks quite good.
>
>
>  From a todays view the CAN controller states could also be transferred into
> the userspace by using recvmsg() that currently supports to push the timestamp
> and the dropcounter within one syscall to the user.
>
> The idea to get the error message through the data path (when it's enabled
> with a special filter) is, that the CAN controller states fit into the logfile
> and have their own timestamps - even when there's no valid CAN frame
> receiption in that time. E.g. if you look into some Vector ASC logfiles the
> problems detected by the CAN controller are also put inside the log at the
> appropriate place. The reception of CAN data and CAN controller states in-line
> with timestamps has real debugging advantages. Although it's easy to implement
> in the interrupt handler whether you have a correct RX-interrupt or an error
> interrupt ->  you create a skb and push it into the system :-)
I don't argue against delivering error frames via the data path. Indeed 
my former company did that as well. But it was easier to work with error 
frames since they could be accessed by a specific structure. So the 
programmer didn't need to know in which field the rx or tx error counter 
is, this is not very intuitve and flexible to me.

But anyway, this was just something I stumbled about and if a 
programmers know that, it might be ok.

Cheers,
  Sebastian

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RFC: improve and consolidate state change and bus-off handling
  2011-12-03 11:53         ` Sebastian Haas
@ 2011-12-03 16:53           ` Oliver Hartkopp
  0 siblings, 0 replies; 7+ messages in thread
From: Oliver Hartkopp @ 2011-12-03 16:53 UTC (permalink / raw)
  To: Sebastian Haas; +Cc: Linux-can Mailing List

On 03.12.2011 12:53, Sebastian Haas wrote:

> Hello,
> 
> Am 03.12.2011 11:18, schrieb Oliver Hartkopp:
>> Hello Sebastian,
>>
>> nice to read that you're back on air ;-)
> I've started to implement a kind of a restbus simulation for Linux and I
> stumbled again over SocketCAN ;-) So I'm back on air.


Fine. When you set up a rest bus simulation probably the can-gw available in
Linux 3.2 and up is valuable for your use case:

http://patchwork.ozlabs.org/patch/112916/


>> E.g. in some vehicle setups an automatic
>> recovery from bus-off after 200ms is specified. If this is done by a 'special'
>> handler or just by the driver can not be seen from the network view.
> I'm not very familiar with the current API, but how do I as an application
> restart recovery manually. I suppose there is ioctl().


No this if done by 'ip' from the iproute2 package:

See http://lxr.linux.no/#linux+v3.1.4/Documentation/networking/can.txt#L646

Regards,
Oliver

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2011-12-03 16:53 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-02  9:04 RFC: improve and consolidate state change and bus-off handling Wolfgang Grandegger
2011-12-02  9:41 ` Marc Kleine-Budde
2011-12-02 10:04   ` Wolfgang Grandegger
2011-12-03  9:28   ` Sebastian Haas
     [not found]     ` <4ED9EBAB.4060701-xpvPi5bcW5W9w4jpWW8B1qHonnlKzd3f@public.gmane.org>
2011-12-03 10:18       ` Oliver Hartkopp
2011-12-03 11:53         ` Sebastian Haas
2011-12-03 16:53           ` Oliver Hartkopp

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).