public inbox for linux-can@vger.kernel.org
 help / color / mirror / Atom feed
* MSG_CONFIRM RX messages with SocketCAN known as unreliable under heavy load?
@ 2021-06-17 12:22 Harald Mommer
  2021-06-18  9:16 ` Marc Kleine-Budde
  0 siblings, 1 reply; 14+ messages in thread
From: Harald Mommer @ 2021-06-17 12:22 UTC (permalink / raw)
  To: linux-can

Hello,

we are currently in the process of developing a draft specification for 
Virtio CAN. In the scope of this work I am developing a Virtio CAN Linux 
driver and a Virtio CAN Linux device running on top of our hypervisor 
solution.

The Virtio CAN Linux device forwards an existing SocketCAN CAN device 
(currently vcan) via Virtio to the Virtio driver guest so that the 
virtual driver guest can send and receive CAN frames via SocketCAN.

What was originally planned (probably with too much AUTOSAR CAN driver 
semantics in my head and too few SocketCAN knowledge) is to mark a 
transmission request as used (done) when it's sent finally on the CAN 
bus (vs. when it's given to SocketCAN not really done but still pending 
somewhere in the protocol stack).

Thought this was doable with some implementation effort using

setsockopt(..., SOL_CAN_RAW, CAN_RAW_RECV_OWN_MSGS, ...) and evaluatiing 
the MSG_CONFIRM bit on received messages.

This works fine with

cangen -g 0 -i can0

on the driver side sending CAN messages to the device guest. No 
confirmation is lost testing for several minutes.

Adding now on the device side a

cangen -g 0 -i vcan0

sending messages like crazy from the device side guest to the driver 
side guest in parallel I'm loosing TX confirmations in the Linux CAN 
stack. Seems also there is no other error indication (CAN_ERR_FLAG) that 
something like this happened. The virtio CAN device gets out of 
resources and TX will become stuck. Which is not really acceptable even 
for such a heavy load situation (-g0 on both sides).

Is CAN_RAW_RECV_OWN_MSGS / MSG_CONFIRM known as being unreliable (means 
MSG_CONFIRM messages are dropped) under extreme load situations? If so, 
is there a way to detect reliably that this happened so that somehow a 
recovery mechanism for the pending TX acknowledgements could be implemented?

I'm aware that "normal" RX messages from other nodes may be dropped due 
to overload. No problem with this.

The timing requirement originally set (done when sent on CAN bus) has to 
be weakened or put under a feature flag when it's not reliably 
implementable in all environments. But before declaring as "not reliably 
implementable with Linux SocketCAN" I would like to be sure that it's 
really that way and absolutely nothing can be done about it. Could even 
be that I missed an additional setting I'm not aware of. But the 
observed behavior may as well be something which is known to everyone 
except me.

Of course it can be that there is still a bug in my software but checked 
this carefully and I'm now convinced that under heavy load situations 
MSG_CONFIRM messages are lost somewhere in the Linux SocketCAN protocol 
stack. If there's no way to recover from this situaton I've to weaken 
the next draft Virtio CAN draft specification regarding the TX ACK 
timing. As this has some additional impact on the specification before 
doing so I would like to be really sure that the TX ACK timing cannot be 
done reliably the way it was originally planned.

Regards
Harald
-- 
Dipl.-Ing. Harald Mommer
Senior Software Engineer

OpenSynergy GmbH
Rotherstr. 20, 10245 Berlin

Phone:  +49 (30) 60 98 540-0 <== Zentrale
Fax:    +49 (30) 60 98 540-99
E-Mail: harald.mommer@opensynergy.com

www.opensynergy.com

Handelsregister: Amtsgericht Charlottenburg, HRB 108616B
Geschäftsführer/Managing Director: Regis Adjamah

^ permalink raw reply	[flat|nested] 14+ messages in thread
* Re: MSG_CONFIRM RX messages with SocketCAN known as unreliable under heavy load?
@ 2021-06-29 19:39 Harald Mommer
  2021-06-30  7:27 ` Oliver Hartkopp
  0 siblings, 1 reply; 14+ messages in thread
From: Harald Mommer @ 2021-06-29 19:39 UTC (permalink / raw)
  To: Marc Kleine-Budde, Oliver Hartkopp; +Cc: linux-can

[Re-sent because some mechanism on the mailing list thought this was 
SPAM and rejected.
Looks like the list does not like when Thunderbird composes a HTML 
E-Mail. Setting changed & retry.]

Hello,

Am 25.06.21 um 11:39 schrieb Marc Kleine-Budde:
> It makes sense to have a TX done notification. You probably need this
> for proper queue handling and throttling.
Yes. But this acknowledgements must be 100% reliable under all possible 
load conditions otherwise testers will prove that the solution does only 
work when the sun is shining but not during bad weather.
>
>>> Can you sketch a quick block diagram showing guest, host, Virtio device,
>>> Virtio driver, etc...
>> I hope this arrives on the list as is been sent and not garbled:
>>
>>       Guest 2                    | Guest3
>> ----------------                | ----------------
>> ! cangen,      !                | ! cangen,      !
>> ! candump,     !                | ! candump,     !
>> ! cansend      !                | ! cansend      !
>> ! using vcan0  !                | ! using can0   !
>> ----------------                | ----------------
>>   ^                              |             ^
>>   !  ---------------------       |             !
>>   !  ! Service process   !       |             !
>>   !  ! in user space     !       |             !
> Oliver has already commented on this :) Getting feedback from the
> community early could have saved you some work :)

I still don't get it. This service process is the virtio device itself. 
All our virtio devices are user land processes. There is no problem, 
this works that way.

The problem may be that the virtio device should better not have used 
vcan0 to get CAN access and that it should have used something different 
instead. CAN GW? Is it that what you want to tell me all the time? "Do 
not use vcan0 to exchange CAN messages but use CAN GW"? In this case in 
the picture the box "Device Linux / VCAN / vcan0" changes but not the 
userland virtio CAN device service process box.

If it's this I'll get into CAN GW to understand what all this means now 
and how to use it.

But anyway, if so this should not have any impact on the driver or the 
spec, this would be an issue of the device implementation itself which 
is closed source and should now not be this interesting.

>>   !  ! virtio-can device !       |             !
>>   !  ! forwarding vcan0  !       |             !
>>   !  ---------------------       |             !
>>   !    ^               ^         |             !
>>   !    !               !         |             !
>> --------------------------------------------------
>>   !    !   Device side ! kernel  | Driver side ! kernel
>>   v    v               v         |             v
>> ---------------- -------------- | ----------------
>> ! Device Linux ! ! HV support ! | ! Driver Linux !
>> !    VCan      ! !   module   ! | !  Virtio CAN  !
>> !    vcan0     ! ! on device  ! | !     can0     !
>> !              ! !   side     ! | !              !
>> ---------------- -------------- | ----------------
>>         ^               ^        |        ^
>>         !               !        |        !
>> --------------------------------------------------
>>         !               !                 ! Hypervisor
>>         v               v                 v
>> --------------------------------------------------
>> !                     COQOS-HV                   !
>> --------------------------------------------------
>>
>>
> IC - as I'm not interested in closed source solution I'd focus on the
> qemu use case. Good thing is, the virtio-can must handle both use cases
> anyways.
For me qemu is in this moment an unknown environment to develop for. 
There are already some challenges in this project and at some point 
there are too much challenges. Have to discuss if/how qemu is to be 
addressed.
> Your user space bridge is the wrong solution here.....See Oliver's mail.
The virtio devices are always user land processes in our architecture. 
Only what exactly is to be bridged is the question.
>> Nothing which should be done now, getting far too complicated for a 1st shot
>> to implement a Virtio CAN device.
>>
>>> We don't have a feature flag to query if the Linux driver support proper
>>> CAN echo on TX complete notification.
>> Not so nice. But the device integrator should know which backend is used and
>> having a command line option for the device application the issue can be
>> handled. Need the command line switch anyway now to do experiments.
> If needed we can add flags to the CAN drivers so that they are
> introspectable, maybe via the ethtool interface.
I understand here that nothing is etched in stone for all time. Did not 
expect that something like this could be possible.
> Marc

Harald



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-07-15 16:04 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-06-17 12:22 MSG_CONFIRM RX messages with SocketCAN known as unreliable under heavy load? Harald Mommer
2021-06-18  9:16 ` Marc Kleine-Budde
2021-06-18 18:23   ` Oliver Hartkopp
2021-06-19 21:42     ` Marc Kleine-Budde
2021-06-24 15:21   ` Harald Mommer
2021-06-24 18:45     ` Oliver Hartkopp
2021-06-28 13:47       ` Harald Mommer
2021-06-25  9:19     ` review of virtio-can (was: Re: MSG_CONFIRM RX messages with SocketCAN known as unreliable under heavy load?) Marc Kleine-Budde
2021-06-29 17:14       ` Harald Mommer
2021-07-14  7:15       ` [virtio-dev] " Michael S. Tsirkin
2021-07-15 16:04         ` Harald Mommer
2021-06-25  9:39     ` MSG_CONFIRM RX messages with SocketCAN known as unreliable under heavy load? Marc Kleine-Budde
  -- strict thread matches above, loose matches on Subject: below --
2021-06-29 19:39 Harald Mommer
2021-06-30  7:27 ` Oliver Hartkopp

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox