From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <45FEFE64.5010206@domain.hid> Date: Mon, 19 Mar 2007 22:19:32 +0100 From: Wolfgang Grandegger MIME-Version: 1.0 Subject: Re: [Xenomai-help] RT-Socket-CAN bus error handling (was CAN errors and real-time behaviour (IRQ raise forever and may lock system)) References: <45FDA81F.2080004@domain.hid> <45FE7578.4000306@domain.hid> <45FE8AA2.1030507@domain.hid> <45FEF649.9060205@domain.hid> In-Reply-To: <45FEF649.9060205@domain.hid> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit List-Id: Help regarding installation and common use of Xenomai List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Wolfgang Grandegger Cc: xenomai@xenomai.org, Jan Kiszka Wolfgang Grandegger wrote: > Jan Kiszka wrote: >> Wolfgang Grandegger wrote: >>> Sebastian Smolorz wrote: >>>> Sebastian Smolorz wrote: >>>>> Hi Jan, >>>>> >>>>> Jan Kiszka wrote: >>>>>> Wolfgang Grandegger wrote: >>>>>>> you know, on the SJA1000 the bus error interrupt can result in high >>>>>>> error interrupt rates and even hang the system on slow processors. >>>>>>> Just >>>>>>> unplugging the CAN cable can cause such interrupt flooding. This >>>>>>> problem >>>>>>> >>>>>>> popped up again recently and Sebastian proposed: >>>>>>>> Last summer we had a discussion about the BEI issue on the >>>>>>>> socketcan-ML. Two additional handling policies popped up: >>>>>>>> 1. The interface could restart itself after an amount of BEIs, thus >>>>>>>> taking responsibility from the user application. >>>>>>>> 2. The BEI could be completely disabled if no one is interested in >>>>>>>> this ype of error frame. >>>>>>> As 2. is also my preferred solution, I have implemented it. The only >>>>>>> downside is that you do not see the error counter increasing when >>>>>>> /proc/rtcan/devices is inspected. We also discussed 1., but >>>>>>> RT-Socket-CAN does not restart the CAN controller by purpose and >>>>>>> just >>>>>>> stoppping it requires user intervention. >>>>>> And if there is someone listening, how is the flooding issue on cable >>>>>> unplug etc. solved by option 2? >>>>> Hm, maybe we could implement 1 additionally (but without automatical >>>>> restart)? >>>> A more precise suggestion: What about letting BEIs appear until >>>> passive mode is reached and if the TX error counter doesn't count up >>>> any more (indication of start-up situation discovered by the SJA1000) >>>> the driver ceases to read out ECC any further (thanks Stephane for the >>>> hint). The controller would be still operating but not reporting BEIs >>>> any more. There has to be some mechanism to let BEIs through after the >>>> situation has normalized. Maybe the driver could check inside the >>>> interrupt handler if active mode was reached again after the above >>>> situation occured. >>> Well, this is rather sophisticated and needs some more careful >>> evaluation. We might also reach the passive level slowly without >>> flooding. Furthermore, the method should also be applicable for other >>> controllers. >> >> What is the current behaviour of other controllers? > > Most do not have such detailed error reporting via bus error interrupts. > I know just the i82527 reporting bus errors as well. > >>> Let's implement 1. and downscaled printk and wait for the users reaction >>> , see also my other mail. Then we should bring up this discussion again >>> on the Socket-CAN-ML to negotiate a common solution. >> >> Instead of waiting on some user triggering a (potential) latency mine, I >> would prefer that we experimentally evaluate the effect. E.g. via an >> I-pipe tracer dump on a faster and a slower box. I would offer to run >> some demo code here on our PC104 Phytec boards as well. > > I think we should first run the latency test concurrently and if we > discover high latencies an IPIPE trace helps locating the latency peaks. > >> The problem is to define what degree of error-related IRQ load is >> generally acceptable. We surely can't do this, so we have to document >> the effect /at least/ and help the users to check it on their own - or >> we have to avoid it / make it insignificant compared to normal CAN >> operation (I'm still in favour of this path). > > We speak about a pathological situation and therefore I do not share > your concerns. When there are electrical problems or even the cable is > not connected, we do have an abnormal mode of operation and CAN related > real-time is broken anyhow. The bus error messages are then useful for > analyzing the problem. The effect of the bus error interrupts on non-CAN > related latencies is another issue but I think it's not that critical > either (handling a bus error just requires the reading of 2 SJA1000 > registers). But I agree, a more detailed analysis of "bus error > flooding" would help to understand the impact on the real-time behavior. And also be aware, that heavy CAN traffic can cause similar latencies as well and when there is more than one CAN controller, they can accumulate (as I have observed with my PCAN dongle tests). Here a IRQ service task or threaded IRQs would help. Maybe this is the right way to go. Wolfgang.