From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wolfgang Grandegger Subject: Re: [RFC v2 0/7] pch_can/c_can: fix races and add PCH support to c_can Date: Wed, 05 Dec 2012 18:35:25 +0100 Message-ID: <50BF85DD.6090809@grandegger.com> References: <1354199987-10350-1-git-send-email-wg@grandegger.com> <2955657.EIGT0HjrVV@ws-stein> <50BF4326.4040507@grandegger.com> <4250988.UdN8LQq6de@ws-stein> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Return-path: Received: from ngcobalt02.manitu.net ([217.11.48.102]:51481 "EHLO ngcobalt02.manitu.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751996Ab2LERf2 (ORCPT ); Wed, 5 Dec 2012 12:35:28 -0500 In-Reply-To: <4250988.UdN8LQq6de@ws-stein> Sender: linux-can-owner@vger.kernel.org List-ID: To: Alexander Stein Cc: linux-can@vger.kernel.org, bhupesh.sharma@st.com, tomoya.rohm@gmail.com On 12/05/2012 03:46 PM, Alexander Stein wrote: > Hello Wolfgang, > > On Wednesday 05 December 2012 13:50:46, Wolfgang Grandegger wrote: >> Hi Alexander, >> >> thanks for testing!. Maybe we deal with more than one problem. >> ... >> A few general questions to understand your hardware and setup: >> >> - Is this a multi-processor system (SMP)? If not, you may not run into >> tx-not-working-any-more problem. Have you ever realized it? > > This is a Intel E660 single core CPU with HT, so it is a SMP system. I'm > currently not aware that tx is not working anymore. OK, your send rate is very low and therefore it's unlikely that you hit that problem. >> - Did you see the problems below with the old PCH_CAN driver as well. >> >> - Do the problems show up with the still existing PCH_CAN driver >> (including the "pch_can: add spinlocks to protect tx objects" patch)? > > With the current version of pch_can from Linuxs' tree and the named patch I > get at least some messaged twice. OK, sounds better but also not good. >>> but if I run my heavy CAN load testcase I get errors sometimes. >>> This test works as follows: I send a CAN message to 2 other CAN nodes >>> configuring some timings (like burst length or time between each can > frame) >>> and they send 250000 messages each containing a counter. This way I can > detect >>> any missing or switched message with a high bus load. >>> If I use the described software state alone it works, but if I run 'watch >>> sensors' in a different ssh session, CAN start to misbehave like missing > CAN >>> frames or switched order. It seems that I2C usage on the PCH influences > the >>> CAN part also: >> >> - When your app sends/writes messages, does it check for errno==ENOBUFS? > > My test application sends only 1 message each test run to start the other > nodes. It checks ENOBUFS and returns an error in that case. Though I've never > seen that. OK, your TX rate it low. > >> - The messages look still ok (not currupted, I mean)? > > The received frames all look good (despite wrong counter sometimes due to > wrong order or lost frames). > >>> Even worse, if I use the following patch to check if PCI writes were >>> successfully, I notices that some writes (or the consecutive read) don't >>> succeed. And I also get lots of I2C timeouts waiting for a xfer complete. >> >> Be careful, there might be some registers changing their values after >> writing. Can you show the value read after writing and the register >> offset? The influence on the I2C bus looks more like an overload or >> hardware problem. What is your CAN interrupt rate? > > I get about 33 interrupts per second on i2c. On a successful run I get 366886 > interrupts for 500000 messages with the c_can driver. In what time? Is the CAN bus highly loaded. > Here are some failed writes to the CAN controller. > [ 50.445695] c_can_pci 0000:02:0c.3: can0: write 0x0 to offset 0x4 failed. > got: 0x10 > [ 51.043027] c_can_pci 0000:02:0c.3: can0: write 0xe to offset 0x0 failed. > got: 0x0 > [... repeats several times] > [ 64.046031] c_can_pci 0000:02:0c.3: can0: write 0xe to offset 0x0 failed. > got: 0x0 > [ 64.458286] c_can_pci 0000:02:0c.3: can0: write 0x73 to offset 0x24 failed. > got: 0xb8 > [ 64.811025] c_can_pci 0000:02:0c.3: can0: write 0xe to offset 0x0 failed. > got: 0x0 > and the last one is repeated all the time. That's wired! Writing 0xe to offset 0x0 does re-enable the interrupts at the end of poll-rx. Disabling the interrupts in the isr does not show that symptoms. Strange. > Some times I also get the 16 of the following message: > c_can_pci 0000:02:0c.3: can0: write 0x0 to offset 0x2c failed. got: 0x2000 >> Do you see this problem with the old PCH_CAN driver as well? > > With pch_can I get about 254000 interrupts for about 283000 frames. > > [ 2422.198378] regs base: f8f5a000 > [ 2449.197911] pch_can_bit_clear: clear bit failed: addr: f8f5a038, reg1: > 0x1404, reg2: 0x8, mask 0xa000 > [ 2458.302028] pch_can_bit_set: set bit failed: addr: f8f5a000, reg1: 0x80, > reg2: 0x80, mask 0xe That's the same thing as with the c_can driver when the interrupts gets re-enabled. > On the second line you can see that the register isn't written at all (or the > read failed for some reason). I assume the latter. Could you please retry reading the register until the correct value shows up. With some timeout, of course. I checked the PCH ethernet driver and did not find anythings special accessing registers. Maybe Tomoya has an idea or could tell use somebody at OKI/LAPIS Semiconductor who could help. Tomoya? Wolfgang.