pch_can probable bug

linux-can.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* pch_can probable bug
@ 2012-12-11 15:43 Christian Bendele
  2012-12-11 16:08 ` Christian Bendele
  2012-12-11 22:05 ` Wolfgang Grandegger
  0 siblings, 2 replies; 6+ messages in thread
From: Christian Bendele @ 2012-12-11 15:43 UTC (permalink / raw)
  To: linux-can

Hi all,

I believe I have found a bug in the pch_can driver. It _might_ also be 
related to what was discussed recently in the thread at
http://thread.gmane.org/gmane.linux.can/2483/focus=2573
however, a problem with transmission was reported there and this bug is not
related to transmission but to receiving.

This is what I observed:

If very high load on the can bus coincides with very high load on the pc,
so that the 26 message fifo buffer configured in the can controller 
actually runs full, then receiving stops completely and the pc gets
very sluggish.

This is what I think is going on:

According to Sections 13.5.9 and 13.5.9.1 of the Intel PCH EG20T
Datasheet setting up a FIFO buffer on the pch can works like this:

"...The EOB bit of all message objects of a FIFO buffer except the last have
to be programmed to 0. The EOB bits of the last message object of a FIFO
buffer is set to 1, configuring it as the End of the Block...."
"When a message is stored into a message object [...] the NEWDAT bit of
that object is set. [...]the message object is locked for further write
accesses [...] until the CPU has written the NEWDAT bit back to 0."
"Messaged are stored into a FIFO buffer until the last message object [...]
is reached. If none of the preceding objects is released by writing NEWDAT
to 0, all further messages [...] are written into the last message object
[...] and therefore overwrite previous messages."

The pch_c_can defines a 26 messages deep FIFO buffer (from message object 1 
to message object 26) for receiving. 

However, In the pch_can_rx_normal() method of the pch_can.c driver, right at
the top of the main loop in which pending messages are read from the
controller, the loop is aborted if the EOB bit is send in the message object
currently considered:

		if (reg & PCH_IF_MCONT_EOB)
			break;

That means that the message(s) stored in this message object are never read
out, and the INTPND (interrupt pending, is set together with NEWDAT when a
message is stored into the object) bit in that message object's control
register is never cleared.
Thus a kind of "endless loop" happens where the same message object (26
= PCH_RX_OBJ_END) is considered again and again but never handled, until the
napi quota is consumed, and then immediately the device interrupt happens
again because the INTPND bit in the message object was never cleared.
The system is completely swamped.

Due to the way the FIFO buffer is handled by the driver (kind of like a 16
deep fifo with a 10 deep overflow, it doesn't matter, though), that
message object is only ever used if the FIFO actually runs completely
full. Thus the problem only manifests if very high loads on the canbus and
on the host happen at the same time. If that never happens the system will
work fine for a long time.

I'm still trying to fix this problem here, but this is the first time I
actually did anything using the napi and the whole driver looks slightly
confusing to me (half the time I don't understand why it does things the
way it does them). Just moving down the offending break towards the bottom
of the loop unfortunately doesn't work (The driver breaks completely). I
would appreciate it if somebody who knows more about this stuff could
try to fix this.

Just for reference, I am usually working with the latest 3.2.x kernel, but I
checked this today and it looks the same in the recent 3.7 kernel. 

Thanks a lot for your great work here,
Christian Bendele

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: pch_can probable bug
  2012-12-11 15:43 pch_can probable bug Christian Bendele
@ 2012-12-11 16:08 ` Christian Bendele
  2012-12-11 16:22   ` Wolfgang Grandegger
  2012-12-11 22:05 ` Wolfgang Grandegger
  1 sibling, 1 reply; 6+ messages in thread
From: Christian Bendele @ 2012-12-11 16:08 UTC (permalink / raw)
  To: linux-can

Well, I just found the most recent patch to the generic c_can driver on this
list. I had only checked with Kernel 3.7rcSomething before and it didn't seem to
support the intel eg20t. I failed to check if there was a more recent patch
available and if that supported the intel chip. I'm really sorry for that.

I will try if that patch works for me tomorrow, since unfortunatelly my working
day is almost at its end and I don't have the hardware at home. However, cursery
skimming the code of c_can.c, c_can_do_rx_poll() it seems to me as if it does
the same thing here:

			if (msg_ctrl_save & IF_MCONT_EOB)
				return num_rx_pkts;

Christian

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: pch_can probable bug
  2012-12-11 16:08 ` Christian Bendele
@ 2012-12-11 16:22   ` Wolfgang Grandegger
  0 siblings, 0 replies; 6+ messages in thread
From: Wolfgang Grandegger @ 2012-12-11 16:22 UTC (permalink / raw)
  To: Christian Bendele; +Cc: linux-can

Hi Christian,

On 12/11/2012 05:08 PM, Christian Bendele wrote:
> Well, I just found the most recent patch to the generic c_can driver on this
> list. I had only checked with Kernel 3.7rcSomething before and it didn't seem to
> support the intel eg20t. I failed to check if there was a more recent patch
> available and if that supported the intel chip. I'm really sorry for that.
> 
> I will try if that patch works for me tomorrow, since unfortunatelly my working
> day is almost at its end and I don't have the hardware at home. However, cursery
> skimming the code of c_can.c, c_can_do_rx_poll() it seems to me as if it does
> the same thing here:
> 
> 
> 			if (msg_ctrl_save & IF_MCONT_EOB)
> 				return num_rx_pkts;

Just a quick answer. We know that the pch_can and also the c_can driver
does have issues with tx and rx. Races, out-of-order-receptions, etc. I
have sent patches for the C_CAN drivers to support the PCH as well. But
still the RX handling needs to be improved to avoid out-of-order
reception at high message input rate. I will tell more later today.

Wolfgang.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: pch_can probable bug
  2012-12-11 15:43 pch_can probable bug Christian Bendele
  2012-12-11 16:08 ` Christian Bendele
@ 2012-12-11 22:05 ` Wolfgang Grandegger
  2012-12-17 13:21   ` Christian Bendele
  1 sibling, 1 reply; 6+ messages in thread
From: Wolfgang Grandegger @ 2012-12-11 22:05 UTC (permalink / raw)
  To: Christian Bendele; +Cc: linux-can

On 12/11/2012 04:43 PM, Christian Bendele wrote:
> Hi all,
> 
> I believe I have found a bug in the pch_can driver. It _might_ also be 
> related to what was discussed recently in the thread at
> http://thread.gmane.org/gmane.linux.can/2483/focus=2573
> however, a problem with transmission was reported there and this bug is not
> related to transmission but to receiving.
> 
> This is what I observed:
> 
> If very high load on the can bus coincides with very high load on the pc,
> so that the 26 message fifo buffer configured in the can controller 
> actually runs full, then receiving stops completely and the pc gets
> very sluggish.

OK. Two other people are reporting somehow different problems with tx
but also rx.

> This is what I think is going on:
> 
> According to Sections 13.5.9 and 13.5.9.1 of the Intel PCH EG20T
> Datasheet setting up a FIFO buffer on the pch can works like this:
> 
> "...The EOB bit of all message objects of a FIFO buffer except the last have
> to be programmed to 0. The EOB bits of the last message object of a FIFO
> buffer is set to 1, configuring it as the End of the Block...."
> "When a message is stored into a message object [...] the NEWDAT bit of
> that object is set. [...]the message object is locked for further write
> accesses [...] until the CPU has written the NEWDAT bit back to 0."
> "Messaged are stored into a FIFO buffer until the last message object [...]
> is reached. If none of the preceding objects is released by writing NEWDAT
> to 0, all further messages [...] are written into the last message object
> [...] and therefore overwrite previous messages."

Yes, note that the PCH does use a C_CAN core (even the manual is identical).

> 
> The pch_c_can defines a 26 messages deep FIFO buffer (from message object 1 
> to message object 26) for receiving. 

Please do not invest time fixing the pch_can driver. We want to get rid
of it asap.

> However, In the pch_can_rx_normal() method of the pch_can.c driver, right at
> the top of the main loop in which pending messages are read from the
> controller, the loop is aborted if the EOB bit is send in the message object
> currently considered:
> 
> 		if (reg & PCH_IF_MCONT_EOB)
> 			break;
> 
> That means that the message(s) stored in this message object are never read
> out, and the INTPND (interrupt pending, is set together with NEWDAT when a
> message is stored into the object) bit in that message object's control
> register is never cleared.

Looking to c_can.c, I see this problem as well. The above check should
be done after reading out the message.

> Thus a kind of "endless loop" happens where the same message object (26
> = PCH_RX_OBJ_END) is considered again and again but never handled, until the
> napi quota is consumed, and then immediately the device interrupt happens
> again because the INTPND bit in the message object was never cleared.
> The system is completely swamped.
> 
> Due to the way the FIFO buffer is handled by the driver (kind of like a 16
> deep fifo with a 10 deep overflow, it doesn't matter, though), that
> message object is only ever used if the FIFO actually runs completely
> full. Thus the problem only manifests if very high loads on the canbus and
> on the host happen at the same time. If that never happens the system will
> work fine for a long time.
> 
> I'm still trying to fix this problem here, but this is the first time I
> actually did anything using the napi and the whole driver looks slightly
> confusing to me (half the time I don't understand why it does things the
> way it does them). Just moving down the offending break towards the bottom
> of the loop unfortunately doesn't work (The driver breaks completely). I
> would appreciate it if somebody who knows more about this stuff could
> try to fix this.

Could you please use my "RFC V3" c_can_pci driver patches I sent out an
hour ago. Anyway, avoiding out-of-order reception with a message FIFO is
tricky and we need to fix that part as well. Unfortunately, I do not
have a system for development and testing.

Wolfgang.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: pch_can probable bug
  2012-12-11 22:05 ` Wolfgang Grandegger
@ 2012-12-17 13:21   ` Christian Bendele
  2012-12-17 14:50     ` Wolfgang Grandegger
  0 siblings, 1 reply; 6+ messages in thread
From: Christian Bendele @ 2012-12-17 13:21 UTC (permalink / raw)
  To: linux-can

Hi,

Wolfgang Grandegger <wg <at> grandegger.com> writes:

> Yes, note that the PCH does use a C_CAN core (even the manual is identical).

Indeed.

> Please do not invest time fixing the pch_can driver. We want to get rid
> of it asap.

Okay, I have turned my efforts towards the c-can driver now. I've had some
issues with some of the design decisions of the pch driver anyway. Some of those
issues seem to be absent from the c-can driver after a first glance.

> > That means that the message(s) stored in this message object are never read
> > out, and the INTPND (interrupt pending, is set together with NEWDAT when a
> > message is stored into the object) bit in that message object's control
> > register is never cleared.
> 
> Looking to c_can.c, I see this problem as well. The above check should
> be done after reading out the message.

That's what I thought as well, but after a closer look at c_can_do_rx_poll()
method of the c_can driver I disagree. I think the way this driver works the
check for the EOB bit should not be necessary at all. Neither before nor after
reading out the message. The loop will already be exited after that message
object is read out due to "msg_obj <= C_CAN_MSG_OBJ_RX_LAST" in the loop header.

> Could you please use my "RFC V3" c_can_pci driver patches I sent out an
> hour ago. Anyway, avoiding out-of-order reception with a message FIFO is
> tricky and we need to fix that part as well. Unfortunately, I do not
> have a system for development and testing.

I tried your RFC V3 drivers now. As you sent them they work about as well as the
pch driver. The generally seem to have similar bugs as the latter, too. 

I was quite surprised, however, that the c_can driver actually caused (very
slightly) higher CPU load for a given CANBUS load than the pch driver.
I had looked at the code before testing it, and some issues I had with design
decisions of the pch driver (among other things it failed to make proper use of
the message handling registers the c-can controller provides) seemed to be much
better with the c_can driver. I had therefore expected it to cause _lower_ cpu
load than the pch driver and was somewhat disappointed. I'll have to give this a
closer look after the holidays, maybe I can find the reasons why this happens.

I then removed the two lines that check for the EOB bit. That seems to fix all
hangs that I observed on the RX end. It seems to be quite stable now, even under
high load. I'm not really sure if I consider the out-of-order reception problems
important right now. 

The TX side still seems to have hickups (hangs, as observed by others as well).
I'll try to look into that before the holidays, but I make no promises that I'll
find the time.

Christian

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: pch_can probable bug
  2012-12-17 13:21   ` Christian Bendele
@ 2012-12-17 14:50     ` Wolfgang Grandegger
  0 siblings, 0 replies; 6+ messages in thread
From: Wolfgang Grandegger @ 2012-12-17 14:50 UTC (permalink / raw)
  To: Christian Bendele; +Cc: linux-can

Hi Christian,

On 12/17/2012 02:21 PM, Christian Bendele wrote:
> Hi,
> 
> Wolfgang Grandegger <wg <at> grandegger.com> writes:
> 
>> Yes, note that the PCH does use a C_CAN core (even the manual is identical).
> 
> Indeed.
> 
> 
>> Please do not invest time fixing the pch_can driver. We want to get rid
>> of it asap.
> 
> Okay, I have turned my efforts towards the c-can driver now. I've had some
> issues with some of the design decisions of the pch driver anyway. Some of those
> issues seem to be absent from the c-can driver after a first glance.
> 
> 
>>> That means that the message(s) stored in this message object are never read
>>> out, and the INTPND (interrupt pending, is set together with NEWDAT when a
>>> message is stored into the object) bit in that message object's control
>>> register is never cleared.
>>
>> Looking to c_can.c, I see this problem as well. The above check should
>> be done after reading out the message.
> 
> That's what I thought as well, but after a closer look at c_can_do_rx_poll()
> method of the c_can driver I disagree. I think the way this driver works the
> check for the EOB bit should not be necessary at all. Neither before nor after
> reading out the message. The loop will already be exited after that message
> object is read out due to "msg_obj <= C_CAN_MSG_OBJ_RX_LAST" in the loop header.

I also did not understand what this test is good for.

>> Could you please use my "RFC V3" c_can_pci driver patches I sent out an
>> hour ago. Anyway, avoiding out-of-order reception with a message FIFO is
>> tricky and we need to fix that part as well. Unfortunately, I do not
>> have a system for development and testing.
> 
> I tried your RFC V3 drivers now. As you sent them they work about as well as the
> pch driver. The generally seem to have similar bugs as the latter, too. 
> 
> I was quite surprised, however, that the c_can driver actually caused (very
> slightly) higher CPU load for a given CANBUS load than the pch driver.
> I had looked at the code before testing it, and some issues I had with design
> decisions of the pch driver (among other things it failed to make proper use of
> the message handling registers the c-can controller provides) seemed to be much
> better with the c_can driver. I had therefore expected it to cause _lower_ cpu
> load than the pch driver and was somewhat disappointed. I'll have to give this a
> closer look after the holidays, maybe I can find the reasons why this happens.
> 
> I then removed the two lines that check for the EOB bit. That seems to fix all
> hangs that I observed on the RX end. It seems to be quite stable now, even under
> high load. I'm not really sure if I consider the out-of-order reception problems
> important right now.

OK, we should avoid out-of-order reception, if possible. At least we can
do better.

> The TX side still seems to have hickups (hangs, as observed by others as well).

Hm, with these patches the others do not see TX problems any more. Could
you please be more precise.

> I'll try to look into that before the holidays, but I make no promises that I'll
> find the time.

Thanks,

Wolfgang.


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-12-17 14:50 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-11 15:43 pch_can probable bug Christian Bendele
2012-12-11 16:08 ` Christian Bendele
2012-12-11 16:22   ` Wolfgang Grandegger
2012-12-11 22:05 ` Wolfgang Grandegger
2012-12-17 13:21   ` Christian Bendele
2012-12-17 14:50     ` Wolfgang Grandegger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).