* MPC5200B FEC TX packets getting stuck
@ 2012-01-27 20:14 Joey Nelson
2012-02-02 2:33 ` Joey Nelson
0 siblings, 1 reply; 4+ messages in thread
From: Joey Nelson @ 2012-01-27 20:14 UTC (permalink / raw)
To: linuxppc-dev
[-- Attachment #1: Type: text/plain, Size: 1525 bytes --]
In my application, I have a PC connected through TCP to a MPC5200B based
system. The PC sends a short request, the MPC5200B receives the request
and sends the data back. It is doing this about 300 times per second.
Normally exchange happens in just handful of milliseconds. But randomly
every 2 to 15 minutes the MPC5200B sends all but the last packet of the
response, and about 200ms later the PC sends a delayed ACK, and the
MPC5200B TCP stack figures the packet was lost. It then sends two nearly
identical packets (The IP header Identification and Checksum fields are
incremented). I can also see that RetransSegs in /proc/net/snmp increments
by one for each of these delays.
My theory is that the packet is getting suck somewhere in the network stack
(most likely toward the bottom). Then when another packet is sent, the
suck one gets pushed out.
I've done a test where I have another task on the MPC5200B sending UDP
packets to a different PC every 10ms. This eliminated the long delays, and
seems to support my stuck packet theory.
I'm seeing the same issue with 2.6.23 and 3.1.6.
I'm getting ready to dive into the hairy world of Bestcomm and FEC, but I
figured I'd see if anyone else has any suggestions before I make my decent.
Has anyone seen this behavior before? Any likely candidates for where the
packet is getting stuck? General advice for reference materials (I've
started on Linux Device Drivers 3rd Ed, BestComm AN2604, and the Datasheets)
Thanks in advance.
Joey Nelson
joey@joescan.com
[-- Attachment #2: Type: text/html, Size: 1770 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: MPC5200B FEC TX packets getting stuck
2012-01-27 20:14 MPC5200B FEC TX packets getting stuck Joey Nelson
@ 2012-02-02 2:33 ` Joey Nelson
2012-02-02 18:43 ` Joey Nelson
0 siblings, 1 reply; 4+ messages in thread
From: Joey Nelson @ 2012-02-02 2:33 UTC (permalink / raw)
To: linuxppc-dev
First I think the spin_locks in the irq handlers should be
spin_lock_irqsave(), because the same lock is used in multiple irq
handlers. If we get an rx interrupt while the tx interrupt holds the
spin lock, this would seem to be a problem. In this case maybe not
because it is a single processor system and spin_locks should compile
to nothing(I haven't verified this), and the rx and tx handlers don't
really touch any common data elements. I haven't tested changing
this, because I've currently running a long test.
On another front, I put some time stamp tracing into the
mpc52xx_fec_start_xmit, and verified that the delay is happening after
the packet is added the the BestComm ring buffer. There will be 3
quick calls to the xmit, but I'll only see 2 packets at the PC, until
200 - 400 ms later, when I'll get another xmit call (for the
retransmit), and then get two duplicate packets at pc.
Attempting to add time stamping to the TX irq handler have revealed
this to be a Heisenbug of sorts. After the following changes, I
haven't seen any delays two hours of running. Previously every minute
of so.
I'll let it run over night and see if I see an additional delays.
Next I'll remove the timestamp code, and attempt to capture the state
of the ring buffer and BestComm at the point the retransmit packet is
handed off to the driver. The delayed packet has to be somewhere at
that point. I could be in the FEC Queue, as I don't think I've seen a
delayed packet larger than 1k.
@@ -382,6 +414,8 @@
=A0 =A0 =A0dev_kfree_skb_irq(skb);
=A0 }
=A0 spin_unlock(&priv->lock);
+ =A0 js_irq_timestamps[js_irq_idx] =3D get_tbl();
+ =A0 js_irq_idx =3D (js_irq_idx+1 =3D=3D TS_COUNT)? 0 : js_irq_idx+1;
=A0 =A0 netif_wake_queue(dev);
@@ -409,6 +443,7 @@
Joey Nelson
On Fri, Jan 27, 2012 at 12:14 PM, Joey Nelson <joey@joescan.com> wrote:
>
>
> In my application, I have a PC connected through TCP to a MPC5200B based =
system. =A0The PC sends a short request, the MPC5200B=A0receives=A0the requ=
est and sends the data back. =A0It is doing this about 300 times per second=
. =A0Normally exchange happens in just handful of milliseconds. =A0But rand=
omly every 2 to 15 minutes the MPC5200B sends all but the last packet of th=
e response, and about 200ms later the PC sends a delayed ACK, and the MPC52=
00B TCP stack figures the packet was lost. =A0It then sends two nearly iden=
tical packets (The IP header Identification and Checksum fields are increme=
nted). =A0I can also see that RetransSegs in /proc/net/snmp increments by o=
ne for each of these delays.
>
> My theory is that the packet is getting suck somewhere in the network sta=
ck (most likely toward the bottom). =A0Then when another packet is sent, th=
e suck one gets pushed out.
>
> I've done a test where I have another task on the MPC5200B sending UDP pa=
ckets to a different PC every 10ms. =A0This eliminated the long delays, and=
seems to support my stuck packet theory.
>
> I'm seeing the same issue with 2.6.23 and 3.1.6.
>
> I'm getting ready to dive into the hairy world of Bestcomm and FEC, but I=
figured I'd see if anyone else has any suggestions before I make my decent=
. =A0Has anyone seen this=A0behavior=A0before? =A0Any likely candidates for=
where the packet is getting stuck? =A0General advice for reference materia=
ls (I've started on Linux Device Drivers 3rd Ed, BestComm AN2604, and the D=
atasheets)
>
> Thanks in advance.
>
> Joey Nelson
> joey@joescan.com
>
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: MPC5200B FEC TX packets getting stuck
2012-02-02 2:33 ` Joey Nelson
@ 2012-02-02 18:43 ` Joey Nelson
2012-02-08 2:47 ` Joey Nelson
0 siblings, 1 reply; 4+ messages in thread
From: Joey Nelson @ 2012-02-02 18:43 UTC (permalink / raw)
To: linuxppc-dev
I ran my test overnight with the TX irq time-stamping and finally got
one delayed packet. Based on the time-stamp data the packets are
getting stuck in the FEC TX FIFO.
The calls to the xmit function are spaced our on about 150 us interval
for a larger TCP socket write. The BestComm tx irq is handled within
about 115us of the xmit (so for TCP at least there is unlikely to be
more than 1 skb in the ring). The "stuck" packet generates a tx irq
just like normal. Which tells me that BestComm has copied it to the
FEC TX fifo, but for some reason the FEC has decided to just sit on
it. But BestComm starts adding another packet, it the FEC starts to
transmit the "stuck" packet.
This testing has been on Kernel 3.1.6 (but I've seen the same problem
on a kernel based on the OLEAS pcm030 2.6.23 kernel). The hardware is
a custom board with 16 bit wide DDR SDRAM.
CPU: MPC5200B v2.2, Core v1.4 at 396 MHz
Bus 132 MHz, IPB 132 MHz, PCI 33 MHz
Joey Nelson
On Wed, Feb 1, 2012 at 6:33 PM, Joey Nelson <joey@joescan.com> wrote:
> First I think the spin_locks in the irq handlers should be
> spin_lock_irqsave(), because the same lock is used in multiple irq
> handlers. =A0If we get an rx interrupt while the tx interrupt holds the
> spin lock, this would seem to be a problem. =A0In this case maybe not
> because it is a single processor system and spin_locks should compile
> to nothing(I haven't verified this), and the rx and tx handlers don't
> really touch any common data elements. =A0I haven't tested changing
> this, because I've currently running a long test.
>
> On another front, I put some time stamp tracing into the
> mpc52xx_fec_start_xmit, and verified that the delay is happening after
> the packet is added the the BestComm ring buffer. =A0There will be 3
> quick calls to the xmit, but I'll only see 2 packets at the PC, until
> 200 - 400 ms later, when I'll get another xmit call (for the
> retransmit), and then get two duplicate packets at pc.
>
> Attempting to add time stamping to the TX irq handler have revealed
> this to be a Heisenbug of sorts. After the following changes, I
> haven't seen any delays two hours of running. =A0Previously every minute
> of so.
>
> I'll let it run over night and see if I see an additional delays.
> Next I'll remove the timestamp code, and attempt to capture the state
> of the ring buffer and BestComm at the point the retransmit packet is
> handed off to the driver. =A0The delayed packet has to be somewhere at
> that point. =A0I could be in the FEC Queue, as I don't think I've seen a
> delayed packet larger than 1k.
>
> @@ -382,6 +414,8 @@
> =A0=A0 =A0 =A0dev_kfree_skb_irq(skb);
> =A0=A0 }
> =A0=A0 spin_unlock(&priv->lock);
> + =A0 js_irq_timestamps[js_irq_idx] =3D get_tbl();
> + =A0 js_irq_idx =3D (js_irq_idx+1 =3D=3D TS_COUNT)? 0 : js_irq_idx+1;
>
> =A0 =A0 netif_wake_queue(dev);
>
> @@ -409,6 +443,7 @@
>
>
>
> Joey Nelson
>
>
>
> On Fri, Jan 27, 2012 at 12:14 PM, Joey Nelson <joey@joescan.com> wrote:
>>
>>
>> In my application, I have a PC connected through TCP to a MPC5200B based=
system. =A0The PC sends a short request, the MPC5200B=A0receives=A0the req=
uest and sends the data back. =A0It is doing this about 300 times per secon=
d. =A0Normally exchange happens in just handful of milliseconds. =A0But ran=
domly every 2 to 15 minutes the MPC5200B sends all but the last packet of t=
he response, and about 200ms later the PC sends a delayed ACK, and the MPC5=
200B TCP stack figures the packet was lost. =A0It then sends two nearly ide=
ntical packets (The IP header Identification and Checksum fields are increm=
ented). =A0I can also see that RetransSegs in /proc/net/snmp increments by =
one for each of these delays.
>>
>> My theory is that the packet is getting suck somewhere in the network st=
ack (most likely toward the bottom). =A0Then when another packet is sent, t=
he suck one gets pushed out.
>>
>> I've done a test where I have another task on the MPC5200B sending UDP p=
ackets to a different PC every 10ms. =A0This eliminated the long delays, an=
d seems to support my stuck packet theory.
>>
>> I'm seeing the same issue with 2.6.23 and 3.1.6.
>>
>> I'm getting ready to dive into the hairy world of Bestcomm and FEC, but =
I figured I'd see if anyone else has any suggestions before I make my decen=
t. =A0Has anyone seen this=A0behavior=A0before? =A0Any likely candidates fo=
r where the packet is getting stuck? =A0General advice for reference materi=
als (I've started on Linux Device Drivers 3rd Ed, BestComm AN2604, and the =
Datasheets)
>>
>> Thanks in advance.
>>
>> Joey Nelson
>> joey@joescan.com
>>
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: MPC5200B FEC TX packets getting stuck
2012-02-02 18:43 ` Joey Nelson
@ 2012-02-08 2:47 ` Joey Nelson
0 siblings, 0 replies; 4+ messages in thread
From: Joey Nelson @ 2012-02-08 2:47 UTC (permalink / raw)
To: linuxppc-dev
I've revisited my testing, and found I was wrong about the packets
getting stuck in the FEC queue. The packets are getting stuck in the
bestcomm.
I generate a timestamp for each skb's ip_hdr->id at the beginning of
mpc52xx_fec_start_xmit, and for each skb retrieved from the bestcomm.
Typically the packets make it through bestcomm in under 200us. But
sometimes a single packet will sit in bestcomm until the next packet
added. In the case of a delayed ACK this can be around 400ms.
When I look at the state of the registers when this condition is
detected (before adding the next bd):
tx_dmatsk->flags = 1
tcr[1] = 0x20c1
Both of these values seem to indicate the the TX task is disabled.
I'm pretty sure that nothing in the kernel is disabling the task, so
it must be an automatic behavior. I notice that
bcom_submit_next_buffer() checks the task->flags and if the LSB is set
it enables the task. So I'm guessing that is what gets things moving
again when the freeze happens.
My best guess is this is a race condition between the driver and the
dma controller. Is there any risk to just calling bcom_enable(tsk).
Maybe after a short sleep.
Joey Nelson
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2012-02-08 2:47 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-01-27 20:14 MPC5200B FEC TX packets getting stuck Joey Nelson
2012-02-02 2:33 ` Joey Nelson
2012-02-02 18:43 ` Joey Nelson
2012-02-08 2:47 ` Joey Nelson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).