Network TX Stall on 440EP Processor

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* Network TX Stall on 440EP Processor
@ 2017-06-20 21:17 Thomas Besemer
  2017-06-21 10:19 ` Michael Ellerman
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Thomas Besemer @ 2017-06-20 21:17 UTC (permalink / raw)
  To: linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 3651 bytes --]

I'm working on a project that is derived from the Yosemite
PPC 440EP board.  It's a legacy project that was running the
2.6.24 Kernel, and network traffic was stalling due to transmission
halting without an understandable error (in this error condition, the
various
status registers of network interface showed no issues), other
than TX stalling due to Buffer Descriptor Ring becoming full.

In order to see if the problem has been resolved, the Kernel
has been updated to 4.9.13, compiled with gcc version 5.4.0
(Buildroot 2017.02.2).  Although the frequency of the
problem is decreased, it still does show up.

The test case is the Linux Target running idle, no application
code.  From a Linux host on a directly connected network, 30
flood pings are started.  After a period of several minutes to
perhaps hours, the transmit aspect of the network controller
ceases to transmit packets (Buffer Descriptor ring becomes full).
RX still works.  In the 2.6.24 Kernel, the problem happens
within seconds, so it has improved with the new Kernel.

Below is the output from the Kernel when this happens.

Has anybody seen this problem before?  I can't find any
errata on it, nor can I find any reports of it.

The orginal problem is rooted in the Embedded Application
running, and after a period of time of heavy network
traffic, the TX side of network stalls.  The flood ping
test is used simply to force the problem to happen.

[ 3127.143572] NETDEV WATCHDOG: eth0 (emac): transmit queue 0 timed out
[ 3127.150172] ------------[ cut here ]------------
[ 3127.154778] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316
dev_watchdog+0x23c/0x244
[ 3127.162965] Modules linked in:
[ 3127.166013] CPU: 0 PID: 0 Comm: swapper Not tainted 4.9.13 #9
[ 3127.171707] task: c0e67300 task.stack: c0f00000
[ 3127.176192] NIP: c068e734 LR: c068e734 CTR: c04672f4
[ 3127.181107] REGS: c0f01c90 TRAP: 0700   Not tainted  (4.9.13)
[ 3127.186793] MSR: 00029000 <CE,EE,ME>[ 3127.190241]   CR: 28122222  XER:
00000000
[ 3127.194210]
GPR00: c068e734 c0f01d40 c0e67300 00000038 d1006301 000000df c04683e4
000000df
GPR08: 000000df c0eff4b0 c0eff4b0 00000004 24122424 00b960f0 00000000
c0e80000
GPR16: 000ac8c1 c07b8618 c098bddc c0e69000 0000000a c0ee0000 c0e73f20
c0f00000
GPR24: c100e4e8 c0ee0000 c0e77d60 c3128000 c068e4f8 c0e80000 00000000
c3128000
NIP [c068e734] dev_watchdog+0x23c/0x244
[ 3127.227680] LR [c068e734] dev_watchdog+0x23c/0x244
[ 3127.232427] Call Trace:
[ 3127.234857] [c0f01d40] [c068e734] dev_watchdog+0x23c/0x244 (unreliable)
[ 3127.241447] [c0f01d60] [c00805e8] call_timer_fn+0x40/0x118
[ 3127.246889] [c0f01d80] [c00808e8] expire_timers.isra.13+0xbc/0x114
[ 3127.253032] [c0f01db0] [c0080a94] run_timer_softirq+0x90/0xf0
[ 3127.258753] [c0f01e00] [c07b31b4] __do_softirq+0x114/0x2b0
[ 3127.264202] [c0f01e60] [c002a158] irq_exit+0xe8/0xec
[ 3127.269144] [c0f01e70] [c0008c98] timer_interrupt+0x34/0x4c
[ 3127.274684] [c0f01e80] [c000ec94] ret_from_except+0x0/0x18
[ 3127.280151] --- interrupt: 901 at cpm_idle+0x3c/0x70
[ 3127.280151]     LR = arch_cpu_idle+0x30/0x68
[ 3127.289300] [c0f01f40] [c0f058e4] cpu_idle_force_poll+0x0/0x4
(unreliable)
[ 3127.296146] [c0f01f50] [c00073e4] arch_cpu_idle+0x30/0x68
[ 3127.301509] [c0f01f60] [c005bce8] cpu_startup_entry+0x184/0x1bc
[ 3127.307392] [c0f01fb0] [c0a76a1c] start_kernel+0x3d4/0x3e8
[ 3127.312843] [c0f01ff0] [c00000b4] _start+0xb4/0xf8
[ 3127.317599] Instruction dump:
[ 3127.320557] 811f0284 4bffff78 39200001 7fe3fb78 99281966 4bfd9cd5
7c651b78 3c60c0a1
[ 3127.328359] 7fc6f378 7fe4fb78 3863357c 48125319 <0fe00000> 4bffffb8
7c0802a6 90010004
[ 3127.336327] ---[ end trace c31dfe4772ff0e8f ]---

[-- Attachment #2: Type: text/html, Size: 4425 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Network TX Stall on 440EP Processor
  2017-06-20 21:17 Network TX Stall on 440EP Processor Thomas Besemer
@ 2017-06-21 10:19 ` Michael Ellerman
  2017-06-21 21:51   ` Thomas Besemer
  2017-06-21 23:25 ` Benjamin Herrenschmidt
  2017-06-22  8:01 ` Denis Kirjanov
  2 siblings, 1 reply; 5+ messages in thread
From: Michael Ellerman @ 2017-06-21 10:19 UTC (permalink / raw)
  To: Thomas Besemer, linuxppc-dev

Hi Thomas,

Thomas Besemer <thomas.besemer@gmail.com> writes:
> I'm working on a project that is derived from the Yosemite
> PPC 440EP board.  It's a legacy project that was running the
> 2.6.24 Kernel, and network traffic was stalling due to transmission
> halting without an understandable error (in this error condition, the
> various
> status registers of network interface showed no issues), other
> than TX stalling due to Buffer Descriptor Ring becoming full.

I'm not really familiar with these boards, and I'm not a network guy
either, so hopefully someone else will have some ideas :)

This is the EMAC driver you're using, which is old but still used so
shouldn't have completely bit rotted.

I think the "Buffer Descriptor Ring becoming full" indicates the
hardware has stopped sending packets that the kernel has put in the
ring?

So did the driver get the ring handling wrong somehow and the device
thinks the ring is empty but we think it's full?

cheers

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Network TX Stall on 440EP Processor
  2017-06-21 10:19 ` Michael Ellerman
@ 2017-06-21 21:51   ` Thomas Besemer
  0 siblings, 0 replies; 5+ messages in thread
From: Thomas Besemer @ 2017-06-21 21:51 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev

Hi Michael -

>
> Thomas Besemer <thomas.besemer@gmail.com> writes:
> > I'm working on a project that is derived from the Yosemite
> > PPC 440EP board.  It's a legacy project that was running the
> > 2.6.24 Kernel, and network traffic was stalling due to transmission
> > halting without an understandable error (in this error condition, the
> > various
> > status registers of network interface showed no issues), other
> > than TX stalling due to Buffer Descriptor Ring becoming full.
>
> I'm not really familiar with these boards, and I'm not a network guy
> either, so hopefully someone else will have some ideas :)
>
> This is the EMAC driver you're using, which is old but still used so
> shouldn't have completely bit rotted.
>
> I think the "Buffer Descriptor Ring becoming full" indicates the
> hardware has stopped sending packets that the kernel has put in the
> ring?
>
> So did the driver get the ring handling wrong somehow and the device
> thinks the ring is empty but we think it's full?
>

Thanks for the feedback.  I'm continuing to look into it, but I should add
to this discussion that when TX stalls, the Ready bit (bit 0) is set in the
TX Status/Control field of all the Buffer Descriptors.  This is what is
perplexing, as TX is enabled, and all BD's are marked as having
valid data.

I've looked to see if there are PLB errors, but cannot see any, and the
MAL/EMAC registers all seem valid.  It simply appears that it stops
sending data for no reason.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Network TX Stall on 440EP Processor
  2017-06-20 21:17 Network TX Stall on 440EP Processor Thomas Besemer
  2017-06-21 10:19 ` Michael Ellerman
@ 2017-06-21 23:25 ` Benjamin Herrenschmidt
  2017-06-22  8:01 ` Denis Kirjanov
  2 siblings, 0 replies; 5+ messages in thread
From: Benjamin Herrenschmidt @ 2017-06-21 23:25 UTC (permalink / raw)
  To: Thomas Besemer, linuxppc-dev

On Tue, 2017-06-20 at 14:17 -0700, Thomas Besemer wrote:
> I'm working on a project that is derived from the Yosemite
> PPC 440EP board.  It's a legacy project that was running the
> 2.6.24 Kernel, and network traffic was stalling due to transmission
> halting without an understandable error (in this error condition, the various
> status registers of network interface showed no issues), other 
> than TX stalling due to Buffer Descriptor Ring becoming full.

This is my emac driver ? I haven't looked at (or touched) that thing in
eons :-)

Cheers,
Ben.

> In order to see if the problem has been resolved, the Kernel
> has been updated to 4.9.13, compiled with gcc version 5.4.0
> (Buildroot 2017.02.2).  Although the frequency of the
> problem is decreased, it still does show up.
> 
> The test case is the Linux Target running idle, no application
> code.  From a Linux host on a directly connected network, 30
> flood pings are started.  After a period of several minutes to
> perhaps hours, the transmit aspect of the network controller
> ceases to transmit packets (Buffer Descriptor ring becomes full). 
> RX still works.  In the 2.6.24 Kernel, the problem happens
> within seconds, so it has improved with the new Kernel.
> 
> Below is the output from the Kernel when this happens.
> 
> Has anybody seen this problem before?  I can't find any
> errata on it, nor can I find any reports of it.
> 
> The orginal problem is rooted in the Embedded Application
> running, and after a period of time of heavy network
> traffic, the TX side of network stalls.  The flood ping
> test is used simply to force the problem to happen.
> 
> [ 3127.143572] NETDEV WATCHDOG: eth0 (emac): transmit queue 0 timed out
> [ 3127.150172] ------------[ cut here ]------------
> [ 3127.154778] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316 dev_watchdog+0x23c/0x244
> [ 3127.162965] Modules linked in:
> [ 3127.166013] CPU: 0 PID: 0 Comm: swapper Not tainted 4.9.13 #9
> [ 3127.171707] task: c0e67300 task.stack: c0f00000
> [ 3127.176192] NIP: c068e734 LR: c068e734 CTR: c04672f4
> [ 3127.181107] REGS: c0f01c90 TRAP: 0700   Not tainted  (4.9.13)
> [ 3127.186793] MSR: 00029000 <CE,EE,ME>[ 3127.190241]   CR: 28122222  XER: 00000000
> [ 3127.194210]
> GPR00: c068e734 c0f01d40 c0e67300 00000038 d1006301 000000df c04683e4 000000df
> GPR08: 000000df c0eff4b0 c0eff4b0 00000004 24122424 00b960f0 00000000 c0e80000
> GPR16: 000ac8c1 c07b8618 c098bddc c0e69000 0000000a c0ee0000 c0e73f20 c0f00000
> GPR24: c100e4e8 c0ee0000 c0e77d60 c3128000 c068e4f8 c0e80000 00000000 c3128000
> NIP [c068e734] dev_watchdog+0x23c/0x244
> [ 3127.227680] LR [c068e734] dev_watchdog+0x23c/0x244
> [ 3127.232427] Call Trace:
> [ 3127.234857] [c0f01d40] [c068e734] dev_watchdog+0x23c/0x244 (unreliable)
> [ 3127.241447] [c0f01d60] [c00805e8] call_timer_fn+0x40/0x118
> [ 3127.246889] [c0f01d80] [c00808e8] expire_timers.isra.13+0xbc/0x114
> [ 3127.253032] [c0f01db0] [c0080a94] run_timer_softirq+0x90/0xf0
> [ 3127.258753] [c0f01e00] [c07b31b4] __do_softirq+0x114/0x2b0
> [ 3127.264202] [c0f01e60] [c002a158] irq_exit+0xe8/0xec
> [ 3127.269144] [c0f01e70] [c0008c98] timer_interrupt+0x34/0x4c
> [ 3127.274684] [c0f01e80] [c000ec94] ret_from_except+0x0/0x18
> [ 3127.280151] --- interrupt: 901 at cpm_idle+0x3c/0x70
> [ 3127.280151]     LR = arch_cpu_idle+0x30/0x68
> [ 3127.289300] [c0f01f40] [c0f058e4] cpu_idle_force_poll+0x0/0x4 (unreliable)
> [ 3127.296146] [c0f01f50] [c00073e4] arch_cpu_idle+0x30/0x68
> [ 3127.301509] [c0f01f60] [c005bce8] cpu_startup_entry+0x184/0x1bc
> [ 3127.307392] [c0f01fb0] [c0a76a1c] start_kernel+0x3d4/0x3e8
> [ 3127.312843] [c0f01ff0] [c00000b4] _start+0xb4/0xf8
> [ 3127.317599] Instruction dump:
> [ 3127.320557] 811f0284 4bffff78 39200001 7fe3fb78 99281966 4bfd9cd5 7c651b78 3c60c0a1
> [ 3127.328359] 7fc6f378 7fe4fb78 3863357c 48125319 <0fe00000> 4bffffb8 7c0802a6 90010004
> [ 3127.336327] ---[ end trace c31dfe4772ff0e8f ]---
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Network TX Stall on 440EP Processor
  2017-06-20 21:17 Network TX Stall on 440EP Processor Thomas Besemer
  2017-06-21 10:19 ` Michael Ellerman
  2017-06-21 23:25 ` Benjamin Herrenschmidt
@ 2017-06-22  8:01 ` Denis Kirjanov
  2 siblings, 0 replies; 5+ messages in thread
From: Denis Kirjanov @ 2017-06-22  8:01 UTC (permalink / raw)
  To: Thomas Besemer; +Cc: linuxppc-dev

On 6/21/17, Thomas Besemer <thomas.besemer@gmail.com> wrote:
> I'm working on a project that is derived from the Yosemite
> PPC 440EP board.  It's a legacy project that was running the
> 2.6.24 Kernel, and network traffic was stalling due to transmission
> halting without an understandable error (in this error condition, the
> various
> status registers of network interface showed no issues), other
> than TX stalling due to Buffer Descriptor Ring becoming full.
>
> In order to see if the problem has been resolved, the Kernel
> has been updated to 4.9.13, compiled with gcc version 5.4.0
> (Buildroot 2017.02.2).  Although the frequency of the
> problem is decreased, it still does show up.
>
> The test case is the Linux Target running idle, no application
> code.  From a Linux host on a directly connected network, 30
> flood pings are started.  After a period of several minutes to
> perhaps hours, the transmit aspect of the network controller
> ceases to transmit packets (Buffer Descriptor ring becomes full).
> RX still works.  In the 2.6.24 Kernel, the problem happens
> within seconds, so it has improved with the new Kernel.
>
> Below is the output from the Kernel when this happens.
>
> Has anybody seen this problem before?  I can't find any
> errata on it, nor can I find any reports of it.
>
> The orginal problem is rooted in the Embedded Application
> running, and after a period of time of heavy network
> traffic, the TX side of network stalls.  The flood ping
> test is used simply to force the problem to happen.

The only thing that you can do is to carefully look at the ring management code.

Looks like that it's not enough to call the emac_reset_work to
properly reset the tx queue on your device.
>
> [ 3127.143572] NETDEV WATCHDOG: eth0 (emac): transmit queue 0 timed out
> [ 3127.150172] ------------[ cut here ]------------
> [ 3127.154778] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316
> dev_watchdog+0x23c/0x244
> [ 3127.162965] Modules linked in:
> [ 3127.166013] CPU: 0 PID: 0 Comm: swapper Not tainted 4.9.13 #9
> [ 3127.171707] task: c0e67300 task.stack: c0f00000
> [ 3127.176192] NIP: c068e734 LR: c068e734 CTR: c04672f4
> [ 3127.181107] REGS: c0f01c90 TRAP: 0700   Not tainted  (4.9.13)
> [ 3127.186793] MSR: 00029000 <CE,EE,ME>[ 3127.190241]   CR: 28122222  XER:
> 00000000
> [ 3127.194210]
> GPR00: c068e734 c0f01d40 c0e67300 00000038 d1006301 000000df c04683e4
> 000000df
> GPR08: 000000df c0eff4b0 c0eff4b0 00000004 24122424 00b960f0 00000000
> c0e80000
> GPR16: 000ac8c1 c07b8618 c098bddc c0e69000 0000000a c0ee0000 c0e73f20
> c0f00000
> GPR24: c100e4e8 c0ee0000 c0e77d60 c3128000 c068e4f8 c0e80000 00000000
> c3128000
> NIP [c068e734] dev_watchdog+0x23c/0x244
> [ 3127.227680] LR [c068e734] dev_watchdog+0x23c/0x244
> [ 3127.232427] Call Trace:
> [ 3127.234857] [c0f01d40] [c068e734] dev_watchdog+0x23c/0x244 (unreliable)
> [ 3127.241447] [c0f01d60] [c00805e8] call_timer_fn+0x40/0x118
> [ 3127.246889] [c0f01d80] [c00808e8] expire_timers.isra.13+0xbc/0x114
> [ 3127.253032] [c0f01db0] [c0080a94] run_timer_softirq+0x90/0xf0
> [ 3127.258753] [c0f01e00] [c07b31b4] __do_softirq+0x114/0x2b0
> [ 3127.264202] [c0f01e60] [c002a158] irq_exit+0xe8/0xec
> [ 3127.269144] [c0f01e70] [c0008c98] timer_interrupt+0x34/0x4c
> [ 3127.274684] [c0f01e80] [c000ec94] ret_from_except+0x0/0x18
> [ 3127.280151] --- interrupt: 901 at cpm_idle+0x3c/0x70
> [ 3127.280151]     LR = arch_cpu_idle+0x30/0x68
> [ 3127.289300] [c0f01f40] [c0f058e4] cpu_idle_force_poll+0x0/0x4
> (unreliable)
> [ 3127.296146] [c0f01f50] [c00073e4] arch_cpu_idle+0x30/0x68
> [ 3127.301509] [c0f01f60] [c005bce8] cpu_startup_entry+0x184/0x1bc
> [ 3127.307392] [c0f01fb0] [c0a76a1c] start_kernel+0x3d4/0x3e8
> [ 3127.312843] [c0f01ff0] [c00000b4] _start+0xb4/0xf8
> [ 3127.317599] Instruction dump:
> [ 3127.320557] 811f0284 4bffff78 39200001 7fe3fb78 99281966 4bfd9cd5
> 7c651b78 3c60c0a1
> [ 3127.328359] 7fc6f378 7fe4fb78 3863357c 48125319 <0fe00000> 4bffffb8
> 7c0802a6 90010004
> [ 3127.336327] ---[ end trace c31dfe4772ff0e8f ]---
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-06-22  8:01 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-06-20 21:17 Network TX Stall on 440EP Processor Thomas Besemer
2017-06-21 10:19 ` Michael Ellerman
2017-06-21 21:51   ` Thomas Besemer
2017-06-21 23:25 ` Benjamin Herrenschmidt
2017-06-22  8:01 ` Denis Kirjanov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).