public inbox for linux-omap@vger.kernel.org
 help / color / mirror / Atom feed
* Problems in the DaVinci EMAC driver & AM35xx?
@ 2012-06-13 22:38 CF Adad
  2012-06-15 19:28 ` CF Adad
  0 siblings, 1 reply; 6+ messages in thread
From: CF Adad @ 2012-06-13 22:38 UTC (permalink / raw)
  To: linux-omap@vger.kernel.org

All,

I believe I may have uncovered a problem in the DaVinci EMAC driver as it is used on the AM35xx. I suspect this is likely related to the slab crashing I reported here (http://thread.gmane.org/gmane.linux.ports.arm.omap/78039/). Your help and thoughts on this would be greatly appreciated.
and 
*A QUICK BACKSTORY:*

We've been working with AM3517 modules (CM-T3517 and TAM3517) for about 6 months now, and in that time have tried many different versions of OS software. We started with the 2.6.37 kernel supported by TI's PSP and quickly migrated through 3.2, 3.3, 3.4, and as of yeterday, 3.5rc1. In that same time, we've tried several versions of both bootloaders as well. One consistent factor we've noticed is a relatively odd DaVinci EMAC performance issue. Using just basic iperf ("iperf -s" on one side, "iperf -c <IP>" on the other), we noted that whenever we used our DaVinci EMAC port on the AM3517, we saw poor performance when connecting AM3517 EMAC <=> AM3517 EMAC, but great performance when connecting the EMAC to another, non-EMAC Ethernet device. Here is a real example capture:

------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
{{ DaVinci EMAC server, PC client }}


[  4] local 192.168.2.192 port 5001 connected with 192.168.2.40 port 44452
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.1 sec    114 MBytes  94.1 Mbits/sec
[  5] local 192.168.2.192 port 5001 connected with 192.168.2.40 port 44459
[  5]  0.0-10.1 sec    114 MBytes  94.0 Mbits/sec
[  4] local 192.168.2.192 port 5001 connected with 192.168.2.40 port 44467
[  4]  0.0-10.1 sec    113 MBytes  94.1 Mbits/sec
[  5] local 192.168.2.192 port 5001 connected with 192.168.2.40 port 44474
[  5]  0.0-10.1 sec    114 MBytes  94.1 Mbits/sec
[  4] local 192.168.2.192 port 5001 connected with 192.168.2.40 port 44481
[  4]  0.0-10.1 sec    114 MBytes  94.1 Mbits/sec
[  5] local 192.168.2.192 port 5001 connected with 192.168.2.40 port 44488
[  5]  0.0- 9.8 sec    110 MBytes  94.1 Mbits/sec

{{ same DaVinci EMAC server, still running, just changed to DaVinci EMAC client }}

[  5] local 192.168.2.192 port 5001 connected with 192.168.2.74 port 37325
[  5]  0.0-10.4 sec  73.9 MBytes  59.6 Mbits/sec
[  6] local 192.168.2.192 port 5001 connected with 192.168.2.74 port 37326
[  6]  0.0-10.8 sec  61.5 MBytes  47.8 Mbits/sec
[  5] local 192.168.2.192 port 5001 connected with 192.168.2.74 port 37327
[  5]  0.0-10.3 sec  64.6 MBytes  52.6 Mbits/sec
[  6] local 192.168.2.192 port 5001 connected with 192.168.2.74 port 37328
[  6]  0.0-10.0 sec  78.2 MBytes  65.5 Mbits/sec

See how the data rates drop from high 80s - 90s+ Mbps down into the 50s+ Mbps?  Despite that data rate slowdown, we only show having dropped 1 packet. There are no errors or xruns reported. We've long been frustrated by this, but not having time to invest in it, we were willing to accept the slower performance for the time being. As we upgraded versions of Linux, we hoped that perhaps it would simply get better. Unfortunately it did not.

*NOW FOR THE "BUG"?:*

Several months ago, we started seeing that slab error that is being discussed on this list (http://thread.gmane.org/gmane.linux.ports.arm.omap/78039/). The error was generic enough that we could not put our finger on it. We could not figure out a way of reliably reproducing it. We assumed something in our software immediately, and spent the last several months constantly updating our BSP to include the latest bootloaders and Linux sources. However, that never made it go away. It was a very sporadic error. Some days the boards would crash constantly with it, and then mysteriously weeks would pass without incident. Right when we started to think our latest kernel update had killed it, it would come roaring back

Last week was one of those weeks where the issues came roaring back. Since they coincided with us enabling the EMAC from within u-boot, we turned our attention to the EMAC immediately and started running constant iperf loop tests against it. For a while, it seemed it could create the slab error relatively easily, but then as fast as it came, it disappeared again.

Today I was running a combination test of iperf and the stress utility (http://weather.ou.edu/~apw/projects/stress/), and even with relatively mild stress loads I was seeing other memory allocation crashes being blamed on the EMAC. (For those that have read the other thread, my SMSC LAN911x port has been disabled in the bootloaders and kernel for these tests. So, it cannot play a role here.)  We saw several errors on two different platforms and across both the 3.4rc6 and 3.5rc1 kernels. All the different errors are reported below. 

I'm definitely not a Linux memory management expert, but it's looking like the Davinci EMAC driver, at least as it pertains to the AM35xx platform, may be malfunctioning a bit???  I can provide any additional information that you all would find helpful.

Our tests are simple:
For iperf: One side does "iperf -s" or "iperf -s -D" if running in the back ground is desired, while the other side runs "iperf -c <IP of server>".  We are looping them in a script like:

1.  #!/bin/sh
2.  set -x
3.  while [ 1 ]
4.  do
5.     iperf -c <IP>
6.     sleep 1
7.  done

For stress:  We're running the command lines you see (such as "stress --cpu 8 --io 4 --vm 1 --vm-bytes 150M --hdd 2 --hdd-bytes 150M --timeout 60s").  We're also running those in a script just like above, with the stress line replacing line #5 obviously.


THANKS IN ADVANCE!!!


*CRASHES AND DUMPS:*

This one happened on BOTH Linux 3.4rc6 and 3.5rc1. You can see the stress command for this particular error there. We eventually eased that back to:
"stress --cpu 8 --io 4 --vm 1 --vm-bytes 150M --hdd 2 --hdd-bytes 150M --timeout 60s".
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

+ /home/root/stress --cpu 24 --io 10 --vm 9 --vm-bytes 25M --hdd 2 --hdd-bytes 150M --timeout 60
stress: info: [21812] dispatching hogs: 24 cpu, 10 io, 9 vm, 2 hdd
[61010.621826] stress: page allocation failure: order:0, mode:0x120
[61010.628173] Backtrace:
[61010.630767] [<c0011d6c>] (dump_backtrace+0x0/0x114) from [<c039cb8c>] (dump_stack+0x20/0x24)
[61010.639648]  r6:cd9fc000 r5:00000000 r4:00000120 r3:c052fed8
[61010.645599] [<c039cb6c>] (dump_stack+0x0/0x24) from [<c009c304>] (warn_alloc_failed+0xd8/0x11c)
[61010.654754] [<c009c22c>] (warn_alloc_failed+0x0/0x11c) from [<c009e918>] (__alloc_pages_nodemask+0x508/0x678)
[61010.665161]  r3:cd9fdcf4 r2:00000000
[61010.668914]  r7:c051ef60 r6:00000000 r5:00000000 r4:00000120
[61010.674896] [<c009e410>] (__alloc_pages_nodemask+0x0/0x678) from [<c0309458>] (netdev_alloc_frag+0xa4/0xdc)
[61010.685119] [<c03093b4>] (netdev_alloc_frag+0x0/0xdc) from [<c030a428>] (__netdev_alloc_skb+0x78/0xd0)
[61010.694885]  r7:00000000 r6:7b4c2a43 r5:cfa50000 r4:00000700
[61010.700866] [<c030a3b0>] (__netdev_alloc_skb+0x0/0xd0) from [<c02a0770>] (emac_rx_alloc+0x28/0x64)
[61010.710266]  r6:7b4c2a43 r5:cd95e300 r4:cfa50000 r3:cfa50440
[61010.716217] [<c02a0748>] (emac_rx_alloc+0x0/0x64) from [<c02a1670>] (emac_rx_handler+0x74/0x11c)
[61010.725463]  r4:cfa50000 r3:059e8859
[61010.729217] [<c02a15fc>] (emac_rx_handler+0x0/0x11c) from [<c02a2968>] (__cpdma_chan_free+0xc8/0xe0)
[61010.738830]  r6:cfa6f5c0 r5:60000113 r4:cfa5bac0
[61010.743682] [<c02a28a0>] (__cpdma_chan_free+0x0/0xe0) from [<c02a2a4c>] (__cpdma_chan_process+0xcc/0x104)
[61010.753723] [<c02a2980>] (__cpdma_chan_process+0x0/0x104) from [<c02a35b8>] (cpdma_chan_process+0x4c/0x64)
[61010.763885]  r7:00000040 r6:00000040 r5:cfa6f5c0 r4:00000000
[61010.769836] [<c02a356c>] (cpdma_chan_process+0x0/0x64) from [<c02a1b44>] (emac_poll+0x9c/0x208)
[61010.778961]  r6:00000001 r5:00000001 r4:cfa5044c r3:00000001
[61010.784912] [<c02a1aa8>] (emac_poll+0x0/0x208) from [<c0314340>] (net_rx_action+0xb0/0x1a8)
[61010.793701]  r8:c0531130 r7:0000012c r6:00000040 r5:00000001 r4:cfa5044c
[61010.793701] r3:c02a1aa8
[61010.803314] [<c0314290>] (net_rx_action+0x0/0x1a8) from [<c0035778>] (__do_softirq+0xb0/0x1d8)
[61010.812347] [<c00356c8>] (__do_softirq+0x0/0x1d8) from [<c0035c7c>] (irq_exit+0x8c/0x94)
[61010.820861] [<c0035bf0>] (irq_exit+0x0/0x94) from [<c000f010>] (handle_IRQ+0x44/0x94)
[61010.829101]  r4:c05499b8 r3:c00764a4
[61010.832855] [<c000efcc>] (handle_IRQ+0x0/0x94) from [<c00085b4>] (omap3_intc_handle_irq+0x68/0x78)
[61010.842285]  r6:c055cbbc r5:cd9fdfb0 r4:fa200000 r3:00000044
[61010.848236] [<c000854c>] (omap3_intc_handle_irq+0x0/0x78) from [<c000e480>] (__irq_usr+0x40/0x60)
[61010.857543] Exception stack(0xcd9fdfb0 to 0xcd9fdff8)
[61010.862854] dfa0:                                     b6e474c0 00000000 00000001 00000000
[61010.871459] dfc0: 00000000 00000000 0000e538 0000e530 0000000e 0000e478 00000014 00000000
[61010.880035] dfe0: 0000e4a8 befbbb58 00008adc b6d4d71c 60000010 ffffffff
[61010.886993]  r7:0000e530 r6:ffffffff r5:60000010 r4:b6d4d71c
[61010.892913] Mem-info:
[61010.895324] Normal per-cpu:
[61010.898254] CPU    0: hi:   90, btch:  15 usd:  10
[61010.903289] active_anon:57090 inactive_anon:60 isolated_anon:0
[61010.903289]  active_file:853 inactive_file:324 isolated_file:0
[61010.903289]  unevictable:0 dirty:0 writeback:0 unstable:0
[61010.903289]  free:2251 slab_reclaimable:679 slab_unreclaimable:692
[61010.903289]  mapped:245 shmem:77 pagetables:350 bounce:0
[61010.933258] Normal free:9004kB min:2028kB low:2532kB high:3040kB active_anon:228360kB inactive_anon:240kB active_file:3412kB inacto
[61010.974212] lowmem_reserve[]: 0 0 0
[61010.977905] Normal: 1589*4kB 331*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 9004kB
[61010.989044] 1254 total pagecache pages
[61011.003723] 65536 pages of RAM
[61011.006927] 2574 free pages
[61011.009857] 2234 reserved pages
[61011.013153] 1371 slab pages
[61011.016082] 4367 pages shared
[61011.019195] 0 pages swap cached
[61011.022613] ------------[ cut here ]------------
[61011.027557] WARNING: at drivers/net/ethernet/ti/davinci_emac.c:997 emac_rx_alloc+0x5c/0x64()
[61011.036437] Modules linked in:
[61011.039642] Backtrace:
[61011.042236] [<c0011d6c>] (dump_backtrace+0x0/0x114) from [<c039cb8c>] (dump_stack+0x20/0x24)
[61011.051147]  r6:c02a07a4 r5:00000009 r4:00000000 r3:c052fed8
[61011.057098] [<c039cb6c>] (dump_stack+0x0/0x24) from [<c002eb3c>] (warn_slowpath_common+0x5c/0x74)
[61011.066467] [<c002eae0>] (warn_slowpath_common+0x0/0x74) from [<c002eb80>] (warn_slowpath_null+0x2c/0x34)
[61011.076538]  r8:0000003c r7:00000000 r6:7b4c2a43 r5:cd95e300 r4:00000000
[61011.076538] r3:00000009
[61011.086151] [<c002eb54>] (warn_slowpath_null+0x0/0x34) from [<c02a07a4>] (emac_rx_alloc+0x5c/0x64)
[61011.095611] [<c02a0748>] (emac_rx_alloc+0x0/0x64) from [<c02a1670>] (emac_rx_handler+0x74/0x11c)
[61011.104858]  r4:cfa50000 r3:059e8859
[61011.108642] [<c02a15fc>] (emac_rx_handler+0x0/0x11c) from [<c02a2968>] (__cpdma_chan_free+0xc8/0xe0)
[61011.118255]  r6:cfa6f5c0 r5:60000113 r4:cfa5bac0
[61011.123138] [<c02a28a0>] (__cpdma_chan_free+0x0/0xe0) from [<c02a2a4c>] (__cpdma_chan_process+0xcc/0x104)
[61011.133239] [<c02a2980>] (__cpdma_chan_process+0x0/0x104) from [<c02a35b8>] (cpdma_chan_process+0x4c/0x64)
[61011.143402]  r7:00000040 r6:00000040 r5:cfa6f5c0 r4:00000000
[61011.149383] [<c02a356c>] (cpdma_chan_process+0x0/0x64) from [<c02a1b44>] (emac_poll+0x9c/0x208)
[61011.158538]  r6:00000001 r5:00000001 r4:cfa5044c r3:00000001
[61011.164520] [<c02a1aa8>] (emac_poll+0x0/0x208) from [<c0314340>] (net_rx_action+0xb0/0x1a8)
[61011.173339]  r8:c0531130 r7:0000012c r6:00000040 r5:00000001 r4:cfa5044c
[61011.173339] r3:c02a1aa8
[61011.182952] [<c0314290>] (net_rx_action+0x0/0x1a8) from [<c0035778>] (__do_softirq+0xb0/0x1d8)
[61011.192016] [<c00356c8>] (__do_softirq+0x0/0x1d8) from [<c0035c7c>] (irq_exit+0x8c/0x94)
[61011.200561] [<c0035bf0>] (irq_exit+0x0/0x94) from [<c000f010>] (handle_IRQ+0x44/0x94)
[61011.208801]  r4:c05499b8 r3:c00764a4
[61011.212585] [<c000efcc>] (handle_IRQ+0x0/0x94) from [<c00085b4>] (omap3_intc_handle_irq+0x68/0x78)
[61011.222015]  r6:c055cbbc r5:cd9fdfb0 r4:fa200000 r3:00000044
[61011.227996] [<c000854c>] (omap3_intc_handle_irq+0x0/0x78) from [<c000e480>] (__irq_usr+0x40/0x60)
[61011.237335] Exception stack(0xcd9fdfb0 to 0xcd9fdff8)
[61011.242645] dfa0:                                     b6e474c0 00000000 00000001 00000000
[61011.251281] dfc0: 00000000 00000000 0000e538 0000e530 0000000e 0000e478 00000014 00000000
[61011.259887] dfe0: 0000e4a8 befbbb58 00008adc b6d4d71c 60000010 ffffffff
[61011.266845]  r7:0000e530 r6:ffffffff r5:60000010 r4:b6d4d71c
[61011.272827] ---[ end trace c901cd47c92f77fb ]---

Then on our Development Platform (Technexion Twister board in this case), running the same 3.4rc6, this happened IMMEDIATELY after rebooting:
---------------------------------------------------------------------------------------------------------------------------------------------------------

[   40.027679] Unable to handle kernel NULL pointer dereference at virtual addc
[   40.036285] pgd = ce568000
[   40.039154] [0000000c] *pgd=8e4b9831, *pte=00000000, *ppte=00000000
[   40.045776] Internal error: Oops: 17 [#1] ARM
[   40.050354] Modules linked in:
[   40.053588] CPU: 0    Not tainted  (3.4.0-rc6 #4)
[   40.058532] PC is at tcp_rcv_established+0x34/0x65c
[   40.063690] LR is at tcp_v4_do_rcv+0xbc/0x1bc
[   40.068267] pc : [<c032b4d8>]    lr : [<c0331ee0>]    psr: 600f0013
[   40.068267] sp : cfb71c98  ip : cfb71cc8  fp : cfb71cc4
[   40.080322] r10: 00000001  r9 : 00002000  r8 : c052b256
[   40.085815] r7 : 00000000  r6 : ce62d0c0  r5 : 00000000  r4 : ce4b2500
[   40.092681] r3 : 00000000  r2 : 0000003a  r1 : c9011080  r0 : ce4b2500
[   40.099548] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
[   40.107025] Control: 10c5387d  Table: 8e568019  DAC: 00000015
[   40.113067] Process iperf (pid: 807, stack limit = 0xcfb702f0)
[   40.119201] Stack: (0xcfb71c98 to 0xcfb72000)
[   40.123779] 1c80:                                                       ce62dc00 000005c8
[   40.132385] 1ca0: c052b256 ce62d0c0 ce4b2500 00000000 00000000 c052b256 cfb71cfc cfb71cc8
[   40.140991] 1cc0: c0331ee0 c032b4b0 cfb71cec cfb71cd8 c0055a20 c037d5a4 00000000 c037d9a0
[   40.149597] 1ce0: ce62d0c0 ce4b2500 00000000 00000000 cfb71d24 cfb71d00 c02e2ba8 c0331e30
[   40.158172] 1d00: cfb2c0c0 000003b8 000003b8 ce4b281c ce4b2500 00000000 cfb71d84 cfb71d28
[   40.166778] 1d20: c0322244 c02e2b3c c028668c c02871f8 00000001 c0026964 cfb71d54 00000001
[   40.175384] 1d40: cfb71ecc ce42f3c0 00000000 00000000 c028565c 7fffffff 04000000 c05280d8
[   40.183990] 1d60: cfb71ecc 00000000 cf5e6200 00000000 cfb71ecc 00002000 cfb71dbc cfb71d88
[   40.192596] 1d80: c03401cc c0321fd0 00000000 00000000 cfb71d9c c0285718 00000001 00000000
[   40.201171] 1da0: cfb71dec 00000000 00000000 00000000 cfb71eb4 cfb71dc0 c02df95c c034018c
[   40.209777] 1dc0: 00000000 cfb71dd0 c0034a8c c0034128 00000000 00002000 cf5e6200 0000000a
[   40.218383] 1de0: 00000000 cfb71ecc c007602c cfb70000 00000044 c0034f18 c0503338 cfb70000
[   40.226989] 1e00: 00000044 00000000 cfb71e2c cfb71e18 00000000 00000001 ffffffff 00000000
[   40.235595] 1e20: 00000000 00000000 00000000 00000000 ce42f3c0 fa200000 00000000 00000000
[   40.244171] 1e40: cfb71e6c cfb71e50 cfb71dc8 c000f098 c02e1780 a0000013 ffffffff cfb71ea4
[   40.252777] 1e60: cfb71f8c cfb71e70 c000e3c0 c0008564 cf5e6200 c02e17f8 cfb71ec4 cfae30c0
[   40.261383] 1e80: 00002000 c02dfb04 cfb71ebc 00002000 cf5e6200 00000000 cfb71ee8 00000000
[   40.269989] 1ea0: cfb70000 0001c470 cfb71f8c cfb71eb8 c02e17b4 c02df8b0 cfb71f0c fffffff7
[   40.278594] 1ec0: 00000001 0001e470 00000000 cfb71ee8 00000080 cfb71ec4 00000001 00000000
[   40.287170] 1ee0: 00000000 cfb70000 00000044 c0034f18 c0503338 cfb70000 00000044 00000000
[   40.295776] 1f00: cfb71f24 cfb71f10 c0034f18 c0079244 c007603c c051ab2c cfb71f44 cfb71f28
[   40.304382] 1f20: c000f0d0 c0034ea0 00000044 fa200000 cfb71f68 c052d560 cfb71f64 cfb71f48
[   40.312988] 1f40: c00085cc c000f098 c000e938 80000013 ffffffff cfb71f9c 00000000 cfb71f68
[   40.321594] 1f60: c000e3c0 c0008564 0001c458 0001e478 0001c470 00000123 c000e9c4 00000000
[   40.330169] 1f80: cfb71fa4 cfb71f90 c02e182c c02e1728 00000000 00000000 00000000 cfb71fa8
[   40.338775] 1fa0: c000e780 c02e1810 0001c458 0001e478 00000000 0001c470 00002000 00000000
[   40.347381] 1fc0: 0001c458 0001e478 0001c470 00000123 00fea9f8 00000000 00000fb8 b64d6f9c
[   40.355987] 1fe0: 00000000 b64d6da0 b6e85fe4 b6e85ff4 80000010 00000000 00000000 00000000
[   40.364562] Backtrace:
[   40.367156] [<c032b4a4>] (tcp_rcv_established+0x0/0x65c) from [<c0331ee0>] (tcp_v4_do_rcv+0xbc/0x1bc)
[   40.376861]  r8:c052b256 r7:00000000 r6:00000000 r5:ce4b2500 r4:ce62d0c0
[   40.383941] [<c0331e24>] (tcp_v4_do_rcv+0x0/0x1bc) from [<c02e2ba8>] (release_sock+0x78/0xe0)
[   40.392883]  r7:00000000 r6:00000000 r5:ce4b2500 r4:ce62d0c0
[   40.398864] [<c02e2b30>] (release_sock+0x0/0xe0) from [<c0322244>] (tcp_recvmsg+0x280/0x864)
[   40.407745] [<c0321fc4>] (tcp_recvmsg+0x0/0x864) from [<c03401cc>] (inet_recvmsg+0x4c/0x60)
[   40.416564] [<c0340180>] (inet_recvmsg+0x0/0x60) from [<c02df95c>] (sock_recvmsg+0xb8/0xd8)
[   40.425323]  r6:00000000 r5:00000000 r4:00000000
[   40.430206] [<c02df8a4>] (sock_recvmsg+0x0/0xd8) from [<c02e17b4>] (sys_recvfrom+0x98/0xe8)
[   40.438995] [<c02e171c>] (sys_recvfrom+0x0/0xe8) from [<c02e182c>] (sys_recv+0x28/0x30)
[   40.447418] [<c02e1804>] (sys_recv+0x0/0x30) from [<c000e780>] (ret_fast_syscall+0x0/0x30)
[   40.456115] Code: e5901314 e1a04000 e7c0201f e5c023ac (e595200c)
[   40.462615] ---[ end trace b41eca898abf2dd4 ]---
[   40.467468] Kernel panic - not syncing: Fatal exception in interrupt


After a power-cycle reboot, as soon as I re-enabled the iperf server, which had a client already waiting to run, it threw the dreaded SLAB error, and faulted the EMAC for it:
---------------------------------------------------------------------------------------------------------------------------------------------------------

$ iperf -s -D
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
Running Iperf Server as a daemon
The Iperf daemon process ID : 814
root@sc-hd1u-tam3517:~# [   36.567047] ------------[ cut here ]------------
[   36.571929] kernel BUG at mm/slab.c:3175!
[   36.576141] Internal error: Oops - BUG: 0 [#1] ARM
[   36.581176] Modules linked in:
[   36.584381] CPU: 0    Not tainted  (3.4.0-rc6 #4)
[   36.589355] PC is at cache_alloc_refill+0x13c/0x534
[   36.594482] LR is at kmem_cache_alloc+0x10c/0x11c
[   36.599426] pc : [<c037a610>]    lr : [<c00c3b84>]    psr: 60000093
[   36.599456] sp : ce945a60  ip : 0000003e  fp : ce945aa4
[   36.611511] r10: 00000014  r9 : cf80d7c0  r8 : 0000000a
[   36.617004] r7 : 00100100  r6 : cf813480  r5 : 00200200  r4 : cf829400
[   36.623870] r3 : cfb0e040  r2 : 00000001  r1 : 00000014  r0 : 00000014
[   36.630706] Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
[   36.638305] Control: 10c5387d  Table: 8e910019  DAC: 00000015
[   36.644348] Process iperf (pid: 815, stack limit = 0xce9442f0)
[   36.650482] Stack: (0xce945a60 to 0xce946000)
[   36.655059] 5a60: cf813490 00000020 00000020 00000000 00000020 cf813488 c03141f0 00000020
[   36.663665] 5a80: 60000013 cf80d7c0 00000020 c02e7bac c052b080 cf80d7c0 ce945ad4 ce945aa8
[   36.672271] 5aa0: c00c3b84 c037a4e0 cf9ee000 00000008 cf9ee000 00000020 cf80d7c0 00000634
[   36.680877] 5ac0: c02e812c 00000000 ce945afc ce945ad8 c02e7bac c00c3a84 cf9ee000 cfb33e40
[   36.689453] 5ae0: 039586f6 00000000 000005ea cfb33e40 ce945b14 ce945b00 c02e812c c02e7b7c
[   36.698059] 5b00: cf9ee480 cf9ee000 ce945b2c ce945b18 c02843d4 c02e810c 00009b52 cf9ee000
[   36.706665] 5b20: ce945b54 ce945b30 c02852d4 c02843b8 c0017b30 c02e8828 cfa2c940 cfa2c940
[   36.715270] 5b40: 60000013 cfa31440 ce945b7c ce945b58 c02865b4 c028526c cfa2c940 00000000
[   36.723876] 5b60: d26d0fc0 d08d0660 cf9ee000 ce944000 ce945b9c ce945b80 c028668c c0286510
[   36.732452] 5b80: 00000023 cfa31440 00000040 00000040 ce945bbc ce945ba0 c02871f8 c02865cc
[   36.741058] 5ba0: 00000001 cf9ee48c 00010001 00000001 ce945be4 ce945bc0 c02857a8 c02871b8
[   36.749664] 5bc0: c028570c cf9ee48c 00000001 00000040 0000012c c0501978 ce945c1c ce945be8
[   36.758270] 5be0: c02f1dbc c0285718 ce945be8 ffff7c46 ffffffff 00000001 00000003 0000000c
[   36.766876] 5c00: c053e390 c053e38c 00000b50 ce944000 ce945c64 ce945c20 c0034a1c c02f1d18
[   36.775451] 5c20: cfb0e180 ce8d5084 17762052 0000000a c053e380 00000101 ce945c68 60000013
[   36.784057] 5c40: ce929864 ce894d40 00000020 c052b256 00000b50 00000001 ce945c7c ce945c68
[   36.792663] 5c60: c0034ca0 c0034978 0000000a ce944000 ce945c94 ce945c80 c0034e1c c0034c54
[   36.801269] 5c80: ce9c9500 ce929864 ce945cc4 ce945c98 c032b9bc c0034d78 ce87e2c0 000005c8
[   36.809875] 5ca0: c053e380 ce894d40 ce9c9500 cea3c3c0 00000000 c052b256 ce945cfc ce945cc8
[   36.818481] 5cc0: c0331ee0 c032b4b0 ce945cec ce945cd8 c0034ca0 c0034978 00000008 ce944000
[   36.827056] 5ce0: ce894d40 ce9c9500 cea3c3c0 00000000 ce945d24 ce945d00 c02e2ba8 c0331e30
[   36.835662] 5d00: ce944000 cea66440 ce9c953c ce9c981c ce9c9500 00000000 ce945d84 ce945d28
[   36.844268] 5d20: c0322300 c02e2b3c c0026964 c0017fe0 ce945d54 ce945d40 c028565c 00000001
[   36.852874] 5d40: ce945ecc cea66440 00000000 000014b0 c0285794 7fffffff c028570c c05280d8
[   36.861480] 5d60: ce945ecc 00000000 cf5e36c0 00000000 ce945ecc 00002000 ce945dbc ce945d88
[   36.870056] 5d80: c03401cc c0321fd0 00000000 00000000 ce945d9c c0034a8c 00000100 00000000
[   36.878662] 5da0: 00000003 00000000 00000000 00000000 ce945eb4 ce945dc0 c02df95c c034018c
[   36.887268] 5dc0: 00000000 0000000a c053e380 00000100 00000000 00002000 cf5e36c0 c0034f18
[   36.895874] 5de0: 00000000 ce945ecc 00000044 00000000 ce945e14 ce945e00 c0034f18 c0079244
[   36.904449] 5e00: c007603c c051ab2c ce945e34 ce945e18 00000000 00000001 ffffffff 00000000
[   36.913055] 5e20: 00000000 00000000 00000000 00000000 cea66440 c000f098 00000000 00000000
[   36.921661] 5e40: ffffffff ce945e8c ce945dc8 ce945e58 c000e3c0 c0008564 cfba7ac0 c03ad080
[   36.930267] 5e60: c03ad080 cfba7ac0 ce945ebc ce945ec0 00000000 c02e17f8 00000000 ce944000
[   36.938873] 5e80: 0001c470 c02dfb04 ce945ebc 00002000 cf5e36c0 00000000 ce945ee8 00000000
[   36.947448] 5ea0: ce944000 0001c470 ce945f8c ce945eb8 c02e17b4 c02df8b0 c0034a8c fffffff7
[   36.956054] 5ec0: 00000001 0001e0b8 000003b8 ce945ee8 00000080 ce945ec4 00000001 00000000
[   36.964660] 5ee0: 00000000 c0034f18 c0503338 ce944000 00000044 00000000 ce945f1c ce945f08
[   36.973266] 5f00: c0034f18 c0079244 c007603c c051ab2c ce945f3c ce945f20 c000f0d0 c0034ea0
[   36.981872] 5f20: 00000044 fa200000 ce945f60 c052d560 ce945f5c ce945f40 c00085cc c000f098
[   36.990447] 5f40: c000e780 20000013 ffffffff ce945f94 00000000 ce945f60 c000e3c0 c0008564
[   36.999053] 5f60: 00002000 00000000 0001c458 0001e478 0001c470 00000123 c000e9c4 00000000
[   37.007659] 5f80: ce945fa4 ce945f90 c02e182c c02e1728 00000000 00000000 00000000 ce945fa8
[   37.016265] 5fa0: c000e780 c02e1810 0001c458 0001e478 00000000 0001c470 00002000 00000000
[   37.024871] 5fc0: 0001c458 0001e478 0001c470 00000123 036b4730 00000000 000010f8 b6472f9c
[   37.033447] 5fe0: 00000000 b6472da0 b6e21fe4 b6e21ff4 80000010 00000000 18631b25 4609c480
[   37.042053] Backtrace:
[   37.044647] [<c037a4d4>] (cache_alloc_refill+0x0/0x534) from [<c00c3b84>] (kmem_cache_alloc+0x10c/0x11c)
[   37.054626] [<c00c3a78>] (kmem_cache_alloc+0x0/0x11c) from [<c02e7bac>] (__alloc_skb+0x3c/0xfc)
[   37.063781] [<c02e7b70>] (__alloc_skb+0x0/0xfc) from [<c02e812c>] (__netdev_alloc_skb+0x2c/0x54)
[   37.073059] [<c02e8100>] (__netdev_alloc_skb+0x0/0x54) from [<c02843d4>] (emac_rx_alloc+0x28/0x64)
[   37.082458]  r4:cf9ee000 r3:cf9ee480
[   37.086242] [<c02843ac>] (emac_rx_alloc+0x0/0x64) from [<c02852d4>] (emac_rx_handler+0x74/0x11c)
[   37.095489]  r4:cf9ee000 r3:00009b52
[   37.099273] [<c0285260>] (emac_rx_handler+0x0/0x11c) from [<c02865b4>] (__cpdma_chan_free+0xb0/0xbc)
[   37.108856]  r6:cfa31440 r5:60000013 r4:cfa2c940
[   37.113739] [<c0286504>] (__cpdma_chan_free+0x0/0xbc) from [<c028668c>] (__cpdma_chan_process+0xcc/0x104)
[   37.123809] [<c02865c0>] (__cpdma_chan_process+0x0/0x104) from [<c02871f8>] (cpdma_chan_process+0x4c/0x64)
[   37.133941]  r7:00000040 r6:00000040 r5:cfa31440 r4:00000023
[   37.139923] [<c02871ac>] (cpdma_chan_process+0x0/0x64) from [<c02857a8>] (emac_poll+0x9c/0x208)
[   37.149078]  r6:00000001 r5:00010001 r4:cf9ee48c r3:00000001
[   37.155059] [<c028570c>] (emac_poll+0x0/0x208) from [<c02f1dbc>] (net_rx_action+0xb0/0x1a8)
[   37.163818]  r8:c0501978 r7:0000012c r6:00000040 r5:00000001 r4:cf9ee48c
[   37.170715] r3:c028570c
[   37.173461] [<c02f1d0c>] (net_rx_action+0x0/0x1a8) from [<c0034a1c>] (__do_softirq+0xb0/0x1d8)
[   37.182525] [<c003496c>] (__do_softirq+0x0/0x1d8) from [<c0034ca0>] (do_softirq+0x58/0x64)
[   37.191223] [<c0034c48>] (do_softirq+0x0/0x64) from [<c0034e1c>] (local_bh_enable+0xb0/0xc0)
[   37.200073]  r4:ce944000 r3:0000000a
[   37.203857] [<c0034d6c>] (local_bh_enable+0x0/0xc0) from [<c032b9bc>] (tcp_rcv_established+0x518/0x65c)
[   37.213745]  r5:ce929864 r4:ce9c9500
[   37.217529] [<c032b4a4>] (tcp_rcv_established+0x0/0x65c) from [<c0331ee0>] (tcp_v4_do_rcv+0xbc/0x1bc)
[   37.227203]  r8:c052b256 r7:00000000 r6:cea3c3c0 r5:ce9c9500 r4:ce894d40
[   37.234283] [<c0331e24>] (tcp_v4_do_rcv+0x0/0x1bc) from [<c02e2ba8>] (release_sock+0x78/0xe0)
[   37.243225]  r7:00000000 r6:cea3c3c0 r5:ce9c9500 r4:ce894d40
[   37.249206] [<c02e2b30>] (release_sock+0x0/0xe0) from [<c0322300>] (tcp_recvmsg+0x33c/0x864)
[   37.258087] [<c0321fc4>] (tcp_recvmsg+0x0/0x864) from [<c03401cc>] (inet_recvmsg+0x4c/0x60)
[   37.266906] [<c0340180>] (inet_recvmsg+0x0/0x60) from [<c02df95c>] (sock_recvmsg+0xb8/0xd8)
[   37.275665]  r6:00000000 r5:00000000 r4:00000000
[   37.280548] [<c02df8a4>] (sock_recvmsg+0x0/0xd8) from [<c02e17b4>] (sys_recvfrom+0x98/0xe8)
[   37.289337] [<c02e171c>] (sys_recvfrom+0x0/0xe8) from [<c02e182c>] (sys_recv+0x28/0x30)
[   37.297760] [<c02e1804>] (sys_recv+0x0/0x30) from [<c000e780>] (ret_fast_syscall+0x0/0x30)
[   37.306457] Code: e5930010 e5991018 e1500001 3a00000e (e7f001f2)
[   37.312927] ---[ end trace 3d0484470996ac0b ]---
[   37.317810] Kernel panic - not syncing: Fatal exception in interrupt
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Problems in the DaVinci EMAC driver & AM35xx?
  2012-06-13 22:38 Problems in the DaVinci EMAC driver & AM35xx? CF Adad
@ 2012-06-15 19:28 ` CF Adad
  2012-06-22 20:22   ` CF Adad
  0 siblings, 1 reply; 6+ messages in thread
From: CF Adad @ 2012-06-15 19:28 UTC (permalink / raw)
  To: linux-omap@vger.kernel.org

We continue to try to sort this out.  Ignoring the errors for a moment, has anyone else experienced the performance slowdown quoted below between two EMACs?


If not, could anyone with two AM3517-baed platforms with the EMACs exposed please test this for us?  We've run several tests between all the hardware we have available, and all perform like this.  Connecting an EMAC to another EMAC and testing with iperf ("iperf -s" on one side, "iperf -c <IP>" on the other) shows the performance drop vs. connecting the same EMAC to nearly any other non-EMAC NIC and running the same test, even through the same infrastructure.


Our theory is that this could be a driver issue that when only 1 side has it, performance is fine.  However, if both sides have the same issue, as in the case of EMAC to EMAC, the performance degrades significantly.

Thanks again for looking.

------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
{{ DaVinci EMAC server, PC client }}


[  4] local 192.168.2.192 port 5001 connected with 192.168.2.40 port 44452
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.1 sec    114 MBytes  94.1 Mbits/sec
[  5] local 192.168.2.192 port 5001 connected with 192.168.2.40 port 44459
[  5]  0.0-10.1 sec    114 MBytes  94.0 Mbits/sec
[  4] local 192.168.2.192 port 5001 connected with 192.168.2.40 port 44467
[  4]  0.0-10.1 sec    113 MBytes  94.1 Mbits/sec
[  5] local 192.168.2.192 port 5001 connected with 192.168.2.40 port 44474
[  5]  0.0-10.1 sec    114 MBytes  94.1 Mbits/sec
[  4] local 192.168.2.192 port 5001 connected with 192.168.2.40 port 44481
[  4]  0.0-10.1 sec    114 MBytes  94.1 Mbits/sec
[  5] local 192.168.2.192 port 5001 connected with 192.168.2.40 port 44488
[  5]  0.0- 9.8 sec    110 MBytes  94.1 Mbits/sec

{{ same DaVinci EMAC server, still running, just changed to DaVinci EMAC client }}

[  5] local 192.168.2.192 port 5001 connected with 192.168.2.74 port 37325
[  5]  0.0-10.4 sec  73.9 MBytes  59.6 Mbits/sec
[  6] local 192.168.2.192 port 5001 connected with 192.168.2.74 port 37326
[  6]  0.0-10.8 sec  61.5 MBytes  47.8 Mbits/sec
[  5] local 192.168.2.192 port 5001 connected with 192.168.2.74 port 37327
[  5]  0.0-10.3 sec  64.6 MBytes  52.6 Mbits/sec
[  6] local 192.168.2.192 port 5001 connected with 192.168.2.74 port 37328
[  6]  0.0-10.0 sec  78.2 MBytes  65.5 Mbits/sec
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Problems in the DaVinci EMAC driver & AM35xx?
  2012-06-15 19:28 ` CF Adad
@ 2012-06-22 20:22   ` CF Adad
  2012-06-22 20:50     ` CF Adad
  0 siblings, 1 reply; 6+ messages in thread
From: CF Adad @ 2012-06-22 20:22 UTC (permalink / raw)
  To: linux-omap@vger.kernel.org
  Cc: mgreer@animalcreek.com, afzal@ti.com, jp.francois@cynove.com,
	tony@atomide.com, santosh.shilimkar@ti.com, khilman@ti.com

All,

Sorry to spam this out to so many folks, but we are REALLY getting stymied by this bug at the moment and cannot believe we're the only ones seeing it.  Googles for this particular crash ONLY show this thread, and so far, no one has responded.  The folks CC'd on here have been kind enough to take a look at the other issues and/or seem to be the experts on the DaVinci EMAC from other threads.  *Can anyone please help???*

As previously discussed, we're having some stability issues with the AM3517.  We initially thought they may be separate issues, hence the separate threads here, but now we're suspecting they could all be the same error or at least related.  We're presently testing with Technexion TAM3517s.  These processor boards can sometimes run under load for days, but will suddenly produce a crash of some sort.  Sometimes it hangs the kernel completely, sometimes not.  These issues have been *very* hard to track, but we believe we're starting to see a rough commonality in the errors, and they seem more prevalent now that we're enabling and using the EMAC heavily.  All are memory management errors, and now some are directly blaming the EMAC.  As noted previously, our EMAC performance has *always* been questionable, especially when connecting EMAC to EMAC.  So maybe we're onto something?

Our current tests are running on a Linux-omap build straight off the master that's a few days old now.  It's a 3.5-rc2, and we're running SLUB now rather than SLAB as we try to get more information on this mm issue.

Things did seem to get a bit worse after enabling the EMAC interface inside of u-boot, but I suspect that could just be coincidence.  Could that make a difference to Linux somehow?  I'll happily provide anything u-boot, or otherwise, related if that would be helpful.  We are doing _nothing_ special. We are configuring the EMAC just about 
like everyone else, including the CM-T3517, AM3517EVM, and 
AM3517-crane.  For now here is our Linux config:

CONFIG_SMSC911X=y
# CONFIG_SMSC911X_ARCH_HOOKS is not set
# CONFIG_NET_VENDOR_STMICRO is not set
CONFIG_NET_VENDOR_TI=y
CONFIG_TI_DAVINCI_EMAC=y
CONFIG_TI_DAVINCI_MDIO=y
CONFIG_TI_DAVINCI_CPDMA=y
# CONFIG_NET_VENDOR_WIZNET is not set
CONFIG_PHYLIB=y

In our board file we just do:


--------board file -------------------------------------------------------------------------------------------------------

#include <linux/davinci_emac.h>
#include "am35xx-emac.h"


...

am35xx_emac_init(AM35XX_DEFAULT_MDIO_FREQUENCY, 1);

...

-----------------------------------------------------------------------------------------------------------------------------

The *ONLY* tweak to the current "linux/arch/arm/mach-omap2/am35xx-emac.c" is a little patch to use the fused MAC address in the AM3517:


--------small patch to "linux/arch/arm/mach-omap2/am35xx-emac.c"----------------------------

void __init am35xx_emac_init(unsigned long mdio_bus_freq, u8 rmii_en)
{
    u32 v;
    int err;

#if 1
    /* use the TI-provided MAC address fused in the AM35xx */
    u32 regval, mac_lo, mac_hi;

    mac_lo = omap_ctrl_readl(AM35XX_CONTROL_FUSE_EMAC_LSB);
    mac_hi = omap_ctrl_readl(AM35XX_CONTROL_FUSE_EMAC_MSB);

    am35xx_emac_pdata.mac_addr[0] = (u_int8_t)((mac_hi & 0xFF0000) >> 16);
    am35xx_emac_pdata.mac_addr[1] = (u_int8_t)((mac_hi & 0xFF00) >> 8);
    am35xx_emac_pdata.mac_addr[2] = (u_int8_t)((mac_hi & 0xFF) >> 0);
    am35xx_emac_pdata.mac_addr[3] = (u_int8_t)((mac_lo & 0xFF0000) >> 16);
    am35xx_emac_pdata.mac_addr[4] = (u_int8_t)((mac_lo & 0xFF00) >> 8);
    am35xx_emac_pdata.mac_addr[5] = (u_int8_t)((mac_lo & 0xFF) >> 0);
#endif


...

-----------------------------------------------------------------------------------------------------------------------------

There are *NO* changes to "linux/drivers/net/ethernet/ti/..." files at all.

The test I ran was simple:  I kicked off a "ping -s 8000 <IP> &" to a common laptop on several TAM-3517 platforms.  Then I also ran the 'stress' utility (http://weather.ou.edu/~apw/projects/stress/) to put very light, *non-Ethernet* load on the platform.  A day or so later, 2 of the 3 processors are running.  The 3rd crashed with the error below.  This is an identical error to one mentioned in the top post of this thread (http://article.gmane.org/gmane.linux.ports.arm.omap/78647):

[312631.542877] ------------[ cut here ]------------
[312631.547851] WARNING: at drivers/net/ethernet/ti/davinci_emac.c:997 emac_rx_alloc+0x5c/0x64()
[312631.556854] Modules linked in:
[312631.560211] [<c0013d60>] (unwind_backtrace+0x0/0x104) from [<c0394f34>] (dump_stack+0x20/0x24)
[312631.569396] [<c0394f34>] (dump_stack+0x20/0x24) from [<c002efa8>] (warn_slowpath_common+0x5c/0x)
[312631.578948] [<c002efa8>] (warn_slowpath_common+0x5c/0x74) from [<c002efec>] (warn_slowpath_null)
[312631.589233] [<c002efec>] (warn_slowpath_null+0x2c/0x34) from [<c0298bc4>] (emac_rx_alloc+0x5c/0)
[312631.598846] [<c0298bc4>] (emac_rx_alloc+0x5c/0x64) from [<c0299a90>] (emac_rx_handler+0x74/0x11)
[312631.608306] [<c0299a90>] (emac_rx_handler+0x74/0x11c) from [<c029ad88>] (__cpdma_chan_free+0xc8)
[312631.618103] [<c029ad88>] (__cpdma_chan_free+0xc8/0xe0) from [<c029ae6c>] (__cpdma_chan_process+)
[312631.628356] [<c029ae6c>] (__cpdma_chan_process+0xcc/0x104) from [<c029ba00>] (cpdma_chan_proces)
[312631.638732] [<c029ba00>] (cpdma_chan_process+0x4c/0x64) from [<c0299f64>] (emac_poll+0x9c/0x208)
[312631.648101] [<c0299f64>] (emac_poll+0x9c/0x208) from [<c030b228>] (net_rx_action+0xb0/0x1a8)
[312631.657073] [<c030b228>] (net_rx_action+0xb0/0x1a8) from [<c0035c0c>] (__do_softirq+0xb0/0x1d8)
[312631.666351] [<c0035c0c>] (__do_softirq+0xb0/0x1d8) from [<c0036110>] (irq_exit+0x8c/0x94)
[312631.675048] [<c0036110>] (irq_exit+0x8c/0x94) from [<c000f010>] (handle_IRQ+0x44/0x94)
[312631.683502] [<c000f010>] (handle_IRQ+0x44/0x94) from [<c00085b4>] (omap3_intc_handle_irq+0x68/0)
[312631.693145] [<c00085b4>] (omap3_intc_handle_irq+0x68/0x78) from [<c000e480>] (__irq_usr+0x40/0x)
[312631.702667] Exception stack(0xce029fb0 to 0xce029ff8)
[312631.708068] 9fa0:                                     00000000 00000000 b6e64020 b6e63458
[312631.716796] 9fc0: 00000000 00000001 b6e64020 0000e530 00000011 0000e478 00000001 00000000
[312631.725494] 9fe0: b6e63198 bef7ab68 b6d6c4f4 b6d6c504 20000010 ffffffff
[312631.732543] ---[ end trace bf1e7d78367d02a3 ]---
[312631.737579] stress: page allocation failure: order:0, mode:0x120
[312631.743988] [<c0013d60>] (unwind_backtrace+0x0/0x104) from [<c0394f34>] (dump_stack+0x20/0x24)
[312631.753173] [<c0394f34>] (dump_stack+0x20/0x24) from [<c009cec4>] (warn_alloc_failed+0xd8/0x11c)
[312631.762481] [<c009cec4>] (warn_alloc_failed+0xd8/0x11c) from [<c009f4d8>] (__alloc_pages_nodema)
[312631.773101] [<c009f4d8>] (__alloc_pages_nodemask+0x508/0x678) from [<c030031c>] (netdev_alloc_f)
[312631.783630] [<c030031c>] (netdev_alloc_frag+0xa4/0xdc) from [<c03012ec>] (__netdev_alloc_skb+0x)
[312631.793609] [<c03012ec>] (__netdev_alloc_skb+0x78/0xd0) from [<c0298b90>] (emac_rx_alloc+0x28/0)
[312631.803192] [<c0298b90>] (emac_rx_alloc+0x28/0x64) from [<c0299a90>] (emac_rx_handler+0x74/0x11)
[312631.812622] [<c0299a90>] (emac_rx_handler+0x74/0x11c) from [<c029ad88>] (__cpdma_chan_free+0xc8)
[312631.822387] [<c029ad88>] (__cpdma_chan_free+0xc8/0xe0) from [<c029ae6c>] (__cpdma_chan_process+)
[312631.832641] [<c029ae6c>] (__cpdma_chan_process+0xcc/0x104) from [<c029ba00>] (cpdma_chan_proces)
[312631.842987] [<c029ba00>] (cpdma_chan_process+0x4c/0x64) from [<c0299f64>] (emac_poll+0x9c/0x208)
[312631.852294] [<c0299f64>] (emac_poll+0x9c/0x208) from [<c030b228>] (net_rx_action+0xb0/0x1a8)
[312631.861267] [<c030b228>] (net_rx_action+0xb0/0x1a8) from [<c0035c0c>] (__do_softirq+0xb0/0x1d8)
[312631.870483] [<c0035c0c>] (__do_softirq+0xb0/0x1d8) from [<c0036110>] (irq_exit+0x8c/0x94)
[312631.879180] [<c0036110>] (irq_exit+0x8c/0x94) from [<c000f010>] (handle_IRQ+0x44/0x94)
[312631.887603] [<c000f010>] (handle_IRQ+0x44/0x94) from [<c00085b4>] (omap3_intc_handle_irq+0x68/0)
[312631.897186] [<c00085b4>] (omap3_intc_handle_irq+0x68/0x78) from [<c000e480>] (__irq_usr+0x40/0x)
[312631.906707] Exception stack(0xce029fb0 to 0xce029ff8)
[312631.912078] 9fa0:                                     00000000 00000000 b6e64020 b6e63458
[312631.920776] 9fc0: 00000000 00000001 b6e64020 0000e530 00000011 0000e478 00000001 00000000
[312631.929473] 9fe0: b6e63198 bef7ab68 b6d6c4f4 b6d6c504 20000010 ffffffff
[312631.936492] Mem-info:
[312631.938964] Normal per-cpu:
[312631.941986] CPU    0: hi:   90, btch:  15 usd:  13
[312631.947113] active_anon:41797 inactive_anon:30 isolated_anon:0
[312631.947113]  active_file:2215 inactive_file:12995 isolated_file:0
[312631.947113]  unevictable:0 dirty:11034 writeback:1 unstable:0
[312631.947113]  free:2352 slab_reclaimable:583 slab_unreclaimable:1014
[312631.947113]  mapped:982 shmem:83 pagetables:606 bounce:0
[312631.978271] Normal free:9408kB min:2028kB low:2532kB high:3040kB active_anon:167188kB inactive_o
[312632.019989] lowmem_reserve[]: 0 0 0
[312632.023742] Normal: 1488*4kB 432*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*20B
[312632.034973] 15295 total pagecache pages
[312632.049987] 65536 pages of RAM
[312632.053283] 2944 free pages
[312632.056304] 2306 reserved pages
[312632.059692] 1333 slab pages
[312632.062713] 3869 pages shared
[312632.065917] 0 pages swap cached


Any ideas what could be causing this?  It has now happened with both SLAB and SLUB and with both heavy and relatively light Ethernet loads on the EMAC.  Can anyone please help?

Thanks in advance!

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Problems in the DaVinci EMAC driver & AM35xx?
  2012-06-22 20:22   ` CF Adad
@ 2012-06-22 20:50     ` CF Adad
  2012-06-28 17:56       ` Kevin Hilman
  0 siblings, 1 reply; 6+ messages in thread
From: CF Adad @ 2012-06-22 20:50 UTC (permalink / raw)
  To: linux-omap@vger.kernel.org
  Cc: mgreer@animalcreek.com, afzal@ti.com, jp.francois@cynove.com,
	tony@atomide.com, santosh.shilimkar@ti.com, khilman@ti.com

A quick follow-up note:

The stress line that was running during the crash mentioned was: "stress --cpu 8 --io 8 --vm 1 --vm-bytes 150M --hdd 2 --timeout 60". The "--hdd 2" option was a leftover that was not supposed to be there.  So that may have created a chunk more disk work for the processor.  Regardless, stress did not crash.  Ping died with the EMAC error mentioned, but stress was still running fine when I stopped it.


It's been suggested that stress may be eating up too much memory and causing the EMAC driver to do this.  That's a fair point, as these "WARNING: at drivers/net/ethernet/ti/davinci_emac.c:997 emac_rx_alloc" errors only appear to show up with stress running.  I suspect that could be true, though no other drivers or other applications are throwing similar errors.  When I've cranked stress too high memory wise in the past, I've seen it break out the oom-killer.  That's what I would expect.  Since I'm no longer seeing that, I suspect I'm playing within the bounds.  Regardless, should a userspace app sucking up too much memory cause a driver to crash?


I bring these questions here, as the crash's call stack shares so many similarities to the "SLAB crash" discussed "http://thread.gmane.org/gmane.linux.ports.arm.omap/78039/", that I think they're related.  At the very least, the EMAC to EMAC performance issues we have always seen are troubling.  IMHO, clearly the EMAC driver is doing something a little questionable.

Thoughts?

Thanks again.
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Problems in the DaVinci EMAC driver & AM35xx?
  2012-06-22 20:50     ` CF Adad
@ 2012-06-28 17:56       ` Kevin Hilman
  2012-06-28 22:36         ` CF Adad
  0 siblings, 1 reply; 6+ messages in thread
From: Kevin Hilman @ 2012-06-28 17:56 UTC (permalink / raw)
  To: CF Adad
  Cc: linux-omap@vger.kernel.org, mgreer@animalcreek.com, afzal@ti.com,
	jp.francois@cynove.com, tony@atomide.com,
	santosh.shilimkar@ti.com

CF Adad <cfadad@rocketmail.com> writes:

[...]

> I bring these questions here, as the crash's call stack shares so many
> similarities to the "SLAB crash" discussed
> "http://thread.gmane.org/gmane.linux.ports.arm.omap/78039/", that I
> think they're related.  At the very least, the EMAC to EMAC
> performance issues we have always seen are troubling.  IMHO, clearly
> the EMAC driver is doing something a little questionable.

I don't know about the crash, but the EMAC performance issues are known.

Basically, the EMAC on the AM3xxx is not wakeup capable.  Meaning that
if you enter idle, and EMAC interrupt will not wake the system.  The
system will not wake until another wakeup-capable device (e.g. timer)
goes off.

To see if this is causing your problems, add 'nohlt' to your kernel
command line.  This will prevent the SoC from ever entering idle, so
your EMAC interrupt latency will drop significantly.

Kevin


--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Problems in the DaVinci EMAC driver & AM35xx?
  2012-06-28 17:56       ` Kevin Hilman
@ 2012-06-28 22:36         ` CF Adad
  0 siblings, 0 replies; 6+ messages in thread
From: CF Adad @ 2012-06-28 22:36 UTC (permalink / raw)
  To: Kevin Hilman
  Cc: linux-omap@vger.kernel.org, mgreer@animalcreek.com, afzal@ti.com,
	jp.francois@cynove.com, tony@atomide.com,
	santosh.shilimkar@ti.com

Hi Kevin,

Thanks for the message!  I did not realize the issue with the wake-ups affected the EMAC, but I have been running my kernel with 'nohlt' since we transitioned to linux-omap-3.4rcX several weeks back. It was required to keep the kernel from crashing on boot.  Running with that option does not appear to have altered my EMAC performance really at all, but I suppose I could be missing something else somewhere...


To try to debug this myself, I did spend a bit of time diving into the driver.  I found several things that caused an eyebrow to raise, but I do not believe any of them are "smoking guns".  I believe there are a few places where skb's could leak or simply be dropped from the RX pool. So I put some prints there, but I've been testing for a few days and have never seen any of my prints hit.  I suspect they are holes in the code, but clearly not my problem.  I'm a novice with the EMAC and CPDMA drivers. I could not even find where the "rx_dropped" stat was being incremented, so obviously this was not a thorough review.

Below is a patch I've been using to test these areas with.  My main "discoveries", if valid, are:

1.) emac_rx_alloc() is redundant to netdev_alloc_skb_ip_align().  So, I swapped it out.

2.) skbs can be _leaked_ by "failed" cpdma_chan_submit()?
dma_map_single() is not checked for a valid return in cpdma_chan_submit(). So if it fails, the code continues to shove forward. I don't know what that would do. So my patch below calls dma_mapping_error() to check that return, and returns an error back to the calling function if a dma_map_single() failed. This lets the calling function, like emac_rx_handler(), free the skb.


3.) losing skb's from our rx pool?
IF I'm understanding this correctly... I noticed that up front, in emac_dev_open(), EMAC_DEF_RX_NUM_DESC skb's are allocated and fed to cpdma.  However, any errors in that loop do not appear to stop things from continuing on.  I suspect this is a bug, albeit one that really never gets hit.  So I put a little code in there to not just bust out of the loop, but rather clean up an dangling skbs and return an error code.


Once the driver is up and running, it looks to me like the emac_rx_handler() is supposed to basically replace every skb it processes with a new one.  Sounds reasonable. However, I think there are gaps in the logic that could allow this replacement to fail, presumably leaving the cpdma with less skbs than desired. In essence, if the repalcement fails, that skb is _never_ resubmitted to cpdma, so it operates forever with 1 less. The patch below just has dev_err() messages in there that I put in place to trap that if it occurs. My plan to fix this, if my prints showed it was worth fixing, would be to insert a value into priv that told me how many skbs are presently submitted to cpdma. So, in that initial loop, I'd increment that value up to the limit. Everytime  emac_rx_handler() was called, I would initially decrement the value to record the one I was pushing up the stack. Then, instead of just allocating 1 skb to replace that packet, I would loop against
 that value (how many skbs *should* be in cpdma) and insert enough to get my buffer count back up to where it should be. From a logic flow perspective, if I failed to either alloc or submit a replacement skb on that particular emac_rx_handler() call, it would be noted and next time around I would try to insert 2 to make up for the one lost previously.


Thoughts on this? Regardless, as I noted at the top, none of these tweaks or prints are showing. So I'm sure these are not core issues at the moment.  I want to track down where the rx_dropped stat is being set so I can have a look there. Still, I suspect my performance issue is something DMA-related. I'm just not well-versed enough in all that to track it down.


Thanks again for all your assistance!

------------------------------------------------------------------------------------------------------------

--- linux/drivers/net/ethernet/ti/davinci_emac.c    2012-06-28 17:11:52.155508605 -0400
+++ linux-omap_3.5rc4/drivers/net/ethernet/ti/davinci_emac.c    2012-06-26 22:39:57.000000000 -0400
@@ -991,6 +991,7 @@
     return IRQ_HANDLED;
 }
 
+/* SAME AS: netdev_alloc_skb_ip_align()
 static struct sk_buff *emac_rx_alloc(struct emac_priv *priv)
 {
     struct sk_buff *skb = netdev_alloc_skb(priv->ndev, priv->rx_buf_size);
@@ -999,6 +1000,7 @@
     skb_reserve(skb, NET_IP_ALIGN);
     return skb;
 }
+*/
 
 static void emac_rx_handler(void *token, int len, int status)
 {
@@ -1028,10 +1030,12 @@
     ndev->stats.rx_packets++;
 
     /* alloc a new packet for receive */
-    skb = emac_rx_alloc(priv);
-    if (!skb) {
+//    skb = emac_rx_alloc(priv);
+    skb = netdev_alloc_skb_ip_align(priv->ndev, priv->rx_buf_size);
+    if (unlikely(!skb)) {
         if (netif_msg_rx_err(priv) && net_ratelimit())
             dev_err(emac_dev, "failed rx buffer alloc\n");
+dev_err(emac_dev, "(1) lost rx skb?\n");
         return;
     }
 
@@ -1041,7 +1045,10 @@
 
     WARN_ON(ret == -ENOMEM);
     if (unlikely(ret < 0))
+    {
         dev_kfree_skb_any(skb);
+dev_err(emac_dev, "(2) lost rx skb?\n");
+    }
 }
 
 static void emac_tx_handler(void *token, int len, int status)
@@ -1053,7 +1060,7 @@
     atomic_dec(&priv->cur_tx);
 
     if (unlikely(netif_queue_stopped(ndev)))
-        netif_start_queue(ndev);
+        netif_start_queue(ndev);    /* maybe consider using 'netif_wake_queue(ndev);' here instead? */
     ndev->stats.tx_packets++;
     ndev->stats.tx_bytes += len;
     dev_kfree_skb_any(skb);
@@ -1105,6 +1112,7 @@
     return NETDEV_TX_OK;
 
 fail_tx:
+dev_err(emac_dev, "(3) lost TX skb, queue stopped - needs restarted (tx_handler)?\n");
     ndev->stats.tx_dropped++;
     netif_stop_queue(ndev);
     return NETDEV_TX_BUSY;
@@ -1540,7 +1548,8 @@
         ndev->dev_addr[cnt] = priv->mac_addr[cnt];
 
     /* Configuration items */
-    priv->rx_buf_size = EMAC_DEF_MAX_FRAME_SIZE + NET_IP_ALIGN;
+//    priv->rx_buf_size = EMAC_DEF_MAX_FRAME_SIZE + NET_IP_ALIGN;
+    priv->rx_buf_size = EMAC_DEF_MAX_FRAME_SIZE;
 
     priv->mac_hash1 = 0;
     priv->mac_hash2 = 0;
@@ -1548,15 +1557,29 @@
     emac_write(EMAC_MACHASH2, 0);
 
     for (i = 0; i < EMAC_DEF_RX_NUM_DESC; i++) {
-        struct sk_buff *skb = emac_rx_alloc(priv);
+//        struct sk_buff *skb = emac_rx_alloc(priv);
+        struct sk_buff *skb = netdev_alloc_skb_ip_align(priv->ndev, priv->rx_buf_size);
 
-        if (!skb)
+        if (unlikely(!skb))
+        {
+            ret = -ENOMEM;
             break;
+        }
 
         ret = cpdma_chan_submit(priv->rxchan, skb, skb->data,
                     skb_tailroom(skb), GFP_KERNEL);
         if (WARN_ON(ret < 0))
+        {
+            dev_kfree_skb_any(skb);
+            ret = -ENOMEM;
             break;
+        }
+    }
+    
+    if (unlikely(ret < 0))
+    {
+        dev_err(emac_dev, "failed prestaging rx skbs in cpdma");
+        return ret;
     }
 
     /* Request IRQ */
@@ -1579,6 +1602,7 @@
 
         coal.rx_coalesce_usecs = (priv->coal_intvl << 4);
         emac_set_coalesce(ndev, &coal);
+dev_info(emac_dev, "interrupt pacing enabled");
     }
 
     cpdma_ctlr_start(priv->dma);
--- linux/drivers/net/ethernet/ti/davinci_cpdma.c    2012-06-28 17:11:52.155508605 -0400
+++ linux-omap_3.5rc4/drivers/net/ethernet/ti/davinci_cpdma.c    2012-06-26 17:15:31.000000000 -0400
@@ -679,6 +679,12 @@
     }
 
     buffer = dma_map_single(ctlr->dev, data, len, chan->dir);
+    if (dma_mapping_error(ctlr->dev, buffer)) {
+        dev_err(ctlr->dev, "CPDMA: dma_map_single failed!");
+        ret = -EINVAL;
+        goto unlock_ret;
+    }
+    
     mode = CPDMA_DESC_OWNER | CPDMA_DESC_SOP | CPDMA_DESC_EOP;
 
     desc_write(desc, hw_next,   0);
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-06-28 22:36 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-06-13 22:38 Problems in the DaVinci EMAC driver & AM35xx? CF Adad
2012-06-15 19:28 ` CF Adad
2012-06-22 20:22   ` CF Adad
2012-06-22 20:50     ` CF Adad
2012-06-28 17:56       ` Kevin Hilman
2012-06-28 22:36         ` CF Adad

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox