* Kernel oops while routing
@ 2001-11-26 16:54 Ricardo Scop
2001-11-29 22:27 ` Ricardo Scop
0 siblings, 1 reply; 6+ messages in thread
From: Ricardo Scop @ 2001-11-26 16:54 UTC (permalink / raw)
To: linuxppc-embedded
Hi,
I'm doing some performance tests with a proprietary Linuxppc-based box
configured as a routing system. The processor is MPC8255 @ 133 MHz (33Mhz on
the bus) and Linux revision is 2.4.15pre8 rsync'ed from MVista linuxppc_2_4
repository.
We are using two other Linux workstations to exercise the router, both
running Netpipe 2.4, one as a client application, the other as a server. Each
one is connected to a different fast ethernet port of our router box
(100MHz, full-duplex mode) using cross cables.
We're achieving throughputs around 40 Mbps with this setup, which is enough
for our purposes.
But, when we try a 30 MBytes' block in Netpipe, the kernel in our Linux box
crashes big time (trace bellow, including ksymoops decode). We also tryed
Linux 2.4.4 version, but then the performance slows down to around 13 Mbps.
My questions are:
Has anyone observed this kind of crash?
Is there any workaround?
Any pointers or suggestions will be greatly appreciated.
Regards,
~Ricardo
R SCOP Consulting.
Crash trace:
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
Oops: kernel access of bad area, sig: 11
NIP: C00A9E78 XER: 00000000 LR: C00A9E54 SP: C1FADC00 REGS: c1fadb50 TRAP:
0300
MSR: 00001032 EE: 0 PR: 0 FP: 0 ME: 1 IR/DR: 11
TASK = c1fac000[3] 'ksoftirqd_CPU0' Last syscall: -1
last math c1f30000 last altivec 00000000
GPR00: 00000000 C1FADC00 C1FAC000 00000001 00009032 C1FADCE0 00000004
0000001F
GPR08: C0154680 00000025 40601801 08000000 C1FADD98 1001F5F0 01FDF000
00000000
GPR16: 00000001 007FFF00 FFFFFFFF 01FD816C 00001032 01FADCD0 00000000
C0003F80
GPR24: C0004F90 00000400 C1FA5200 C0180000 000005FA 00000020 C0178E20
C0FC9594
Call backtrace:
00000000 C00A558C C00A5234 C0004EEC C0004FD8 C0003F80 C00E9FB4
C00B5B14 C00B5F74 C00BCA94 C00AED38 C0016548 C0005034 C0003F80
C0016548 C0016C20 C0006464
Warning (Oops_read): Code line not seen, dumping what data is available
>>???; c00a9e78 <alloc_skb+d4/204> <=====
Trace; 00000000 Before first symbol
Trace; c00a558c <fcc_enet_rx+e4/220>
Trace; c00a5234 <fcc_enet_interrupt+3c/2b0>
Trace; c0004eec <ppc_irq_dispatch_handler+190/234>
Trace; c0004fd8 <do_IRQ+48/bc>
Trace; c0003f80 <ret_from_intercept+0/8>
Trace; c00e9fb4 <ip_conntrack_in+248/318>
Trace; c00b5b14 <nf_iterate+64/e4>
Trace; c00b5f74 <nf_hook_slow+100/1cc>
Trace; c00bca94 <ip_rcv+450/4b0>
Trace; c00aed38 <net_rx_action+2b0/3e0>
Trace; c0016548 <do_softirq+88/100>
Trace; c0005034 <do_IRQ+a4/bc>
Trace; c0003f80 <ret_from_intercept+0/8>
Trace; c0016548 <do_softirq+88/100>
Trace; c0016c20 <ksoftirqd+84/a8>
Trace; c0006464 <kernel_thread+34/40>
** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: Kernel oops while routing 2001-11-26 16:54 Kernel oops while routing Ricardo Scop @ 2001-11-29 22:27 ` Ricardo Scop 2001-11-29 22:42 ` Dan Malek 0 siblings, 1 reply; 6+ messages in thread From: Ricardo Scop @ 2001-11-29 22:27 UTC (permalink / raw) To: linuxppc-embedded; +Cc: andy_lowe Hi everybody, Ok, time for an update on the problem bellow. First of all, many thanks to Andy Lowe for his patch. The problem was gone with it, but with some performance penalties that I would like to discuss. Even before applying the patch, we found out that the crash was caused by a memory leak in the system. When the memoty was exhauted, the kernel crashed. Furthermore, we discovered that turning CONFIG...MMAP off decreased considerably the leakage rate, at least in our test setup. Then, applying the patch stopped the leak. But, just as I mentioned before, the performance of our routing test decreased a lot. Best throughput rates dropped to 15 Mbps, against 46 Mbps before patching. I'm kind of lost with this performance variations. As far as I could see, the patch did not insert much processing overhead, so... Tips, commentaries, pointers on what to seek for... everything will be appreciated. []'s, Scop mailto:scop@digitel.com.br ------------------------------------------------------------------ "What's money? A man is a success if he gets up in the morning and goes to bed at night and in between does what he wants to do." ~Bob Dylan Monday, November 26, 2001, 1:54:19 PM, Ricardo Scop wrote: RS> Hi, RS> I'm doing some performance tests with a proprietary Linuxppc-based box RS> configured as a routing system. The processor is MPC8255 @ 133 MHz (33Mhz on RS> the bus) and Linux revision is 2.4.15pre8 rsync'ed from MVista linuxppc_2_4 RS> repository. RS> We are using two other Linux workstations to exercise the router, both RS> running Netpipe 2.4, one as a client application, the other as a server. Each RS> one is connected to a different fast ethernet port of our router box RS> (100MHz, full-duplex mode) using cross cables. RS> We're achieving throughputs around 40 Mbps with this setup, which is enough RS> for our purposes. RS> But, when we try a 30 MBytes' block in Netpipe, the kernel in our Linux box RS> crashes big time (trace bellow, including ksymoops decode). We also tryed RS> Linux 2.4.4 version, but then the performance slows down to around 13 Mbps. RS> My questions are: RS> Has anyone observed this kind of crash? RS> Is there any workaround? RS> Any pointers or suggestions will be greatly appreciated. RS> Regards, RS> ~Ricardo RS> R SCOP Consulting. ** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/ ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Kernel oops while routing 2001-11-29 22:27 ` Ricardo Scop @ 2001-11-29 22:42 ` Dan Malek 2001-12-05 3:24 ` Re[2]: " Ricardo Scop 0 siblings, 1 reply; 6+ messages in thread From: Dan Malek @ 2001-11-29 22:42 UTC (permalink / raw) To: scop; +Cc: linuxppc-embedded, andy_lowe Ricardo Scop wrote: > I'm kind of lost with this performance variations. As far as I could > see, the patch did not insert much processing overhead, so... Perhaps if someone would post the patch for the rest of us to see we could be of some assistance. -- Dan ** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/ ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re[2]: Kernel oops while routing 2001-11-29 22:42 ` Dan Malek @ 2001-12-05 3:24 ` Ricardo Scop 2001-12-05 17:56 ` Dan Malek 0 siblings, 1 reply; 6+ messages in thread From: Ricardo Scop @ 2001-12-05 3:24 UTC (permalink / raw) To: Dan Malek; +Cc: linuxppc-embedded, andy_lowe [-- Attachment #1: Type: text/plain, Size: 2338 bytes --] Dan, I apologize for the delay. We were conducting some more tests so as to not make any more false alarms :-) about kernel crashes, memory leaks and/or performance problems in the linuxppc port to our 8255 hardware platform. So, after a _carefull_ test period, these are our findings: 1 - Andy's patch (which is attached) works well and does _not_ append any performance penalties in our tests (we were having PHY negatiation problems there, again :-/ ). 2 - We _did_ have a memory leak which was causing a kernel crash after a while, and it _was_ solved by Andy's patch (thanks, Andy!). I believe it's still on linuxppc_2_4, _2_4_devel and _2_5. It goes like this: - in fcc_enet_start_xmit, after setting up another bd and incrementing bdp, the next bd's tx-ready bit is tested in order to stop the xmit queue if it is set, ok? But, sometimes, the CPM may already have cleared this bit _and_ the corresponding interrupt has not been serviced yet (because we're in a spin_lock_irq); so, netif_stop_queue is not called in this case, nor is tx_full set; - next, the interrupt is serviced, but then curr_tx equals dirty_tx _and_ tx_full is not set, so no sk_buffers are freed! - next time fcc_enet_start_xmit is called, tx_ready bit is still cleared and the next bd is used, but the corresponding sk_buffer wasn't freed, and it's pointer is now lost; - cep->lock can't help with this problem, because the CPM is not bothered by that 8-). AFAIK, Andy's solution is a good one. So, we're offering this patch to the public list (with Andy's blessing :-). I can provide any other details about our tests, if required. Thenks, Ricardo Scop mailto:scop@vanet.com.br R SCOP Consulting ------------------------------------------------------------------ "What's money? A man is a success if he gets up in the morning and goes to bed at night and in between does what he wants to do." ~Bob Dylan Thursday, November 29, 2001, 7:42:24 PM, you wrote: DM> Ricardo Scop wrote: >> I'm kind of lost with this performance variations. As far as I could >> see, the patch did not insert much processing overhead, so... DM> Perhaps if someone would post the patch for the rest of us to see we DM> could be of some assistance. DM> -- Dan [-- Attachment #2: patch-2.41.16-pre1-fcc_enet --] [-- Type: application/octet-stream, Size: 2359 bytes --] Index: arch/ppc/8260_io/fcc_enet.c =================================================================== RCS file: /var/cvs/kernel/arch/ppc/8260_io/fcc_enet.c,v retrieving revision 1.1.1.1.4.2 diff -u -r1.1.1.1.4.2 fcc_enet.c --- arch/ppc/8260_io/fcc_enet.c 4 Sep 2001 16:37:18 -0000 1.1.1.1.4.2 +++ arch/ppc/8260_io/fcc_enet.c 27 Nov 2001 18:42:43 -0000 @@ -300,7 +300,7 @@ volatile fcc_t *fccp; volatile fcc_enet_t *ep; struct net_device_stats stats; - uint tx_full; + uint tx_free; spinlock_t lock; #ifdef CONFIG_USE_MDIO @@ -360,9 +360,9 @@ bdp = cep->cur_tx; #ifndef final_version - if (bdp->cbd_sc & BD_ENET_TX_READY) { + if (!cep->tx_free || (bdp->cbd_sc & BD_ENET_TX_READY)) { /* Ooops. All transmit buffers are full. Bail out. - * This should not happen, since cep->tx_full should be set. + * This should not happen, since the tx queue should be stopped. */ printk("%s: tx queue full!.\n", dev->name); return 1; @@ -407,10 +407,8 @@ else bdp++; - if (bdp->cbd_sc & BD_ENET_TX_READY) { + if (!--cep->tx_free) netif_stop_queue(dev); - cep->tx_full = 1; - } cep->cur_tx = (cbd_t *)bdp; @@ -431,8 +429,8 @@ { int i; cbd_t *bdp; - printk(" Ring data dump: cur_tx %p%s cur_rx %p.\n", - cep->cur_tx, cep->tx_full ? " (full)" : "", + printk(" Ring data dump: cur_tx %p tx_free %d cur_rx %p.\n", + cep->cur_tx, cep->tx_free, cep->cur_rx); bdp = cep->tx_bd_base; printk(" Tx @base %p :\n", bdp); @@ -450,7 +448,7 @@ bdp->cbd_bufaddr); } #endif - if (!cep->tx_full) + if (cep->tx_free) netif_wake_queue(dev); } @@ -492,7 +490,7 @@ spin_lock(&cep->lock); bdp = cep->dirty_tx; while ((bdp->cbd_sc&BD_ENET_TX_READY)==0) { - if ((bdp==cep->cur_tx) && (cep->tx_full == 0)) + if (cep->tx_free == TX_RING_SIZE) break; if (bdp->cbd_sc & BD_ENET_TX_HB) /* No heartbeat */ @@ -546,8 +544,7 @@ /* Since we have freed up a buffer, the ring is no longer * full. */ - if (cep->tx_full) { - cep->tx_full = 0; + if (!cep->tx_free++) { if (netif_queue_stopped(dev)) { netif_wake_queue(dev); } @@ -1529,6 +1526,7 @@ #endif cep->dirty_tx = cep->cur_tx = cep->tx_bd_base; + cep->tx_free = TX_RING_SIZE; cep->cur_rx = cep->rx_bd_base; ep->fen_genfcc.fcc_rstate = (CPMFCR_GBL | CPMFCR_EB) << 24; ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Kernel oops while routing 2001-12-05 3:24 ` Re[2]: " Ricardo Scop @ 2001-12-05 17:56 ` Dan Malek 0 siblings, 0 replies; 6+ messages in thread From: Dan Malek @ 2001-12-05 17:56 UTC (permalink / raw) To: scop; +Cc: linuxppc-embedded, andy_lowe Ricardo Scop wrote: > So, we're offering this patch to the public list (with Andy's > blessing :-). I can provide any other details about our tests, if > required. Cool. Thanks. This could also be a problem with all Ethernet drivers (SCC, FCC, and FEC) on both 8xx and 8260. I'll look at the others. -- Dan ** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/ ^ permalink raw reply [flat|nested] 6+ messages in thread
* RE: Kernel oops while routing
@ 2001-12-05 18:01 Jean-Denis Boyer
2001-12-05 22:41 ` Re[2]: " Ricardo Scop
0 siblings, 1 reply; 6+ messages in thread
From: Jean-Denis Boyer @ 2001-12-05 18:01 UTC (permalink / raw)
To: 'Ricardo Scop'; +Cc: linuxppc-embedded, Dan Malek, andy_lowe
> - in fcc_enet_start_xmit, after setting up another bd and
> incrementing bdp, the next bd's tx-ready bit is tested in order
> to stop the xmit queue if it is set, ok? But, sometimes, the CPM
> may already have cleared this bit _and_ the corresponding
> interrupt has not been serviced yet (because we're in a
> spin_lock_irq); so, netif_stop_queue is not called in this case,
> nor is tx_full set;
>
> - next, the interrupt is serviced, but then curr_tx equals
> dirty_tx _and_ tx_full is not set, so no sk_buffers are freed!
Yes! I totally agree with you, checking the ready bit in the buffer
descriptor is not guaranteed, even if the interrupts are masked, since the
CPM doesn't suspend its processing.
I have done many tests between two of our custom boards, that use an 8260
and a single FCC. I could effectively see a memory leak.
IMHO, I could suggest an easier patch, that would result in modifying only
one line of code, without changing the 'tx_full' logic. In function
fcc_enet_start_xmit, instead of checking the ready bit (which is bad), we
could only check if cur_tx has reached dirty_tx, and then call
netif_stop_queue. Does it make sense?
BTW, I worked hard last week in debugging the fcc_enet driver. It was not
handling correctly some transmission errors, resulting in the transmitter
completely stopping, without restarting. This is related to an errata
(CPM37) from Motorola about the 8260, concerning the way of restarting the
transmitter. If someone is interested, I can release a patch for that.
--------------------------------------------
Jean-Denis Boyer, B.Eng., Technical Leader
Mediatrix Telecom Inc.
4229 Garlock Street
Sherbrooke (Québec)
J1L 2C8 CANADA
(819)829-8749 x241
--------------------------------------------
** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 6+ messages in thread* Re[2]: Kernel oops while routing 2001-12-05 18:01 Jean-Denis Boyer @ 2001-12-05 22:41 ` Ricardo Scop 0 siblings, 0 replies; 6+ messages in thread From: Ricardo Scop @ 2001-12-05 22:41 UTC (permalink / raw) To: Jean-Denis Boyer; +Cc: linuxppc-embedded, Dan Malek, andy_lowe Jean-Denis, Wednesday, December 05, 2001, 3:01:28 PM, you wrote: JDB> IMHO, I could suggest an easier patch, that would result in modifying only JDB> one line of code, without changing the 'tx_full' logic. In function JDB> fcc_enet_start_xmit, instead of checking the ready bit (which is bad), we JDB> could only check if cur_tx has reached dirty_tx, and then call JDB> netif_stop_queue. Does it make sense? Make sense to me. I'll try it out. JDB> BTW, I worked hard last week in debugging the fcc_enet driver. It was not JDB> handling correctly some transmission errors, resulting in the transmitter JDB> completely stopping, without restarting. This is related to an errata JDB> (CPM37) from Motorola about the 8260, concerning the way of restarting the JDB> transmitter. If someone is interested, I can release a patch for that. I'm interested! Ricardo Scop mailto:scop@vanet.com.br R SCOP Consulting ------------------------------------------------------------------ "What's money? A man is a success if he gets up in the morning and goes to bed at night and in between does what he wants to do." ~Bob Dylan ** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/ ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2001-12-05 22:41 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2001-11-26 16:54 Kernel oops while routing Ricardo Scop 2001-11-29 22:27 ` Ricardo Scop 2001-11-29 22:42 ` Dan Malek 2001-12-05 3:24 ` Re[2]: " Ricardo Scop 2001-12-05 17:56 ` Dan Malek -- strict thread matches above, loose matches on Subject: below -- 2001-12-05 18:01 Jean-Denis Boyer 2001-12-05 22:41 ` Re[2]: " Ricardo Scop
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.