* looking for help interpreting softlockup/stack trace
@ 2008-08-05 18:25 Chris Friesen
2008-08-05 19:11 ` Kaz Kylheku
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: Chris Friesen @ 2008-08-05 18:25 UTC (permalink / raw)
To: linux-mips, ralf
I've run into an interesting issue with an Octeon-based board, where it
just seems to hang. I suspect we're hitting some kind of locking bug,
and I'm trying to track it down. If it matters, the kernel is quite old
(heavily patched 2.6.14) and I've got no chance of upgrading it. (The
usual embedded scenario.)
I've added some scheduler instrumentation, as well as adding a stack
dump to the output of the softlockup code.
In the trace below, is "epc" the program counter at the time of the
timer interrupt? How does "ra" fit into this, given that the function
whose address it contains isn't seen in the stack trace until quite a
ways down?
Any insights are greatly appreciated...
Chris
cpu9: jiffies: 4295366417, hrtime: 552015654907, 500 ms without calling
schedule() since scheduler requested
cpu1: jiffies: 4295366418, hrtime: 552016450325, 500 ms without calling
schedule() since scheduler requested
cpu3: jiffies: 4295366442, hrtime: 552034584838, 500 ms without calling
schedule() since scheduler requested
cpu9: jiffies: 4295367258, hrtime: 552646827014, 841 ms between
scheduler_tick() calls
BUG: soft lockup detected on CPU#0!
Cpu 0
$ 0 : 0000000000000000 0000000000000000 ffffffff80000000 ffffffff00000000
$ 4 : a8000000db3ce808 0000000000000084 a8000000d34cb1ea a8000000d34cb1ea
$ 8 : 0000000045000084 0000000000000002 0000000000000001 ffffffff8165e1e0
$12 : 0000000000000000 ffffffff812dedac 0000500000008000 0000100000000000
$16 : a8000000d3a40b68 a8000000db3ce580 a8000000d3a40b00 ffffffff8151e448
$20 : a8000000db3ce808 ffffffff81593808 a8000000d3a40b00 0000000000000000
$24 : 0000000000000030 c000000001ae1820
$28 : ffffffff81590000 ffffffff81593680 ffffffff81593810 c000000001b4ab9c
Hi : 0000000000000000
Lo : 0000000000000000
epc : ffffffff8151e44c _read_lock+0x4/0x20 Tainted: P
ra : c000000001b4ab9c $LBB378+0x0/0x18 [vnb]
Status: 1000ffe3 KX SX UX KERNEL EXL IE
Cause : 40808000
PrId : 000d0301
Call Trace:
[<ffffffff811a251c>] softlockup_tick+0x12c/0x158
[<ffffffff811126dc>] timer_interrupt+0xd4/0x4c8
[<ffffffff81112890>] timer_interrupt+0x288/0x4c8
[<ffffffff81101d50>] octeon_main_timer_interrupt+0x78/0x90
[<ffffffff811a2c24>] handle_IRQ_event+0x45c/0x1528
[<ffffffff811a3dc4>] __do_IRQ+0xd4/0x230
[<ffffffff8151e448>] _read_lock+0x0/0x20
[<ffffffff8110c4a8>] do_IRQ+0x440/0x1520
[<ffffffff8139263c>] neigh_lookup+0xb4/0x128
[<ffffffff8151e448>] _read_lock+0x0/0x20
[<ffffffff8110a430>] ret_from_irq+0x0/0x10
[<ffffffff8114fbdc>] __wake_up_common+0x6c/0xb8
[<ffffffff812dedac>] memset_partial+0x48/0x60
[<ffffffff8151e448>] _read_lock+0x0/0x20
[<c000000001ae1820>] $Ltext0+0x0/0x4 [vnb]
[<ffffffff8151e618>] _read_unlock+0x0/0x20
[<c000000001b4ab9c>] $LBB378+0x0/0x18 [vnb]
[<ffffffff8151e44c>] _read_lock+0x4/0x20
[<ffffffff81407fcc>] arp_rcv+0x18c/0x258
[<ffffffff8138b84c>] netif_dispatch_skb+0x18c/0x2d0
[<c000000000861208>] $LBE370+0x0/0x10 [cavium_ethernet]
[<c000000001ae2acc>] $L396+0x0/0x1c [vnb]
[<c000000001b47ba8>] $LCFI35+0x8c/0x9c [vnb]
[<c000000001adb36c>] $L1193+0x2c/0x48 [vnb]
[<c000000001b3dc5c>] $L1047+0x28/0xd0 [vnb]
[<c000000001b3d6f0>] $LBB344+0x2c/0x34 [vnb]
[<ffffffff811476a8>] activate_task+0x98/0x108
[<c000000001adb4cc>] $L1215+0x2c/0x30 [vnb]
[<c000000001adb7c0>] $L1273+0x0/0x18 [vnb]
[<c000000001ae18a8>] $L28+0x24/0x4c [vnb]
[<ffffffff8138bc90>] netif_receive_skb+0x300/0x730
[<ffffffff8138130c>] __alloc_skb+0x84/0x1a8
[<c000000000861970>] $L620+0x20/0x28 [cavium_ethernet]
[<ffffffff811734e8>] process_timeout+0x0/0xb08
[<ffffffff81111cc0>] do_gettimeofday+0xa8/0x220
[<ffffffff81169b2c>] tasklet_action+0x684/0x1950
[<ffffffff81187b9c>] hrtimer_run_queues+0x84/0x1a0
[<c0000000008613d8>] cvm_oct_tasklet_rx+0x0/0x4 [cavium_ethernet]
[<ffffffff81171ce4>] run_timer_softirq+0x20c/0xce0
[<c0000000008613d8>] cvm_oct_tasklet_rx+0x0/0x4 [cavium_ethernet]
[<ffffffff81112890>] timer_interrupt+0x288/0x4c8
[<ffffffff81167afc>] __do_softirq+0x56c/0x1818
[<ffffffff81168e40>] do_softirq+0x98/0xb8
[<ffffffff8110c874>] do_IRQ+0x80c/0x1520
[<ffffffff8110dd98>] kernel_thread+0xa0/0xbf8
[<ffffffff8110a430>] ret_from_irq+0x0/0x10
[<ffffffff8151bef4>] schedule+0xd4c/0x1a98
[<ffffffff81109ec0>] r4k_wait+0x0/0x10
[<ffffffff813859b0>] datagram_poll+0x0/0x140
[<ffffffff8110d974>] cpu_idle+0x44/0x60
[<ffffffff81109ec8>] r4k_wait+0x8/0x10
[<ffffffff81611e70>] start_kernel+0x460/0x4f8
[<ffffffff81611e68>] start_kernel+0x458/0x4f8
[<ffffffff81611000>] kernel_entry+0x0/0x4c
^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: looking for help interpreting softlockup/stack trace
2008-08-05 18:25 looking for help interpreting softlockup/stack trace Chris Friesen
@ 2008-08-05 19:11 ` Kaz Kylheku
2008-08-05 19:11 ` Kaz Kylheku
2008-08-05 21:38 ` Chris Friesen
2008-08-05 19:16 ` Ralf Baechle
2008-08-05 20:28 ` Chad Reese
2 siblings, 2 replies; 12+ messages in thread
From: Kaz Kylheku @ 2008-08-05 19:11 UTC (permalink / raw)
To: Chris Friesen, linux-mips, ralf
Chris Friesem wrote:
> I've run into an interesting issue with an Octeon-based
> board, where it
> just seems to hang. I suspect we're hitting some kind of
> locking bug,
> and I'm trying to track it down. If it matters, the kernel
> is quite old
> (heavily patched 2.6.14) and I've got no chance of upgrading
> it. (The
> usual embedded scenario.)
>
> I've added some scheduler instrumentation, as well as adding a stack
> dump to the output of the softlockup code.
>
> In the trace below, is "epc" the program counter at the time of the
> timer interrupt? How does "ra" fit into this, given that the function
> whose address it contains isn't seen in the stack trace until quite a
> ways down?
Hi Chris
The contents of RA may or may not be useful. RA is used to hold the
return address when some branch instructions are executed. The contents
of RA may provide some kind of clue about what the
processor was doing previously. Often, RA can give you the parent
function, but not always.
Here it looks it does give the parent function; it is apparently
inside a kernel module called "vnb", which called _read_lock
and got stuck there.
Mind you, these Linux MIPS stack traces are not completely
trustworthy; the routine just walks the stack one word at a time
and prints out anything that looks like the address of code.
So the stack trace may include stale stack data from previous
call chains that have already returned. It also might not
include the full call chain, because the return address is not
always stored on the stack. For instance, here it looks like
RA is pointing into a kernel module. But you don't actually
see this in the fake stack trace. Nowhere in the stack trace
do you see _read_lock being called by c000000001b4ab9c.
I would proceed by inspecting this vnb module and its use of locks.
>
> Any insights are greatly appreciated...
>
> Chris
>
> cpu9: jiffies: 4295366417, hrtime: 552015654907, 500 ms
> without calling
> schedule() since scheduler requested
> cpu1: jiffies: 4295366418, hrtime: 552016450325, 500 ms
> without calling
> schedule() since scheduler requested
> cpu3: jiffies: 4295366442, hrtime: 552034584838, 500 ms
> without calling
> schedule() since scheduler requested
> cpu9: jiffies: 4295367258, hrtime: 552646827014, 841 ms between
> scheduler_tick() calls BUG: soft lockup detected on CPU#0!
> Cpu 0
> $ 0 : 0000000000000000 0000000000000000 ffffffff80000000
> ffffffff00000000 $ 4 : a8000000db3ce808 0000000000000084
> a8000000d34cb1ea a8000000d34cb1ea $ 8 : 0000000045000084
> 0000000000000002 0000000000000001 ffffffff8165e1e0 $12 :
> 0000000000000000 ffffffff812dedac 0000500000008000 0000100000000000
> $16 : a8000000d3a40b68 a8000000db3ce580 a8000000d3a40b00
> ffffffff8151e448 $20 : a8000000db3ce808 ffffffff81593808
> a8000000d3a40b00 0000000000000000 $24 : 0000000000000030
> c000000001ae1820 $28 : ffffffff81590000 ffffffff81593680
> ffffffff81593810 c000000001b4ab9c Hi : 0000000000000000
> Lo : 0000000000000000
> epc : ffffffff8151e44c _read_lock+0x4/0x20 Tainted: P
> ra : c000000001b4ab9c $LBB378+0x0/0x18 [vnb]
> Status: 1000ffe3 KX SX UX KERNEL EXL IE
> Cause : 40808000
> PrId : 000d0301
> Call Trace:
> [<ffffffff811a251c>] softlockup_tick+0x12c/0x158
> [<ffffffff811126dc>] timer_interrupt+0xd4/0x4c8
> [<ffffffff81112890>] timer_interrupt+0x288/0x4c8
> [<ffffffff81101d50>] octeon_main_timer_interrupt+0x78/0x90
> [<ffffffff811a2c24>] handle_IRQ_event+0x45c/0x1528
> [<ffffffff811a3dc4>] __do_IRQ+0xd4/0x230
> [<ffffffff8151e448>] _read_lock+0x0/0x20
This read lock is almost certainly a red herring; it's
stuff left over on the stack that hasn't been overwritten
by new activation chains.
> [<ffffffff8110c4a8>] do_IRQ+0x440/0x1520
This is the timer interrupt going off, which ends up
detecting the lockup. So the problem is above here.
> [<ffffffff8139263c>] neigh_lookup+0xb4/0x128
This may also be garbage. Basically, the stack
trace is not that useful, except that it provides
some additional circumstantial evidence implicating
the vnb module.
^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: looking for help interpreting softlockup/stack trace
2008-08-05 19:11 ` Kaz Kylheku
@ 2008-08-05 19:11 ` Kaz Kylheku
2008-08-05 21:38 ` Chris Friesen
1 sibling, 0 replies; 12+ messages in thread
From: Kaz Kylheku @ 2008-08-05 19:11 UTC (permalink / raw)
To: Chris Friesen, linux-mips, ralf
Chris Friesem wrote:
> I've run into an interesting issue with an Octeon-based
> board, where it
> just seems to hang. I suspect we're hitting some kind of
> locking bug,
> and I'm trying to track it down. If it matters, the kernel
> is quite old
> (heavily patched 2.6.14) and I've got no chance of upgrading
> it. (The
> usual embedded scenario.)
>
> I've added some scheduler instrumentation, as well as adding a stack
> dump to the output of the softlockup code.
>
> In the trace below, is "epc" the program counter at the time of the
> timer interrupt? How does "ra" fit into this, given that the function
> whose address it contains isn't seen in the stack trace until quite a
> ways down?
Hi Chris
The contents of RA may or may not be useful. RA is used to hold the
return address when some branch instructions are executed. The contents
of RA may provide some kind of clue about what the
processor was doing previously. Often, RA can give you the parent
function, but not always.
Here it looks it does give the parent function; it is apparently
inside a kernel module called "vnb", which called _read_lock
and got stuck there.
Mind you, these Linux MIPS stack traces are not completely
trustworthy; the routine just walks the stack one word at a time
and prints out anything that looks like the address of code.
So the stack trace may include stale stack data from previous
call chains that have already returned. It also might not
include the full call chain, because the return address is not
always stored on the stack. For instance, here it looks like
RA is pointing into a kernel module. But you don't actually
see this in the fake stack trace. Nowhere in the stack trace
do you see _read_lock being called by c000000001b4ab9c.
I would proceed by inspecting this vnb module and its use of locks.
>
> Any insights are greatly appreciated...
>
> Chris
>
> cpu9: jiffies: 4295366417, hrtime: 552015654907, 500 ms
> without calling
> schedule() since scheduler requested
> cpu1: jiffies: 4295366418, hrtime: 552016450325, 500 ms
> without calling
> schedule() since scheduler requested
> cpu3: jiffies: 4295366442, hrtime: 552034584838, 500 ms
> without calling
> schedule() since scheduler requested
> cpu9: jiffies: 4295367258, hrtime: 552646827014, 841 ms between
> scheduler_tick() calls BUG: soft lockup detected on CPU#0!
> Cpu 0
> $ 0 : 0000000000000000 0000000000000000 ffffffff80000000
> ffffffff00000000 $ 4 : a8000000db3ce808 0000000000000084
> a8000000d34cb1ea a8000000d34cb1ea $ 8 : 0000000045000084
> 0000000000000002 0000000000000001 ffffffff8165e1e0 $12 :
> 0000000000000000 ffffffff812dedac 0000500000008000 0000100000000000
> $16 : a8000000d3a40b68 a8000000db3ce580 a8000000d3a40b00
> ffffffff8151e448 $20 : a8000000db3ce808 ffffffff81593808
> a8000000d3a40b00 0000000000000000 $24 : 0000000000000030
> c000000001ae1820 $28 : ffffffff81590000 ffffffff81593680
> ffffffff81593810 c000000001b4ab9c Hi : 0000000000000000
> Lo : 0000000000000000
> epc : ffffffff8151e44c _read_lock+0x4/0x20 Tainted: P
> ra : c000000001b4ab9c $LBB378+0x0/0x18 [vnb]
> Status: 1000ffe3 KX SX UX KERNEL EXL IE
> Cause : 40808000
> PrId : 000d0301
> Call Trace:
> [<ffffffff811a251c>] softlockup_tick+0x12c/0x158
> [<ffffffff811126dc>] timer_interrupt+0xd4/0x4c8
> [<ffffffff81112890>] timer_interrupt+0x288/0x4c8
> [<ffffffff81101d50>] octeon_main_timer_interrupt+0x78/0x90
> [<ffffffff811a2c24>] handle_IRQ_event+0x45c/0x1528
> [<ffffffff811a3dc4>] __do_IRQ+0xd4/0x230
> [<ffffffff8151e448>] _read_lock+0x0/0x20
This read lock is almost certainly a red herring; it's
stuff left over on the stack that hasn't been overwritten
by new activation chains.
> [<ffffffff8110c4a8>] do_IRQ+0x440/0x1520
This is the timer interrupt going off, which ends up
detecting the lockup. So the problem is above here.
> [<ffffffff8139263c>] neigh_lookup+0xb4/0x128
This may also be garbage. Basically, the stack
trace is not that useful, except that it provides
some additional circumstantial evidence implicating
the vnb module.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: looking for help interpreting softlockup/stack trace
2008-08-05 18:25 looking for help interpreting softlockup/stack trace Chris Friesen
2008-08-05 19:11 ` Kaz Kylheku
@ 2008-08-05 19:16 ` Ralf Baechle
2008-08-06 18:35 ` Chris Friesen
2008-08-06 18:57 ` David Daney
2008-08-05 20:28 ` Chad Reese
2 siblings, 2 replies; 12+ messages in thread
From: Ralf Baechle @ 2008-08-05 19:16 UTC (permalink / raw)
To: Chris Friesen; +Cc: linux-mips
On Tue, Aug 05, 2008 at 12:25:02PM -0600, Chris Friesen wrote:
> I've run into an interesting issue with an Octeon-based board, where it
> just seems to hang. I suspect we're hitting some kind of locking bug,
> and I'm trying to track it down. If it matters, the kernel is quite old
> (heavily patched 2.6.14) and I've got no chance of upgrading it. (The
> usual embedded scenario.)
>
> I've added some scheduler instrumentation, as well as adding a stack
> dump to the output of the softlockup code.
>
> In the trace below, is "epc" the program counter at the time of the
> timer interrupt? How does "ra" fit into this, given that the function
> whose address it contains isn't seen in the stack trace until quite a
> ways down?
$LBB378 is an internal symbol. The value of RA may not be very informative
if it was overwritten by a random subroutine call.
> Any insights are greatly appreciated...
You may also try lockdep; it gives much more detailed information though
it's more heavyweight.
Ralf
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: looking for help interpreting softlockup/stack trace
2008-08-05 18:25 looking for help interpreting softlockup/stack trace Chris Friesen
2008-08-05 19:11 ` Kaz Kylheku
2008-08-05 19:16 ` Ralf Baechle
@ 2008-08-05 20:28 ` Chad Reese
2008-08-05 20:46 ` Chris Friesen
2 siblings, 1 reply; 12+ messages in thread
From: Chad Reese @ 2008-08-05 20:28 UTC (permalink / raw)
To: Chris Friesen; +Cc: linux-mips, ralf
I assume this is happening under high network load. Some early Octeon
ethernet drivers had a problem where they could starve the rest of the
system processing incoming packets. The message you are getting is the
kernel warning you that userspace hasn't been given any processing time.
What is probably happening is the cavium_ethernet driver is spending all
its time in a receive tasklet getting packets and then dropping them.
This has been fixed in a later cavium SDK, but it looks like you are
running an ancient kernel. A newer kernel is available on the cavium
support site.
The cavium ethernet driver doesn't use the standard NAPI interface since
it doesn't support multicore receive for a single port.
Chad
Chris Friesen wrote:
> I've run into an interesting issue with an Octeon-based board, where it
> just seems to hang. I suspect we're hitting some kind of locking bug,
> and I'm trying to track it down. If it matters, the kernel is quite old
> (heavily patched 2.6.14) and I've got no chance of upgrading it. (The
> usual embedded scenario.)
>
> I've added some scheduler instrumentation, as well as adding a stack
> dump to the output of the softlockup code.
>
> In the trace below, is "epc" the program counter at the time of the
> timer interrupt? How does "ra" fit into this, given that the function
> whose address it contains isn't seen in the stack trace until quite a
> ways down?
>
> Any insights are greatly appreciated...
>
> Chris
>
> cpu9: jiffies: 4295366417, hrtime: 552015654907, 500 ms without calling
> schedule() since scheduler requested
> cpu1: jiffies: 4295366418, hrtime: 552016450325, 500 ms without calling
> schedule() since scheduler requested
> cpu3: jiffies: 4295366442, hrtime: 552034584838, 500 ms without calling
> schedule() since scheduler requested
> cpu9: jiffies: 4295367258, hrtime: 552646827014, 841 ms between
> scheduler_tick() calls
> BUG: soft lockup detected on CPU#0!
> Cpu 0
> $ 0 : 0000000000000000 0000000000000000 ffffffff80000000 ffffffff00000000
> $ 4 : a8000000db3ce808 0000000000000084 a8000000d34cb1ea a8000000d34cb1ea
> $ 8 : 0000000045000084 0000000000000002 0000000000000001 ffffffff8165e1e0
> $12 : 0000000000000000 ffffffff812dedac 0000500000008000 0000100000000000
> $16 : a8000000d3a40b68 a8000000db3ce580 a8000000d3a40b00 ffffffff8151e448
> $20 : a8000000db3ce808 ffffffff81593808 a8000000d3a40b00 0000000000000000
> $24 : 0000000000000030 c000000001ae1820
> $28 : ffffffff81590000 ffffffff81593680 ffffffff81593810 c000000001b4ab9c
> Hi : 0000000000000000
> Lo : 0000000000000000
> epc : ffffffff8151e44c _read_lock+0x4/0x20 Tainted: P
> ra : c000000001b4ab9c $LBB378+0x0/0x18 [vnb]
> Status: 1000ffe3 KX SX UX KERNEL EXL IE
> Cause : 40808000
> PrId : 000d0301
> Call Trace:
> [<ffffffff811a251c>] softlockup_tick+0x12c/0x158
> [<ffffffff811126dc>] timer_interrupt+0xd4/0x4c8
> [<ffffffff81112890>] timer_interrupt+0x288/0x4c8
> [<ffffffff81101d50>] octeon_main_timer_interrupt+0x78/0x90
> [<ffffffff811a2c24>] handle_IRQ_event+0x45c/0x1528
> [<ffffffff811a3dc4>] __do_IRQ+0xd4/0x230
> [<ffffffff8151e448>] _read_lock+0x0/0x20
> [<ffffffff8110c4a8>] do_IRQ+0x440/0x1520
> [<ffffffff8139263c>] neigh_lookup+0xb4/0x128
> [<ffffffff8151e448>] _read_lock+0x0/0x20
> [<ffffffff8110a430>] ret_from_irq+0x0/0x10
> [<ffffffff8114fbdc>] __wake_up_common+0x6c/0xb8
> [<ffffffff812dedac>] memset_partial+0x48/0x60
> [<ffffffff8151e448>] _read_lock+0x0/0x20
> [<c000000001ae1820>] $Ltext0+0x0/0x4 [vnb]
> [<ffffffff8151e618>] _read_unlock+0x0/0x20
> [<c000000001b4ab9c>] $LBB378+0x0/0x18 [vnb]
> [<ffffffff8151e44c>] _read_lock+0x4/0x20
> [<ffffffff81407fcc>] arp_rcv+0x18c/0x258
> [<ffffffff8138b84c>] netif_dispatch_skb+0x18c/0x2d0
> [<c000000000861208>] $LBE370+0x0/0x10 [cavium_ethernet]
> [<c000000001ae2acc>] $L396+0x0/0x1c [vnb]
> [<c000000001b47ba8>] $LCFI35+0x8c/0x9c [vnb]
> [<c000000001adb36c>] $L1193+0x2c/0x48 [vnb]
> [<c000000001b3dc5c>] $L1047+0x28/0xd0 [vnb]
> [<c000000001b3d6f0>] $LBB344+0x2c/0x34 [vnb]
> [<ffffffff811476a8>] activate_task+0x98/0x108
> [<c000000001adb4cc>] $L1215+0x2c/0x30 [vnb]
> [<c000000001adb7c0>] $L1273+0x0/0x18 [vnb]
> [<c000000001ae18a8>] $L28+0x24/0x4c [vnb]
> [<ffffffff8138bc90>] netif_receive_skb+0x300/0x730
> [<ffffffff8138130c>] __alloc_skb+0x84/0x1a8
> [<c000000000861970>] $L620+0x20/0x28 [cavium_ethernet]
> [<ffffffff811734e8>] process_timeout+0x0/0xb08
> [<ffffffff81111cc0>] do_gettimeofday+0xa8/0x220
> [<ffffffff81169b2c>] tasklet_action+0x684/0x1950
> [<ffffffff81187b9c>] hrtimer_run_queues+0x84/0x1a0
> [<c0000000008613d8>] cvm_oct_tasklet_rx+0x0/0x4 [cavium_ethernet]
> [<ffffffff81171ce4>] run_timer_softirq+0x20c/0xce0
> [<c0000000008613d8>] cvm_oct_tasklet_rx+0x0/0x4 [cavium_ethernet]
> [<ffffffff81112890>] timer_interrupt+0x288/0x4c8
> [<ffffffff81167afc>] __do_softirq+0x56c/0x1818
> [<ffffffff81168e40>] do_softirq+0x98/0xb8
> [<ffffffff8110c874>] do_IRQ+0x80c/0x1520
> [<ffffffff8110dd98>] kernel_thread+0xa0/0xbf8
> [<ffffffff8110a430>] ret_from_irq+0x0/0x10
> [<ffffffff8151bef4>] schedule+0xd4c/0x1a98
> [<ffffffff81109ec0>] r4k_wait+0x0/0x10
> [<ffffffff813859b0>] datagram_poll+0x0/0x140
> [<ffffffff8110d974>] cpu_idle+0x44/0x60
> [<ffffffff81109ec8>] r4k_wait+0x8/0x10
> [<ffffffff81611e70>] start_kernel+0x460/0x4f8
> [<ffffffff81611e68>] start_kernel+0x458/0x4f8
> [<ffffffff81611000>] kernel_entry+0x0/0x4c
>
--
Chad Reese <kreese@caviumnetworks.com>
Cavium Networks
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: looking for help interpreting softlockup/stack trace
2008-08-05 20:28 ` Chad Reese
@ 2008-08-05 20:46 ` Chris Friesen
0 siblings, 0 replies; 12+ messages in thread
From: Chris Friesen @ 2008-08-05 20:46 UTC (permalink / raw)
To: Chad Reese; +Cc: linux-mips, ralf
Chad Reese wrote:
> I assume this is happening under high network load. Some early Octeon
> ethernet drivers had a problem where they could starve the rest of the
> system processing incoming packets. The message you are getting is the
> kernel warning you that userspace hasn't been given any processing time.
> What is probably happening is the cavium_ethernet driver is spending all
> its time in a receive tasklet getting packets and then dropping them.
> This has been fixed in a later cavium SDK, but it looks like you are
> running an ancient kernel. A newer kernel is available on the cavium
> support site.
It seems unlikely it's due to cpu load. It's only receiving around
30000 packets/sec total, and before the problem shows up "top" shows
some cores at around 25% and others totally idle. The only symptom of
the problem is that the system simply stops responding.
You wouldn't happen to know which versions fixed the problem you
described above, would you? Upgrading the kernel isn't an option, but
it wouldn't be the first time we've had to backport things.
> The cavium ethernet driver doesn't use the standard NAPI interface since
> it doesn't support multicore receive for a single port.
Ah...I was wondering about that.
Chris
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: looking for help interpreting softlockup/stack trace
2008-08-05 19:11 ` Kaz Kylheku
2008-08-05 19:11 ` Kaz Kylheku
@ 2008-08-05 21:38 ` Chris Friesen
1 sibling, 0 replies; 12+ messages in thread
From: Chris Friesen @ 2008-08-05 21:38 UTC (permalink / raw)
To: Kaz Kylheku; +Cc: linux-mips, ralf
Kaz Kylheku wrote:
> Chris Friesem wrote:
>>In the trace below, is "epc" the program counter at the time of the
>>timer interrupt? How does "ra" fit into this, given that the function
>>whose address it contains isn't seen in the stack trace until quite a
>>ways down?
> Mind you, these Linux MIPS stack traces are not completely
> trustworthy; the routine just walks the stack one word at a time
> and prints out anything that looks like the address of code.
> So the stack trace may include stale stack data from previous
> call chains that have already returned. It also might not
> include the full call chain, because the return address is not
> always stored on the stack. For instance, here it looks like
> RA is pointing into a kernel module. But you don't actually
> see this in the fake stack trace. Nowhere in the stack trace
> do you see _read_lock being called by c000000001b4ab9c.
Right...but it does look like it calls _read_unlock().
> I would proceed by inspecting this vnb module and its use of locks.
Yeah...I'll do that.
>>scheduler_tick() calls BUG: soft lockup detected on CPU#0!
One thing I've noticed, all the softlockups are on cpu0 even though
other cpus are also complaining about scheduling delays. Is there
something special about cpu0?
>>Call Trace:
>> [<ffffffff811a251c>] softlockup_tick+0x12c/0x158
>> [<ffffffff811126dc>] timer_interrupt+0xd4/0x4c8
>> [<ffffffff81112890>] timer_interrupt+0x288/0x4c8
>> [<ffffffff81101d50>] octeon_main_timer_interrupt+0x78/0x90
>> [<ffffffff811a2c24>] handle_IRQ_event+0x45c/0x1528
>> [<ffffffff811a3dc4>] __do_IRQ+0xd4/0x230
>> [<ffffffff8151e448>] _read_lock+0x0/0x20
>
>
> This read lock is almost certainly a red herring; it's
> stuff left over on the stack that hasn't been overwritten
> by new activation chains.
Wouldn't it be part of the do_IRQ call below?
>> [<ffffffff8110c4a8>] do_IRQ+0x440/0x1520
>
>
> This is the timer interrupt going off, which ends up
> detecting the lockup. So the problem is above here.
I assume you mean "above" in terms of chronological order, since in
terms of the call trace the problem would be below.
>> [<ffffffff8139263c>] neigh_lookup+0xb4/0x128
>
>
> This may also be garbage. Basically, the stack
> trace is not that useful, except that it provides
> some additional circumstantial evidence implicating
> the vnb module.
It's unfortunate that the call trace isn't accurate.
Thanks for the help,
Chris
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: looking for help interpreting softlockup/stack trace
2008-08-05 19:16 ` Ralf Baechle
@ 2008-08-06 18:35 ` Chris Friesen
2008-08-06 18:57 ` David Daney
1 sibling, 0 replies; 12+ messages in thread
From: Chris Friesen @ 2008-08-06 18:35 UTC (permalink / raw)
To: Ralf Baechle; +Cc: linux-mips
Ralf Baechle wrote:
> $LBB378 is an internal symbol. The value of RA may not be very informative
> if it was overwritten by a random subroutine call.
What determines which function names get compiled into the object file,
and which ones get these numerical symbols?
> You may also try lockdep; it gives much more detailed information though
> it's more heavyweight.
I would love to be able to use lockdep...however I'm stuck on an old kernel.
Chris
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: looking for help interpreting softlockup/stack trace
2008-08-05 19:16 ` Ralf Baechle
2008-08-06 18:35 ` Chris Friesen
@ 2008-08-06 18:57 ` David Daney
2008-08-07 13:43 ` Ralf Baechle
1 sibling, 1 reply; 12+ messages in thread
From: David Daney @ 2008-08-06 18:57 UTC (permalink / raw)
To: Ralf Baechle; +Cc: Chris Friesen, linux-mips
Ralf Baechle wrote:
> On Tue, Aug 05, 2008 at 12:25:02PM -0600, Chris Friesen wrote:
>
>> I've run into an interesting issue with an Octeon-based board, where it
>> just seems to hang. I suspect we're hitting some kind of locking bug,
>> and I'm trying to track it down. If it matters, the kernel is quite old
>> (heavily patched 2.6.14) and I've got no chance of upgrading it. (The
>> usual embedded scenario.)
>>
>> I've added some scheduler instrumentation, as well as adding a stack
>> dump to the output of the softlockup code.
>>
>> In the trace below, is "epc" the program counter at the time of the
>> timer interrupt? How does "ra" fit into this, given that the function
>> whose address it contains isn't seen in the stack trace until quite a
>> ways down?
>
> $LBB378 is an internal symbol. The value of RA may not be very informative
> if it was overwritten by a random subroutine call.
>
I have thought about eliminating these internal labels when the module's symbols are read. Would this make any sense?
David Daney
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: looking for help interpreting softlockup/stack trace
2008-08-06 18:57 ` David Daney
@ 2008-08-07 13:43 ` Ralf Baechle
2008-08-07 14:19 ` Thiemo Seufer
2008-08-07 21:57 ` David Daney
0 siblings, 2 replies; 12+ messages in thread
From: Ralf Baechle @ 2008-08-07 13:43 UTC (permalink / raw)
To: David Daney; +Cc: Chris Friesen, linux-mips, Thiemo Seufer
On Wed, Aug 06, 2008 at 11:57:32AM -0700, David Daney wrote:
>>> In the trace below, is "epc" the program counter at the time of the
>>> timer interrupt? How does "ra" fit into this, given that the
>>> function whose address it contains isn't seen in the stack trace
>>> until quite a ways down?
>>
>> $LBB378 is an internal symbol. The value of RA may not be very informative
>> if it was overwritten by a random subroutine call.
>>
>
> I have thought about eliminating these internal labels when the module's symbols are read. Would this make any sense?
I think so. Maybe that could even be done when the module is linked. I
don't think there are ever any relocations against these local symbols.
Thiemo?
Ralf
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: looking for help interpreting softlockup/stack trace
2008-08-07 13:43 ` Ralf Baechle
@ 2008-08-07 14:19 ` Thiemo Seufer
2008-08-07 21:57 ` David Daney
1 sibling, 0 replies; 12+ messages in thread
From: Thiemo Seufer @ 2008-08-07 14:19 UTC (permalink / raw)
To: Ralf Baechle; +Cc: David Daney, Chris Friesen, linux-mips
Ralf Baechle wrote:
> On Wed, Aug 06, 2008 at 11:57:32AM -0700, David Daney wrote:
>
> >>> In the trace below, is "epc" the program counter at the time of the
> >>> timer interrupt? How does "ra" fit into this, given that the
> >>> function whose address it contains isn't seen in the stack trace
> >>> until quite a ways down?
> >>
> >> $LBB378 is an internal symbol. The value of RA may not be very informative
> >> if it was overwritten by a random subroutine call.
> >>
> >
> > I have thought about eliminating these internal labels when the module's symbols are read. Would this make any sense?
>
> I think so. Maybe that could even be done when the module is linked. I
> don't think there are ever any relocations against these local symbols.
> Thiemo?
AFAIR all such symbols are branch labels or debug info labels, the module
loader should never use them. (But I don't know if kgdb uses them.)
Thiemo
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: looking for help interpreting softlockup/stack trace
2008-08-07 13:43 ` Ralf Baechle
2008-08-07 14:19 ` Thiemo Seufer
@ 2008-08-07 21:57 ` David Daney
1 sibling, 0 replies; 12+ messages in thread
From: David Daney @ 2008-08-07 21:57 UTC (permalink / raw)
To: Ralf Baechle; +Cc: Chris Friesen, linux-mips, Thiemo Seufer
Ralf Baechle wrote:
> On Wed, Aug 06, 2008 at 11:57:32AM -0700, David Daney wrote:
>
>>>> In the trace below, is "epc" the program counter at the time of the
>>>> timer interrupt? How does "ra" fit into this, given that the
>>>> function whose address it contains isn't seen in the stack trace
>>>> until quite a ways down?
>>> $LBB378 is an internal symbol. The value of RA may not be very informative
>>> if it was overwritten by a random subroutine call.
>>>
>> I have thought about eliminating these internal labels when the module's symbols are read. Would this make any sense?
>
> I think so. Maybe that could even be done when the module is linked. I
> don't think there are ever any relocations against these local symbols.
> Thiemo?
>
I take it back. Most of the $LC... symbols (typically string constants) are needed for relocations. Currently I cannot find in my builds any $LB... symbols. I wonder if gcc-4.3 eliminates these.
I think the reason they show up in stack traces is that they are data pointers that have been stored on the stack that are misinterpreted as function return addresses.
I wonder what would happen if we compiled all the code with -funwind-tables and had a small DWARF2 unwinder ala GCC's C++ exception handling mechanism. That would allow exact stack traces with no runtime overhead.
David Daney
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2008-08-07 21:58 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-05 18:25 looking for help interpreting softlockup/stack trace Chris Friesen
2008-08-05 19:11 ` Kaz Kylheku
2008-08-05 19:11 ` Kaz Kylheku
2008-08-05 21:38 ` Chris Friesen
2008-08-05 19:16 ` Ralf Baechle
2008-08-06 18:35 ` Chris Friesen
2008-08-06 18:57 ` David Daney
2008-08-07 13:43 ` Ralf Baechle
2008-08-07 14:19 ` Thiemo Seufer
2008-08-07 21:57 ` David Daney
2008-08-05 20:28 ` Chad Reese
2008-08-05 20:46 ` Chris Friesen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox