From mboxrd@z Thu Jan 1 00:00:00 1970 Received: with ECARTIS (v1.0.0; list linux-mips); Tue, 05 Aug 2008 19:25:25 +0100 (BST) Received: from zrtps0kp.nortel.com ([47.140.192.56]:18062 "EHLO zrtps0kp.nortel.com") by ftp.linux-mips.org with ESMTP id S20024104AbYHESZR (ORCPT ); Tue, 5 Aug 2008 19:25:17 +0100 Received: from zcarhxs1.corp.nortel.com (zcarhxs1.corp.nortel.com [47.129.230.89]) by zrtps0kp.nortel.com (Switch-2.2.6/Switch-2.2.0) with ESMTP id m75IP8X09787; Tue, 5 Aug 2008 18:25:09 GMT Received: from [47.130.64.132] ([47.130.64.132] RDNS failed) by zcarhxs1.corp.nortel.com with Microsoft SMTPSVC(6.0.3790.3959); Tue, 5 Aug 2008 14:25:06 -0400 Message-ID: <48989AFE.5000500@nortel.com> Date: Tue, 05 Aug 2008 12:25:02 -0600 From: "Chris Friesen" User-Agent: Mozilla Thunderbird 1.0.2-6 (X11/20050513) X-Accept-Language: en-us, en MIME-Version: 1.0 To: linux-mips@linux-mips.org, ralf@linux-mips.org Subject: looking for help interpreting softlockup/stack trace Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-OriginalArrivalTime: 05 Aug 2008 18:25:06.0463 (UTC) FILETIME=[95D5D2F0:01C8F728] Return-Path: X-Envelope-To: <"|/home/ecartis/ecartis -s linux-mips"> (uid 0) X-Orcpt: rfc822;linux-mips@linux-mips.org Original-Recipient: rfc822;linux-mips@linux-mips.org X-archive-position: 20116 X-ecartis-version: Ecartis v1.0.0 Sender: linux-mips-bounce@linux-mips.org Errors-to: linux-mips-bounce@linux-mips.org X-original-sender: cfriesen@nortel.com Precedence: bulk X-list: linux-mips I've run into an interesting issue with an Octeon-based board, where it just seems to hang. I suspect we're hitting some kind of locking bug, and I'm trying to track it down. If it matters, the kernel is quite old (heavily patched 2.6.14) and I've got no chance of upgrading it. (The usual embedded scenario.) I've added some scheduler instrumentation, as well as adding a stack dump to the output of the softlockup code. In the trace below, is "epc" the program counter at the time of the timer interrupt? How does "ra" fit into this, given that the function whose address it contains isn't seen in the stack trace until quite a ways down? Any insights are greatly appreciated... Chris cpu9: jiffies: 4295366417, hrtime: 552015654907, 500 ms without calling schedule() since scheduler requested cpu1: jiffies: 4295366418, hrtime: 552016450325, 500 ms without calling schedule() since scheduler requested cpu3: jiffies: 4295366442, hrtime: 552034584838, 500 ms without calling schedule() since scheduler requested cpu9: jiffies: 4295367258, hrtime: 552646827014, 841 ms between scheduler_tick() calls BUG: soft lockup detected on CPU#0! Cpu 0 $ 0 : 0000000000000000 0000000000000000 ffffffff80000000 ffffffff00000000 $ 4 : a8000000db3ce808 0000000000000084 a8000000d34cb1ea a8000000d34cb1ea $ 8 : 0000000045000084 0000000000000002 0000000000000001 ffffffff8165e1e0 $12 : 0000000000000000 ffffffff812dedac 0000500000008000 0000100000000000 $16 : a8000000d3a40b68 a8000000db3ce580 a8000000d3a40b00 ffffffff8151e448 $20 : a8000000db3ce808 ffffffff81593808 a8000000d3a40b00 0000000000000000 $24 : 0000000000000030 c000000001ae1820 $28 : ffffffff81590000 ffffffff81593680 ffffffff81593810 c000000001b4ab9c Hi : 0000000000000000 Lo : 0000000000000000 epc : ffffffff8151e44c _read_lock+0x4/0x20 Tainted: P ra : c000000001b4ab9c $LBB378+0x0/0x18 [vnb] Status: 1000ffe3 KX SX UX KERNEL EXL IE Cause : 40808000 PrId : 000d0301 Call Trace: [] softlockup_tick+0x12c/0x158 [] timer_interrupt+0xd4/0x4c8 [] timer_interrupt+0x288/0x4c8 [] octeon_main_timer_interrupt+0x78/0x90 [] handle_IRQ_event+0x45c/0x1528 [] __do_IRQ+0xd4/0x230 [] _read_lock+0x0/0x20 [] do_IRQ+0x440/0x1520 [] neigh_lookup+0xb4/0x128 [] _read_lock+0x0/0x20 [] ret_from_irq+0x0/0x10 [] __wake_up_common+0x6c/0xb8 [] memset_partial+0x48/0x60 [] _read_lock+0x0/0x20 [] $Ltext0+0x0/0x4 [vnb] [] _read_unlock+0x0/0x20 [] $LBB378+0x0/0x18 [vnb] [] _read_lock+0x4/0x20 [] arp_rcv+0x18c/0x258 [] netif_dispatch_skb+0x18c/0x2d0 [] $LBE370+0x0/0x10 [cavium_ethernet] [] $L396+0x0/0x1c [vnb] [] $LCFI35+0x8c/0x9c [vnb] [] $L1193+0x2c/0x48 [vnb] [] $L1047+0x28/0xd0 [vnb] [] $LBB344+0x2c/0x34 [vnb] [] activate_task+0x98/0x108 [] $L1215+0x2c/0x30 [vnb] [] $L1273+0x0/0x18 [vnb] [] $L28+0x24/0x4c [vnb] [] netif_receive_skb+0x300/0x730 [] __alloc_skb+0x84/0x1a8 [] $L620+0x20/0x28 [cavium_ethernet] [] process_timeout+0x0/0xb08 [] do_gettimeofday+0xa8/0x220 [] tasklet_action+0x684/0x1950 [] hrtimer_run_queues+0x84/0x1a0 [] cvm_oct_tasklet_rx+0x0/0x4 [cavium_ethernet] [] run_timer_softirq+0x20c/0xce0 [] cvm_oct_tasklet_rx+0x0/0x4 [cavium_ethernet] [] timer_interrupt+0x288/0x4c8 [] __do_softirq+0x56c/0x1818 [] do_softirq+0x98/0xb8 [] do_IRQ+0x80c/0x1520 [] kernel_thread+0xa0/0xbf8 [] ret_from_irq+0x0/0x10 [] schedule+0xd4c/0x1a98 [] r4k_wait+0x0/0x10 [] datagram_poll+0x0/0x140 [] cpu_idle+0x44/0x60 [] r4k_wait+0x8/0x10 [] start_kernel+0x460/0x4f8 [] start_kernel+0x458/0x4f8 [] kernel_entry+0x0/0x4c