From mboxrd@z Thu Jan 1 00:00:00 1970 Received: with ECARTIS (v1.0.0; list linux-mips); Sat, 19 Jun 2010 21:13:51 +0200 (CEST) Received: from mail1.adax.com ([208.201.231.104]:19018 "EHLO mail1.adax.com" rhost-flags-OK-OK-OK-OK) by eddie.linux-mips.org with ESMTP id S1491874Ab0FSTNr (ORCPT ); Sat, 19 Jun 2010 21:13:47 +0200 Received: from static-151-204-189-187.pskn.east.verizon.net (static-151-204-189-187.pskn.east.verizon.net [151.204.189.187]) by mail1.adax.com (Postfix) with ESMTP id D8723120A8B; Sat, 19 Jun 2010 12:13:39 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by static-151-204-189-187.pskn.east.verizon.net (Postfix) with ESMTP id E29F5201E7; Sat, 19 Jun 2010 15:13:38 -0400 (EDT) X-Virus-Scanned: Debian amavisd-new at static-151-204-189-187.pskn.east.verizon.net Received: from static-151-204-189-187.pskn.east.verizon.net ([127.0.0.1]) by localhost (static-151-204-189-187.pskn.east.verizon.net [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id JWXZtC+No4IT; Sat, 19 Jun 2010 15:13:31 -0400 (EDT) Received: from [192.168.1.76] (jr001327.mtl-nj.adax [192.168.1.76]) by static-151-204-189-187.pskn.east.verizon.net (Postfix) with ESMTPS id B800F200C0; Sat, 19 Jun 2010 15:13:30 -0400 (EDT) Message-ID: <4C1D16F0.2090102@adax.com> Date: Sat, 19 Jun 2010 15:13:52 -0400 From: Jan Rovins Organization: Adax Inc. User-Agent: Thunderbird 2.0.0.24 (Windows/20100228) MIME-Version: 1.0 To: David Daney CC: "Kevin D. Kissell" , linux-mips@linux-mips.org Subject: Re: Help with decoding a NMI Watchdog interrupt on an Octeon References: <4C1A8D86.60005@adax.com> <4C1A9319.1020202@paralogos.com> <4C1A98EC.1030708@caviumnetworks.com> In-Reply-To: <4C1A98EC.1030708@caviumnetworks.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 27221 X-ecartis-version: Ecartis v1.0.0 Sender: linux-mips-bounce@linux-mips.org Errors-to: linux-mips-bounce@linux-mips.org X-original-sender: janr@adax.com Precedence: bulk X-list: linux-mips Return-Path: X-Keywords: X-UID: 13609 David Daney wrote: > On 06/17/2010 02:26 PM, Kevin D. Kissell wrote: >> NMI is just an input pin, so you'd really need to know what it's >> connected to in the system you're working on. > > In this case, the NMI is likely being asserted by the watchdog. So if > you are stuck in a loop with interrupts disabled, the register dump > might help you figure out where things are stuck. But as you say > below, knowing the value of the ErrorEPC register is critical. Thank you David & Kevin for the detailed information. Yes, in my case it's the watchdog, when I turn the watchdog off, the machine just hangs, with no NMI dump. Ok, I added the code to Print out the ErrorEPC, and got: ErrorEpc 0xc0000000023c5004 This address is not in vmlinux, but is the address of a loaded module. So, I poked around in /sys/module/ until I found one that had that address range: cat /sys/module/linux_bcm_core/sections/.text :0xc000000001c4e000 And then did an objdump on this module. Since the module dump did not contain the actual addresses that it was running from, I doctored up the offsets by using the .text address from /sys/module/ of where the module actually loaded. objdump.cavium -d --adjust-vma 0xc000000001c4e000 linux-bcm-core.ko Just want to check if all this sounds correct so far? is my objdump valid with the .text offset? I got a hit on the ErrorEPC value in my dump: c0000000023c5004: 08000000 j c000000000000000 Does this mean that the lockup happened at the jump, or after the jump? I am also a little confused about the jump location, I am used to seeing but this has . is that valid? I have a feeling that is referring to a symbol in a different module, since sal_dma_alloc is the first symbol of the module that I am looking at. and that module is tightly coupled to 2 other modules. Is the c000000000000000 the actual address of the jump? can I then just look it up the same way that I found the ErrorEPC address in a module, or do I have to work backwards from to find the offset into the previous module that it is referring to? Jan > David Daney > >> Typically, it's tied to >> some kind of memory bus time-out, but it could be other things. >> Depending on what it's hooked up to, knowing what code was executing >> when it came in may be completely useless. *If* it's hooked up to a bus >> time-out, *and* the instruction that caused the time-out was a load, >> *and* the time-out and NMI occurred *after* the processor got to the >> instruction that consumes the load value (pretty likely if the first two >> conditions are met), *then* looking at disassembled kernel code >> (mips-linux-objdump --disassemble vmlinux) at the ErrorEPC address, >> *not* the address in EPC, which will have latched the address of the >> last recoverable exception (which NMI is not, strictly speaking). That >> instruction should be the consumer of the bad load, so one of its input >> registers should be the target of that load. If it's a two-input >> instruction, e.g. add r1,r2,r3, then it could be either r2 or r3, and >> you have to work your way backwords up the code flow to find out where >> the r2 and r3 values came from, respectively. *Usually* it's possible to >> identify the load, thus the register used as a base address, and see >> that the base address register was trashed, at which point you can start >> forming hypotheses as to how that could have happened. >> >> Of course, in the dump below, we don't see ErrorEPC. I've never been >> able to figure out why so many kernel register dumps skip that register, >> especially for NMI reporting. But unless you're able to reproduce this >> with a kernel that you build yourself, so that you can fix the >> instrumentation, it's going to be tough. So "Plan B" would be to make >> sure that any removable memory DIMMs have been properly seated, and >> double-check that the actual memory capacity corresponds to whatever >> boot parameters are being passed to the kernel. In otherwords, if you >> can't debug the kernel, pray that it's a hardware or operator error. ;o) >> >> Regards, >> >> Kevin K. >> >> Jan Rovins wrote: >>> Hi, I need some tips on how to go about deciphering the following NMI >>> dump. >>> >>> This is from a 2.6.21.7 kernel that came with the Cavium Networks >>> 1.8.1 toolchain. >>> Is there any way to get some kind of back trace from this, or just >>> find out which function it was in? >>> >>> I have been playing around with objdump -x vmlinux but I cant zero in >>> on anything this way. >>> >>> Thanks in advance, >>> >>> Jan >>> *** NMI Watchdog interrupt on Core 0x6 *** >>> $0 0x0000000000000000 at 0x000000001010cce0 >>> v0 0x000000000000003d v1 0x000000000000024a >>> a0 0xffffffff807d7b70 a1 0x0000000000000000 >>> a2 0x000000000000024a a3 0x0000000000000000 >>> a4 0xffffffff807d7b60 a5 0x0000000000000080 >>> a6 0x0000000000000001 a7 0xa800000411c62578 >>> t0 0x0000000000000001 t1 0xa80000048ef3e880 >>> t2 0xffffffff82d40000 t3 0xa80000041f48c000 >>> s0 0xc0000000000d9640 s1 0xc000000000088028 >>> s2 0x0000000000000000 s3 0x0000000000000180 >>> s4 0x0000000000000000 s5 0x0000000000000000 >>> s6 0xb7a89c196f513832 s7 0x0000000000000000 >>> t8 0xffffffff807d0000 t9 0xffffffff807d0000 >>> k0 0x0000000000000000 k1 0x00000000104dbcbf >>> gp 0xa80000041f48c000 sp 0xa80000041f48fcf0 >>> s8 0x0000000000000000 ra 0xc0000000023c5004 >>> epc 0xffffffff802b10b8 >>> status 0x000000001058cce4 cause 0x0000000040008c08 >>> sum0 0x0000002100000000 en0 0x0000009300008000 >>> Code around epc >>> 0xffffffff802b10a8 000000002406ffff >>> 0xffffffff802b10ac 0000000064a5ffff >>> 0xffffffff802b10b0 0000000010a60005 >>> 0xffffffff802b10b4 0000000000000000 >>> 0xffffffff802b10b8 0000000080620000 >>> 0xffffffff802b10bc 000000001440fffb >>> 0xffffffff802b10c0 0000000064630001 >>> 0xffffffff802b10c4 000000006463ffff >>> 0xffffffff802b10c8 0000000003e00008 >>> >>> >> >> >> >