From: Jan Rovins <janr@adax.com>
To: David Daney <ddaney@caviumnetworks.com>
Cc: "Kevin D. Kissell" <kevink@paralogos.com>, linux-mips@linux-mips.org
Subject: Re: Help with decoding a NMI Watchdog interrupt on an Octeon
Date: Sat, 19 Jun 2010 15:13:52 -0400 [thread overview]
Message-ID: <4C1D16F0.2090102@adax.com> (raw)
In-Reply-To: <4C1A98EC.1030708@caviumnetworks.com>
David Daney wrote:
> On 06/17/2010 02:26 PM, Kevin D. Kissell wrote:
>> NMI is just an input pin, so you'd really need to know what it's
>> connected to in the system you're working on.
>
> In this case, the NMI is likely being asserted by the watchdog. So if
> you are stuck in a loop with interrupts disabled, the register dump
> might help you figure out where things are stuck. But as you say
> below, knowing the value of the ErrorEPC register is critical.
Thank you David & Kevin for the detailed information.
Yes, in my case it's the watchdog, when I turn the watchdog off, the
machine just hangs, with no NMI dump.
Ok, I added the code to Print out the ErrorEPC, and got:
ErrorEpc 0xc0000000023c5004
This address is not in vmlinux, but is the address of a loaded module.
So, I poked around in /sys/module/ until I found one that had that
address range:
cat /sys/module/linux_bcm_core/sections/.text :0xc000000001c4e000
And then did an objdump on this module. Since the module dump did not
contain the actual addresses that it was running from, I doctored up the
offsets by using the .text address from /sys/module/ of where the module
actually loaded.
objdump.cavium -d --adjust-vma 0xc000000001c4e000 linux-bcm-core.ko
Just want to check if all this sounds correct so far? is my objdump
valid with the .text offset?
I got a hit on the ErrorEPC value in my dump:
c0000000023c5004: 08000000 j c000000000000000
<sal_dma_alloc-0x1c4e000>
Does this mean that the lockup happened at the jump, or after the jump?
I am also a little confused about the jump location, I am used to
seeing <symbol+offset> but this has <symbol-offset>. is that valid?
I have a feeling that is referring to a symbol in a different module,
since sal_dma_alloc is the first symbol of the module that I am looking
at. and that module is tightly coupled to 2 other modules.
Is the c000000000000000 the actual address of the jump? can I then
just look it up the same way that I found the ErrorEPC address in a
module, or do I have to work backwards from <sal_dma_alloc-0x1c4e000> to
find the offset into the previous module that it is referring to?
Jan
> David Daney
>
>> Typically, it's tied to
>> some kind of memory bus time-out, but it could be other things.
>> Depending on what it's hooked up to, knowing what code was executing
>> when it came in may be completely useless. *If* it's hooked up to a bus
>> time-out, *and* the instruction that caused the time-out was a load,
>> *and* the time-out and NMI occurred *after* the processor got to the
>> instruction that consumes the load value (pretty likely if the first two
>> conditions are met), *then* looking at disassembled kernel code
>> (mips-linux-objdump --disassemble vmlinux) at the ErrorEPC address,
>> *not* the address in EPC, which will have latched the address of the
>> last recoverable exception (which NMI is not, strictly speaking). That
>> instruction should be the consumer of the bad load, so one of its input
>> registers should be the target of that load. If it's a two-input
>> instruction, e.g. add r1,r2,r3, then it could be either r2 or r3, and
>> you have to work your way backwords up the code flow to find out where
>> the r2 and r3 values came from, respectively. *Usually* it's possible to
>> identify the load, thus the register used as a base address, and see
>> that the base address register was trashed, at which point you can start
>> forming hypotheses as to how that could have happened.
>>
>> Of course, in the dump below, we don't see ErrorEPC. I've never been
>> able to figure out why so many kernel register dumps skip that register,
>> especially for NMI reporting. But unless you're able to reproduce this
>> with a kernel that you build yourself, so that you can fix the
>> instrumentation, it's going to be tough. So "Plan B" would be to make
>> sure that any removable memory DIMMs have been properly seated, and
>> double-check that the actual memory capacity corresponds to whatever
>> boot parameters are being passed to the kernel. In otherwords, if you
>> can't debug the kernel, pray that it's a hardware or operator error. ;o)
>>
>> Regards,
>>
>> Kevin K.
>>
>> Jan Rovins wrote:
>>> Hi, I need some tips on how to go about deciphering the following NMI
>>> dump.
>>>
>>> This is from a 2.6.21.7 kernel that came with the Cavium Networks
>>> 1.8.1 toolchain.
>>> Is there any way to get some kind of back trace from this, or just
>>> find out which function it was in?
>>>
>>> I have been playing around with objdump -x vmlinux but I cant zero in
>>> on anything this way.
>>>
>>> Thanks in advance,
>>>
>>> Jan
>>> *** NMI Watchdog interrupt on Core 0x6 ***
>>> $0 0x0000000000000000 at 0x000000001010cce0
>>> v0 0x000000000000003d v1 0x000000000000024a
>>> a0 0xffffffff807d7b70 a1 0x0000000000000000
>>> a2 0x000000000000024a a3 0x0000000000000000
>>> a4 0xffffffff807d7b60 a5 0x0000000000000080
>>> a6 0x0000000000000001 a7 0xa800000411c62578
>>> t0 0x0000000000000001 t1 0xa80000048ef3e880
>>> t2 0xffffffff82d40000 t3 0xa80000041f48c000
>>> s0 0xc0000000000d9640 s1 0xc000000000088028
>>> s2 0x0000000000000000 s3 0x0000000000000180
>>> s4 0x0000000000000000 s5 0x0000000000000000
>>> s6 0xb7a89c196f513832 s7 0x0000000000000000
>>> t8 0xffffffff807d0000 t9 0xffffffff807d0000
>>> k0 0x0000000000000000 k1 0x00000000104dbcbf
>>> gp 0xa80000041f48c000 sp 0xa80000041f48fcf0
>>> s8 0x0000000000000000 ra 0xc0000000023c5004
>>> epc 0xffffffff802b10b8
>>> status 0x000000001058cce4 cause 0x0000000040008c08
>>> sum0 0x0000002100000000 en0 0x0000009300008000
>>> Code around epc
>>> 0xffffffff802b10a8 000000002406ffff
>>> 0xffffffff802b10ac 0000000064a5ffff
>>> 0xffffffff802b10b0 0000000010a60005
>>> 0xffffffff802b10b4 0000000000000000
>>> 0xffffffff802b10b8 0000000080620000
>>> 0xffffffff802b10bc 000000001440fffb
>>> 0xffffffff802b10c0 0000000064630001
>>> 0xffffffff802b10c4 000000006463ffff
>>> 0xffffffff802b10c8 0000000003e00008
>>>
>>>
>>
>>
>>
>
next prev parent reply other threads:[~2010-06-19 19:13 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-06-17 21:03 Help with decoding a NMI Watchdog interrupt on an Octeon Jan Rovins
2010-06-17 21:25 ` David Daney
2010-06-17 21:26 ` Kevin D. Kissell
2010-06-17 21:51 ` David Daney
2010-06-19 19:13 ` Jan Rovins [this message]
2010-06-21 5:55 ` Jan Rovins
2010-06-21 5:55 ` Jan Rovins
2010-06-21 16:22 ` David Daney
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4C1D16F0.2090102@adax.com \
--to=janr@adax.com \
--cc=ddaney@caviumnetworks.com \
--cc=kevink@paralogos.com \
--cc=linux-mips@linux-mips.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox