Re: Help with decoding a NMI Watchdog interrupt on an Octeon

Linux MIPS Architecture development
 help / color / mirror / Atom feed

From: Jan Rovins <janr@adax.com>
To: David Daney <ddaney@caviumnetworks.com>
Cc: "Kevin D. Kissell" <kevink@paralogos.com>, linux-mips@linux-mips.org
Subject: Re: Help with decoding a NMI Watchdog interrupt on an Octeon
Date: Sat, 19 Jun 2010 15:13:52 -0400	[thread overview]
Message-ID: <4C1D16F0.2090102@adax.com> (raw)
In-Reply-To: <4C1A98EC.1030708@caviumnetworks.com>

David Daney wrote:
> On 06/17/2010 02:26 PM, Kevin D. Kissell wrote:
>> NMI is just an input pin, so you'd really need to know what it's
>> connected to in the system you're working on.
>
> In this case, the NMI is likely being asserted by the watchdog.  So if 
> you are stuck in a loop with interrupts disabled, the register dump 
> might help you figure out where things are stuck.  But as you say 
> below, knowing the value of the ErrorEPC register is critical.
Thank you David & Kevin for the detailed information.

Yes, in my case it's the watchdog, when I turn the watchdog off, the 
machine just hangs, with no NMI dump.

Ok, I added the code to Print out the ErrorEPC, and got:
ErrorEpc        0xc0000000023c5004
This address is not in vmlinux, but is the address of a loaded module.

So, I poked around in /sys/module/ until I found one that had that 
address range:
cat /sys/module/linux_bcm_core/sections/.text :0xc000000001c4e000

And then did an objdump on this module. Since the module dump did not 
contain the actual addresses that it was running from, I doctored up the 
offsets by using the .text address from /sys/module/ of where the module 
actually loaded.
objdump.cavium -d --adjust-vma 0xc000000001c4e000  linux-bcm-core.ko

Just want to check if all this sounds correct so far? is my objdump 
valid with the .text offset?

I got a hit on the ErrorEPC value in my dump:
c0000000023c5004:       08000000        j       c000000000000000 
<sal_dma_alloc-0x1c4e000>
Does this mean that the lockup happened at the jump, or after the jump?
I am also a little confused about the jump location,  I am used to 
seeing <symbol+offset> but this has <symbol-offset>. is that valid?

I have a feeling that is referring to a symbol in a different module, 
since sal_dma_alloc is the first symbol of the module that I am looking 
at. and that module is tightly coupled to 2 other modules.

Is the c000000000000000  the actual address of the jump? can I  then  
just  look it up the same way that I found the ErrorEPC  address in a 
module, or do I have to work backwards from <sal_dma_alloc-0x1c4e000> to 
find the offset into the previous module that it is referring to?

Jan

> David Daney
>
>> Typically, it's tied to
>> some kind of memory bus time-out, but it could be other things.
>> Depending on what it's hooked up to, knowing what code was executing
>> when it came in may be completely useless. *If* it's hooked up to a bus
>> time-out, *and* the instruction that caused the time-out was a load,
>> *and* the time-out and NMI occurred *after* the processor got to the
>> instruction that consumes the load value (pretty likely if the first two
>> conditions are met), *then* looking at disassembled kernel code
>> (mips-linux-objdump --disassemble vmlinux) at the ErrorEPC address,
>> *not* the address in EPC, which will have latched the address of the
>> last recoverable exception (which NMI is not, strictly speaking). That
>> instruction should be the consumer of the bad load, so one of its input
>> registers should be the target of that load. If it's a two-input
>> instruction, e.g. add r1,r2,r3, then it could be either r2 or r3, and
>> you have to work your way backwords up the code flow to find out where
>> the r2 and r3 values came from, respectively. *Usually* it's possible to
>> identify the load, thus the register used as a base address, and see
>> that the base address register was trashed, at which point you can start
>> forming hypotheses as to how that could have happened.
>>
>> Of course, in the dump below, we don't see ErrorEPC. I've never been
>> able to figure out why so many kernel register dumps skip that register,
>> especially for NMI reporting. But unless you're able to reproduce this
>> with a kernel that you build yourself, so that you can fix the
>> instrumentation, it's going to be tough. So "Plan B" would be to make
>> sure that any removable memory DIMMs have been properly seated, and
>> double-check that the actual memory capacity corresponds to whatever
>> boot parameters are being passed to the kernel. In otherwords, if you
>> can't debug the kernel, pray that it's a hardware or operator error. ;o)
>>
>> Regards,
>>
>> Kevin K.
>>
>> Jan Rovins wrote:
>>> Hi, I need some tips on how to go about deciphering the following NMI
>>> dump.
>>>
>>> This is from a 2.6.21.7 kernel that came with the Cavium Networks
>>> 1.8.1 toolchain.
>>> Is there any way to get some kind of back trace from this, or just
>>> find out which function it was in?
>>>
>>> I have been playing around with objdump -x vmlinux but I cant zero in
>>> on anything this way.
>>>
>>> Thanks in advance,
>>>
>>> Jan
>>> *** NMI Watchdog interrupt on Core 0x6 ***
>>> $0 0x0000000000000000 at 0x000000001010cce0
>>> v0 0x000000000000003d v1 0x000000000000024a
>>> a0 0xffffffff807d7b70 a1 0x0000000000000000
>>> a2 0x000000000000024a a3 0x0000000000000000
>>> a4 0xffffffff807d7b60 a5 0x0000000000000080
>>> a6 0x0000000000000001 a7 0xa800000411c62578
>>> t0 0x0000000000000001 t1 0xa80000048ef3e880
>>> t2 0xffffffff82d40000 t3 0xa80000041f48c000
>>> s0 0xc0000000000d9640 s1 0xc000000000088028
>>> s2 0x0000000000000000 s3 0x0000000000000180
>>> s4 0x0000000000000000 s5 0x0000000000000000
>>> s6 0xb7a89c196f513832 s7 0x0000000000000000
>>> t8 0xffffffff807d0000 t9 0xffffffff807d0000
>>> k0 0x0000000000000000 k1 0x00000000104dbcbf
>>> gp 0xa80000041f48c000 sp 0xa80000041f48fcf0
>>> s8 0x0000000000000000 ra 0xc0000000023c5004
>>> epc 0xffffffff802b10b8
>>> status 0x000000001058cce4 cause 0x0000000040008c08
>>> sum0 0x0000002100000000 en0 0x0000009300008000
>>> Code around epc
>>> 0xffffffff802b10a8 000000002406ffff
>>> 0xffffffff802b10ac 0000000064a5ffff
>>> 0xffffffff802b10b0 0000000010a60005
>>> 0xffffffff802b10b4 0000000000000000
>>> 0xffffffff802b10b8 0000000080620000
>>> 0xffffffff802b10bc 000000001440fffb
>>> 0xffffffff802b10c0 0000000064630001
>>> 0xffffffff802b10c4 000000006463ffff
>>> 0xffffffff802b10c8 0000000003e00008
>>>
>>>
>>
>>
>>
>

next prev parent reply	other threads:[~2010-06-19 19:13 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-06-17 21:03 Help with decoding a NMI Watchdog interrupt on an Octeon Jan Rovins
2010-06-17 21:25 ` David Daney
2010-06-17 21:26 ` Kevin D. Kissell
2010-06-17 21:51   ` David Daney
2010-06-19 19:13     ` Jan Rovins [this message]
2010-06-21  5:55       ` Jan Rovins
2010-06-21  5:55         ` Jan Rovins
2010-06-21 16:22         ` David Daney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4C1D16F0.2090102@adax.com \
    --to=janr@adax.com \
    --cc=ddaney@caviumnetworks.com \
    --cc=kevink@paralogos.com \
    --cc=linux-mips@linux-mips.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox