Help with decoding a NMI Watchdog interrupt on an Octeon

All of lore.kernel.org
 help / color / mirror / Atom feed

* Help with decoding a NMI Watchdog interrupt on an Octeon
@ 2010-06-17 21:03 Jan Rovins
  2010-06-17 21:25 ` David Daney
  2010-06-17 21:26 ` Kevin D. Kissell
  0 siblings, 2 replies; 8+ messages in thread
From: Jan Rovins @ 2010-06-17 21:03 UTC (permalink / raw)
  To: linux-mips

Hi, I need some tips on how to go about deciphering the following NMI dump.

This is from a 2.6.21.7 kernel that came with the Cavium Networks 1.8.1 
toolchain.
Is there any way to get some kind of back trace from this, or just find 
out which function it was in?

I have been playing around with objdump -x vmlinux  but I cant zero in 
on anything this way.

Thanks in advance,

 Jan
*** NMI Watchdog interrupt on Core 0x6 ***
        $0      0x0000000000000000      at      0x000000001010cce0
        v0      0x000000000000003d      v1      0x000000000000024a
        a0      0xffffffff807d7b70      a1      0x0000000000000000
        a2      0x000000000000024a      a3      0x0000000000000000
        a4      0xffffffff807d7b60      a5      0x0000000000000080
        a6      0x0000000000000001      a7      0xa800000411c62578
        t0      0x0000000000000001      t1      0xa80000048ef3e880
        t2      0xffffffff82d40000      t3      0xa80000041f48c000
        s0      0xc0000000000d9640      s1      0xc000000000088028
        s2      0x0000000000000000      s3      0x0000000000000180
        s4      0x0000000000000000      s5      0x0000000000000000
        s6      0xb7a89c196f513832      s7      0x0000000000000000
        t8      0xffffffff807d0000      t9      0xffffffff807d0000
        k0      0x0000000000000000      k1      0x00000000104dbcbf
        gp      0xa80000041f48c000      sp      0xa80000041f48fcf0
        s8      0x0000000000000000      ra      0xc0000000023c5004
        epc     0xffffffff802b10b8
        status  0x000000001058cce4      cause   0x0000000040008c08
        sum0    0x0000002100000000      en0     0x0000009300008000
Code around epc
        0xffffffff802b10a8      000000002406ffff
        0xffffffff802b10ac      0000000064a5ffff
        0xffffffff802b10b0      0000000010a60005
        0xffffffff802b10b4      0000000000000000
        0xffffffff802b10b8      0000000080620000
        0xffffffff802b10bc      000000001440fffb
        0xffffffff802b10c0      0000000064630001
        0xffffffff802b10c4      000000006463ffff
        0xffffffff802b10c8      0000000003e00008

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Help with decoding a NMI Watchdog interrupt on an Octeon
  2010-06-17 21:03 Help with decoding a NMI Watchdog interrupt on an Octeon Jan Rovins
@ 2010-06-17 21:25 ` David Daney
  2010-06-17 21:26 ` Kevin D. Kissell
  1 sibling, 0 replies; 8+ messages in thread
From: David Daney @ 2010-06-17 21:25 UTC (permalink / raw)
  To: Jan Rovins; +Cc: linux-mips

On 06/17/2010 02:03 PM, Jan Rovins wrote:
> Hi, I need some tips on how to go about deciphering the following NMI dump.
>
> This is from a 2.6.21.7 kernel that came with the Cavium Networks 1.8.1
> toolchain.
> Is there any way to get some kind of back trace from this, or just find
> out which function it was in?
>
> I have been playing around with objdump -x vmlinux but I cant zero in on
> anything this way.
>
> Thanks in advance,
>
> Jan
> *** NMI Watchdog interrupt on Core 0x6 ***
> $0 0x0000000000000000 at 0x000000001010cce0
> v0 0x000000000000003d v1 0x000000000000024a
> a0 0xffffffff807d7b70 a1 0x0000000000000000
> a2 0x000000000000024a a3 0x0000000000000000
> a4 0xffffffff807d7b60 a5 0x0000000000000080
> a6 0x0000000000000001 a7 0xa800000411c62578
> t0 0x0000000000000001 t1 0xa80000048ef3e880
> t2 0xffffffff82d40000 t3 0xa80000041f48c000
> s0 0xc0000000000d9640 s1 0xc000000000088028
> s2 0x0000000000000000 s3 0x0000000000000180
> s4 0x0000000000000000 s5 0x0000000000000000
> s6 0xb7a89c196f513832 s7 0x0000000000000000
> t8 0xffffffff807d0000 t9 0xffffffff807d0000
> k0 0x0000000000000000 k1 0x00000000104dbcbf
> gp 0xa80000041f48c000 sp 0xa80000041f48fcf0
> s8 0x0000000000000000 ra 0xc0000000023c5004
> epc 0xffffffff802b10b8

You may want to verify that epc value is being loaded from C0_ErrorEPC 
rather than C0_EPC.  SDK-1.8.1 gets this wrong.

Look in watchgog.c:octeon_watchdog_nmi_stage3.

Once you have it printing the ErrorEPC value, the trace actually tells 
you what was happening when the NMI fired.

objdump -d vmlinux will give you a disassembly of the kernel, and away 
you go.

David Daney

> status 0x000000001058cce4 cause 0x0000000040008c08
> sum0 0x0000002100000000 en0 0x0000009300008000
> Code around epc
> 0xffffffff802b10a8 000000002406ffff
> 0xffffffff802b10ac 0000000064a5ffff
> 0xffffffff802b10b0 0000000010a60005
> 0xffffffff802b10b4 0000000000000000
> 0xffffffff802b10b8 0000000080620000
> 0xffffffff802b10bc 000000001440fffb
> 0xffffffff802b10c0 0000000064630001
> 0xffffffff802b10c4 000000006463ffff
> 0xffffffff802b10c8 0000000003e00008
>
>
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Help with decoding a NMI Watchdog interrupt on an Octeon
  2010-06-17 21:03 Help with decoding a NMI Watchdog interrupt on an Octeon Jan Rovins
  2010-06-17 21:25 ` David Daney
@ 2010-06-17 21:26 ` Kevin D. Kissell
  2010-06-17 21:51   ` David Daney
  1 sibling, 1 reply; 8+ messages in thread
From: Kevin D. Kissell @ 2010-06-17 21:26 UTC (permalink / raw)
  To: Jan Rovins; +Cc: linux-mips

NMI is just an input pin, so you'd really need to know what it's 
connected to in the system you're working on.  Typically, it's tied to 
some kind of memory bus time-out, but it could be other things.  
Depending on what it's hooked up to, knowing what code was executing 
when it came in may be completely useless.  *If* it's hooked up to a bus 
time-out, *and* the instruction that caused the time-out was a load, 
*and* the time-out and NMI occurred *after* the processor got to the 
instruction that consumes the load value (pretty likely if the first two 
conditions are met), *then* looking at disassembled kernel code 
(mips-linux-objdump --disassemble vmlinux) at the ErrorEPC address, 
*not* the address in EPC, which will have latched the address of the 
last recoverable exception (which NMI is not, strictly speaking).  That 
instruction should be the consumer of the bad load, so one of its input 
registers should be the target of that load.  If it's a two-input 
instruction, e.g. add r1,r2,r3, then it could be either r2 or r3, and 
you have to work your way backwords up the code flow to find out where 
the r2 and r3 values came from, respectively.  *Usually* it's possible 
to identify the load, thus the register used as a base address, and see 
that the base address register was trashed, at which point you can start 
forming hypotheses as to how that could have happened.

Of course, in the dump below, we don't see ErrorEPC.  I've never been 
able to figure out why so many kernel register dumps skip that register, 
especially for NMI reporting.  But unless you're able to reproduce this 
with a kernel that you build yourself, so that you can fix the 
instrumentation, it's going to be tough.  So "Plan B" would be to make 
sure that any removable memory DIMMs have been properly seated, and 
double-check that the actual memory capacity corresponds to whatever 
boot parameters are being passed to the kernel.  In otherwords, if you 
can't debug the kernel, pray that it's a hardware or operator error. ;o)

          Regards,

          Kevin K.

Jan Rovins wrote:
> Hi, I need some tips on how to go about deciphering the following NMI 
> dump.
>
> This is from a 2.6.21.7 kernel that came with the Cavium Networks 
> 1.8.1 toolchain.
> Is there any way to get some kind of back trace from this, or just 
> find out which function it was in?
>
> I have been playing around with objdump -x vmlinux  but I cant zero in 
> on anything this way.
>
> Thanks in advance,
>
> Jan
> *** NMI Watchdog interrupt on Core 0x6 ***
>        $0      0x0000000000000000      at      0x000000001010cce0
>        v0      0x000000000000003d      v1      0x000000000000024a
>        a0      0xffffffff807d7b70      a1      0x0000000000000000
>        a2      0x000000000000024a      a3      0x0000000000000000
>        a4      0xffffffff807d7b60      a5      0x0000000000000080
>        a6      0x0000000000000001      a7      0xa800000411c62578
>        t0      0x0000000000000001      t1      0xa80000048ef3e880
>        t2      0xffffffff82d40000      t3      0xa80000041f48c000
>        s0      0xc0000000000d9640      s1      0xc000000000088028
>        s2      0x0000000000000000      s3      0x0000000000000180
>        s4      0x0000000000000000      s5      0x0000000000000000
>        s6      0xb7a89c196f513832      s7      0x0000000000000000
>        t8      0xffffffff807d0000      t9      0xffffffff807d0000
>        k0      0x0000000000000000      k1      0x00000000104dbcbf
>        gp      0xa80000041f48c000      sp      0xa80000041f48fcf0
>        s8      0x0000000000000000      ra      0xc0000000023c5004
>        epc     0xffffffff802b10b8
>        status  0x000000001058cce4      cause   0x0000000040008c08
>        sum0    0x0000002100000000      en0     0x0000009300008000
> Code around epc
>        0xffffffff802b10a8      000000002406ffff
>        0xffffffff802b10ac      0000000064a5ffff
>        0xffffffff802b10b0      0000000010a60005
>        0xffffffff802b10b4      0000000000000000
>        0xffffffff802b10b8      0000000080620000
>        0xffffffff802b10bc      000000001440fffb
>        0xffffffff802b10c0      0000000064630001
>        0xffffffff802b10c4      000000006463ffff
>        0xffffffff802b10c8      0000000003e00008
>
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Help with decoding a NMI Watchdog interrupt on an Octeon
  2010-06-17 21:26 ` Kevin D. Kissell
@ 2010-06-17 21:51   ` David Daney
  2010-06-19 19:13     ` Jan Rovins
  0 siblings, 1 reply; 8+ messages in thread
From: David Daney @ 2010-06-17 21:51 UTC (permalink / raw)
  To: Kevin D. Kissell; +Cc: Jan Rovins, linux-mips

On 06/17/2010 02:26 PM, Kevin D. Kissell wrote:
> NMI is just an input pin, so you'd really need to know what it's
> connected to in the system you're working on.

In this case, the NMI is likely being asserted by the watchdog.  So if 
you are stuck in a loop with interrupts disabled, the register dump 
might help you figure out where things are stuck.  But as you say below, 
knowing the value of the ErrorEPC register is critical.

David Daney

> Typically, it's tied to
> some kind of memory bus time-out, but it could be other things.
> Depending on what it's hooked up to, knowing what code was executing
> when it came in may be completely useless. *If* it's hooked up to a bus
> time-out, *and* the instruction that caused the time-out was a load,
> *and* the time-out and NMI occurred *after* the processor got to the
> instruction that consumes the load value (pretty likely if the first two
> conditions are met), *then* looking at disassembled kernel code
> (mips-linux-objdump --disassemble vmlinux) at the ErrorEPC address,
> *not* the address in EPC, which will have latched the address of the
> last recoverable exception (which NMI is not, strictly speaking). That
> instruction should be the consumer of the bad load, so one of its input
> registers should be the target of that load. If it's a two-input
> instruction, e.g. add r1,r2,r3, then it could be either r2 or r3, and
> you have to work your way backwords up the code flow to find out where
> the r2 and r3 values came from, respectively. *Usually* it's possible to
> identify the load, thus the register used as a base address, and see
> that the base address register was trashed, at which point you can start
> forming hypotheses as to how that could have happened.
>
> Of course, in the dump below, we don't see ErrorEPC. I've never been
> able to figure out why so many kernel register dumps skip that register,
> especially for NMI reporting. But unless you're able to reproduce this
> with a kernel that you build yourself, so that you can fix the
> instrumentation, it's going to be tough. So "Plan B" would be to make
> sure that any removable memory DIMMs have been properly seated, and
> double-check that the actual memory capacity corresponds to whatever
> boot parameters are being passed to the kernel. In otherwords, if you
> can't debug the kernel, pray that it's a hardware or operator error. ;o)
>
> Regards,
>
> Kevin K.
>
> Jan Rovins wrote:
>> Hi, I need some tips on how to go about deciphering the following NMI
>> dump.
>>
>> This is from a 2.6.21.7 kernel that came with the Cavium Networks
>> 1.8.1 toolchain.
>> Is there any way to get some kind of back trace from this, or just
>> find out which function it was in?
>>
>> I have been playing around with objdump -x vmlinux but I cant zero in
>> on anything this way.
>>
>> Thanks in advance,
>>
>> Jan
>> *** NMI Watchdog interrupt on Core 0x6 ***
>> $0 0x0000000000000000 at 0x000000001010cce0
>> v0 0x000000000000003d v1 0x000000000000024a
>> a0 0xffffffff807d7b70 a1 0x0000000000000000
>> a2 0x000000000000024a a3 0x0000000000000000
>> a4 0xffffffff807d7b60 a5 0x0000000000000080
>> a6 0x0000000000000001 a7 0xa800000411c62578
>> t0 0x0000000000000001 t1 0xa80000048ef3e880
>> t2 0xffffffff82d40000 t3 0xa80000041f48c000
>> s0 0xc0000000000d9640 s1 0xc000000000088028
>> s2 0x0000000000000000 s3 0x0000000000000180
>> s4 0x0000000000000000 s5 0x0000000000000000
>> s6 0xb7a89c196f513832 s7 0x0000000000000000
>> t8 0xffffffff807d0000 t9 0xffffffff807d0000
>> k0 0x0000000000000000 k1 0x00000000104dbcbf
>> gp 0xa80000041f48c000 sp 0xa80000041f48fcf0
>> s8 0x0000000000000000 ra 0xc0000000023c5004
>> epc 0xffffffff802b10b8
>> status 0x000000001058cce4 cause 0x0000000040008c08
>> sum0 0x0000002100000000 en0 0x0000009300008000
>> Code around epc
>> 0xffffffff802b10a8 000000002406ffff
>> 0xffffffff802b10ac 0000000064a5ffff
>> 0xffffffff802b10b0 0000000010a60005
>> 0xffffffff802b10b4 0000000000000000
>> 0xffffffff802b10b8 0000000080620000
>> 0xffffffff802b10bc 000000001440fffb
>> 0xffffffff802b10c0 0000000064630001
>> 0xffffffff802b10c4 000000006463ffff
>> 0xffffffff802b10c8 0000000003e00008
>>
>>
>
>
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Help with decoding a NMI Watchdog interrupt on an Octeon
  2010-06-17 21:51   ` David Daney
@ 2010-06-19 19:13     ` Jan Rovins
  2010-06-21  5:55         ` Jan Rovins
  0 siblings, 1 reply; 8+ messages in thread
From: Jan Rovins @ 2010-06-19 19:13 UTC (permalink / raw)
  To: David Daney; +Cc: Kevin D. Kissell, linux-mips

David Daney wrote:
> On 06/17/2010 02:26 PM, Kevin D. Kissell wrote:
>> NMI is just an input pin, so you'd really need to know what it's
>> connected to in the system you're working on.
>
> In this case, the NMI is likely being asserted by the watchdog.  So if 
> you are stuck in a loop with interrupts disabled, the register dump 
> might help you figure out where things are stuck.  But as you say 
> below, knowing the value of the ErrorEPC register is critical.
Thank you David & Kevin for the detailed information.

Yes, in my case it's the watchdog, when I turn the watchdog off, the 
machine just hangs, with no NMI dump.

Ok, I added the code to Print out the ErrorEPC, and got:
ErrorEpc        0xc0000000023c5004
This address is not in vmlinux, but is the address of a loaded module.

So, I poked around in /sys/module/ until I found one that had that 
address range:
cat /sys/module/linux_bcm_core/sections/.text :0xc000000001c4e000

And then did an objdump on this module. Since the module dump did not 
contain the actual addresses that it was running from, I doctored up the 
offsets by using the .text address from /sys/module/ of where the module 
actually loaded.
objdump.cavium -d --adjust-vma 0xc000000001c4e000  linux-bcm-core.ko

Just want to check if all this sounds correct so far? is my objdump 
valid with the .text offset?

I got a hit on the ErrorEPC value in my dump:
c0000000023c5004:       08000000        j       c000000000000000 
<sal_dma_alloc-0x1c4e000>
Does this mean that the lockup happened at the jump, or after the jump?
I am also a little confused about the jump location,  I am used to 
seeing <symbol+offset> but this has <symbol-offset>. is that valid?

I have a feeling that is referring to a symbol in a different module, 
since sal_dma_alloc is the first symbol of the module that I am looking 
at. and that module is tightly coupled to 2 other modules.

Is the c000000000000000  the actual address of the jump? can I  then  
just  look it up the same way that I found the ErrorEPC  address in a 
module, or do I have to work backwards from <sal_dma_alloc-0x1c4e000> to 
find the offset into the previous module that it is referring to?

Jan

> David Daney
>
>> Typically, it's tied to
>> some kind of memory bus time-out, but it could be other things.
>> Depending on what it's hooked up to, knowing what code was executing
>> when it came in may be completely useless. *If* it's hooked up to a bus
>> time-out, *and* the instruction that caused the time-out was a load,
>> *and* the time-out and NMI occurred *after* the processor got to the
>> instruction that consumes the load value (pretty likely if the first two
>> conditions are met), *then* looking at disassembled kernel code
>> (mips-linux-objdump --disassemble vmlinux) at the ErrorEPC address,
>> *not* the address in EPC, which will have latched the address of the
>> last recoverable exception (which NMI is not, strictly speaking). That
>> instruction should be the consumer of the bad load, so one of its input
>> registers should be the target of that load. If it's a two-input
>> instruction, e.g. add r1,r2,r3, then it could be either r2 or r3, and
>> you have to work your way backwords up the code flow to find out where
>> the r2 and r3 values came from, respectively. *Usually* it's possible to
>> identify the load, thus the register used as a base address, and see
>> that the base address register was trashed, at which point you can start
>> forming hypotheses as to how that could have happened.
>>
>> Of course, in the dump below, we don't see ErrorEPC. I've never been
>> able to figure out why so many kernel register dumps skip that register,
>> especially for NMI reporting. But unless you're able to reproduce this
>> with a kernel that you build yourself, so that you can fix the
>> instrumentation, it's going to be tough. So "Plan B" would be to make
>> sure that any removable memory DIMMs have been properly seated, and
>> double-check that the actual memory capacity corresponds to whatever
>> boot parameters are being passed to the kernel. In otherwords, if you
>> can't debug the kernel, pray that it's a hardware or operator error. ;o)
>>
>> Regards,
>>
>> Kevin K.
>>
>> Jan Rovins wrote:
>>> Hi, I need some tips on how to go about deciphering the following NMI
>>> dump.
>>>
>>> This is from a 2.6.21.7 kernel that came with the Cavium Networks
>>> 1.8.1 toolchain.
>>> Is there any way to get some kind of back trace from this, or just
>>> find out which function it was in?
>>>
>>> I have been playing around with objdump -x vmlinux but I cant zero in
>>> on anything this way.
>>>
>>> Thanks in advance,
>>>
>>> Jan
>>> *** NMI Watchdog interrupt on Core 0x6 ***
>>> $0 0x0000000000000000 at 0x000000001010cce0
>>> v0 0x000000000000003d v1 0x000000000000024a
>>> a0 0xffffffff807d7b70 a1 0x0000000000000000
>>> a2 0x000000000000024a a3 0x0000000000000000
>>> a4 0xffffffff807d7b60 a5 0x0000000000000080
>>> a6 0x0000000000000001 a7 0xa800000411c62578
>>> t0 0x0000000000000001 t1 0xa80000048ef3e880
>>> t2 0xffffffff82d40000 t3 0xa80000041f48c000
>>> s0 0xc0000000000d9640 s1 0xc000000000088028
>>> s2 0x0000000000000000 s3 0x0000000000000180
>>> s4 0x0000000000000000 s5 0x0000000000000000
>>> s6 0xb7a89c196f513832 s7 0x0000000000000000
>>> t8 0xffffffff807d0000 t9 0xffffffff807d0000
>>> k0 0x0000000000000000 k1 0x00000000104dbcbf
>>> gp 0xa80000041f48c000 sp 0xa80000041f48fcf0
>>> s8 0x0000000000000000 ra 0xc0000000023c5004
>>> epc 0xffffffff802b10b8
>>> status 0x000000001058cce4 cause 0x0000000040008c08
>>> sum0 0x0000002100000000 en0 0x0000009300008000
>>> Code around epc
>>> 0xffffffff802b10a8 000000002406ffff
>>> 0xffffffff802b10ac 0000000064a5ffff
>>> 0xffffffff802b10b0 0000000010a60005
>>> 0xffffffff802b10b4 0000000000000000
>>> 0xffffffff802b10b8 0000000080620000
>>> 0xffffffff802b10bc 000000001440fffb
>>> 0xffffffff802b10c0 0000000064630001
>>> 0xffffffff802b10c4 000000006463ffff
>>> 0xffffffff802b10c8 0000000003e00008
>>>
>>>
>>
>>
>>
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: Help with decoding a NMI Watchdog interrupt on an Octeon
@ 2010-06-21  5:55         ` Jan Rovins
  0 siblings, 0 replies; 8+ messages in thread
From: Jan Rovins @ 2010-06-21  5:55 UTC (permalink / raw)
  To: 'Jan Rovins', 'David Daney'
  Cc: 'Kevin D. Kissell', linux-mips

Some additions & corrections to the previous:

> -----Original Message-----
> From: linux-mips-bounce@linux-mips.org [mailto:linux-mips-bounce@linux-
> mips.org] On Behalf Of Jan Rovins
> Sent: Saturday, June 19, 2010 3:14 PM
> To: David Daney
> Cc: Kevin D. Kissell; linux-mips@linux-mips.org
> Subject: Re: Help with decoding a NMI Watchdog interrupt on an Octeon
> 
> David Daney wrote:
> > On 06/17/2010 02:26 PM, Kevin D. Kissell wrote:
> >> NMI is just an input pin, so you'd really need to know what it's
> >> connected to in the system you're working on.
> >
> > In this case, the NMI is likely being asserted by the watchdog.  So if
> > you are stuck in a loop with interrupts disabled, the register dump
> > might help you figure out where things are stuck.  But as you say
> > below, knowing the value of the ErrorEPC register is critical.
> Thank you David & Kevin for the detailed information.
> 
> Yes, in my case it's the watchdog, when I turn the watchdog off, the
> machine just hangs, with no NMI dump.
> 
> Ok, I added the code to Print out the ErrorEPC, and got:
> ErrorEpc        0xc0000000023c5004
> This address is not in vmlinux, but is the address of a loaded module.
> 
> So, I poked around in /sys/module/ until I found one that had that
> address range:
> cat /sys/module/linux_bcm_core/sections/.text :0xc000000001c4e000
> 
> And then did an objdump on this module. Since the module dump did not
> contain the actual addresses that it was running from, I doctored up the
> offsets by using the .text address from /sys/module/ of where the module
> actually loaded.
> objdump.cavium -d --adjust-vma 0xc000000001c4e000  linux-bcm-core.ko
> 
> Just want to check if all this sounds correct so far? is my objdump
> valid with the .text offset?
> 
> I got a hit on the ErrorEPC value in my dump:
> c0000000023c5004:       08000000        j       c000000000000000
> <sal_dma_alloc-0x1c4e000>


This line of code was inside a function called _default_assert, which on
assertion failure, did a printk() and went into an intentional infinite
loop, which explains the NMI dump. The only thing that puzzles me now, is
that the assert failure printk rarely displayed. Could that be because it
was called while interrupts were turned off? I suppose that would stop it
from showing up in /var/log/messages.

The assembly still does not make sense to me (first time with MIPS assembly)
but on examining the C code I think I understand what's going on here.

> Does this mean that the lockup happened at the jump, or after the jump?
> I am also a little confused about the jump location,  I am used to
> seeing <symbol+offset> but this has <symbol-offset>. is that valid?
> 
> I have a feeling that is referring to a symbol in a different module,
> since sal_dma_alloc is the first symbol of the module that I am looking
> at. and that module is tightly coupled to 2 other modules.
> 
> Is the c000000000000000  the actual address of the jump? can I  then
> just  look it up the same way that I found the ErrorEPC  address in a
> module, or do I have to work backwards from <sal_dma_alloc-0x1c4e000> to
> find the offset into the previous module that it is referring to?
> 
> Jan
> 
> > David Daney
> >
> >> Typically, it's tied to
> >> some kind of memory bus time-out, but it could be other things.
> >> Depending on what it's hooked up to, knowing what code was executing
> >> when it came in may be completely useless. *If* it's hooked up to a bus
> >> time-out, *and* the instruction that caused the time-out was a load,
> >> *and* the time-out and NMI occurred *after* the processor got to the
> >> instruction that consumes the load value (pretty likely if the first
> two
> >> conditions are met), *then* looking at disassembled kernel code
> >> (mips-linux-objdump --disassemble vmlinux) at the ErrorEPC address,
> >> *not* the address in EPC, which will have latched the address of the
> >> last recoverable exception (which NMI is not, strictly speaking). That
> >> instruction should be the consumer of the bad load, so one of its input
> >> registers should be the target of that load. If it's a two-input
> >> instruction, e.g. add r1,r2,r3, then it could be either r2 or r3, and
> >> you have to work your way backwords up the code flow to find out where
> >> the r2 and r3 values came from, respectively. *Usually* it's possible
> to
> >> identify the load, thus the register used as a base address, and see
> >> that the base address register was trashed, at which point you can
> start
> >> forming hypotheses as to how that could have happened.
> >>
> >> Of course, in the dump below, we don't see ErrorEPC. I've never been
> >> able to figure out why so many kernel register dumps skip that
> register,
> >> especially for NMI reporting. But unless you're able to reproduce this
> >> with a kernel that you build yourself, so that you can fix the
> >> instrumentation, it's going to be tough. So "Plan B" would be to make
> >> sure that any removable memory DIMMs have been properly seated, and
> >> double-check that the actual memory capacity corresponds to whatever
> >> boot parameters are being passed to the kernel. In otherwords, if you
> >> can't debug the kernel, pray that it's a hardware or operator error.
> ;o)
> >>
> >> Regards,
> >>
> >> Kevin K.
> >>
> >> Jan Rovins wrote:
> >>> Hi, I need some tips on how to go about deciphering the following NMI
> >>> dump.
> >>>
> >>> This is from a 2.6.21.7 kernel that came with the Cavium Networks
> >>> 1.8.1 toolchain.
> >>> Is there any way to get some kind of back trace from this, or just
> >>> find out which function it was in?
> >>>
> >>> I have been playing around with objdump -x vmlinux but I cant zero in
> >>> on anything this way.
> >>>
> >>> Thanks in advance,
> >>>
> >>> Jan
> >>> *** NMI Watchdog interrupt on Core 0x6 ***
> >>> $0 0x0000000000000000 at 0x000000001010cce0
> >>> v0 0x000000000000003d v1 0x000000000000024a
> >>> a0 0xffffffff807d7b70 a1 0x0000000000000000
> >>> a2 0x000000000000024a a3 0x0000000000000000
> >>> a4 0xffffffff807d7b60 a5 0x0000000000000080
> >>> a6 0x0000000000000001 a7 0xa800000411c62578
> >>> t0 0x0000000000000001 t1 0xa80000048ef3e880
> >>> t2 0xffffffff82d40000 t3 0xa80000041f48c000
> >>> s0 0xc0000000000d9640 s1 0xc000000000088028
> >>> s2 0x0000000000000000 s3 0x0000000000000180
> >>> s4 0x0000000000000000 s5 0x0000000000000000
> >>> s6 0xb7a89c196f513832 s7 0x0000000000000000
> >>> t8 0xffffffff807d0000 t9 0xffffffff807d0000
> >>> k0 0x0000000000000000 k1 0x00000000104dbcbf
> >>> gp 0xa80000041f48c000 sp 0xa80000041f48fcf0
> >>> s8 0x0000000000000000 ra 0xc0000000023c5004
> >>> epc 0xffffffff802b10b8
> >>> status 0x000000001058cce4 cause 0x0000000040008c08
> >>> sum0 0x0000002100000000 en0 0x0000009300008000
> >>> Code around epc
> >>> 0xffffffff802b10a8 000000002406ffff
> >>> 0xffffffff802b10ac 0000000064a5ffff
> >>> 0xffffffff802b10b0 0000000010a60005
> >>> 0xffffffff802b10b4 0000000000000000
> >>> 0xffffffff802b10b8 0000000080620000
> >>> 0xffffffff802b10bc 000000001440fffb
> >>> 0xffffffff802b10c0 0000000064630001
> >>> 0xffffffff802b10c4 000000006463ffff
> >>> 0xffffffff802b10c8 0000000003e00008
> >>>
> >>>
> >>
> >>
> >>
> >

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: Help with decoding a NMI Watchdog interrupt on an Octeon
@ 2010-06-21  5:55         ` Jan Rovins
  0 siblings, 0 replies; 8+ messages in thread
From: Jan Rovins @ 2010-06-21  5:55 UTC (permalink / raw)
  To: 'Jan Rovins', 'David Daney'
  Cc: 'Kevin D. Kissell', linux-mips

Some additions & corrections to the previous:

> -----Original Message-----
> From: linux-mips-bounce@linux-mips.org [mailto:linux-mips-bounce@linux-
> mips.org] On Behalf Of Jan Rovins
> Sent: Saturday, June 19, 2010 3:14 PM
> To: David Daney
> Cc: Kevin D. Kissell; linux-mips@linux-mips.org
> Subject: Re: Help with decoding a NMI Watchdog interrupt on an Octeon
> 
> David Daney wrote:
> > On 06/17/2010 02:26 PM, Kevin D. Kissell wrote:
> >> NMI is just an input pin, so you'd really need to know what it's
> >> connected to in the system you're working on.
> >
> > In this case, the NMI is likely being asserted by the watchdog.  So if
> > you are stuck in a loop with interrupts disabled, the register dump
> > might help you figure out where things are stuck.  But as you say
> > below, knowing the value of the ErrorEPC register is critical.
> Thank you David & Kevin for the detailed information.
> 
> Yes, in my case it's the watchdog, when I turn the watchdog off, the
> machine just hangs, with no NMI dump.
> 
> Ok, I added the code to Print out the ErrorEPC, and got:
> ErrorEpc        0xc0000000023c5004
> This address is not in vmlinux, but is the address of a loaded module.
> 
> So, I poked around in /sys/module/ until I found one that had that
> address range:
> cat /sys/module/linux_bcm_core/sections/.text :0xc000000001c4e000
> 
> And then did an objdump on this module. Since the module dump did not
> contain the actual addresses that it was running from, I doctored up the
> offsets by using the .text address from /sys/module/ of where the module
> actually loaded.
> objdump.cavium -d --adjust-vma 0xc000000001c4e000  linux-bcm-core.ko
> 
> Just want to check if all this sounds correct so far? is my objdump
> valid with the .text offset?
> 
> I got a hit on the ErrorEPC value in my dump:
> c0000000023c5004:       08000000        j       c000000000000000
> <sal_dma_alloc-0x1c4e000>


This line of code was inside a function called _default_assert, which on
assertion failure, did a printk() and went into an intentional infinite
loop, which explains the NMI dump. The only thing that puzzles me now, is
that the assert failure printk rarely displayed. Could that be because it
was called while interrupts were turned off? I suppose that would stop it
from showing up in /var/log/messages.

The assembly still does not make sense to me (first time with MIPS assembly)
but on examining the C code I think I understand what's going on here.

> Does this mean that the lockup happened at the jump, or after the jump?
> I am also a little confused about the jump location,  I am used to
> seeing <symbol+offset> but this has <symbol-offset>. is that valid?
> 
> I have a feeling that is referring to a symbol in a different module,
> since sal_dma_alloc is the first symbol of the module that I am looking
> at. and that module is tightly coupled to 2 other modules.
> 
> Is the c000000000000000  the actual address of the jump? can I  then
> just  look it up the same way that I found the ErrorEPC  address in a
> module, or do I have to work backwards from <sal_dma_alloc-0x1c4e000> to
> find the offset into the previous module that it is referring to?
> 
> Jan
> 
> > David Daney
> >
> >> Typically, it's tied to
> >> some kind of memory bus time-out, but it could be other things.
> >> Depending on what it's hooked up to, knowing what code was executing
> >> when it came in may be completely useless. *If* it's hooked up to a bus
> >> time-out, *and* the instruction that caused the time-out was a load,
> >> *and* the time-out and NMI occurred *after* the processor got to the
> >> instruction that consumes the load value (pretty likely if the first
> two
> >> conditions are met), *then* looking at disassembled kernel code
> >> (mips-linux-objdump --disassemble vmlinux) at the ErrorEPC address,
> >> *not* the address in EPC, which will have latched the address of the
> >> last recoverable exception (which NMI is not, strictly speaking). That
> >> instruction should be the consumer of the bad load, so one of its input
> >> registers should be the target of that load. If it's a two-input
> >> instruction, e.g. add r1,r2,r3, then it could be either r2 or r3, and
> >> you have to work your way backwords up the code flow to find out where
> >> the r2 and r3 values came from, respectively. *Usually* it's possible
> to
> >> identify the load, thus the register used as a base address, and see
> >> that the base address register was trashed, at which point you can
> start
> >> forming hypotheses as to how that could have happened.
> >>
> >> Of course, in the dump below, we don't see ErrorEPC. I've never been
> >> able to figure out why so many kernel register dumps skip that
> register,
> >> especially for NMI reporting. But unless you're able to reproduce this
> >> with a kernel that you build yourself, so that you can fix the
> >> instrumentation, it's going to be tough. So "Plan B" would be to make
> >> sure that any removable memory DIMMs have been properly seated, and
> >> double-check that the actual memory capacity corresponds to whatever
> >> boot parameters are being passed to the kernel. In otherwords, if you
> >> can't debug the kernel, pray that it's a hardware or operator error.
> ;o)
> >>
> >> Regards,
> >>
> >> Kevin K.
> >>
> >> Jan Rovins wrote:
> >>> Hi, I need some tips on how to go about deciphering the following NMI
> >>> dump.
> >>>
> >>> This is from a 2.6.21.7 kernel that came with the Cavium Networks
> >>> 1.8.1 toolchain.
> >>> Is there any way to get some kind of back trace from this, or just
> >>> find out which function it was in?
> >>>
> >>> I have been playing around with objdump -x vmlinux but I cant zero in
> >>> on anything this way.
> >>>
> >>> Thanks in advance,
> >>>
> >>> Jan
> >>> *** NMI Watchdog interrupt on Core 0x6 ***
> >>> $0 0x0000000000000000 at 0x000000001010cce0
> >>> v0 0x000000000000003d v1 0x000000000000024a
> >>> a0 0xffffffff807d7b70 a1 0x0000000000000000
> >>> a2 0x000000000000024a a3 0x0000000000000000
> >>> a4 0xffffffff807d7b60 a5 0x0000000000000080
> >>> a6 0x0000000000000001 a7 0xa800000411c62578
> >>> t0 0x0000000000000001 t1 0xa80000048ef3e880
> >>> t2 0xffffffff82d40000 t3 0xa80000041f48c000
> >>> s0 0xc0000000000d9640 s1 0xc000000000088028
> >>> s2 0x0000000000000000 s3 0x0000000000000180
> >>> s4 0x0000000000000000 s5 0x0000000000000000
> >>> s6 0xb7a89c196f513832 s7 0x0000000000000000
> >>> t8 0xffffffff807d0000 t9 0xffffffff807d0000
> >>> k0 0x0000000000000000 k1 0x00000000104dbcbf
> >>> gp 0xa80000041f48c000 sp 0xa80000041f48fcf0
> >>> s8 0x0000000000000000 ra 0xc0000000023c5004
> >>> epc 0xffffffff802b10b8
> >>> status 0x000000001058cce4 cause 0x0000000040008c08
> >>> sum0 0x0000002100000000 en0 0x0000009300008000
> >>> Code around epc
> >>> 0xffffffff802b10a8 000000002406ffff
> >>> 0xffffffff802b10ac 0000000064a5ffff
> >>> 0xffffffff802b10b0 0000000010a60005
> >>> 0xffffffff802b10b4 0000000000000000
> >>> 0xffffffff802b10b8 0000000080620000
> >>> 0xffffffff802b10bc 000000001440fffb
> >>> 0xffffffff802b10c0 0000000064630001
> >>> 0xffffffff802b10c4 000000006463ffff
> >>> 0xffffffff802b10c8 0000000003e00008
> >>>
> >>>
> >>
> >>
> >>
> >

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Help with decoding a NMI Watchdog interrupt on an Octeon
  2010-06-21  5:55         ` Jan Rovins
  (?)
@ 2010-06-21 16:22         ` David Daney
  -1 siblings, 0 replies; 8+ messages in thread
From: David Daney @ 2010-06-21 16:22 UTC (permalink / raw)
  To: Jan Rovins; +Cc: 'Kevin D. Kissell', linux-mips

On 06/20/2010 10:55 PM, Jan Rovins wrote:
> Some additions&  corrections to the previous:
>
>> -----Original Message-----
>> From: linux-mips-bounce@linux-mips.org [mailto:linux-mips-bounce@linux-
>> mips.org] On Behalf Of Jan Rovins
>> Sent: Saturday, June 19, 2010 3:14 PM
>> To: David Daney
>> Cc: Kevin D. Kissell; linux-mips@linux-mips.org
>> Subject: Re: Help with decoding a NMI Watchdog interrupt on an Octeon
>>
>> David Daney wrote:
>>> On 06/17/2010 02:26 PM, Kevin D. Kissell wrote:
>>>> NMI is just an input pin, so you'd really need to know what it's
>>>> connected to in the system you're working on.
>>>
>>> In this case, the NMI is likely being asserted by the watchdog.  So if
>>> you are stuck in a loop with interrupts disabled, the register dump
>>> might help you figure out where things are stuck.  But as you say
>>> below, knowing the value of the ErrorEPC register is critical.
>> Thank you David&  Kevin for the detailed information.
>>
>> Yes, in my case it's the watchdog, when I turn the watchdog off, the
>> machine just hangs, with no NMI dump.
>>
>> Ok, I added the code to Print out the ErrorEPC, and got:
>> ErrorEpc        0xc0000000023c5004
>> This address is not in vmlinux, but is the address of a loaded module.
>>
>> So, I poked around in /sys/module/ until I found one that had that
>> address range:
>> cat /sys/module/linux_bcm_core/sections/.text :0xc000000001c4e000
>>
>> And then did an objdump on this module. Since the module dump did not
>> contain the actual addresses that it was running from, I doctored up the
>> offsets by using the .text address from /sys/module/ of where the module
>> actually loaded.
>> objdump.cavium -d --adjust-vma 0xc000000001c4e000  linux-bcm-core.ko
>>

When looking at kernel modules, it can be helpful to show the 
relocations as well, so add '-r' to your objdump command line...

>> Just want to check if all this sounds correct so far? is my objdump
>> valid with the .text offset?
>>
>> I got a hit on the ErrorEPC value in my dump:
>> c0000000023c5004:       08000000        j       c000000000000000
>> <sal_dma_alloc-0x1c4e000>
>

... Once you turn on display or relocations, you can see where the jump 
is really going.  The relocations are applied by the kernel when loading 
the module.


>
> This line of code was inside a function called _default_assert, which on
> assertion failure, did a printk() and went into an intentional infinite
> loop, which explains the NMI dump. The only thing that puzzles me now, is
> that the assert failure printk rarely displayed. Could that be because it
> was called while interrupts were turned off? I suppose that would stop it
> from showing up in /var/log/messages.
>
> The assembly still does not make sense to me (first time with MIPS assembly)
> but on examining the C code I think I understand what's going on here.
>

It seems like you may be onto the cause of the watchdog expiring, all 
that's left is to figure out how you get into this spot in the first place.

David Daney

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-06-21 16:22 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-06-17 21:03 Help with decoding a NMI Watchdog interrupt on an Octeon Jan Rovins
2010-06-17 21:25 ` David Daney
2010-06-17 21:26 ` Kevin D. Kissell
2010-06-17 21:51   ` David Daney
2010-06-19 19:13     ` Jan Rovins
2010-06-21  5:55       ` Jan Rovins
2010-06-21  5:55         ` Jan Rovins
2010-06-21 16:22         ` David Daney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.