All of lore.kernel.org
 help / color / mirror / Atom feed
* NMI watchdog
@ 2007-10-12  9:18 John Sigler
  2007-10-12 10:00 ` Björn Steinbrink
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: John Sigler @ 2007-10-12  9:18 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users

Hello,

I'm experiencing a full system lockup. I'm using an out-of-tree driver 
which I suspect is responsible. I'm trying to enable the NMI watchdog.

# cat /proc/version
Linux version 2.6.22.1-rt9 (gcc version 3.4.6) #1 PREEMPT RT Tue Oct 9 
12:25:47 CEST 2007

# cat /proc/cmdline
ro root=/dev/hdc1 console=ttyS0,57600n8 console=tty0 panic=3 apic=debug 
nmi_watchdog=2

However, after boot, the NMI count does not change.

# cat /proc/interrupts ; sleep 10 ; cat /proc/interrupts
            CPU0
   0:         99   IO-APIC-edge      timer
   4:       3822   IO-APIC-edge      serial
   8:          1   IO-APIC-edge      rtc
   9:          0   IO-APIC-fasteoi   acpi
  15:      16443   IO-APIC-edge      ide1
  16:       2166   IO-APIC-fasteoi   eth0
  17:        840   IO-APIC-fasteoi   eth1
  18:        840   IO-APIC-fasteoi   eth2
  19:        840   IO-APIC-fasteoi   eth3
  20:          0   IO-APIC-fasteoi   Dta1xx
  21:          0   IO-APIC-fasteoi   Dta1xx
NMI:       2895
LOC:     168445
ERR:          0
MIS:          0

            CPU0
   0:         99   IO-APIC-edge      timer
   4:       3822   IO-APIC-edge      serial
   8:          1   IO-APIC-edge      rtc
   9:          0   IO-APIC-fasteoi   acpi
  15:      16467   IO-APIC-edge      ide1
  16:       2173   IO-APIC-fasteoi   eth0
  17:        845   IO-APIC-fasteoi   eth1
  18:        845   IO-APIC-fasteoi   eth2
  19:        845   IO-APIC-fasteoi   eth3
  20:          0   IO-APIC-fasteoi   Dta1xx
  21:          0   IO-APIC-fasteoi   Dta1xx
NMI:       2895
LOC:     169448
ERR:          0
MIS:          0

Does this mean the NMI watchdog is not working correctly on my system?

# dmesg | grep NMI
ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
Testing NMI watchdog ... OK.

Regards.

^ permalink raw reply	[flat|nested] 15+ messages in thread
* NMI watchdog
@ 2015-03-30 12:15 Justin Keller
  0 siblings, 0 replies; 15+ messages in thread
From: Justin Keller @ 2015-03-30 12:15 UTC (permalink / raw)
  To: linux-watchdog

Hello,
Although not running a vanilla kernel on this machine, I have reported the
issue to the distribution's bug tracking system. It has been almost a week
with no response, so I am sending this email.

Multiple times, when I return to my computer from being away for a little
while, I noticed:
Message from syslogd@redacted at Mar 23 XX:XX:XX ...
kernel:[1059322.470817] NMI watchdog: BUG: soft lockup - CPU#1 stuck for
22s! [kswapd0:31]

Dmesg | grep NMI produced:
[1151200.727734] sending NMI to all CPUs:
[1151200.727812] NMI backtrace for cpu 0
[1151200.764129] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler)
took too long to run: 36.262 msecs
[1151200.764198] NMI backtrace for cpu 1
[1151216.700893] sending NMI to all CPUs:
[1151216.700984] NMI backtrace for cpu 1
[1151216.706524] NMI backtrace for cpu 0
[1723994.455161] <NMI> [<ffffffff81554a5e>] ? dump_stack+0x41/0x51

I didn't have time to grep for kswapd or to investigate further. Long story
short, the machine was shutdown shortly afterwords.

Justin

^ permalink raw reply	[flat|nested] 15+ messages in thread
* NMI watchdog
@ 2015-03-30 12:14 Justin Keller
  2015-03-30 17:09 ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Justin Keller @ 2015-03-30 12:14 UTC (permalink / raw)
  To: linux-kernel

Hello,
Although not running a vanilla kernel on this machine, I have reported
the issue to the distribution's bug tracking system. It has been
almost a week with no response, so I am sending this email.

Multiple times, when I return to my computer from being away for a
little while, I noticed:
Message from syslogd@redacted at Mar 23 XX:XX:XX ...
kernel:[1059322.470817] NMI watchdog: BUG: soft lockup - CPU#1 stuck
for 22s! [kswapd0:31]

Dmesg | grep NMI produced:
[1151200.727734] sending NMI to all CPUs:
[1151200.727812] NMI backtrace for cpu 0
[1151200.764129] INFO: NMI handler
(arch_trigger_all_cpu_backtrace_handler) took too long to run: 36.262
msecs
[1151200.764198] NMI backtrace for cpu 1
[1151216.700893] sending NMI to all CPUs:
[1151216.700984] NMI backtrace for cpu 1
[1151216.706524] NMI backtrace for cpu 0
[1723994.455161] <NMI> [<ffffffff81554a5e>] ? dump_stack+0x41/0x51

I didn't have time to grep for kswapd or to investigate further. Long
story short, the machine was shutdown shortly afterwords.

Justin

PS this was also sent to linux-watchdog. I forgot to turn of HTML, so
I had to re-send it here

^ permalink raw reply	[flat|nested] 15+ messages in thread
* NMI watchdog...
@ 2009-01-29 23:54 David Miller
  0 siblings, 0 replies; 15+ messages in thread
From: David Miller @ 2009-01-29 23:54 UTC (permalink / raw)
  To: sparclinux


I just wanted to let folks know what I've been working on, sparc wise.

I have this reocurring issue where one of my workstations hangs
completely, no keyboard input, no console messages, nothing.

Since we have pseudo-NMI support in oprofile via performance counters
in the current tree I worked on rearchitecting this so that a nice NMI
watchdog layer could be added.

It is modelled after the x86 NMI watchdog, with the major difference
being that it is enabled by default.  The cost is one interrupt per
second, and the payback is enormous wrt. the ability to debug complete
system hangs.

Basically how it works is if we see no timer interrupts processed for
5 seconds we print a message, dump registers, and optionally panic the
system.

This will be supported on any system that has profiling counter
overflow interrupt support.  That essentially means any cpu from
UltraSPARC-III onward (including Niagara chips).

Another nice side effect of this work is that it gives us some of the
framework necessary for whatever generic performance counter layer
gets merged into the tree in the future (Ingo Molnar's work, perfmon3,
whatever).

I noticed while doing these changes that we need some work in the
handling of OOPSes and other errors.  In particular we need to start
using the existing generic infrastructure the kernel provides, such as
oops_enter(), oops_exit(), bust_spinlocks(), etc.  I do intend to work
on this.

I'm currently busy doing testing to make sure that the NMI watchdog
and oprofile work as expected.

I'll post the patches when I check them in.  I intend to push this
into the current stable tree because there are entire classes of bugs
people run into which can't be analyzed at all without this kind of
facility.

^ permalink raw reply	[flat|nested] 15+ messages in thread
* NMI Watchdog
@ 2003-11-14 10:12 Maciej Zenczykowski
  2003-11-14 10:29 ` Mikael Pettersson
  0 siblings, 1 reply; 15+ messages in thread
From: Maciej Zenczykowski @ 2003-11-14 10:12 UTC (permalink / raw)
  To: Linux Kernel Mailing List

Hi,

How do I go about getting the NMI Watchdog to work on a Celeron Mendocino
400 MHz with no local APIC (nmi_watchdog=1/2 doesn't work, same kernel
works [/proc/interrupts show NMI's occuring 1/sec] on a 1GHz P3 with local
APIC)

Thanks,
MaZe.



^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-03-30 17:09 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-12  9:18 NMI watchdog John Sigler
2007-10-12 10:00 ` Björn Steinbrink
2007-10-12 10:58   ` John Sigler
2007-10-12 10:26 ` Steven Rostedt
2007-10-12 13:26   ` John Sigler
2007-10-12 16:12     ` Steven Rostedt
2007-10-17 12:20       ` John Sigler
2007-10-12 14:48 ` Arjan van de Ven
2007-10-15 16:05   ` John Sigler
  -- strict thread matches above, loose matches on Subject: below --
2015-03-30 12:15 Justin Keller
2015-03-30 12:14 Justin Keller
2015-03-30 17:09 ` Michal Hocko
2009-01-29 23:54 David Miller
2003-11-14 10:12 NMI Watchdog Maciej Zenczykowski
2003-11-14 10:29 ` Mikael Pettersson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.