* Re: [PATCH] check nmi watchdog is broken
[not found] <200505011704.j41H4SUQ021132@hera.kernel.org>
@ 2005-05-11 5:55 ` Dave Jones
2005-05-11 11:33 ` Mikael Pettersson
0 siblings, 1 reply; 2+ messages in thread
From: Dave Jones @ 2005-05-11 5:55 UTC (permalink / raw)
To: Linux Kernel Mailing List; +Cc: Jack F Vogel, Andi Kleen, notting
On Sun, May 01, 2005 at 10:04:28AM -0700, Linux Kernel wrote:
> tree 6adb8d33585f8eee20794827c79e40991aeeaee5
> parent fd51f666fa591294bd7462447512666e61c56ea0
> author Jack F Vogel <jfv@bluesong.net> Sun, 01 May 2005 22:58:48 -0700
> committer Linus Torvalds <torvalds@ppc970.osdl.org> Sun, 01 May 2005 22:58:48 -0700
>
> [PATCH] check nmi watchdog is broken
>
> A bug against an xSeries system showed up recently noting that the
> check_nmi_watchdog() test was failing.
>
> I have been investigating it and discovered in both i386 and x86_64 the
> recent change to the routine to use the cpu_callin_map has uncovered a
> problem. Prior to that change, on an SMP box, the test was trivally
> passing because all cpu's were found to not yet be online, but now with the
> callin_map they are discovered, it goes on to test the counter and they
> have not yet begun to increment, so it announces a CPU is stuck and bails
> out.
>
> On all the systems I have access to test, the announcement of failure is
> also bougs... by the time you can login and check /proc/interrupts, the
> NMI count is happily incrementing on all CPUs. Its just that the test is
> being done too early.
>
> I have tried moving the call to the test around a bit, and it was always
> too early. I finally hit on this proposed solution, it delays the routine
> via a late_initcall(), seems like the right solution to me.
My EM64T Dell Precision 470 seems to have problems both before
and after this change, though the behaviour changes.
With -rc3..
testing NMI watchdog ... CPU#1: NMI appears to be stuck (0)!
With -rc4...
Testing NMI watchdog ... CPU#4: NMI appears to be stuck (0)!
The CPU #'s are consistent across reboots (Ie, they're always the same).
Though the rc4 behaviour seems odd, as my CPUs are numbered 0-3.
Bill (Cc'd) also reports that with -rc4, he sees around
a 10 second delay when it does that 'Testing NMI watchdog'
message.
Dave
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: [PATCH] check nmi watchdog is broken
2005-05-11 5:55 ` [PATCH] check nmi watchdog is broken Dave Jones
@ 2005-05-11 11:33 ` Mikael Pettersson
0 siblings, 0 replies; 2+ messages in thread
From: Mikael Pettersson @ 2005-05-11 11:33 UTC (permalink / raw)
To: Dave Jones; +Cc: Linux Kernel Mailing List, Jack F Vogel, Andi Kleen, notting
Dave Jones writes:
> My EM64T Dell Precision 470 seems to have problems both before
> and after this change, though the behaviour changes.
>
> With -rc3..
> testing NMI watchdog ... CPU#1: NMI appears to be stuck (0)!
>
> With -rc4...
> Testing NMI watchdog ... CPU#4: NMI appears to be stuck (0)!
>
> The CPU #'s are consistent across reboots (Ie, they're always the same).
> Though the rc4 behaviour seems odd, as my CPUs are numbered 0-3.
There is known breakage in x86-64's nmi watchdog recently.
One of the problems is that the checking code inspects
all possible CPU numbers [0..NR_CPUS[ and not just the
ones actually present, which leads to false failure reports.
> Bill (Cc'd) also reports that with -rc4, he sees around
> a 10 second delay when it does that 'Testing NMI watchdog'
> message.
And that's another of the known x86-64 breakages.
/Mikael
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2005-05-11 11:34 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <200505011704.j41H4SUQ021132@hera.kernel.org>
2005-05-11 5:55 ` [PATCH] check nmi watchdog is broken Dave Jones
2005-05-11 11:33 ` Mikael Pettersson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox