public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Steven Rostedt <rostedt@goodmis.org>
To: minyard@acm.org
Cc: Linux Kernel <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@osdl.org>, Ingo Molnar <mingo@elte.hu>
Subject: Re: [PATCH] fix i386 NMI watchdog checking
Date: Mon, 16 Oct 2006 21:48:13 -0400	[thread overview]
Message-ID: <1161049693.24364.15.camel@localhost.localdomain> (raw)
In-Reply-To: <20061013032221.GA10816@localdomain>

On Thu, 2006-10-12 at 22:22 -0500, Corey Minyard wrote:
> I was having a problem with the NMI testing hanging on an SMP system
> in check_nmi_watchdog() when using nmi_watchdog=2.  It doesn't seem to
> happen on a stock kernel, but I was working on something else and it
> triggered this problem.
> 
> This patch solves the problem.  I'm not sure this is quite the right
> solution, but I know that the local_irq_enable() is kind of pointless
> here and it seems that having scheduling on while the other CPUs are
> locked up with interrupts off is a bad idea.  And you can't call
> smp_call_function() with interrupts off.  But adding the preempt
> disable around this operation seems to solve the problem.

The patch makes sense. And you're right about the local_irq_enable.
Either it should be matched with a local_irq_disable (knowing that this
function is called with interrupts disabled), or it isn't needed.  The
later seems to be correct.

But this also shows something that bothers me.

> 
> Signed-off-by: Corey Minyard <minyard@acm.org>
> 
> Index: linux-2.6.18/arch/i386/kernel/nmi.c
> ===================================================================
> --- linux-2.6.18.orig/arch/i386/kernel/nmi.c
> +++ linux-2.6.18/arch/i386/kernel/nmi.c
> @@ -134,12 +134,18 @@ static int __init check_nmi_watchdog(voi
>  
>  	printk(KERN_INFO "Testing NMI watchdog ... ");
>  
> +	/*
> +	 * We must have preempt off while testing the local APIC
> +	 * watchdog.  If we have an interrupt on this CPU while the
> +	 * other CPUs are wedged, and that interrupt tries to schedule
> +	 * (and possibly do an IPC), we would be hung.
> +	 */
> +	preempt_disable();
>  	if (nmi_watchdog == NMI_LOCAL_APIC)
>  		smp_call_function(nmi_cpu_busy, (void *)&endflag, 0, 0);
>  
>  	for_each_possible_cpu(cpu)
>  		prev_nmi_count[cpu] = per_cpu(irq_stat, cpu).__nmi_count;
> -	local_irq_enable();
>  	mdelay((10*1000)/nmi_hz); // wait 10 ticks
>  
>  	for_each_possible_cpu(cpu) {
> @@ -151,6 +157,7 @@ static int __init check_nmi_watchdog(voi
>  #endif
>  		if (nmi_count(cpu) - prev_nmi_count[cpu] <= 5) {
>  			endflag = 1;
> +			preempt_enable();
>  			printk("CPU#%d: NMI appears to be stuck (%d->%d)!\n",
>  				cpu,
>  				prev_nmi_count[cpu],
> @@ -162,6 +169,7 @@ static int __init check_nmi_watchdog(voi
>  		}
>  	}
>  	endflag = 1;

endflag is a local variable!!!

Yes there's a printk and a kfree after it, but say we have printk turned
off (which is a possibility) and kfree happens to be quick. At that same
time, we have the other CPU (for a two CPU system) running.  If the
other CPU takes an interrupt (which is possible since interrupts are
turned on), it can miss the endflag set to one.  That is, this function
could return and call another function that resets the location on the
stack where endflag was, to zero.

This race is very unlikely, since the system will continue with one CPU,
and eventually that location will probably be set.  I don't know if this
can cause a dead lock or not, but it just seems like bad practice to use
a local variable to share across CPUS.  Why not just convert this to a
global static?  This function is only called once.

Ah what the heck, patch below (on top of this patch).  Only compiled
tested (trivial enough).

-- Steve

> +	preempt_enable();
>  	printk("OK.\n");
>  
>  	/* now that we know it works we can reduce NMI frequency to


Description:

Don't use a local stack variable to share across CPUS, even if it most
likely wont break anything.  It's just wrong.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Index: linux-2.6.18/arch/i386/kernel/nmi.c
===================================================================
--- linux-2.6.18.orig/arch/i386/kernel/nmi.c	2006-10-16 21:34:08.000000000 -0400
+++ linux-2.6.18/arch/i386/kernel/nmi.c	2006-10-16 21:45:13.000000000 -0400
@@ -119,9 +119,10 @@ static __init void nmi_cpu_busy(void *da
 }
 #endif
 
+static volatile int end_flag __initdata = 0;
+
 static int __init check_nmi_watchdog(void)
 {
-	volatile int endflag = 0;
 	unsigned int *prev_nmi_count;
 	int cpu;
 
@@ -142,7 +143,7 @@ static int __init check_nmi_watchdog(voi
 	 */
 	preempt_disable();
 	if (nmi_watchdog == NMI_LOCAL_APIC)
-		smp_call_function(nmi_cpu_busy, (void *)&endflag, 0, 0);
+		smp_call_function(nmi_cpu_busy, (void *)&end_flag, 0, 0);
 
 	for_each_possible_cpu(cpu)
 		prev_nmi_count[cpu] = per_cpu(irq_stat, cpu).__nmi_count;
@@ -156,7 +157,7 @@ static int __init check_nmi_watchdog(voi
 			continue;
 #endif
 		if (nmi_count(cpu) - prev_nmi_count[cpu] <= 5) {
-			endflag = 1;
+			end_flag = 1;
 			preempt_enable();
 			printk("CPU#%d: NMI appears to be stuck (%d->%d)!\n",
 				cpu,
@@ -168,7 +169,7 @@ static int __init check_nmi_watchdog(voi
 			return -1;
 		}
 	}
-	endflag = 1;
+	end_flag = 1;
 	preempt_enable();
 	printk("OK.\n");
 



      reply	other threads:[~2006-10-17  1:48 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-10-13  3:22 [PATCH] fix i386 NMI watchdog checking Corey Minyard
2006-10-17  1:48 ` Steven Rostedt [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1161049693.24364.15.camel@localhost.localdomain \
    --to=rostedt@goodmis.org \
    --cc=akpm@osdl.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=minyard@acm.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox