From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrew Cooper Subject: Re: [PATCH 2/2] x86/crash: Disable the watchdog NMIs on the crashing cpu. Date: Mon, 18 Nov 2013 11:09:20 +0000 Message-ID: <5289F560.9020206@citrix.com> References: <1384547567-17059-1-git-send-email-andrew.cooper3@citrix.com> <1384547567-17059-3-git-send-email-andrew.cooper3@citrix.com> <5289EB6F0200007800103EBB@nat28.tlf.novell.com> <5289ECEC.4030207@citrix.com> <528A02690200007800103F53@nat28.tlf.novell.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Received: from mail6.bemta5.messagelabs.com ([195.245.231.135]) by lists.xen.org with esmtp (Exim 4.72) (envelope-from ) id 1ViMho-0004SY-R3 for xen-devel@lists.xenproject.org; Mon, 18 Nov 2013 11:09:25 +0000 In-Reply-To: <528A02690200007800103F53@nat28.tlf.novell.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Jan Beulich Cc: xen-devel , Keir Fraser , David Vrabel , Tim Deegan List-Id: xen-devel@lists.xenproject.org On 18/11/13 11:04, Jan Beulich wrote: >>>> On 18.11.13 at 11:33, Andrew Cooper wrote: >> On 18/11/13 09:26, Jan Beulich wrote: >>>>>> On 15.11.13 at 21:32, Andrew Cooper wrote: >>>> --- a/xen/arch/x86/crash.c >>>> +++ b/xen/arch/x86/crash.c >>>> @@ -118,6 +118,7 @@ static void nmi_shootdown_cpus(void) >>>> unsigned long msecs; >>>> int i, cpu = smp_processor_id(); >>>> >>>> + disable_lapic_nmi_watchdog(); >>>> local_irq_disable(); >>>> >>>> crashing_cpu = cpu; >>> _If_ you do this here, I wonder why it's being done before >>> disabling interrupts. >>> >>> But then again I wonder whether it wouldn't be better to do >>> this even earlier (i.e. by passing a flag to watchdog_disable()), >>> as the NMI watchdog becomes useless with that call being done >>> from kexec_common_shutdown(). >> Disabling interrupts here is more defensive coding than anything else. >> It is not expected to be able to get here with interrupts enabled, but >> in a crash >> >> Putting this in watchdog_disable() would result in a race condition. >> disable_lapic_nmi_watchdog() mutates global state, meaning that it can >> only possibly run correctly on a single cpu. In an ideal world with >> plenty of time, the lapic watchdog code could be improved. However, >> restricting its use until after one_cpu_only() is the easiest fix. > Just looked at disable_lapic_nmi_watchdog() another time, and I > don't see the race. What I do see though is that you'd need to > run this on each CPU for it to have the full intended effect... > > Jan > disable_lapic_nmi_watchdog() sets nmi_active to -1, which casues subsequent calls to exit early. There are possible crash paths where a first cpu will execute watchdog_disable() but a subsequent cpu will complete the kexec path. Hooking disable_lapic_nmi_watchdog() off watchdog_disable() will result in certain paths where the wrong cpu ends up with its MSRs properly disabled. ~Andrew disable_lapic_nmi_watchdog()