* [PATCH 1/4] x86/nmi: remove spurious local_irq_enable from check_nmi_watchdog()
2014-05-14 12:58 [PATCH 0/4] x86/nmi: improve NMI watchdog test David Vrabel
@ 2014-05-14 12:58 ` David Vrabel
2014-05-14 13:25 ` Jan Beulich
2014-05-15 11:40 ` Tim Deegan
2014-05-14 12:58 ` [PATCH 2/4] x86/nmi: enable local irqs in wait_for_nmis() David Vrabel
` (3 subsequent siblings)
4 siblings, 2 replies; 20+ messages in thread
From: David Vrabel @ 2014-05-14 12:58 UTC (permalink / raw)
To: xen-devel; +Cc: Keir Fraser, David Vrabel, Jan Beulich
All callers of check_nmi_watchdog() already have local irqs enabled so
remove the unpaired local_irq_enable().
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
xen/arch/x86/nmi.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/xen/arch/x86/nmi.c b/xen/arch/x86/nmi.c
index 526020b..18d3820 100644
--- a/xen/arch/x86/nmi.c
+++ b/xen/arch/x86/nmi.c
@@ -126,7 +126,6 @@ int __init check_nmi_watchdog (void)
for_each_online_cpu ( cpu )
prev_nmi_count[cpu] = nmi_count(cpu);
- local_irq_enable();
/* Wait for 10 ticks. Busy-wait on all CPUs: the LAPIC counter that
* the NMI watchdog uses only runs while the core's not halted */
--
1.7.10.4
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [PATCH 1/4] x86/nmi: remove spurious local_irq_enable from check_nmi_watchdog()
2014-05-14 12:58 ` [PATCH 1/4] x86/nmi: remove spurious local_irq_enable from check_nmi_watchdog() David Vrabel
@ 2014-05-14 13:25 ` Jan Beulich
2014-05-14 13:31 ` David Vrabel
2014-05-15 11:40 ` Tim Deegan
1 sibling, 1 reply; 20+ messages in thread
From: Jan Beulich @ 2014-05-14 13:25 UTC (permalink / raw)
To: David Vrabel; +Cc: xen-devel, Keir Fraser
>>> On 14.05.14 at 14:58, <david.vrabel@citrix.com> wrote:
> All callers of check_nmi_watchdog() already have local irqs enabled so
> remove the unpaired local_irq_enable().
>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> ---
> xen/arch/x86/nmi.c | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/xen/arch/x86/nmi.c b/xen/arch/x86/nmi.c
> index 526020b..18d3820 100644
> --- a/xen/arch/x86/nmi.c
> +++ b/xen/arch/x86/nmi.c
> @@ -126,7 +126,6 @@ int __init check_nmi_watchdog (void)
>
> for_each_online_cpu ( cpu )
> prev_nmi_count[cpu] = nmi_count(cpu);
> - local_irq_enable();
I guess this will want to be replaced by a suitable ASSERT(), to make
sure eventual new callers adhere to the described rule.
Jan
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/4] x86/nmi: remove spurious local_irq_enable from check_nmi_watchdog()
2014-05-14 13:25 ` Jan Beulich
@ 2014-05-14 13:31 ` David Vrabel
2014-05-14 13:53 ` Jan Beulich
0 siblings, 1 reply; 20+ messages in thread
From: David Vrabel @ 2014-05-14 13:31 UTC (permalink / raw)
To: Jan Beulich; +Cc: xen-devel, Keir Fraser
On 14/05/14 14:25, Jan Beulich wrote:
>>>> On 14.05.14 at 14:58, <david.vrabel@citrix.com> wrote:
>> All callers of check_nmi_watchdog() already have local irqs enabled so
>> remove the unpaired local_irq_enable().
>>
>> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
>> ---
>> xen/arch/x86/nmi.c | 1 -
>> 1 file changed, 1 deletion(-)
>>
>> diff --git a/xen/arch/x86/nmi.c b/xen/arch/x86/nmi.c
>> index 526020b..18d3820 100644
>> --- a/xen/arch/x86/nmi.c
>> +++ b/xen/arch/x86/nmi.c
>> @@ -126,7 +126,6 @@ int __init check_nmi_watchdog (void)
>>
>> for_each_online_cpu ( cpu )
>> prev_nmi_count[cpu] = nmi_count(cpu);
>> - local_irq_enable();
>
> I guess this will want to be replaced by a suitable ASSERT(), to make
> sure eventual new callers adhere to the described rule.
smp_call_function() and on_selected_cpus() which is where this matters
already has such an ASSERT() so I don't think another one needs to be
added here.
David
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/4] x86/nmi: remove spurious local_irq_enable from check_nmi_watchdog()
2014-05-14 13:31 ` David Vrabel
@ 2014-05-14 13:53 ` Jan Beulich
0 siblings, 0 replies; 20+ messages in thread
From: Jan Beulich @ 2014-05-14 13:53 UTC (permalink / raw)
To: David Vrabel; +Cc: xen-devel, Keir Fraser
>>> On 14.05.14 at 15:31, <david.vrabel@citrix.com> wrote:
> On 14/05/14 14:25, Jan Beulich wrote:
>>>>> On 14.05.14 at 14:58, <david.vrabel@citrix.com> wrote:
>>> All callers of check_nmi_watchdog() already have local irqs enabled so
>>> remove the unpaired local_irq_enable().
>>>
>>> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
>>> ---
>>> xen/arch/x86/nmi.c | 1 -
>>> 1 file changed, 1 deletion(-)
>>>
>>> diff --git a/xen/arch/x86/nmi.c b/xen/arch/x86/nmi.c
>>> index 526020b..18d3820 100644
>>> --- a/xen/arch/x86/nmi.c
>>> +++ b/xen/arch/x86/nmi.c
>>> @@ -126,7 +126,6 @@ int __init check_nmi_watchdog (void)
>>>
>>> for_each_online_cpu ( cpu )
>>> prev_nmi_count[cpu] = nmi_count(cpu);
>>> - local_irq_enable();
>>
>> I guess this will want to be replaced by a suitable ASSERT(), to make
>> sure eventual new callers adhere to the described rule.
>
> smp_call_function() and on_selected_cpus() which is where this matters
> already has such an ASSERT() so I don't think another one needs to be
> added here.
Okay, that's true.
Jan
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/4] x86/nmi: remove spurious local_irq_enable from check_nmi_watchdog()
2014-05-14 12:58 ` [PATCH 1/4] x86/nmi: remove spurious local_irq_enable from check_nmi_watchdog() David Vrabel
2014-05-14 13:25 ` Jan Beulich
@ 2014-05-15 11:40 ` Tim Deegan
1 sibling, 0 replies; 20+ messages in thread
From: Tim Deegan @ 2014-05-15 11:40 UTC (permalink / raw)
To: David Vrabel; +Cc: xen-devel, Keir Fraser, Jan Beulich
At 13:58 +0100 on 14 May (1400072296), David Vrabel wrote:
> All callers of check_nmi_watchdog() already have local irqs enabled so
> remove the unpaired local_irq_enable().
>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Tim Deegan <tim@xen.org>
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH 2/4] x86/nmi: enable local irqs in wait_for_nmis()
2014-05-14 12:58 [PATCH 0/4] x86/nmi: improve NMI watchdog test David Vrabel
2014-05-14 12:58 ` [PATCH 1/4] x86/nmi: remove spurious local_irq_enable from check_nmi_watchdog() David Vrabel
@ 2014-05-14 12:58 ` David Vrabel
2014-05-14 13:57 ` Jan Beulich
2014-05-14 12:58 ` [PATCH 3/4] x86/nmi: wait for all CPUs in check_nmi_watchdog() David Vrabel
` (2 subsequent siblings)
4 siblings, 1 reply; 20+ messages in thread
From: David Vrabel @ 2014-05-14 12:58 UTC (permalink / raw)
To: xen-devel; +Cc: Keir Fraser, David Vrabel, Jan Beulich
wait_for_nmis() runs for 10 ticks with local interrupts disabled,
which is quite a long time.
We occasionally see serial output (particularly with virtual serial
ports such as HP's iLO) stopping on the "Testing for NMIs" line. We
think the long delay in handling the serial interrupt is the cause.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
xen/arch/x86/nmi.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/xen/arch/x86/nmi.c b/xen/arch/x86/nmi.c
index 18d3820..07488da 100644
--- a/xen/arch/x86/nmi.c
+++ b/xen/arch/x86/nmi.c
@@ -111,7 +111,14 @@ int nmi_active;
static void __init wait_for_nmis(void *p)
{
+ unsigned long flags;
+
+ local_save_flags(flags);
+ local_irq_enable();
+
mdelay((10*1000)/nmi_hz); /* wait 10 ticks */
+
+ local_irq_restore(flags);
}
int __init check_nmi_watchdog (void)
--
1.7.10.4
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [PATCH 2/4] x86/nmi: enable local irqs in wait_for_nmis()
2014-05-14 12:58 ` [PATCH 2/4] x86/nmi: enable local irqs in wait_for_nmis() David Vrabel
@ 2014-05-14 13:57 ` Jan Beulich
2014-05-14 14:38 ` David Vrabel
2014-05-15 11:55 ` Tim Deegan
0 siblings, 2 replies; 20+ messages in thread
From: Jan Beulich @ 2014-05-14 13:57 UTC (permalink / raw)
To: David Vrabel; +Cc: xen-devel, Keir Fraser
>>> On 14.05.14 at 14:58, <david.vrabel@citrix.com> wrote:
> --- a/xen/arch/x86/nmi.c
> +++ b/xen/arch/x86/nmi.c
> @@ -111,7 +111,14 @@ int nmi_active;
>
> static void __init wait_for_nmis(void *p)
> {
> + unsigned long flags;
> +
> + local_save_flags(flags);
> + local_irq_enable();
> +
> mdelay((10*1000)/nmi_hz); /* wait 10 ticks */
> +
> + local_irq_restore(flags);
> }
This being the callback for on_selected_cpus(), i.e. called out of
interrupt context, I don't think it is uniformly safe to enable
interrupts here. The current behavior anyway is for the function
to run with interrupts enabled on the boot CPU, and with interrupts
disabled on the APs. So for this change to make a difference, the
IRQ would need to be bound to one of the APs, and consequently
the fix would be to force it onto the BP until boot progressed far
enough.
Jan
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH 2/4] x86/nmi: enable local irqs in wait_for_nmis()
2014-05-14 13:57 ` Jan Beulich
@ 2014-05-14 14:38 ` David Vrabel
2014-05-14 14:45 ` Jan Beulich
2014-05-15 11:55 ` Tim Deegan
1 sibling, 1 reply; 20+ messages in thread
From: David Vrabel @ 2014-05-14 14:38 UTC (permalink / raw)
To: Jan Beulich; +Cc: xen-devel, Keir Fraser
On 14/05/14 14:57, Jan Beulich wrote:
>>>> On 14.05.14 at 14:58, <david.vrabel@citrix.com> wrote:
>> --- a/xen/arch/x86/nmi.c
>> +++ b/xen/arch/x86/nmi.c
>> @@ -111,7 +111,14 @@ int nmi_active;
>>
>> static void __init wait_for_nmis(void *p)
>> {
>> + unsigned long flags;
>> +
>> + local_save_flags(flags);
>> + local_irq_enable();
>> +
>> mdelay((10*1000)/nmi_hz); /* wait 10 ticks */
>> +
>> + local_irq_restore(flags);
>> }
>
> This being the callback for on_selected_cpus(), i.e. called out of
> interrupt context, I don't think it is uniformly safe to enable
> interrupts here. The current behavior anyway is for the function
> to run with interrupts enabled on the boot CPU, and with interrupts
> disabled on the APs. So for this change to make a difference, the
> IRQ would need to be bound to one of the APs, and consequently
> the fix would be to force it onto the BP until boot progressed far
> enough.
The call function IPIs are handled with a direct apic vector so skip the
normal interrupt enabling path for irqs. I can't see anywhere where
irqs are enabled in call_function_interrupt() (x86) and
smp_call_function_interrupt() (common).
David
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH 2/4] x86/nmi: enable local irqs in wait_for_nmis()
2014-05-14 14:38 ` David Vrabel
@ 2014-05-14 14:45 ` Jan Beulich
0 siblings, 0 replies; 20+ messages in thread
From: Jan Beulich @ 2014-05-14 14:45 UTC (permalink / raw)
To: David Vrabel; +Cc: xen-devel, Keir Fraser
>>> On 14.05.14 at 16:38, <david.vrabel@citrix.com> wrote:
> On 14/05/14 14:57, Jan Beulich wrote:
>>>>> On 14.05.14 at 14:58, <david.vrabel@citrix.com> wrote:
>>> --- a/xen/arch/x86/nmi.c
>>> +++ b/xen/arch/x86/nmi.c
>>> @@ -111,7 +111,14 @@ int nmi_active;
>>>
>>> static void __init wait_for_nmis(void *p)
>>> {
>>> + unsigned long flags;
>>> +
>>> + local_save_flags(flags);
>>> + local_irq_enable();
>>> +
>>> mdelay((10*1000)/nmi_hz); /* wait 10 ticks */
>>> +
>>> + local_irq_restore(flags);
>>> }
>>
>> This being the callback for on_selected_cpus(), i.e. called out of
>> interrupt context, I don't think it is uniformly safe to enable
>> interrupts here. The current behavior anyway is for the function
>> to run with interrupts enabled on the boot CPU, and with interrupts
>> disabled on the APs. So for this change to make a difference, the
>> IRQ would need to be bound to one of the APs, and consequently
>> the fix would be to force it onto the BP until boot progressed far
>> enough.
>
> The call function IPIs are handled with a direct apic vector so skip the
> normal interrupt enabling path for irqs. I can't see anywhere where
> irqs are enabled in call_function_interrupt() (x86) and
> smp_call_function_interrupt() (common).
Exactly, they aren't, because it's not generally safe to do so. And if
the generic code doesn't do so, you'll need a rather good reason to do
this manually.
Jan
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/4] x86/nmi: enable local irqs in wait_for_nmis()
2014-05-14 13:57 ` Jan Beulich
2014-05-14 14:38 ` David Vrabel
@ 2014-05-15 11:55 ` Tim Deegan
1 sibling, 0 replies; 20+ messages in thread
From: Tim Deegan @ 2014-05-15 11:55 UTC (permalink / raw)
To: David Vrabel; +Cc: xen-devel, Keir Fraser, Jan Beulich
At 14:57 +0100 on 14 May (1400075856), Jan Beulich wrote:
> >>> On 14.05.14 at 14:58, <david.vrabel@citrix.com> wrote:
> > --- a/xen/arch/x86/nmi.c
> > +++ b/xen/arch/x86/nmi.c
> > @@ -111,7 +111,14 @@ int nmi_active;
> >
> > static void __init wait_for_nmis(void *p)
> > {
> > + unsigned long flags;
> > +
> > + local_save_flags(flags);
> > + local_irq_enable();
> > +
> > mdelay((10*1000)/nmi_hz); /* wait 10 ticks */
> > +
> > + local_irq_restore(flags);
> > }
>
> This being the callback for on_selected_cpus(), i.e. called out of
> interrupt context, I don't think it is uniformly safe to enable
> interrupts here.
Yes, I think if you really need this (and I don't see what the problem
is with serial interrupts being delayed as long as they're not lost),
you'll have to use some sort of softirq context to do this spinning.
Tim.
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH 3/4] x86/nmi: wait for all CPUs in check_nmi_watchdog()
2014-05-14 12:58 [PATCH 0/4] x86/nmi: improve NMI watchdog test David Vrabel
2014-05-14 12:58 ` [PATCH 1/4] x86/nmi: remove spurious local_irq_enable from check_nmi_watchdog() David Vrabel
2014-05-14 12:58 ` [PATCH 2/4] x86/nmi: enable local irqs in wait_for_nmis() David Vrabel
@ 2014-05-14 12:58 ` David Vrabel
2014-05-14 13:59 ` Jan Beulich
2014-05-15 11:47 ` Tim Deegan
2014-05-14 12:58 ` [PATCH 4/4] x86/nmi: be less verbose when testing the NMI watchdog David Vrabel
2014-05-14 13:03 ` [PATCH 0/4] x86/nmi: improve NMI watchdog test Andrew Cooper
4 siblings, 2 replies; 20+ messages in thread
From: David Vrabel @ 2014-05-14 12:58 UTC (permalink / raw)
To: xen-devel; +Cc: Keir Fraser, David Vrabel, Jan Beulich
The counting of a CPUs NMIs in check_nmi_watchdog() is only reliable
if all CPUs have been spinning for 5 or more ticks. There may be
delays in waking other CPUs from deep power states that can mean that
when the counts are checked CPUs haven't run for long enough.
Fix this by waiting for all CPUs to have delayed in wait_for_nmis().
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
xen/arch/x86/nmi.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/xen/arch/x86/nmi.c b/xen/arch/x86/nmi.c
index 07488da..4f330d8 100644
--- a/xen/arch/x86/nmi.c
+++ b/xen/arch/x86/nmi.c
@@ -136,9 +136,7 @@ int __init check_nmi_watchdog (void)
/* Wait for 10 ticks. Busy-wait on all CPUs: the LAPIC counter that
* the NMI watchdog uses only runs while the core's not halted */
- if ( nmi_watchdog == NMI_LOCAL_APIC )
- smp_call_function(wait_for_nmis, NULL, 0);
- wait_for_nmis(NULL);
+ on_selected_cpus(&cpu_online_map, wait_for_nmis, NULL, 1);
for_each_online_cpu ( cpu )
{
--
1.7.10.4
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [PATCH 3/4] x86/nmi: wait for all CPUs in check_nmi_watchdog()
2014-05-14 12:58 ` [PATCH 3/4] x86/nmi: wait for all CPUs in check_nmi_watchdog() David Vrabel
@ 2014-05-14 13:59 ` Jan Beulich
2014-05-14 14:09 ` David Vrabel
2014-05-14 14:09 ` Andrew Cooper
2014-05-15 11:47 ` Tim Deegan
1 sibling, 2 replies; 20+ messages in thread
From: Jan Beulich @ 2014-05-14 13:59 UTC (permalink / raw)
To: David Vrabel; +Cc: xen-devel, Keir Fraser
>>> On 14.05.14 at 14:58, <david.vrabel@citrix.com> wrote:
> The counting of a CPUs NMIs in check_nmi_watchdog() is only reliable
> if all CPUs have been spinning for 5 or more ticks. There may be
> delays in waking other CPUs from deep power states that can mean that
> when the counts are checked CPUs haven't run for long enough.
5 ticks ought to be a couple of orders of a magnitude longer than
the worst possible wakeup time. I.e. I don't buy this argument
without actual numbers to support it.
Jan
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 3/4] x86/nmi: wait for all CPUs in check_nmi_watchdog()
2014-05-14 13:59 ` Jan Beulich
@ 2014-05-14 14:09 ` David Vrabel
2014-05-14 14:33 ` Jan Beulich
2014-05-14 14:09 ` Andrew Cooper
1 sibling, 1 reply; 20+ messages in thread
From: David Vrabel @ 2014-05-14 14:09 UTC (permalink / raw)
To: Jan Beulich; +Cc: xen-devel, Keir Fraser, Andrew Cooper
On 14/05/14 14:59, Jan Beulich wrote:
>>>> On 14.05.14 at 14:58, <david.vrabel@citrix.com> wrote:
>> The counting of a CPUs NMIs in check_nmi_watchdog() is only reliable
>> if all CPUs have been spinning for 5 or more ticks. There may be
>> delays in waking other CPUs from deep power states that can mean that
>> when the counts are checked CPUs haven't run for long enough.
>
> 5 ticks ought to be a couple of orders of a magnitude longer than
> the worst possible wakeup time. I.e. I don't buy this argument
> without actual numbers to support it.
This was a change Andrew asked for and the reasoning he used, but I
agree that it seems highly implausible.
I think this is a worthwhile cleanup anyway so you can apply with the
below commit message or drop it (your preference).
David
8<-------
x86/nmi: wait for all CPUs in check_nmi_watchdog()
The counting of a CPUs NMIs in check_nmi_watchdog() is only reliable
if all CPUs have been spinning for 5 or more ticks. Whilst its highly
implausible that this won't happen in practise, explicitly wait so it's
clear that this is required.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 3/4] x86/nmi: wait for all CPUs in check_nmi_watchdog()
2014-05-14 14:09 ` David Vrabel
@ 2014-05-14 14:33 ` Jan Beulich
0 siblings, 0 replies; 20+ messages in thread
From: Jan Beulich @ 2014-05-14 14:33 UTC (permalink / raw)
To: David Vrabel; +Cc: AndrewCooper, Keir Fraser, xen-devel
>>> On 14.05.14 at 16:09, <david.vrabel@citrix.com> wrote:
> x86/nmi: wait for all CPUs in check_nmi_watchdog()
>
> The counting of a CPUs NMIs in check_nmi_watchdog() is only reliable
> if all CPUs have been spinning for 5 or more ticks. Whilst its highly
> implausible that this won't happen in practise, explicitly wait so it's
> clear that this is required.
The thing is that I actually consider the early continuation to be
desirable - there's no need for the BP to actually wait for the APs
to finish their wait loop, as long as they progressed far enough
into it. If you added a flag telling the others to bail as soon as the
master exited its waiting, that would look more acceptable to me.
Agreed we're not talking about big time differences here, but
we shouldn't be making booting slower than it needs to be. Btw.,
did you check how modern Linux deals with the same issue (last I
looked was many releases ago)?
Jan
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 3/4] x86/nmi: wait for all CPUs in check_nmi_watchdog()
2014-05-14 13:59 ` Jan Beulich
2014-05-14 14:09 ` David Vrabel
@ 2014-05-14 14:09 ` Andrew Cooper
1 sibling, 0 replies; 20+ messages in thread
From: Andrew Cooper @ 2014-05-14 14:09 UTC (permalink / raw)
To: Jan Beulich; +Cc: xen-devel, Keir Fraser, David Vrabel
On 14/05/14 14:59, Jan Beulich wrote:
>>>> On 14.05.14 at 14:58, <david.vrabel@citrix.com> wrote:
>> The counting of a CPUs NMIs in check_nmi_watchdog() is only reliable
>> if all CPUs have been spinning for 5 or more ticks. There may be
>> delays in waking other CPUs from deep power states that can mean that
>> when the counts are checked CPUs haven't run for long enough.
> 5 ticks ought to be a couple of orders of a magnitude longer than
> the worst possible wakeup time. I.e. I don't buy this argument
> without actual numbers to support it.
>
> Jan
Its not necesserily the wakeup time, although on some systems that does
appear to be a consideration. Simple A
We have a prototype 8 socket system where the cpus on the further
sockets had progressively less nmi delta, and are mostly declared stuck
as the BSP reads the delta before the APs have completed their busy loop.
~Andrew
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 3/4] x86/nmi: wait for all CPUs in check_nmi_watchdog()
2014-05-14 12:58 ` [PATCH 3/4] x86/nmi: wait for all CPUs in check_nmi_watchdog() David Vrabel
2014-05-14 13:59 ` Jan Beulich
@ 2014-05-15 11:47 ` Tim Deegan
1 sibling, 0 replies; 20+ messages in thread
From: Tim Deegan @ 2014-05-15 11:47 UTC (permalink / raw)
To: David Vrabel; +Cc: xen-devel, Keir Fraser, Jan Beulich
At 13:58 +0100 on 14 May (1400072298), David Vrabel wrote:
> The counting of a CPUs NMIs in check_nmi_watchdog() is only reliable
> if all CPUs have been spinning for 5 or more ticks. There may be
> delays in waking other CPUs from deep power states that can mean that
> when the counts are checked CPUs haven't run for long enough.
>
> Fix this by waiting for all CPUs to have delayed in wait_for_nmis().
>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Tim Deegan <tim@xen.org>
I think we could separately address the boot time issue by having
wait_for_nmis() exit after it's seen 5 NMIs _or_ counted 10 ticks.
Tim.
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH 4/4] x86/nmi: be less verbose when testing the NMI watchdog
2014-05-14 12:58 [PATCH 0/4] x86/nmi: improve NMI watchdog test David Vrabel
` (2 preceding siblings ...)
2014-05-14 12:58 ` [PATCH 3/4] x86/nmi: wait for all CPUs in check_nmi_watchdog() David Vrabel
@ 2014-05-14 12:58 ` David Vrabel
2014-05-15 11:48 ` Tim Deegan
2014-05-14 13:03 ` [PATCH 0/4] x86/nmi: improve NMI watchdog test Andrew Cooper
4 siblings, 1 reply; 20+ messages in thread
From: David Vrabel @ 2014-05-14 12:58 UTC (permalink / raw)
To: xen-devel; +Cc: Keir Fraser, David Vrabel, Jan Beulich
There's no need to print all the CPUs that are ok, only the ones that
got stuck.
The resulting output is either:
Testing NMI watchdog on all CPUs: 1 4 6 stuck
or
Testing NMI watchdog on all CPUs: ok
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
xen/arch/x86/nmi.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/xen/arch/x86/nmi.c b/xen/arch/x86/nmi.c
index 4f330d8..84c1a5e 100644
--- a/xen/arch/x86/nmi.c
+++ b/xen/arch/x86/nmi.c
@@ -125,11 +125,12 @@ int __init check_nmi_watchdog (void)
{
static unsigned int __initdata prev_nmi_count[NR_CPUS];
int cpu;
-
+ bool_t ok = 1;
+
if ( !nmi_watchdog )
return 0;
- printk("Testing NMI watchdog --- ");
+ printk("Testing NMI watchdog on all CPUs:");
for_each_online_cpu ( cpu )
prev_nmi_count[cpu] = nmi_count(cpu);
@@ -141,12 +142,13 @@ int __init check_nmi_watchdog (void)
for_each_online_cpu ( cpu )
{
if ( nmi_count(cpu) - prev_nmi_count[cpu] <= 5 )
- printk("CPU#%d stuck. ", cpu);
- else
- printk("CPU#%d okay. ", cpu);
+ {
+ printk(" %d", cpu);
+ ok = 0;
+ }
}
- printk("\n");
+ printk(" %s\n", ok ? "ok" : "stuck");
/*
* Now that we know it works we can reduce NMI frequency to
--
1.7.10.4
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [PATCH 0/4] x86/nmi: improve NMI watchdog test
2014-05-14 12:58 [PATCH 0/4] x86/nmi: improve NMI watchdog test David Vrabel
` (3 preceding siblings ...)
2014-05-14 12:58 ` [PATCH 4/4] x86/nmi: be less verbose when testing the NMI watchdog David Vrabel
@ 2014-05-14 13:03 ` Andrew Cooper
4 siblings, 0 replies; 20+ messages in thread
From: Andrew Cooper @ 2014-05-14 13:03 UTC (permalink / raw)
To: David Vrabel; +Cc: xen-devel, Keir Fraser, Jan Beulich
On 14/05/14 13:58, David Vrabel wrote:
> Serial output (usually virtual serial like HP's iLO) occasional stops
> on the Testing NMI watchdog line. This series should fix this (see
> patch #2) and cleans-up a few other bits and pieces as well.
>
> David
All patches Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
^ permalink raw reply [flat|nested] 20+ messages in thread