[PATCH] x86/watchdog: Use real timestamps for watchdog timeout

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
@ 2013-05-23 20:32 Andrew Cooper
  2013-05-24  7:09 ` Jan Beulich
  2013-05-24  9:37 ` Tim Deegan
  0 siblings, 2 replies; 19+ messages in thread
From: Andrew Cooper @ 2013-05-23 20:32 UTC (permalink / raw)
  To: xen-devel; +Cc: Tim Deegan, Keir Fraser, Jan Beulich

Do not assume that we will only receive interrupts at a rate of nmi_hz.  On a
test system being debugged, I observed a PCI SERR being continuously asserted
without the SERR bit being set.  The result was Xen "exceeding" a 300 second
timeout within 1 second.

Change the nmi_watchdog_tick() timecounting to use timestamps rather than a
rate calculated from nmi_hz (which itself has been seen to deviate on some
systems due to Turbo/Pstates).

Also, move the comment to a more appropriate place (as we would expect to
enter the old if() block once a second anyway), and fix up two trailing
whitespace issues.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

diff -r c6fb586f83a0 -r ebb0070be9fd xen/arch/x86/nmi.c
--- a/xen/arch/x86/nmi.c
+++ b/xen/arch/x86/nmi.c
@@ -396,7 +396,7 @@ static struct notifier_block cpu_nmi_nfb
 };
 
 static DEFINE_PER_CPU(unsigned int, last_irq_sums);
-static DEFINE_PER_CPU(unsigned int, alert_counter);
+static DEFINE_PER_CPU(s_time_t, last_irq_change);
 
 static atomic_t watchdog_disable_count = ATOMIC_INIT(1);
 
@@ -432,23 +432,22 @@ void nmi_watchdog_tick(struct cpu_user_r
     if ( (this_cpu(last_irq_sums) == sum) &&
          !atomic_read(&watchdog_disable_count) )
     {
-        /*
-         * Ayiee, looks like this CPU is stuck ... wait for the timeout
-         * before doing the oops ...
-         */
-        this_cpu(alert_counter)++;
-        if ( this_cpu(alert_counter) == opt_watchdog_timeout*nmi_hz )
+        s_time_t last_change = this_cpu(last_irq_change);
+
+        if ( (NOW() - last_change) > SECONDS(opt_watchdog_timeout) )
         {
+            /* Ayiee, looks like this CPU is stuck. */
+
             console_force_unlock();
             printk("Watchdog timer detects that CPU%d is stuck!\n",
                    smp_processor_id());
             fatal_trap(TRAP_nmi, regs);
         }
-    } 
-    else 
+    }
+    else
     {
         this_cpu(last_irq_sums) = sum;
-        this_cpu(alert_counter) = 0;
+        this_cpu(last_irq_change) = NOW();
     }
 
     if ( nmi_perfctr_msr )

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
  2013-05-23 20:32 [PATCH] x86/watchdog: Use real timestamps for watchdog timeout Andrew Cooper
@ 2013-05-24  7:09 ` Jan Beulich
  2013-05-24  9:57   ` Andrew Cooper
  2013-05-24  9:37 ` Tim Deegan
  1 sibling, 1 reply; 19+ messages in thread
From: Jan Beulich @ 2013-05-24  7:09 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Tim Deegan, Keir Fraser, xen-devel

>>> On 23.05.13 at 22:32, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> Do not assume that we will only receive interrupts at a rate of nmi_hz.  On a
> test system being debugged, I observed a PCI SERR being continuously 
> asserted
> without the SERR bit being set.  The result was Xen "exceeding" a 300 second
> timeout within 1 second.

This is a questionable rationale for the patch, no matter that
conceptually I don't mind a change like this. On broken systems
like this it may be more reasonable to require the watchdog (which
is disabled by default anyway) to not get enabled.

> @@ -432,23 +432,22 @@ void nmi_watchdog_tick(struct cpu_user_r
>      if ( (this_cpu(last_irq_sums) == sum) &&
>           !atomic_read(&watchdog_disable_count) )
>      {
> -        /*
> -         * Ayiee, looks like this CPU is stuck ... wait for the timeout
> -         * before doing the oops ...
> -         */
> -        this_cpu(alert_counter)++;
> -        if ( this_cpu(alert_counter) == opt_watchdog_timeout*nmi_hz )
> +        s_time_t last_change = this_cpu(last_irq_change);
> +
> +        if ( (NOW() - last_change) > SECONDS(opt_watchdog_timeout) )

You can't use NOW() here - while the time updating code is safe
against normal interrupts, it's not atomic wrt NMIs.

Since you don't really need this calculation to be very precise,
doing it using plain TSC values might be acceptable, with the one
caveat that you'd need to check that doing this in the middle of a
TSC update (by time_calibration_tsc_rendezvous()) is still safe
and sufficiently precise.

The only other (non-generic) alternative I see is to use the HPET
main counter if 64-bit capable _and_ we ran it as 64-bit counter.
But that would neither cover all machines nor be backportable
(because of 32-bit's constraints).

Jan

>          {
> +            /* Ayiee, looks like this CPU is stuck. */
> +
>              console_force_unlock();
>              printk("Watchdog timer detects that CPU%d is stuck!\n",
>                     smp_processor_id());
>              fatal_trap(TRAP_nmi, regs);
>          }
> -    } 
> -    else 
> +    }
> +    else
>      {
>          this_cpu(last_irq_sums) = sum;
> -        this_cpu(alert_counter) = 0;
> +        this_cpu(last_irq_change) = NOW();
>      }
>  
>      if ( nmi_perfctr_msr )

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
  2013-05-23 20:32 [PATCH] x86/watchdog: Use real timestamps for watchdog timeout Andrew Cooper
  2013-05-24  7:09 ` Jan Beulich
@ 2013-05-24  9:37 ` Tim Deegan
  2013-05-24 10:03   ` Andrew Cooper
  1 sibling, 1 reply; 19+ messages in thread
From: Tim Deegan @ 2013-05-24  9:37 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Keir Fraser, Jan Beulich, xen-devel

At 21:32 +0100 on 23 May (1369344726), Andrew Cooper wrote:
> Do not assume that we will only receive interrupts at a rate of nmi_hz.  On a
> test system being debugged, I observed a PCI SERR being continuously asserted
> without the SERR bit being set.  The result was Xen "exceeding" a 300 second
> timeout within 1 second.

Sounds like the CPU is indeed stuck, and the watchdog has just optimized
away the 5 minutes of back-to-back NMIs. :)

Handling this case it nice, but I wonder whether this patch ought to
detect and report ludicrous NMI rates rather than silently ignoring
them.  I guess that's hard to do in an NMI handler, other than by
adjusting the printk when we crash.

Tim.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
  2013-05-24  7:09 ` Jan Beulich
@ 2013-05-24  9:57   ` Andrew Cooper
  2013-05-24 10:13     ` Tim Deegan
  0 siblings, 1 reply; 19+ messages in thread
From: Andrew Cooper @ 2013-05-24  9:57 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Tim (Xen.org), Keir (Xen.org), xen-devel@lists.xen.org

On 24/05/13 08:09, Jan Beulich wrote:
>>>> On 23.05.13 at 22:32, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>> Do not assume that we will only receive interrupts at a rate of nmi_hz.  On a
>> test system being debugged, I observed a PCI SERR being continuously 
>> asserted
>> without the SERR bit being set.  The result was Xen "exceeding" a 300 second
>> timeout within 1 second.
> This is a questionable rationale for the patch, no matter that
> conceptually I don't mind a change like this. On broken systems
> like this it may be more reasonable to require the watchdog (which
> is disabled by default anyway) to not get enabled.

In this particular system there is already a need for a BIOS fix (for
another issue), so adding a fix of the SERR bit to list is also doable.

I suppose that I should say that the reason for the patch is not because
*this* system entered that bad state, but because it is possible with a
sufficiently high rate of NMIs in general.

>
>> @@ -432,23 +432,22 @@ void nmi_watchdog_tick(struct cpu_user_r
>>      if ( (this_cpu(last_irq_sums) == sum) &&
>>           !atomic_read(&watchdog_disable_count) )
>>      {
>> -        /*
>> -         * Ayiee, looks like this CPU is stuck ... wait for the timeout
>> -         * before doing the oops ...
>> -         */
>> -        this_cpu(alert_counter)++;
>> -        if ( this_cpu(alert_counter) == opt_watchdog_timeout*nmi_hz )
>> +        s_time_t last_change = this_cpu(last_irq_change);
>> +
>> +        if ( (NOW() - last_change) > SECONDS(opt_watchdog_timeout) )
> You can't use NOW() here - while the time updating code is safe
> against normal interrupts, it's not atomic wrt NMIs.

But NMIs are latched at the hardware level.  If we get a nested NMI the
Xen will be toast on the exit path anyway.

We dont currently detect that yet because I have not had sufficient time
to complete my reentrant issues patch series yet.

>
> Since you don't really need this calculation to be very precise,
> doing it using plain TSC values might be acceptable, with the one
> caveat that you'd need to check that doing this in the middle of a
> TSC update (by time_calibration_tsc_rendezvous()) is still safe
> and sufficiently precise.

I dont see how using raw TSC values differs from using NOW() when it
comes to racing with time_calibration_tsc_rendezvous().  NOW() has an
rdtsc in it.

>From a quick eyeball of the code, I cant spot any errors which would
occur.  time_calibration_tsc_rendezvous() already has the possibility of
an NMI interrupting it which might skew the calculation.  On the other
hand, NOW() is read-only with respect to any system state

The only issue might be the write_tsc() causing the tsc to step
backwards, but that is already a potential problem using NOW() on either
side of a tsc_rendezvous() anyway.

~Andrew

>
> The only other (non-generic) alternative I see is to use the HPET
> main counter if 64-bit capable _and_ we ran it as 64-bit counter.
> But that would neither cover all machines nor be backportable
> (because of 32-bit's constraints).
>
> Jan
>
>>          {
>> +            /* Ayiee, looks like this CPU is stuck. */
>> +
>>              console_force_unlock();
>>              printk("Watchdog timer detects that CPU%d is stuck!\n",
>>                     smp_processor_id());
>>              fatal_trap(TRAP_nmi, regs);
>>          }
>> -    } 
>> -    else 
>> +    }
>> +    else
>>      {
>>          this_cpu(last_irq_sums) = sum;
>> -        this_cpu(alert_counter) = 0;
>> +        this_cpu(last_irq_change) = NOW();
>>      }
>>  
>>      if ( nmi_perfctr_msr )
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
  2013-05-24  9:37 ` Tim Deegan
@ 2013-05-24 10:03   ` Andrew Cooper
  0 siblings, 0 replies; 19+ messages in thread
From: Andrew Cooper @ 2013-05-24 10:03 UTC (permalink / raw)
  To: Tim Deegan; +Cc: Keir (Xen.org), Jan Beulich, xen-devel@lists.xen.org

On 24/05/13 10:37, Tim Deegan wrote:
> At 21:32 +0100 on 23 May (1369344726), Andrew Cooper wrote:
>> Do not assume that we will only receive interrupts at a rate of nmi_hz.  On a
>> test system being debugged, I observed a PCI SERR being continuously asserted
>> without the SERR bit being set.  The result was Xen "exceeding" a 300 second
>> timeout within 1 second.
> Sounds like the CPU is indeed stuck, and the watchdog has just optimized
> away the 5 minutes of back-to-back NMIs. :)
>
> Handling this case it nice, but I wonder whether this patch ought to
> detect and report ludicrous NMI rates rather than silently ignoring
> them.  I guess that's hard to do in an NMI handler, other than by
> adjusting the printk when we crash.
>
> Tim.

Actually I suspect the system was livelocked with PCI SERRs being issued
from a PCIe switch.  I only have second granularity on the serial
console, but can confirm that cpu0 was perfectly alive and well within
the same second as the watchdog supposedly expiring.

I was considering trying to work around a ludicrous rate of interrupts,
but decided to go for the easier patch first

~Andrew

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
  2013-05-24  9:57   ` Andrew Cooper
@ 2013-05-24 10:13     ` Tim Deegan
  2013-05-24 10:33       ` Andrew Cooper
  2013-05-24 11:36       ` Jan Beulich
  0 siblings, 2 replies; 19+ messages in thread
From: Tim Deegan @ 2013-05-24 10:13 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Keir (Xen.org), Jan Beulich, xen-devel@lists.xen.org

At 10:57 +0100 on 24 May (1369393060), Andrew Cooper wrote:
> On 24/05/13 08:09, Jan Beulich wrote:
> > You can't use NOW() here - while the time updating code is safe
> > against normal interrupts, it's not atomic wrt NMIs.
> 
> But NMIs are latched at the hardware level.  If we get a nested NMI the
> Xen will be toast on the exit path anyway.

The problem is that an NMI can arrive while local_time_calibration() is
writing its results, so calling NOW() in the NMI handler might return
garbage. 

> > Handling this case it nice, but I wonder whether this patch ought to
> > detect and report ludicrous NMI rates rather than silently ignoring
> > them.  I guess that's hard to do in an NMI handler, other than by
> > adjusting the printk when we crash.

Actually on second thoughts it's easier: as well as having this patch
(or near equivalent) to avoid premature watchdog expiry, we cna detect
the NMI rate in, say, the timer softirq and report if it's gone mad.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
  2013-05-24 10:13     ` Tim Deegan
@ 2013-05-24 10:33       ` Andrew Cooper
  2013-05-24 11:42         ` Jan Beulich
  2013-05-24 11:36       ` Jan Beulich
  1 sibling, 1 reply; 19+ messages in thread
From: Andrew Cooper @ 2013-05-24 10:33 UTC (permalink / raw)
  To: Tim Deegan; +Cc: Keir (Xen.org), Jan Beulich, xen-devel@lists.xen.org

On 24/05/13 11:13, Tim Deegan wrote:
> At 10:57 +0100 on 24 May (1369393060), Andrew Cooper wrote:
>> On 24/05/13 08:09, Jan Beulich wrote:
>>> You can't use NOW() here - while the time updating code is safe
>>> against normal interrupts, it's not atomic wrt NMIs.
>> But NMIs are latched at the hardware level.  If we get a nested NMI the
>> Xen will be toast on the exit path anyway.
> The problem is that an NMI can arrive while local_time_calibration() is
> writing its results, so calling NOW() in the NMI handler might return
> garbage. 

Aah - I see.  Sorry - I misunderstood the original point.

Yes - that is an issue.

Two solutions come to mind.

1) Along with the local_irq_disable()/enable() pairs in
local_time_calibration, having an atomic_t indicating "time data update
in progress", allowing the NMI handler to decide to bail early.

2) Modify local_time_calibration() to fill in a shadow cpu_time set, and
a different atomic_t to indicate which one is consistent.  This would
allow the NMI handler to always use one consistent set of timing
information.

>
>>> Handling this case it nice, but I wonder whether this patch ought to
>>> detect and report ludicrous NMI rates rather than silently ignoring
>>> them.  I guess that's hard to do in an NMI handler, other than by
>>> adjusting the printk when we crash.
> Actually on second thoughts it's easier: as well as having this patch
> (or near equivalent) to avoid premature watchdog expiry, we cna detect
> the NMI rate in, say, the timer softirq and report if it's gone mad.
>
> Cheers,
>
> Tim.

I was thinking along that line, but had not yet worked out where to put
it.  That looks like the best place.

~Andrew

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
  2013-05-24 10:13     ` Tim Deegan
  2013-05-24 10:33       ` Andrew Cooper
@ 2013-05-24 11:36       ` Jan Beulich
  2013-05-24 12:41         ` Tim Deegan
  1 sibling, 1 reply; 19+ messages in thread
From: Jan Beulich @ 2013-05-24 11:36 UTC (permalink / raw)
  To: Andrew Cooper, Tim Deegan; +Cc: Keir (Xen.org), xen-devel@lists.xen.org

>>> On 24.05.13 at 12:13, Tim Deegan <tim@xen.org> wrote:
>> > Handling this case it nice, but I wonder whether this patch ought to
>> > detect and report ludicrous NMI rates rather than silently ignoring
>> > them.  I guess that's hard to do in an NMI handler, other than by
>> > adjusting the printk when we crash.
> 
> Actually on second thoughts it's easier: as well as having this patch
> (or near equivalent) to avoid premature watchdog expiry, we cna detect
> the NMI rate in, say, the timer softirq and report if it's gone mad.

I may not get how you imagine this to work, but why would you
assume the timer softirq gets any chance to run anymore with an
absurd NMI rate (or at least get run enough times to prevent the
watchdog to fire)?

Jan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
  2013-05-24 10:33       ` Andrew Cooper
@ 2013-05-24 11:42         ` Jan Beulich
  2013-05-24 12:00           ` Andrew Cooper
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Beulich @ 2013-05-24 11:42 UTC (permalink / raw)
  To: Andrew Cooper, Tim Deegan; +Cc: Keir (Xen.org), xen-devel@lists.xen.org

>>> On 24.05.13 at 12:33, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> On 24/05/13 11:13, Tim Deegan wrote:
>> At 10:57 +0100 on 24 May (1369393060), Andrew Cooper wrote:
>>> On 24/05/13 08:09, Jan Beulich wrote:
>>>> You can't use NOW() here - while the time updating code is safe
>>>> against normal interrupts, it's not atomic wrt NMIs.
>>> But NMIs are latched at the hardware level.  If we get a nested NMI the
>>> Xen will be toast on the exit path anyway.
>> The problem is that an NMI can arrive while local_time_calibration() is
>> writing its results, so calling NOW() in the NMI handler might return
>> garbage. 
> 
> Aah - I see.  Sorry - I misunderstood the original point.
> 
> Yes - that is an issue.
> 
> Two solutions come to mind.
> 
> 1) Along with the local_irq_disable()/enable() pairs in
> local_time_calibration, having an atomic_t indicating "time data update
> in progress", allowing the NMI handler to decide to bail early.
> 
> 2) Modify local_time_calibration() to fill in a shadow cpu_time set, and
> a different atomic_t to indicate which one is consistent.  This would
> allow the NMI handler to always use one consistent set of timing
> information.

What's the advantage of either over using the plain TSC values?
Accuracy of the expiry, but the accuracy here doesn't matter
much (as long as its not off by an order of a magnitude), and
would be achieved by some presumably not very neat code of no
other (general) use.

And if picking up one of these, then 2 seems the preferable one to
me.

Jan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
  2013-05-24 11:42         ` Jan Beulich
@ 2013-05-24 12:00           ` Andrew Cooper
  2013-05-24 13:11             ` Jan Beulich
  0 siblings, 1 reply; 19+ messages in thread
From: Andrew Cooper @ 2013-05-24 12:00 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Keir (Xen.org), Tim (Xen.org), xen-devel@lists.xen.org

On 24/05/13 12:42, Jan Beulich wrote:
>>>> On 24.05.13 at 12:33, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>> On 24/05/13 11:13, Tim Deegan wrote:
>>> At 10:57 +0100 on 24 May (1369393060), Andrew Cooper wrote:
>>>> On 24/05/13 08:09, Jan Beulich wrote:
>>>>> You can't use NOW() here - while the time updating code is safe
>>>>> against normal interrupts, it's not atomic wrt NMIs.
>>>> But NMIs are latched at the hardware level.  If we get a nested NMI the
>>>> Xen will be toast on the exit path anyway.
>>> The problem is that an NMI can arrive while local_time_calibration() is
>>> writing its results, so calling NOW() in the NMI handler might return
>>> garbage. 
>> Aah - I see.  Sorry - I misunderstood the original point.
>>
>> Yes - that is an issue.
>>
>> Two solutions come to mind.
>>
>> 1) Along with the local_irq_disable()/enable() pairs in
>> local_time_calibration, having an atomic_t indicating "time data update
>> in progress", allowing the NMI handler to decide to bail early.
>>
>> 2) Modify local_time_calibration() to fill in a shadow cpu_time set, and
>> a different atomic_t to indicate which one is consistent.  This would
>> allow the NMI handler to always use one consistent set of timing
>> information.
> What's the advantage of either over using the plain TSC values?
> Accuracy of the expiry, but the accuracy here doesn't matter
> much (as long as its not off by an order of a magnitude), and
> would be achieved by some presumably not very neat code of no
> other (general) use.
>
> And if picking up one of these, then 2 seems the preferable one to
> me.
>
> Jan
>

The TSC scale is recalculated as part of local_time_calibration().  I
couldn't spot any other sensible way of converting a TSC delta to
something approximating seconds.  Have I missed something?

~Andrew

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
  2013-05-24 11:36       ` Jan Beulich
@ 2013-05-24 12:41         ` Tim Deegan
  2013-05-24 12:48           ` Andrew Cooper
  2013-05-24 13:17           ` Jan Beulich
  0 siblings, 2 replies; 19+ messages in thread
From: Tim Deegan @ 2013-05-24 12:41 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Keir (Xen.org), xen-devel@lists.xen.org

At 12:36 +0100 on 24 May (1369398982), Jan Beulich wrote:
> >>> On 24.05.13 at 12:13, Tim Deegan <tim@xen.org> wrote:
> >> > Handling this case it nice, but I wonder whether this patch ought to
> >> > detect and report ludicrous NMI rates rather than silently ignoring
> >> > them.  I guess that's hard to do in an NMI handler, other than by
> >> > adjusting the printk when we crash.
> > 
> > Actually on second thoughts it's easier: as well as having this patch
> > (or near equivalent) to avoid premature watchdog expiry, we cna detect
> > the NMI rate in, say, the timer softirq and report if it's gone mad.
> 
> I may not get how you imagine this to work, but why would you
> assume the timer softirq gets any chance to run anymore with an
> absurd NMI rate (or at least get run enough times to prevent the
> watchdog to fire)?

In that case, the watchdog will fire, if we can add something to the
watchdog printk if the NMI count is mugh higher than (timeout * nmi_hz).
I was worrying about the case where there are lots of NMIs but not so
many that the system livelocks entirely.  It would be nice to have some
non-fatal warning that something is wrong.  A one-time (or at least very
limited) printk in the softirq would do that for not much extra cost.

At 12:42 +0100 on 24 May (1369399344), Jan Beulich wrote:
> >>> On 24.05.13 at 12:33, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> > On 24/05/13 11:13, Tim Deegan wrote:
> >> At 10:57 +0100 on 24 May (1369393060), Andrew Cooper wrote:
> >>> On 24/05/13 08:09, Jan Beulich wrote:
> >>>> You can't use NOW() here - while the time updating code is safe
> >>>> against normal interrupts, it's not atomic wrt NMIs.
> >>> But NMIs are latched at the hardware level.  If we get a nested NMI the
> >>> Xen will be toast on the exit path anyway.
> >> The problem is that an NMI can arrive while local_time_calibration() is
> >> writing its results, so calling NOW() in the NMI handler might return
> >> garbage. 
> > 
> > Aah - I see.  Sorry - I misunderstood the original point.
> > 
> > Yes - that is an issue.
> > 
> > Two solutions come to mind.
> > 
> > 1) Along with the local_irq_disable()/enable() pairs in
> > local_time_calibration, having an atomic_t indicating "time data update
> > in progress", allowing the NMI handler to decide to bail early.
> > 
> > 2) Modify local_time_calibration() to fill in a shadow cpu_time set, and
> > a different atomic_t to indicate which one is consistent.  This would
> > allow the NMI handler to always use one consistent set of timing
> > information.

Of those two, I prefer (1), just because it doesn't add any cost to the
normal users of NOW().

Using TSC to gate the actual watchdog crash might get a bit messy,
especially if it ends up adding code to the users of write_tsc().  But
all we need is to double-check the existing count -- I think it would be
enough to say "the watchdog fired but it looks like it was caused by an
NMI storm" and crash anyway.  (At least assuming timeout * nmi_hz is a a
largish number.)  And nmi_watchdog_tick() can just check regs->eip as a
hint not to trust the scale factors. :)

Tim.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
  2013-05-24 12:41         ` Tim Deegan
@ 2013-05-24 12:48           ` Andrew Cooper
  2013-05-24 13:55             ` Tim Deegan
  2013-05-24 13:17           ` Jan Beulich
  1 sibling, 1 reply; 19+ messages in thread
From: Andrew Cooper @ 2013-05-24 12:48 UTC (permalink / raw)
  To: Tim Deegan; +Cc: Keir (Xen.org), Jan Beulich, xen-devel@lists.xen.org

On 24/05/13 13:41, Tim Deegan wrote:
> At 12:36 +0100 on 24 May (1369398982), Jan Beulich wrote:
>>>>> On 24.05.13 at 12:13, Tim Deegan <tim@xen.org> wrote:
>>>>> Handling this case it nice, but I wonder whether this patch ought to
>>>>> detect and report ludicrous NMI rates rather than silently ignoring
>>>>> them.  I guess that's hard to do in an NMI handler, other than by
>>>>> adjusting the printk when we crash.
>>> Actually on second thoughts it's easier: as well as having this patch
>>> (or near equivalent) to avoid premature watchdog expiry, we cna detect
>>> the NMI rate in, say, the timer softirq and report if it's gone mad.
>> I may not get how you imagine this to work, but why would you
>> assume the timer softirq gets any chance to run anymore with an
>> absurd NMI rate (or at least get run enough times to prevent the
>> watchdog to fire)?
> In that case, the watchdog will fire, if we can add something to the
> watchdog printk if the NMI count is mugh higher than (timeout * nmi_hz).
> I was worrying about the case where there are lots of NMIs but not so
> many that the system livelocks entirely.  It would be nice to have some
> non-fatal warning that something is wrong.  A one-time (or at least very
> limited) printk in the softirq would do that for not much extra cost.
>
> At 12:42 +0100 on 24 May (1369399344), Jan Beulich wrote:
>>>>> On 24.05.13 at 12:33, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>>> On 24/05/13 11:13, Tim Deegan wrote:
>>>> At 10:57 +0100 on 24 May (1369393060), Andrew Cooper wrote:
>>>>> On 24/05/13 08:09, Jan Beulich wrote:
>>>>>> You can't use NOW() here - while the time updating code is safe
>>>>>> against normal interrupts, it's not atomic wrt NMIs.
>>>>> But NMIs are latched at the hardware level.  If we get a nested NMI the
>>>>> Xen will be toast on the exit path anyway.
>>>> The problem is that an NMI can arrive while local_time_calibration() is
>>>> writing its results, so calling NOW() in the NMI handler might return
>>>> garbage. 
>>> Aah - I see.  Sorry - I misunderstood the original point.
>>>
>>> Yes - that is an issue.
>>>
>>> Two solutions come to mind.
>>>
>>> 1) Along with the local_irq_disable()/enable() pairs in
>>> local_time_calibration, having an atomic_t indicating "time data update
>>> in progress", allowing the NMI handler to decide to bail early.
>>>
>>> 2) Modify local_time_calibration() to fill in a shadow cpu_time set, and
>>> a different atomic_t to indicate which one is consistent.  This would
>>> allow the NMI handler to always use one consistent set of timing
>>> information.
> Of those two, I prefer (1), just because it doesn't add any cost to the
> normal users of NOW().
>
> Using TSC to gate the actual watchdog crash might get a bit messy,
> especially if it ends up adding code to the users of write_tsc().  But
> all we need is to double-check the existing count -- I think it would be
> enough to say "the watchdog fired but it looks like it was caused by an
> NMI storm" and crash anyway.  (At least assuming timeout * nmi_hz is a a
> largish number.)  And nmi_watchdog_tick() can just check regs->eip as a
> hint not to trust the scale factors. :)
>
> Tim.

I was not planning to make any requirement to change users of NOW(). 
The only consumers which need to care about the shadow set or "update in
progress" flag would be the NMI and MCE handlers, being the only ones
which can interrupt the current structure update.  The shadow set can be
neatly hidden away from anyone who doesn't wish to know.

~Andrew

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
  2013-05-24 12:00           ` Andrew Cooper
@ 2013-05-24 13:11             ` Jan Beulich
  0 siblings, 0 replies; 19+ messages in thread
From: Jan Beulich @ 2013-05-24 13:11 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Tim (Xen.org), Keir (Xen.org), xen-devel@lists.xen.org

>>> On 24.05.13 at 14:00, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> The TSC scale is recalculated as part of local_time_calibration().  I
> couldn't spot any other sensible way of converting a TSC delta to
> something approximating seconds.  Have I missed something?

Since we're not after a precise amount of time, using cpu_khz as
basis and accepting that the period until the watchdog may fire
could be, say, twice the nominal amount doesn't seem that bad
to me.

Or, if we're going the double checking route, retaining the current
counting and simply using way too low a TSC delta to alter the
printed message, largely like what Tim suggested, could be an
option. In that case, even a factor 10 mis-scaling of the TSC count
wouldn't hurt. The only real problem would be when the TSC stops
counting while in deep C states, but that again is a non-issue here
because you can't at the same time get NMI storms and be in deep
C states permanently.

Jan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
  2013-05-24 12:41         ` Tim Deegan
  2013-05-24 12:48           ` Andrew Cooper
@ 2013-05-24 13:17           ` Jan Beulich
  2013-05-24 14:01             ` Tim Deegan
  1 sibling, 1 reply; 19+ messages in thread
From: Jan Beulich @ 2013-05-24 13:17 UTC (permalink / raw)
  To: Tim Deegan; +Cc: Andrew Cooper, Keir (Xen.org), xen-devel@lists.xen.org

>>> On 24.05.13 at 14:41, Tim Deegan <tim@xen.org> wrote:
> At 12:36 +0100 on 24 May (1369398982), Jan Beulich wrote:
>> > 1) Along with the local_irq_disable()/enable() pairs in
>> > local_time_calibration, having an atomic_t indicating "time data update
>> > in progress", allowing the NMI handler to decide to bail early.
>> > 
>> > 2) Modify local_time_calibration() to fill in a shadow cpu_time set, and
>> > a different atomic_t to indicate which one is consistent.  This would
>> > allow the NMI handler to always use one consistent set of timing
>> > information.
> 
> Of those two, I prefer (1), just because it doesn't add any cost to the
> normal users of NOW().

The reason I dislike 1 is because you then have however small a
probability of many/all NMI instance just happening while the time
gets updated, resulting in all of them bailing early, and the
watchdog never firing.

> Using TSC to gate the actual watchdog crash might get a bit messy,
> especially if it ends up adding code to the users of write_tsc().

The only problematic write_tsc() user is that to recover from a
stopped counter in deep C states. This is not a meaningful problem
because - as just said in another reply - NMI storms and long times
spent in deep C states exclude each other.

> And nmi_watchdog_tick() can just check regs->eip as a
> hint not to trust the scale factors. :)

By doing a range check looking for it to point into some function?
How would that cope with LTO?

Jan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
  2013-05-24 12:48           ` Andrew Cooper
@ 2013-05-24 13:55             ` Tim Deegan
  2013-05-24 14:29               ` Andrew Cooper
  0 siblings, 1 reply; 19+ messages in thread
From: Tim Deegan @ 2013-05-24 13:55 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Keir (Xen.org), Jan Beulich, xen-devel@lists.xen.org

At 13:48 +0100 on 24 May (1369403327), Andrew Cooper wrote:
> On 24/05/13 13:41, Tim Deegan wrote:
> > Of those two, I prefer (1), just because it doesn't add any cost to the
> > normal users of NOW().
> 
> I was not planning to make any requirement to change users of NOW(). 

Well, you were planning to make NOW() slightly more expensive by needing
to look up which of the banekd alternatives is valid.  In any case, I
think some sort of approximate version based on tsc will do.

Tim.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
  2013-05-24 13:17           ` Jan Beulich
@ 2013-05-24 14:01             ` Tim Deegan
  0 siblings, 0 replies; 19+ messages in thread
From: Tim Deegan @ 2013-05-24 14:01 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Keir (Xen.org), xen-devel@lists.xen.org

At 14:17 +0100 on 24 May (1369405052), Jan Beulich wrote:
> > Using TSC to gate the actual watchdog crash might get a bit messy,
> > especially if it ends up adding code to the users of write_tsc().
> 
> The only problematic write_tsc() user is that to recover from a
> stopped counter in deep C states. This is not a meaningful problem
> because - as just said in another reply - NMI storms and long times
> spent in deep C states exclude each other.

Yep.  But we could go into a C-state and be woken by an NMI storm,
meaning that our 'before' TSC value would be immediately wrong.  The
more I think about this the less I think we should be trying to measure
elapsed time except to change the printout.

> > And nmi_watchdog_tick() can just check regs->eip as a
> > hint not to trust the scale factors. :)
> 
> By doing a range check looking for it to point into some function?
> How would that cope with LTO?

Unreliably. :)  I imagine in most cases it would be fine, since that
function has its address taken, but unless we called out to an explicit
.S routine to do the update it ouwld be at best a hint. 

Tim.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
  2013-05-24 13:55             ` Tim Deegan
@ 2013-05-24 14:29               ` Andrew Cooper
  2013-05-24 17:10                 ` Tim Deegan
  0 siblings, 1 reply; 19+ messages in thread
From: Andrew Cooper @ 2013-05-24 14:29 UTC (permalink / raw)
  To: Tim Deegan; +Cc: Keir (Xen.org), Jan Beulich, xen-devel@lists.xen.org

On 24/05/13 14:55, Tim Deegan wrote:
> At 13:48 +0100 on 24 May (1369403327), Andrew Cooper wrote:
>> On 24/05/13 13:41, Tim Deegan wrote:
>>> Of those two, I prefer (1), just because it doesn't add any cost to the
>>> normal users of NOW().
>> I was not planning to make any requirement to change users of NOW(). 
> Well, you were planning to make NOW() slightly more expensive by needing
> to look up which of the banekd alternatives is valid.  In any case, I
> think some sort of approximate version based on tsc will do.
>
> Tim.

I was planning to memcpy the shadow set over the main set as part of
calibration, leaving no alteration whatsoever to NOW().

An approximation from the TSC alone would be better so long as it is a
reasonable approximation.  I am concerned about how accuate a dumb
approximation would be for non-stable TSCs etc.

~Andrew

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
  2013-05-24 14:29               ` Andrew Cooper
@ 2013-05-24 17:10                 ` Tim Deegan
  2013-05-24 17:27                   ` Andrew Cooper
  0 siblings, 1 reply; 19+ messages in thread
From: Tim Deegan @ 2013-05-24 17:10 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Keir (Xen.org), Jan Beulich, xen-devel@lists.xen.org

At 15:29 +0100 on 24 May (1369409374), Andrew Cooper wrote:
> On 24/05/13 14:55, Tim Deegan wrote:
> > At 13:48 +0100 on 24 May (1369403327), Andrew Cooper wrote:
> >> On 24/05/13 13:41, Tim Deegan wrote:
> >>> Of those two, I prefer (1), just because it doesn't add any cost to the
> >>> normal users of NOW().
> >> I was not planning to make any requirement to change users of NOW(). 
> > Well, you were planning to make NOW() slightly more expensive by needing
> > to look up which of the banekd alternatives is valid.  In any case, I
> > think some sort of approximate version based on tsc will do.
> >
> > Tim.
> 
> I was planning to memcpy the shadow set over the main set as part of
> calibration, leaving no alteration whatsoever to NOW().

Sorry, yes, I see how that works now.  And so I too prefer (2). :)

> An approximation from the TSC alone would be better so long as it is a
> reasonable approximation.  I am concerned about how accuate a dumb
> approximation would be for non-stable TSCs etc.

Yep.  I'm more and more convinced that we should gate on the number of
NMIs we've taken without seeing a timer tick.  I'm more afratid of funny
TSC edge cases (and remember we might take an NMI anywhere in the s3
wakeup) than I am of machines with really bad NMI storms.  So even if
the approximate time is wildly off we just print the wrong thing. 

In the case where you saw this (and cpu0 was alive for a while before
it managed a burst of enough NMIs), would detecting and warning
about high NMI rates be enough to point out what's gone wrong?

Tim.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
  2013-05-24 17:10                 ` Tim Deegan
@ 2013-05-24 17:27                   ` Andrew Cooper
  0 siblings, 0 replies; 19+ messages in thread
From: Andrew Cooper @ 2013-05-24 17:27 UTC (permalink / raw)
  To: Tim Deegan; +Cc: Keir (Xen.org), Jan Beulich, xen-devel@lists.xen.org

On 24/05/13 18:10, Tim Deegan wrote:
> At 15:29 +0100 on 24 May (1369409374), Andrew Cooper wrote:
>> On 24/05/13 14:55, Tim Deegan wrote:
>>> At 13:48 +0100 on 24 May (1369403327), Andrew Cooper wrote:
>>>> On 24/05/13 13:41, Tim Deegan wrote:
>>>>> Of those two, I prefer (1), just because it doesn't add any cost to the
>>>>> normal users of NOW().
>>>> I was not planning to make any requirement to change users of NOW(). 
>>> Well, you were planning to make NOW() slightly more expensive by needing
>>> to look up which of the banekd alternatives is valid.  In any case, I
>>> think some sort of approximate version based on tsc will do.
>>>
>>> Tim.
>> I was planning to memcpy the shadow set over the main set as part of
>> calibration, leaving no alteration whatsoever to NOW().
> Sorry, yes, I see how that works now.  And so I too prefer (2). :)
>
>> An approximation from the TSC alone would be better so long as it is a
>> reasonable approximation.  I am concerned about how accuate a dumb
>> approximation would be for non-stable TSCs etc.
> Yep.  I'm more and more convinced that we should gate on the number of
> NMIs we've taken without seeing a timer tick.  I'm more afratid of funny
> TSC edge cases (and remember we might take an NMI anywhere in the s3
> wakeup) than I am of machines with really bad NMI storms.  So even if
> the approximate time is wildly off we just print the wrong thing. 
>
> In the case where you saw this (and cpu0 was alive for a while before
> it managed a burst of enough NMIs), would detecting and warning
> about high NMI rates be enough to point out what's gone wrong?
>
> Tim.

Not directly, butb or my debugging case knowing when the NMI storm has
started is very useful so I can dump lspci -vvvxxxx to get the debug
state from whichever PCI device is the original cause of the SERR storm.

I certainly don't think there is anything useful Xen could automatically
do when discovering an NMI storm, but leaving a message on the serial or
in a crash state certainly helps someone trying to investigate why the
server reset.  It is certainly better than finding an NMI watchdog
timeout with serial timestamps proving that the watchdog didn't actually
time out :), and starting debugging wondering WTF was going on.

~Andrew

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2013-05-24 17:27 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-05-23 20:32 [PATCH] x86/watchdog: Use real timestamps for watchdog timeout Andrew Cooper
2013-05-24  7:09 ` Jan Beulich
2013-05-24  9:57   ` Andrew Cooper
2013-05-24 10:13     ` Tim Deegan
2013-05-24 10:33       ` Andrew Cooper
2013-05-24 11:42         ` Jan Beulich
2013-05-24 12:00           ` Andrew Cooper
2013-05-24 13:11             ` Jan Beulich
2013-05-24 11:36       ` Jan Beulich
2013-05-24 12:41         ` Tim Deegan
2013-05-24 12:48           ` Andrew Cooper
2013-05-24 13:55             ` Tim Deegan
2013-05-24 14:29               ` Andrew Cooper
2013-05-24 17:10                 ` Tim Deegan
2013-05-24 17:27                   ` Andrew Cooper
2013-05-24 13:17           ` Jan Beulich
2013-05-24 14:01             ` Tim Deegan
2013-05-24  9:37 ` Tim Deegan
2013-05-24 10:03   ` Andrew Cooper

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).