[PATCH] strictly increasing hvm guest time

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] strictly increasing  hvm guest time
@ 2008-07-02 16:03 Dan Magenheimer
  2008-07-02 16:07 ` Keir Fraser
  0 siblings, 1 reply; 29+ messages in thread
From: Dan Magenheimer @ 2008-07-02 16:03 UTC (permalink / raw)
  To: Xen-Devel (E-mail), Keir Fraser; +Cc: Dave Winchell

[-- Attachment #1: Type: text/plain, Size: 983 bytes --]

This simple one-line patch changes hvm guest time from
monotonically non-decreasing to monotonically strictly-
increasing.  As a result, two consecutive reads of the
(virtual) hpet will never return the same value, thus
avoiding the appearance that time has stopped (which may
occur if there is skew between physical processor TSCs).

The only problem scenario I can see is if:

1) N = number of physical CPUs on system
2) T = time in nsec of fastest call P that an hvm guest can
   make that indirectly invokes hvm_get_guest_time()
3) N>T (highly unlikely)
4) guests on all N physical CPUs are continuously
   calling P (also highly unlikely)

then guest time could accelerate faster than Xen system
time.

Dan

===================================
Thanks... for the memory
I really could use more / My throughput's on the floor
The balloon is flat / My swap disk's fat / I've OOM's in store
Overcommitted so much
(with apologies to the late great Bob Hope)

[-- Attachment #2: hvmmono.patch --]
[-- Type: application/octet-stream, Size: 483 bytes --]

diff -r 08f77df14cba xen/arch/x86/hvm/vpt.c
--- a/xen/arch/x86/hvm/vpt.c	Wed Jul 02 11:30:37 2008 +0900
+++ b/xen/arch/x86/hvm/vpt.c	Wed Jul 02 09:46:40 2008 -0600
@@ -47,7 +47,7 @@ u64 hvm_get_guest_time(struct vcpu *v)
     if ( (int64_t)(now - pl->last_guest_time) >= 0 )
         pl->last_guest_time = now;
     else
-        now = pl->last_guest_time;
+        now = ++pl->last_guest_time;
     spin_unlock(&pl->pl_time_lock);

     return now + v->arch.hvm_vcpu.stime_offset;

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] strictly increasing  hvm guest time
  2008-07-02 16:03 [PATCH] strictly increasing hvm guest time Dan Magenheimer
@ 2008-07-02 16:07 ` Keir Fraser
  2008-07-02 21:50   ` Dan Magenheimer
  0 siblings, 1 reply; 29+ messages in thread
From: Keir Fraser @ 2008-07-02 16:07 UTC (permalink / raw)
  To: dan.magenheimer@oracle.com, Xen-Devel (E-mail); +Cc: Dave Winchell

On 2/7/08 17:03, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> This simple one-line patch changes hvm guest time from
> monotonically non-decreasing to monotonically strictly-
> increasing.  As a result, two consecutive reads of the
> (virtual) hpet will never return the same value, thus
> avoiding the appearance that time has stopped (which may
> occur if there is skew between physical processor TSCs).

It does seem a little hack-ish, if we don't know of any issues arising from
the current code, and we expect cross-cpu deltas to be pretty small. Also
guests will often convert HPET reads to well-known units (e.g.,
microseconds, milliseconds) before using them, in which case even a delta of
one may not result in differing converted time values.

 -- Keir

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH] strictly increasing  hvm guest time
  2008-07-02 16:07 ` Keir Fraser
@ 2008-07-02 21:50   ` Dan Magenheimer
  2008-07-02 22:41     ` [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time) Dan Magenheimer
  0 siblings, 1 reply; 29+ messages in thread
From: Dan Magenheimer @ 2008-07-02 21:50 UTC (permalink / raw)
  To: Keir Fraser, Xen-Devel (E-mail); +Cc: Dave Winchell

> > This simple one-line patch changes hvm guest time from
> > monotonically non-decreasing to monotonically strictly-
> > increasing.  As a result, two consecutive reads of the
> > (virtual) hpet will never return the same value, thus
> > avoiding the appearance that time has stopped (which may
> > occur if there is skew between physical processor TSCs).
> 
> It does seem a little hack-ish, if we don't know of any 
> issues arising from
> the current code, and we expect cross-cpu deltas to be pretty 
> small.

Using "xm debug-key t; xm dmesg | tail -1" you can get an idea
of the deltas.  Even on my single-socket dual-core recent-vintage
Intel box, I'm frequently seeing Diff's > 300ns.  While this
is still relatively small (and part of it may be SMP cache
synchronization time), this is supposed to be a "good TSC"
box.

I'm spinning a small patch capturing the maximum so that can
be output via debug-key t also.

> Also
> guests will often convert HPET reads to well-known units (e.g.,
> microseconds, milliseconds) before using them, in which case 
> even a delta of
> one may not result in differing converted time values.

Yes, but most newer Linux systems have a high-res timer API
that returns nanoseconds, though admittedly it is not widely
used yet.

Thanks,
Dan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH] record max stime skew (was RE: [PATCH] strictly increasing  hvm guest time)
  2008-07-02 21:50   ` Dan Magenheimer
@ 2008-07-02 22:41     ` Dan Magenheimer
  2008-07-03  8:03       ` Keir Fraser
  0 siblings, 1 reply; 29+ messages in thread
From: Dan Magenheimer @ 2008-07-02 22:41 UTC (permalink / raw)
  To: dan.magenheimer@oracle.com, Keir Fraser, Xen-Devel (E-mail); +Cc: Dave Winchell

[-- Attachment #1: Type: text/plain, Size: 612 bytes --]

> Subject: [Xen-devel] RE: [PATCH] strictly increasing hvm guest time
> 
> I'm spinning a small patch capturing the maximum so that can
> be output via debug-key t also.

Attached is the patch.  Interestingly, on my single-socket
two-core recent-vintage Intel processor, this patch reports
a max skew of >13 usec, much higher than the values I'm
seeing from "xm debug-key t".  I wonder if this is due to
a mistake in my patch (though I don't see it) or if the
various stime error corrections are not converging as
expected, resulting in a broader stime skew between
processors than expected?

Dan

[-- Attachment #2: maxskew.patch --]
[-- Type: application/octet-stream, Size: 2649 bytes --]

diff -r 08f77df14cba xen/arch/x86/hvm/vpt.c
--- a/xen/arch/x86/hvm/vpt.c	Wed Jul 02 11:30:37 2008 +0900
+++ b/xen/arch/x86/hvm/vpt.c	Wed Jul 02 16:34:33 2008 -0600
@@ -47,7 +47,7 @@ u64 hvm_get_guest_time(struct vcpu *v)
     if ( (int64_t)(now - pl->last_guest_time) >= 0 )
         pl->last_guest_time = now;
     else
-        now = pl->last_guest_time;
+        now = ++pl->last_guest_time;
     spin_unlock(&pl->pl_time_lock);
 
     return now + v->arch.hvm_vcpu.stime_offset;
diff -r 08f77df14cba xen/arch/x86/time.c
--- a/xen/arch/x86/time.c	Wed Jul 02 11:30:37 2008 +0900
+++ b/xen/arch/x86/time.c	Wed Jul 02 16:34:33 2008 -0600
@@ -69,6 +69,9 @@ static DEFINE_PER_CPU(struct cpu_time, c
 
 /* TSC is invariant on C state entry? */
 static bool_t tsc_invariant;
+
+/* record maximum skew to report with debug-key t */
+u64 max_stime_skew = 0;
 
 /*
  * We simulate a 32-bit platform timer from the 16-bit PIT ch2 counter.
@@ -845,6 +848,23 @@ static void local_time_calibration(void 
     rdtscll(curr_tsc);
     local_irq_enable();
 
+    /*
+     * Record maximum stime skew from master processor.  Note that
+     * in the case of a fast local clock, skew reflects the post-adjusted
+     * skew (see below and get_s_time()), not the actual skew.  Also
+     * note that some processors may skew positive and others negative
+     * relative to master so skew between ANY pair of processors may be
+     * as much as 2x recorded max
+     */
+    if ( smp_processor_id() )
+    {
+        s64 curr_stime_skew = curr_master_stime - curr_local_stime;
+        if ( curr_stime_skew < 0 )
+            curr_stime_skew = - curr_stime_skew;
+        if ( curr_stime_skew > max_stime_skew )
+            max_stime_skew = curr_stime_skew;
+    }
+
 #if 0
     printk("PRE%d: tsc=%"PRIu64" stime=%"PRIu64" master=%"PRIu64"\n",
            smp_processor_id(), prev_tsc, prev_local_stime, prev_master_stime);
diff -r 08f77df14cba xen/common/keyhandler.c
--- a/xen/common/keyhandler.c	Wed Jul 02 11:30:37 2008 +0900
+++ b/xen/common/keyhandler.c	Wed Jul 02 16:34:33 2008 -0600
@@ -251,6 +251,7 @@ static void read_clocks(unsigned char ke
     unsigned int cpu = smp_processor_id(), min_cpu, max_cpu;
     u64 min, max, dif, difus;
     static DEFINE_SPINLOCK(lock);
+    extern u64 max_stime_skew;
 
     spin_lock(&lock);
 
@@ -284,6 +285,7 @@ static void read_clocks(unsigned char ke
     printk("Min = %"PRIu64" ; Max = %"PRIu64" ; Diff = %"PRIu64
            " (%"PRIu64" microseconds)\n",
            min, max, dif, difus);
+    printk("Max recorded stime skew = %"PRIu64"ns\n", max_stime_skew);
 }
 
 extern void dump_runq(unsigned char key);

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing  hvm guest time)
  2008-07-02 22:41     ` [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time) Dan Magenheimer
@ 2008-07-03  8:03       ` Keir Fraser
  2008-07-03 16:24         ` Dan Magenheimer
  0 siblings, 1 reply; 29+ messages in thread
From: Keir Fraser @ 2008-07-03  8:03 UTC (permalink / raw)
  To: dan.magenheimer@oracle.com, Xen-Devel (E-mail); +Cc: Dave Winchell

On 2/7/08 23:41, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> Attached is the patch.  Interestingly, on my single-socket
> two-core recent-vintage Intel processor, this patch reports
> a max skew of >13 usec, much higher than the values I'm
> seeing from "xm debug-key t".  I wonder if this is due to
> a mistake in my patch (though I don't see it) or if the
> various stime error corrections are not converging as
> expected, resulting in a broader stime skew between
> processors than expected?

Perhaps this relatively large skew happens at start of day, before the
periodic calibration has 'locked on'?

 -- Keir

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing  hvm guest time)
  2008-07-03  8:03       ` Keir Fraser
@ 2008-07-03 16:24         ` Dan Magenheimer
  2008-07-03 16:35           ` Dan Magenheimer
  0 siblings, 1 reply; 29+ messages in thread
From: Dan Magenheimer @ 2008-07-03 16:24 UTC (permalink / raw)
  To: Keir Fraser, Xen-Devel (E-mail); +Cc: Dave Winchell

[-- Attachment #1: Type: text/plain, Size: 844 bytes --]

> > Attached is the patch.  Interestingly, on my single-socket
> > two-core recent-vintage Intel processor, this patch reports
> > a max skew of >13 usec, much higher than the values I'm
> > seeing from "xm debug-key t".  I wonder if this is due to
> > a mistake in my patch (though I don't see it) or if the
> > various stime error corrections are not converging as
> > expected, resulting in a broader stime skew between
> > processors than expected?
> 
> Perhaps this relatively large skew happens at start of day, before the
> periodic calibration has 'locked on'?

Indeed you are correct.  This updated patch now reports zero skew
as expected.

IMHO, it would be nice to put this patch into the tree as it
will be good for helping to diagnose time skew problems
such as the one just reported on the list.

Thanks,
Dan

[-- Attachment #2: maxskew2.patch --]
[-- Type: application/octet-stream, Size: 3838 bytes --]

diff -r 08f77df14cba xen/arch/x86/hvm/vpt.c
--- a/xen/arch/x86/hvm/vpt.c	Wed Jul 02 11:30:37 2008 +0900
+++ b/xen/arch/x86/hvm/vpt.c	Thu Jul 03 10:10:03 2008 -0600
@@ -25,6 +25,8 @@
 #define mode_is(d, name) \
     ((d)->arch.hvm_domain.params[HVM_PARAM_TIMER_MODE] == HVMPTM_##name)
 
+u64 max_guest_time_skew = 0;
+
 void hvm_init_guest_time(struct domain *d)
 {
     struct pl_time *pl = &d->arch.hvm_domain.pl_time;
@@ -38,16 +40,22 @@ u64 hvm_get_guest_time(struct vcpu *v)
 {
     struct pl_time *pl = &v->domain->arch.hvm_domain.pl_time;
     u64 now;
+    int64_t skew;
 
     /* Called from device models shared with PV guests. Be careful. */
     ASSERT(is_hvm_vcpu(v));
 
     spin_lock(&pl->pl_time_lock);
     now = get_s_time() + pl->stime_offset;
-    if ( (int64_t)(now - pl->last_guest_time) >= 0 )
+    if ( ( skew = (int64_t)(now - pl->last_guest_time) ) >= 0 )
         pl->last_guest_time = now;
     else
+    {
+        skew = -skew;
+        if ( skew > max_guest_time_skew )
+            max_guest_time_skew = skew;
         now = pl->last_guest_time;
+    }
     spin_unlock(&pl->pl_time_lock);
 
     return now + v->arch.hvm_vcpu.stime_offset;
diff -r 08f77df14cba xen/arch/x86/time.c
--- a/xen/arch/x86/time.c	Wed Jul 02 11:30:37 2008 +0900
+++ b/xen/arch/x86/time.c	Thu Jul 03 10:10:03 2008 -0600
@@ -69,6 +69,9 @@ static DEFINE_PER_CPU(struct cpu_time, c
 
 /* TSC is invariant on C state entry? */
 static bool_t tsc_invariant;
+
+/* record maximum skew to report with debug-key t */
+u64 max_stime_skew = 0;
 
 /*
  * We simulate a 32-bit platform timer from the 16-bit PIT ch2 counter.
@@ -831,6 +834,9 @@ static void local_time_calibration(void 
     /* The overall calibration scale multiplier. */
     u32 calibration_mul_frac;
 
+    /* ignore max skew calculation on first few iterations */
+    static int skip_max_skew_calc = 1000;
+
     prev_tsc          = t->local_tsc_stamp;
     prev_local_stime  = t->stime_local_stamp;
     prev_master_stime = t->stime_master_stamp;
@@ -844,6 +850,25 @@ static void local_time_calibration(void 
     curr_local_stime  = get_s_time();
     rdtscll(curr_tsc);
     local_irq_enable();
+
+    /*
+     * Record maximum stime skew from master processor.  Note that
+     * in the case of a fast local clock, skew reflects the post-adjusted
+     * skew (see below and get_s_time()), not the actual skew.  Also
+     * note that some processors may skew positive and others negative
+     * relative to master so skew between ANY pair of processors may be
+     * as much as 2x recorded max
+     */
+    if ( skip_max_skew_calc )
+        skip_max_skew_calc--;
+    else if ( smp_processor_id() )
+    {
+        s64 curr_stime_skew = curr_master_stime - curr_local_stime;
+        if ( curr_stime_skew < 0 )
+            curr_stime_skew = - curr_stime_skew;
+        if ( curr_stime_skew > max_stime_skew )
+            max_stime_skew = curr_stime_skew;
+    }
 
 #if 0
     printk("PRE%d: tsc=%"PRIu64" stime=%"PRIu64" master=%"PRIu64"\n",
diff -r 08f77df14cba xen/common/keyhandler.c
--- a/xen/common/keyhandler.c	Wed Jul 02 11:30:37 2008 +0900
+++ b/xen/common/keyhandler.c	Thu Jul 03 10:10:03 2008 -0600
@@ -251,6 +251,7 @@ static void read_clocks(unsigned char ke
     unsigned int cpu = smp_processor_id(), min_cpu, max_cpu;
     u64 min, max, dif, difus;
     static DEFINE_SPINLOCK(lock);
+    extern u64 max_stime_skew, max_guest_time_skew;
 
     spin_lock(&lock);
 
@@ -284,6 +285,8 @@ static void read_clocks(unsigned char ke
     printk("Min = %"PRIu64" ; Max = %"PRIu64" ; Diff = %"PRIu64
            " (%"PRIu64" microseconds)\n",
            min, max, dif, difus);
+    printk("Max stime skew = %"PRIu64"ns; Max guest stoppage = %"PRIu64"ns\n",
+           max_stime_skew, max_guest_time_skew);
 }
 
 extern void dump_runq(unsigned char key);

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing  hvm guest time)
  2008-07-03 16:24         ` Dan Magenheimer
@ 2008-07-03 16:35           ` Dan Magenheimer
  2008-07-03 20:03             ` Dan Magenheimer
  0 siblings, 1 reply; 29+ messages in thread
From: Dan Magenheimer @ 2008-07-03 16:35 UTC (permalink / raw)
  To: dan.magenheimer@oracle.com, Keir Fraser, Xen-Devel (E-mail); +Cc: Dave Winchell

> From: xen-devel-bounces@lists.xensource.com
> [mailto:xen-devel-bounces@lists.xensource.com]On Behalf Of Dan
> Magenheimer
> Subject: [Xen-devel] RE: [PATCH] record max stime skew (was 
> 
> > Perhaps this relatively large skew happens at start of day, 
> before the
> > periodic calibration has 'locked on'?
> 
> Indeed you are correct.  This updated patch now reports zero skew
> as expected.
> 
> IMHO, it would be nice to put this patch into the tree as it
> will be good for helping to diagnose time skew problems
> such as the one just reported on the list.

Oops!  Just after I sent the above email, I checked again and
the same machine (no reboots, no guests ever launched) now reports
a max stime skew of 4333ns!!  Methinks there might be some
periodic glitch in the calibration code?

Dan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing  hvm guest time)
  2008-07-03 16:35           ` Dan Magenheimer
@ 2008-07-03 20:03             ` Dan Magenheimer
  2008-07-03 23:00               ` Keir Fraser
  0 siblings, 1 reply; 29+ messages in thread
From: Dan Magenheimer @ 2008-07-03 20:03 UTC (permalink / raw)
  To: dan.magenheimer@oracle.com, Keir Fraser, Xen-Devel (E-mail); +Cc: Dave Winchell

[-- Attachment #1: Type: text/plain, Size: 1437 bytes --]

> > IMHO, it would be nice to put this patch into the tree as it
> > will be good for helping to diagnose time skew problems
> > such as the one just reported on the list.
> 
> Oops!  Just after I sent the above email, I checked again and
> the same machine (no reboots, no guests ever launched) now reports
> a max stime skew of 4333ns!!  Methinks there might be some
> periodic glitch in the calibration code?

OK this version records not only max but also a distribution
of skew.  (The code is a bit ugly... I thought about doing
something fancy with log-binary but decided a few base-10
ranges were clearer for a human to read.)

With this, I use "watch -d 'xm debug-key t; xm dmesg | tail -3'"
and can observe that (on my single-socket two-core recent-vintage
Intel box) roughly three-quarters of the skew measurements are
between 10-100nsec, roughly one-quarter are between 100ns-1us,
a couple percent are between 1us-10us and a few are >10us.

This represents an approximate distribution of how long an hvm
guest might observe time to be stopped (if it is able to repeatedly
read time values quickly enough).

So on some machines, this might be substantially worse than the
old hvm-platform-timer-built-on-tsc mechanism (though we had
no monotonicity constraint built into that).

I wonder if the >1us outliers are occurring only if the
processor has been idle for awhile, vs entirely random.

Dan

[-- Attachment #2: maxskew3.patch --]
[-- Type: application/octet-stream, Size: 5382 bytes --]

diff -r 08f77df14cba xen/arch/x86/hvm/vpt.c
--- a/xen/arch/x86/hvm/vpt.c	Wed Jul 02 11:30:37 2008 +0900
+++ b/xen/arch/x86/hvm/vpt.c	Thu Jul 03 13:29:54 2008 -0600
@@ -25,6 +25,8 @@
 #define mode_is(d, name) \
     ((d)->arch.hvm_domain.params[HVM_PARAM_TIMER_MODE] == HVMPTM_##name)
 
+u64 max_guest_time_skew = 0;
+
 void hvm_init_guest_time(struct domain *d)
 {
     struct pl_time *pl = &d->arch.hvm_domain.pl_time;
@@ -38,16 +40,22 @@ u64 hvm_get_guest_time(struct vcpu *v)
 {
     struct pl_time *pl = &v->domain->arch.hvm_domain.pl_time;
     u64 now;
+    int64_t skew;
 
     /* Called from device models shared with PV guests. Be careful. */
     ASSERT(is_hvm_vcpu(v));
 
     spin_lock(&pl->pl_time_lock);
     now = get_s_time() + pl->stime_offset;
-    if ( (int64_t)(now - pl->last_guest_time) >= 0 )
+    if ( ( skew = (int64_t)(now - pl->last_guest_time) ) >= 0 )
         pl->last_guest_time = now;
     else
+    {
+        skew = -skew;
+        if ( skew > max_guest_time_skew )
+            max_guest_time_skew = skew;
         now = pl->last_guest_time;
+    }
     spin_unlock(&pl->pl_time_lock);
 
     return now + v->arch.hvm_vcpu.stime_offset;
diff -r 08f77df14cba xen/arch/x86/time.c
--- a/xen/arch/x86/time.c	Wed Jul 02 11:30:37 2008 +0900
+++ b/xen/arch/x86/time.c	Thu Jul 03 13:29:54 2008 -0600
@@ -69,6 +69,11 @@ static DEFINE_PER_CPU(struct cpu_time, c
 
 /* TSC is invariant on C state entry? */
 static bool_t tsc_invariant;
+
+/* record maximum skew and range to report with debug-key t */
+u64 max_stime_skew = 0;
+u64 stime_skew_zero_cnt = 0,stime_skew_10_cnt = 0, stime_skew_100_cnt = 0;
+u64 stime_skew_1000_cnt = 0,stime_skew_10000_cnt = 0, stime_skew_big_cnt = 0;
 
 /*
  * We simulate a 32-bit platform timer from the 16-bit PIT ch2 counter.
@@ -808,6 +813,7 @@ static void local_time_calibration(void 
      */
     s_time_t prev_local_stime, curr_local_stime;
     s_time_t prev_master_stime, curr_master_stime;
+    s_time_t curr_stime_skew;
 
     /* TSC timestamps taken during this calibration and prev calibration. */
     u64 prev_tsc, curr_tsc;
@@ -831,6 +837,9 @@ static void local_time_calibration(void 
     /* The overall calibration scale multiplier. */
     u32 calibration_mul_frac;
 
+    /* ignore max skew calculation on first few iterations */
+    static int skip_max_skew_calc = 100;
+
     prev_tsc          = t->local_tsc_stamp;
     prev_local_stime  = t->stime_local_stamp;
     prev_master_stime = t->stime_master_stamp;
@@ -844,6 +853,40 @@ static void local_time_calibration(void 
     curr_local_stime  = get_s_time();
     rdtscll(curr_tsc);
     local_irq_enable();
+
+    /*
+     * Record maximum stime skew from master processor.  Note that
+     * in the case of a fast local clock, skew reflects the post-adjusted
+     * skew (see below and get_s_time()), not the actual skew.  Also
+     * note that some processors may skew positive and others negative
+     * relative to master so skew between ANY pair of processors may be
+     * as much as 2x recorded max
+     */
+    if ( smp_processor_id() )
+    {
+        if ( skip_max_skew_calc > 0)
+            skip_max_skew_calc--;  /* allow calibration to converge */
+        else
+        {
+            curr_stime_skew = curr_master_stime - curr_local_stime;
+            if ( (s64) curr_stime_skew < 0 )
+                curr_stime_skew = - curr_stime_skew;
+            if ( curr_stime_skew > max_stime_skew )
+                max_stime_skew = curr_stime_skew;
+            if ( !curr_stime_skew )
+                stime_skew_zero_cnt++;
+            else if ( curr_stime_skew < 10 )
+                stime_skew_10_cnt++;
+            else if ( curr_stime_skew < 100 )
+                stime_skew_100_cnt++;
+            else if ( curr_stime_skew < 1000 )
+                stime_skew_1000_cnt++;
+            else if ( curr_stime_skew < 10000 )
+    	    stime_skew_10000_cnt++;
+            else
+                stime_skew_big_cnt++;
+        }
+    }
 
 #if 0
     printk("PRE%d: tsc=%"PRIu64" stime=%"PRIu64" master=%"PRIu64"\n",
diff -r 08f77df14cba xen/common/keyhandler.c
--- a/xen/common/keyhandler.c	Wed Jul 02 11:30:37 2008 +0900
+++ b/xen/common/keyhandler.c	Thu Jul 03 13:29:54 2008 -0600
@@ -251,6 +251,9 @@ static void read_clocks(unsigned char ke
     unsigned int cpu = smp_processor_id(), min_cpu, max_cpu;
     u64 min, max, dif, difus;
     static DEFINE_SPINLOCK(lock);
+    extern u64 max_stime_skew, max_guest_time_skew;
+    extern u64 stime_skew_zero_cnt, stime_skew_10_cnt, stime_skew_100_cnt;
+    extern u64 stime_skew_1000_cnt, stime_skew_10000_cnt, stime_skew_big_cnt;
 
     spin_lock(&lock);
 
@@ -284,6 +287,14 @@ static void read_clocks(unsigned char ke
     printk("Min = %"PRIu64" ; Max = %"PRIu64" ; Diff = %"PRIu64
            " (%"PRIu64" microseconds)\n",
            min, max, dif, difus);
+    printk("Max stime skew = %"PRIu64"ns; Max guest stoppage = %"PRIu64"ns\n",
+           max_stime_skew, max_guest_time_skew);
+    printk("stime skew counts: 0=%"PRIu64"; ",stime_skew_zero_cnt);
+    printk("-10=%"PRIu64"; ",stime_skew_10_cnt);
+    printk("-100=%"PRIu64"; ",stime_skew_100_cnt);
+    printk("-1000=%"PRIu64"; ",stime_skew_1000_cnt);
+    printk("-10000=%"PRIu64"; ",stime_skew_10000_cnt);
+    printk(">10000=%"PRIu64"\n",stime_skew_big_cnt);
 }
 
 extern void dump_runq(unsigned char key);

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing  hvm guest time)
  2008-07-03 20:03             ` Dan Magenheimer
@ 2008-07-03 23:00               ` Keir Fraser
  2008-07-04 15:11                 ` Dan Magenheimer
  0 siblings, 1 reply; 29+ messages in thread
From: Keir Fraser @ 2008-07-03 23:00 UTC (permalink / raw)
  To: dan.magenheimer@oracle.com, Xen-Devel (E-mail); +Cc: Dave Winchell

Skipping cpu0 makes no sense. It's not the 'master'. master_stime is time
calculated from the platform timer (hpet, pit, or whatever). All cpus are
equal peers. Apart from that looks plausible to me.

  -- Keir

On 3/7/08 21:03, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

>>> IMHO, it would be nice to put this patch into the tree as it
>>> will be good for helping to diagnose time skew problems
>>> such as the one just reported on the list.
>> 
>> Oops!  Just after I sent the above email, I checked again and
>> the same machine (no reboots, no guests ever launched) now reports
>> a max stime skew of 4333ns!!  Methinks there might be some
>> periodic glitch in the calibration code?
> 
> OK this version records not only max but also a distribution
> of skew.  (The code is a bit ugly... I thought about doing
> something fancy with log-binary but decided a few base-10
> ranges were clearer for a human to read.)
> 
> With this, I use "watch -d 'xm debug-key t; xm dmesg | tail -3'"
> and can observe that (on my single-socket two-core recent-vintage
> Intel box) roughly three-quarters of the skew measurements are
> between 10-100nsec, roughly one-quarter are between 100ns-1us,
> a couple percent are between 1us-10us and a few are >10us.
> 
> This represents an approximate distribution of how long an hvm
> guest might observe time to be stopped (if it is able to repeatedly
> read time values quickly enough).
> 
> So on some machines, this might be substantially worse than the
> old hvm-platform-timer-built-on-tsc mechanism (though we had
> no monotonicity constraint built into that).
> 
> I wonder if the >1us outliers are occurring only if the
> processor has been idle for awhile, vs entirely random.
> 
> Dan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing  hvm guest time)
  2008-07-03 23:00               ` Keir Fraser
@ 2008-07-04 15:11                 ` Dan Magenheimer
  2008-07-04 15:22                   ` Keir Fraser
  0 siblings, 1 reply; 29+ messages in thread
From: Dan Magenheimer @ 2008-07-04 15:11 UTC (permalink / raw)
  To: Keir Fraser, Xen-Devel (E-mail); +Cc: Dave Winchell

> Skipping cpu0 makes no sense.

Oops, I misunderstood that for some reason.

Here's a fixed version.  I also now preserve the "Platform timer is"
line since that can get flushed out of the dmesg buffer.

Any idea why the skew can get so bad?

Dan

> -----Original Message-----
> From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
> Sent: Thursday, July 03, 2008 5:00 PM
> To: dan.magenheimer@oracle.com; Xen-Devel (E-mail)
> Cc: Dave Winchell
> Subject: Re: [Xen-devel] RE: [PATCH] record max stime skew (was RE:
> [PATCH] strictly increasing hvm guest time)
> 
> 
> Skipping cpu0 makes no sense. It's not the 'master'. 
> master_stime is time
> calculated from the platform timer (hpet, pit, or whatever). 
> All cpus are
> equal peers. Apart from that looks plausible to me.
> 
>   -- Keir
> 
> On 3/7/08 21:03, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:
> 
> >>> IMHO, it would be nice to put this patch into the tree as it
> >>> will be good for helping to diagnose time skew problems
> >>> such as the one just reported on the list.
> >>
> >> Oops!  Just after I sent the above email, I checked again and
> >> the same machine (no reboots, no guests ever launched) now reports
> >> a max stime skew of 4333ns!!  Methinks there might be some
> >> periodic glitch in the calibration code?
> >
> > OK this version records not only max but also a distribution
> > of skew.  (The code is a bit ugly... I thought about doing
> > something fancy with log-binary but decided a few base-10
> > ranges were clearer for a human to read.)
> >
> > With this, I use "watch -d 'xm debug-key t; xm dmesg | tail -3'"
> > and can observe that (on my single-socket two-core recent-vintage
> > Intel box) roughly three-quarters of the skew measurements are
> > between 10-100nsec, roughly one-quarter are between 100ns-1us,
> > a couple percent are between 1us-10us and a few are >10us.
> >
> > This represents an approximate distribution of how long an hvm
> > guest might observe time to be stopped (if it is able to repeatedly
> > read time values quickly enough).
> >
> > So on some machines, this might be substantially worse than the
> > old hvm-platform-timer-built-on-tsc mechanism (though we had
> > no monotonicity constraint built into that).
> >
> > I wonder if the >1us outliers are occurring only if the
> > processor has been idle for awhile, vs entirely random.
> >
> > Dan
> 
> 
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing  hvm guest time)
  2008-07-04 15:11                 ` Dan Magenheimer
@ 2008-07-04 15:22                   ` Keir Fraser
  2008-07-04 19:32                     ` Dan Magenheimer
  0 siblings, 1 reply; 29+ messages in thread
From: Keir Fraser @ 2008-07-04 15:22 UTC (permalink / raw)
  To: dan.magenheimer@oracle.com, Xen-Devel (E-mail); +Cc: Dave Winchell




On 4/7/08 16:11, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> Oops, I misunderstood that for some reason.
> 
> Here's a fixed version.  I also now preserve the "Platform timer is"
> line since that can get flushed out of the dmesg buffer.
> 
> Any idea why the skew can get so bad?

Not really. We could check in this patch or similar and perhaps collect more
information.

 -- Keir

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing  hvm guest time)
  2008-07-04 15:22                   ` Keir Fraser
@ 2008-07-04 19:32                     ` Dan Magenheimer
  2008-07-04 19:56                       ` Keir Fraser
  0 siblings, 1 reply; 29+ messages in thread
From: Dan Magenheimer @ 2008-07-04 19:32 UTC (permalink / raw)
  To: Keir Fraser, Xen-Devel (E-mail); +Cc: Dave Winchell

[-- Attachment #1: Type: text/plain, Size: 842 bytes --]

> On 4/7/08 16:11, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:
> 
> > Oops, I misunderstood that for some reason.
> >
> > Here's a fixed version.  I also now preserve the "Platform timer is"
> > line since that can get flushed out of the dmesg buffer.

OOPS, forgot the patch!  Attached this time.

> > Any idea why the skew can get so bad?
> 
> Not really. We could check in this patch or similar and 
> perhaps collect more
> information.
> 
>  -- Keir

Well one suspicion I had was that very long hpet reads were
getting serialized, but I tried clocksource=acpi and
clocksource=pit and get similar skew range results.
In fact pit shows a max of >17000 vs hpet and acpi closer
to 11000.  (OTOH, I suppose it IS possible that this is
roughly how long it takes to read each of these platform
timers.)

Dan

[-- Attachment #2: maxskew4.patch --]
[-- Type: application/octet-stream, Size: 5842 bytes --]

diff -r 08f77df14cba xen/arch/x86/hvm/vpt.c
--- a/xen/arch/x86/hvm/vpt.c	Wed Jul 02 11:30:37 2008 +0900
+++ b/xen/arch/x86/hvm/vpt.c	Fri Jul 04 09:01:27 2008 -0600
@@ -25,6 +25,8 @@
 #define mode_is(d, name) \
     ((d)->arch.hvm_domain.params[HVM_PARAM_TIMER_MODE] == HVMPTM_##name)
 
+u64 max_guest_time_skew = 0;
+
 void hvm_init_guest_time(struct domain *d)
 {
     struct pl_time *pl = &d->arch.hvm_domain.pl_time;
@@ -38,16 +40,22 @@ u64 hvm_get_guest_time(struct vcpu *v)
 {
     struct pl_time *pl = &v->domain->arch.hvm_domain.pl_time;
     u64 now;
+    int64_t skew;
 
     /* Called from device models shared with PV guests. Be careful. */
     ASSERT(is_hvm_vcpu(v));
 
     spin_lock(&pl->pl_time_lock);
     now = get_s_time() + pl->stime_offset;
-    if ( (int64_t)(now - pl->last_guest_time) >= 0 )
+    if ( ( skew = (int64_t)(now - pl->last_guest_time) ) >= 0 )
         pl->last_guest_time = now;
     else
+    {
+        skew = -skew;
+        if ( skew > max_guest_time_skew )
+            max_guest_time_skew = skew;
         now = pl->last_guest_time;
+    }
     spin_unlock(&pl->pl_time_lock);
 
     return now + v->arch.hvm_vcpu.stime_offset;
diff -r 08f77df14cba xen/arch/x86/time.c
--- a/xen/arch/x86/time.c	Wed Jul 02 11:30:37 2008 +0900
+++ b/xen/arch/x86/time.c	Fri Jul 04 09:01:27 2008 -0600
@@ -69,6 +69,12 @@ static DEFINE_PER_CPU(struct cpu_time, c
 
 /* TSC is invariant on C state entry? */
 static bool_t tsc_invariant;
+
+/* global variables exported to report with debug-key t */
+u64 max_stime_skew = 0;
+u64 stime_skew_zero_cnt = 0,stime_skew_10_cnt = 0, stime_skew_100_cnt = 0;
+u64 stime_skew_1000_cnt = 0,stime_skew_10000_cnt = 0, stime_skew_big_cnt = 0;
+char platform_timer_info[80];
 
 /*
  * We simulate a 32-bit platform timer from the 16-bit PIT ch2 counter.
@@ -560,8 +566,12 @@ static void init_platform_timer(void)
 
     platform_timer_stamp = plt_stamp64;
 
-    printk("Platform timer is %s %s\n",
+    /* preserve for xm debug-key 't' */
+    snprintf(platform_timer_info, sizeof(platform_timer_info),
+           "Platform timer is %s %s",
            freq_string(pts->frequency), pts->name);
+
+    printk("%s\n",platform_timer_info);
 }
 
 void cstate_save_tsc(void)
@@ -808,6 +818,7 @@ static void local_time_calibration(void 
      */
     s_time_t prev_local_stime, curr_local_stime;
     s_time_t prev_master_stime, curr_master_stime;
+    s_time_t curr_stime_skew;
 
     /* TSC timestamps taken during this calibration and prev calibration. */
     u64 prev_tsc, curr_tsc;
@@ -831,6 +842,9 @@ static void local_time_calibration(void 
     /* The overall calibration scale multiplier. */
     u32 calibration_mul_frac;
 
+    /* ignore max skew calculation on first few iterations */
+    static int skip_max_skew_calc = 100;
+
     prev_tsc          = t->local_tsc_stamp;
     prev_local_stime  = t->stime_local_stamp;
     prev_master_stime = t->stime_master_stamp;
@@ -844,6 +858,37 @@ static void local_time_calibration(void 
     curr_local_stime  = get_s_time();
     rdtscll(curr_tsc);
     local_irq_enable();
+
+    /*
+     * Record maximum stime skew from platform timer.  Note that
+     * in the case of a fast local clock, skew reflects the post-adjusted
+     * skew (see below and get_s_time()), not the actual skew.  Also
+     * note that some processors may skew positive and others negative
+     * relative to platform timer so skew between ANY pair of processors may be
+     * as much as 2x recorded max
+     */
+    if ( skip_max_skew_calc > 0)
+        skip_max_skew_calc--;  /* allow calibration to converge */
+    else
+    {
+        curr_stime_skew = curr_master_stime - curr_local_stime;
+        if ( (s64) curr_stime_skew < 0 )
+            curr_stime_skew = - curr_stime_skew;
+        if ( curr_stime_skew > max_stime_skew )
+            max_stime_skew = curr_stime_skew;
+        if ( !curr_stime_skew )
+            stime_skew_zero_cnt++;
+        else if ( curr_stime_skew < 10 )
+            stime_skew_10_cnt++;
+        else if ( curr_stime_skew < 100 )
+            stime_skew_100_cnt++;
+        else if ( curr_stime_skew < 1000 )
+            stime_skew_1000_cnt++;
+        else if ( curr_stime_skew < 10000 )
+	    stime_skew_10000_cnt++;
+        else
+            stime_skew_big_cnt++;
+    }
 
 #if 0
     printk("PRE%d: tsc=%"PRIu64" stime=%"PRIu64" master=%"PRIu64"\n",
diff -r 08f77df14cba xen/common/keyhandler.c
--- a/xen/common/keyhandler.c	Wed Jul 02 11:30:37 2008 +0900
+++ b/xen/common/keyhandler.c	Fri Jul 04 09:01:27 2008 -0600
@@ -251,6 +251,10 @@ static void read_clocks(unsigned char ke
     unsigned int cpu = smp_processor_id(), min_cpu, max_cpu;
     u64 min, max, dif, difus;
     static DEFINE_SPINLOCK(lock);
+    extern u64 max_stime_skew, max_guest_time_skew;
+    extern u64 stime_skew_zero_cnt, stime_skew_10_cnt, stime_skew_100_cnt;
+    extern u64 stime_skew_1000_cnt, stime_skew_10000_cnt, stime_skew_big_cnt;
+    extern char platform_timer_info[80];
 
     spin_lock(&lock);
 
@@ -281,9 +285,18 @@ static void read_clocks(unsigned char ke
 
     dif = difus = max - min;
     do_div(difus, 1000);
+    printk("%s\n",platform_timer_info);
     printk("Min = %"PRIu64" ; Max = %"PRIu64" ; Diff = %"PRIu64
            " (%"PRIu64" microseconds)\n",
            min, max, dif, difus);
+    printk("Max stime skew = %"PRIu64"ns; Max guest stoppage = %"PRIu64"ns\n",
+           max_stime_skew, max_guest_time_skew);
+    printk("stime skew counts: 0=%"PRIu64"; ",stime_skew_zero_cnt);
+    printk("-10=%"PRIu64"; ",stime_skew_10_cnt);
+    printk("-100=%"PRIu64"; ",stime_skew_100_cnt);
+    printk("-1000=%"PRIu64"; ",stime_skew_1000_cnt);
+    printk("-10000=%"PRIu64"; ",stime_skew_10000_cnt);
+    printk(">10000=%"PRIu64"\n",stime_skew_big_cnt);
 }
 
 extern void dump_runq(unsigned char key);

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing  hvm guest time)
  2008-07-04 19:32                     ` Dan Magenheimer
@ 2008-07-04 19:56                       ` Keir Fraser
  2008-07-10  0:24                         ` Dan Magenheimer
  0 siblings, 1 reply; 29+ messages in thread
From: Keir Fraser @ 2008-07-04 19:56 UTC (permalink / raw)
  To: dan.magenheimer@oracle.com, Xen-Devel (E-mail); +Cc: Dave Winchell

On 4/7/08 20:32, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> Well one suspicion I had was that very long hpet reads were
> getting serialized, but I tried clocksource=acpi and
> clocksource=pit and get similar skew range results.
> In fact pit shows a max of >17000 vs hpet and acpi closer
> to 11000.  (OTOH, I suppose it IS possible that this is
> roughly how long it takes to read each of these platform
> timers.)

That ought to be easy to check. I would expect that the PIT, for example,
could take a couple of microseconds to access.

 -- Keir

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing  hvm guest time)
  2008-07-04 19:56                       ` Keir Fraser
@ 2008-07-10  0:24                         ` Dan Magenheimer
  2008-07-10  7:40                           ` Keir Fraser
  0 siblings, 1 reply; 29+ messages in thread
From: Dan Magenheimer @ 2008-07-10  0:24 UTC (permalink / raw)
  To: Keir Fraser, Xen-Devel (E-mail); +Cc: Dave Winchell

> > Well one suspicion I had was that very long hpet reads were
> > getting serialized, but I tried clocksource=acpi and
> > clocksource=pit and get similar skew range results.
> > In fact pit shows a max of >17000 vs hpet and acpi closer
> > to 11000.  (OTOH, I suppose it IS possible that this is
> > roughly how long it takes to read each of these platform
> > timers.)
> 
> That ought to be easy to check. I would expect that the PIT, 
> for example,
> could take a couple of microseconds to access.
> 
>  -- Keir

(I haven't seen the patch applied... since it just collects
data, it would be nice if it was applied so others could
try it.)

To follow up on this, I tried a number of tests but wasn't
able to identify the problem and have given up (for now).
In case someone else starts looking at this (or if any of
my tests suggest a solution to someone), I thought I'd
document what I tried.

PROBLEM: Xen system time skew between processors local time
and platform time is generally "small" but "sometimes" gets
quite "large".  This is important because, the larger the
skew, the more likely an hvm guest will experience time
stopping or (in some cases) time going backwards.

On my box, "small" is under 1 usec, "large" is 9-18 usec,
and "sometimes" is about one out of 500 measurements.  Note
that my box is a recent vintage Intel single-socket dual-core
("Conroe").

I suspect periodically some lock is being waited for for
a long time, or maybe an unexpected interrupt is occurring,
but I didn't find anything through code reading or
experiments.

TEST METHOD: The patch I sent on this thread collects data
whenever local_time_calibration() is run (which is 1Hz on
each processor) and "xm debug-key t" prints this data
so it can be seen with "xm dmesg".  To see the problem,
one need only boot dom0 and run xm debug-key and xm dmesg.

1) CONJECTURE: Related to how long it takes to read the
   platform timer

The max skew (and distribution) are definitely different
depending on whether clocksource=hpet or clocksource=pit.
For hpet, I am almost always seeing a max skew of 11000+
and with pit 17000+.  ONCE (over many hours of runs) I saw
a skew with hpet of 15000.  However, I added code in the
platform timer read routine (inside all locks but NOT with
interrupts off) to artificially lengthen a platform timer
read and it made no difference in the measurements

2) CONJECTURE: Max skew only occurs on some processors (e.g.
   not on the one that does the platform calibration)

Nope, if you wait long enough max skew is fairly close
on all processors (though in some cases, it seems to
take a long time... perhaps because of unbalanced load?)

3) CONJECTURE: Max skew occurs on platform timer overflow.

Possibly, but there is certainly not a 1-1 correspondence.
Sometimes there are more large skews than overflows and
sometimes less.

4) CONJECTURE: Artifact of ntpd running

Nope, same skews whether ntpd is running on dom0 or not

5) CONJECTURE: Related to frequency changes or suspends

Nope, none of these happening on my box.

6)  CONJECTURE: "Weirdness can happen" comment in time.c

Nope, this path isn't getting executed.

7) CONJECTURE: Result of natural skews between platform
timer and tsc, plus jitter.  Unfixable.

Possible, untested, not sure how.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing  hvm guest time)
  2008-07-10  0:24                         ` Dan Magenheimer
@ 2008-07-10  7:40                           ` Keir Fraser
  2008-07-10 22:42                             ` Xen system skew MUCH worse than tsc skew (was RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)) Dan Magenheimer
  0 siblings, 1 reply; 29+ messages in thread
From: Keir Fraser @ 2008-07-10  7:40 UTC (permalink / raw)
  To: dan.magenheimer@oracle.com, Xen-Devel (E-mail); +Cc: Dave Winchell

On 10/7/08 01:24, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> 7) CONJECTURE: Result of natural skews between platform
> timer and tsc, plus jitter.  Unfixable.
> 
> Possible, untested, not sure how.

I ended up suspecting this on one of the test platforms I originally did the
Xen-system-time implementation on. It was an old AMD white box iirc. On that
system, TSC and platform time seemed to have a significant and inexplicable
jitter at around 1Hz. The jitter was 100s of ppm, which was totally
unexpected for what should be crystal-based oscillators. And the test code
was simple enough that it was hard to suspect that either (I think I was
just dumping the counters every second or two after reading them as close
together as I could).

 -- Keir

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Xen system skew MUCH worse than tsc skew (was RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time))
  2008-07-10  7:40                           ` Keir Fraser
@ 2008-07-10 22:42                             ` Dan Magenheimer
  2008-07-11  8:27                               ` Keir Fraser
  0 siblings, 1 reply; 29+ messages in thread
From: Dan Magenheimer @ 2008-07-10 22:42 UTC (permalink / raw)
  To: Keir Fraser, Xen-Devel (E-mail); +Cc: Dave Winchell

> > 7) CONJECTURE: Result of natural skews between platform
> > timer and tsc, plus jitter.  Unfixable.
> >
> > Possible, untested, not sure how.
> 
> I ended up suspecting this on one of the test platforms I 
> originally did the
> Xen-system-time implementation on. It was an old AMD white 
> box iirc. On that
> system, TSC and platform time seemed to have a significant 
> and inexplicable
> jitter at around 1Hz. The jitter was 100s of ppm, which was totally
> unexpected for what should be crystal-based oscillators. And 
> the test code
> was simple enough that it was hard to suspect that either (I 
> think I was
> just dumping the counters every second or two after reading 
> them as close
> together as I could).

Is this the code in read_clocks() in keyhandler.c?  If so,
I just did an experiment there with some interesting results:

I modified that code to record the "max dif" and then executed
it >10000 times.  The result shows maxdif ~11usec which
corresponds with my earlier measurements.  Next, I replaced the
calls to NOW() in read_clocks() and read_clocks_slave() with
rdtscll().  Guess what?  The result is a maxdif of 11000 "ticks"
but now on a 3GHz clock, which is about 3.3usec.  Next, I disabled
interrupts in read_clocks_slave() around the while loop plus
the rdtscll() so that I ensure I'm not accidentally counting any
interrupts.  Now I'm seeing maxdif<330nsec (>6000 measurements).
Next, I go back to NOW(), but with interrupts disabled as above.
So far maxdif is about 10.7usec (>6000 measurements).

SO XEN SYSTEM TIME MAX SKEW IS >30X WORSE THAN TSC MAX SKEW!

Looks to me like there's still something algorithmically wrong
and its not just natural skew and jitter.  Maybe some corner
case in the scale-delta code?  Also, should interrupts be turned
off during the calibration part of init_pit_and_calibrate_tsc()
(which might cause different scaling factors for each CPU)?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Xen system skew MUCH worse than tsc skew (was RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time))
  2008-07-10 22:42                             ` Xen system skew MUCH worse than tsc skew (was RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)) Dan Magenheimer
@ 2008-07-11  8:27                               ` Keir Fraser
  2008-07-11 20:53                                 ` Dan Magenheimer
  2008-07-19 17:51                                 ` Dan Magenheimer
  0 siblings, 2 replies; 29+ messages in thread
From: Keir Fraser @ 2008-07-11  8:27 UTC (permalink / raw)
  To: dan.magenheimer@oracle.com, Xen-Devel (E-mail); +Cc: Dave Winchell

On 10/7/08 23:42, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> SO XEN SYSTEM TIME MAX SKEW IS >30X WORSE THAN TSC MAX SKEW!
> 
> Looks to me like there's still something algorithmically wrong
> and its not just natural skew and jitter.  Maybe some corner
> case in the scale-delta code?  Also, should interrupts be turned
> off during the calibration part of init_pit_and_calibrate_tsc()
> (which might cause different scaling factors for each CPU)?

I didn't measure skew across CPUs. I measured jitter between one local TSC
and the chosen platform timer for calibration (in my case I think this was
the HPET). I did this because getting a consistent tick rate from the
platform timer, and from each local TSC, is the basis for the calibration
algorithm. The more jitter there is between them, the less well it will
work.

I implemented a user-space program to collect the required stats. It used
CLI/STI to prevent getting interrupted when reading the timer pair.

 -- Keir

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Xen system skew MUCH worse than tsc skew (was RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time))
  2008-07-11  8:27                               ` Keir Fraser
@ 2008-07-11 20:53                                 ` Dan Magenheimer
  2008-07-11 21:27                                   ` Xen system skew MUCH worse than tsc skew (was RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasinghvm " Ian Pratt
  2008-07-11 21:27                                   ` Xen system skew MUCH worse than tsc skew (was RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm " Keir Fraser
  2008-07-19 17:51                                 ` Dan Magenheimer
  1 sibling, 2 replies; 29+ messages in thread
From: Dan Magenheimer @ 2008-07-11 20:53 UTC (permalink / raw)
  To: Keir Fraser, Xen-Devel (E-mail); +Cc: Dave Winchell

> I didn't measure skew across CPUs. I measured jitter between 
> one local TSC
> and the chosen platform timer for calibration (in my case I 
> think this was
> the HPET). I did this because getting a consistent tick rate from the
> platform timer, and from each local TSC, is the basis for the 
> calibration
> algorithm. The more jitter there is between them, the less 
> well it will
> work.
> 
> I implemented a user-space program to collect the required 
> stats. It used
> CLI/STI to prevent getting interrupted when reading the timer pair.

Hmmm... if the TSC is known to be stable*, is there any reason to
do the calibration vs the platform timer?  If TSC is stable,
could we instead just do essentially a divide by cpu_ghz in
get_s_time() and be done, no periodic local_time_calibration()
necessary?  Since TSC is stable on many newer platforms, it
would be nice to use this feature to decrease skew for guests
(both PV and HV).

* stable is the term used by Linux to mean that there's no
skew between the different TSC's in an SMP system

I gave this a try and it seems to work so far.  (Fortunately,
my CPU is 3GHz so I just had to divide by 3... I'm not sure
how to divide by a non-integer.)  Max skew for stime is holding
steady at 270nsec, >40x better than periodic calibration w/hpet.

If this sounds good, a design question:  Should this be
controlled:

1) by a boot option, or
2) by the TSC_CONSTANT cpu flag, or
3) when determined dynamically to be safe using code similar
   to arch/x86/tsc_sync.c in recent Linux kernels

(1) is by far the easiest (perhaps not too late for 3.3?)
while (3) is clearly the best for users but adds lots of
code (bloat/untested)

Thanks,
Dan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Xen system skew MUCH worse than tsc skew (was RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasinghvm guest time))
  2008-07-11 20:53                                 ` Dan Magenheimer
@ 2008-07-11 21:27                                   ` Ian Pratt
  2008-07-12 21:05                                     ` Dan Magenheimer
  2008-07-11 21:27                                   ` Xen system skew MUCH worse than tsc skew (was RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm " Keir Fraser
  1 sibling, 1 reply; 29+ messages in thread
From: Ian Pratt @ 2008-07-11 21:27 UTC (permalink / raw)
  To: Dan Magenheimer, Keir Fraser, Xen-Devel (E-mail); +Cc: Ian Pratt, Dave Winchell

> Hmmm... if the TSC is known to be stable*, is there any reason to
> do the calibration vs the platform timer?  If TSC is stable,
> could we instead just do essentially a divide by cpu_ghz in
> get_s_time() and be done, no periodic local_time_calibration()
> necessary?  Since TSC is stable on many newer platforms, it
> would be nice to use this feature to decrease skew for guests
> (both PV and HV).
> 
> * stable is the term used by Linux to mean that there's no
> skew between the different TSC's in an SMP system

Some NUMA systems have different oscillators on each node so you can't
rely on the frequency being identical. Such systems are fairly rare
(though their common use case is server virtualization). I guess a
command line option to enable independent calibration for these systems
would be OK, though it would obviously be better to start off assuming
the frequencies are identical, and then detect rate differences. 

Ian

 
> I gave this a try and it seems to work so far.  (Fortunately,
> my CPU is 3GHz so I just had to divide by 3... I'm not sure
> how to divide by a non-integer.)  Max skew for stime is holding
> steady at 270nsec, >40x better than periodic calibration w/hpet.
> 
> If this sounds good, a design question:  Should this be
> controlled:
> 
> 1) by a boot option, or
> 2) by the TSC_CONSTANT cpu flag, or
> 3) when determined dynamically to be safe using code similar
>    to arch/x86/tsc_sync.c in recent Linux kernels
> 
> (1) is by far the easiest (perhaps not too late for 3.3?)
> while (3) is clearly the best for users but adds lots of
> code (bloat/untested)
> 
> Thanks,
> Dan
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Xen system skew MUCH worse than tsc skew (was RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time))
  2008-07-11 20:53                                 ` Dan Magenheimer
  2008-07-11 21:27                                   ` Xen system skew MUCH worse than tsc skew (was RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasinghvm " Ian Pratt
@ 2008-07-11 21:27                                   ` Keir Fraser
  2008-07-12 21:07                                     ` Dan Magenheimer
  1 sibling, 1 reply; 29+ messages in thread
From: Keir Fraser @ 2008-07-11 21:27 UTC (permalink / raw)
  To: dan.magenheimer@oracle.com, Xen-Devel (E-mail); +Cc: Dave Winchell

On 11/7/08 21:53, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> 1) by a boot option, or
> 2) by the TSC_CONSTANT cpu flag, or
> 3) when determined dynamically to be safe using code similar
>    to arch/x86/tsc_sync.c in recent Linux kernels
> 
> (1) is by far the easiest (perhaps not too late for 3.3?)
> while (3) is clearly the best for users but adds lots of
> code (bloat/untested)

(1) is perhaps fine.

How does (2) work? The individual CPUs do not know whether they are
synchronised across the mainboard. I think constant-tsc is necessary
(individual CPUs must not vary their multiplier of the input clock rate) but
may not be sufficient.

I don't know how much code is involved in (3).

 -- Keir

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Xen system skew MUCH worse than tsc skew (was RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasinghvm guest time))
  2008-07-11 21:27                                   ` Xen system skew MUCH worse than tsc skew (was RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasinghvm " Ian Pratt
@ 2008-07-12 21:05                                     ` Dan Magenheimer
  0 siblings, 0 replies; 29+ messages in thread
From: Dan Magenheimer @ 2008-07-12 21:05 UTC (permalink / raw)
  To: Ian Pratt, Keir Fraser, Xen-Devel (E-mail); +Cc: Dave Winchell

> Some NUMA systems have different oscillators on each node so you can't
> rely on the frequency being identical. Such systems are fairly rare
> (though their common use case is server virtualization). I guess a
> command line option to enable independent calibration for 
> these systems
> would be OK, though it would obviously be better to start off assuming
> the frequencies are identical, and then detect rate differences. 
> 
> Ian

Good point.  This is the way that Linux does it too, I think.

Dan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Xen system skew MUCH worse than tsc skew (was RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time))
  2008-07-11 21:27                                   ` Xen system skew MUCH worse than tsc skew (was RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm " Keir Fraser
@ 2008-07-12 21:07                                     ` Dan Magenheimer
  0 siblings, 0 replies; 29+ messages in thread
From: Dan Magenheimer @ 2008-07-12 21:07 UTC (permalink / raw)
  To: Keir Fraser, Xen-Devel (E-mail); +Cc: Dave Winchell

> > 1) by a boot option, or
> > 2) by the TSC_CONSTANT cpu flag, or
> > 3) when determined dynamically to be safe using code similar
> >    to arch/x86/tsc_sync.c in recent Linux kernels
> >
> > (1) is by far the easiest (perhaps not too late for 3.3?)
> > while (3) is clearly the best for users but adds lots of
> > code (bloat/untested)
> 
> (1) is perhaps fine.

OK, patch to follow.  I've used "clocksource=tsc"
 
> How does (2) work? The individual CPUs do not know whether they are
> synchronised across the mainboard. I think constant-tsc is necessary
> (individual CPUs must not vary their multiplier of the input 
> clock rate) but
> may not be sufficient.

Good point.

> I don't know how much code is involved in (3).

It's enough that I will take the "easy way" for now (boot option)
and look at submitting a dynamically-evaluate patch later.

Dan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Xen system skew MUCH worse than tsc skew (was RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time))
  2008-07-11  8:27                               ` Keir Fraser
  2008-07-11 20:53                                 ` Dan Magenheimer
@ 2008-07-19 17:51                                 ` Dan Magenheimer
  2008-07-21  8:32                                   ` Keir Fraser
  1 sibling, 1 reply; 29+ messages in thread
From: Dan Magenheimer @ 2008-07-19 17:51 UTC (permalink / raw)
  To: Keir Fraser, Xen-Devel (E-mail); +Cc: Dave Winchell

> > SO XEN SYSTEM TIME MAX SKEW IS >30X WORSE THAN TSC MAX SKEW!
> >
> > Looks to me like there's still something algorithmically wrong
> > and its not just natural skew and jitter.  Maybe some corner
> > case in the scale-delta code?  Also, should interrupts be turned
> > off during the calibration part of init_pit_and_calibrate_tsc()
> > (which might cause different scaling factors for each CPU)?
> 
> I didn't measure skew across CPUs. I measured jitter between 
> one local TSC
> and the chosen platform timer for calibration (in my case I 
> think this was
> the HPET). I did this because getting a consistent tick rate from the
> platform timer, and from each local TSC, is the basis for the 
> calibration
> algorithm. The more jitter there is between them, the less 
> well it will
> work.
> 
> I implemented a user-space program to collect the required 
> stats. It used
> CLI/STI to prevent getting interrupted when reading the timer pair.

Hi Keir -

I'm still looking at whether all of the intra-processor stime
skew I'm seeing is due to jitter vs algorithmic.

Would you expect system load to impact stime skew between
processors (using hpet as a system timer)?  I can repeatably
watch skew get worse when I am launching an hvm domain.  It is
MUCH worse when the new domain is in its early stages of booting.
CPU load on domain0 has little or no impact but I/O load
on dom0 seems to make skew get worse.

Thanks,
Dan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Xen system skew MUCH worse than tsc skew (was RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time))
  2008-07-19 17:51                                 ` Dan Magenheimer
@ 2008-07-21  8:32                                   ` Keir Fraser
  2008-07-22 22:27                                     ` Dan Magenheimer
  0 siblings, 1 reply; 29+ messages in thread
From: Keir Fraser @ 2008-07-21  8:32 UTC (permalink / raw)
  To: dan.magenheimer@oracle.com, Xen-Devel (E-mail); +Cc: Dave Winchell




On 19/7/08 18:51, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> Would you expect system load to impact stime skew between
> processors (using hpet as a system timer)?  I can repeatably
> watch skew get worse when I am launching an hvm domain.  It is
> MUCH worse when the new domain is in its early stages of booting.
> CPU load on domain0 has little or no impact but I/O load
> on dom0 seems to make skew get worse.

Perhaps it makes a difference if it takes each CPU a bit longer to execute
the calibration function in softirq context? That could be delayed by long
hypercalls, for example (although long hypercalls should mostly be
preemptible).

 -- Keir

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Xen system skew MUCH worse than tsc skew (was RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time))
  2008-07-21  8:32                                   ` Keir Fraser
@ 2008-07-22 22:27                                     ` Dan Magenheimer
  2008-07-22 23:07                                       ` Xen system skew MUCH worse than tsc skew (was RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasinghvm " Ian Pratt
  2008-07-23  6:11                                       ` Tian, Kevin
  0 siblings, 2 replies; 29+ messages in thread
From: Dan Magenheimer @ 2008-07-22 22:27 UTC (permalink / raw)
  To: Keir Fraser, Xen-Devel (E-mail); +Cc: Dave Winchell

> > Would you expect system load to impact stime skew between
> > processors (using hpet as a system timer)?  I can repeatably
> > watch skew get worse when I am launching an hvm domain.  It is
> > MUCH worse when the new domain is in its early stages of booting.
> > CPU load on domain0 has little or no impact but I/O load
> > on dom0 seems to make skew get worse.
> 
> Perhaps it makes a difference if it takes each CPU a bit 
> longer to execute
> the calibration function in softirq context? That could be 
> delayed by long
> hypercalls, for example (although long hypercalls should mostly be
> preemptible).

I'm not positive yet, but I think I have an explanation for
this.  The issue is not HOW LONG it takes to execute the
calibration function but WHEN relative to other processors
the calibration function executes.  If jitter on the platform
timer occurs and the (e.g. two) calibration functions are triggered
"temporally maximally distant" (e.g. cpu0 at 1.0, 2.0, 3.0
and cpu1 at 1.5, 2.5, 3.5), their differing slope during the
interim partial-second could result in greater skew.  Since activity
on a processor will result in different locks held, interrupts
on/off, etc, system load differences between processors is more
likely to cause distance to vary between the scheduled calibration
functions on each processor.

(Worse, could maximal distance maybe result in harmonic
resonance?  The fact that I can observe the effect seems to
imply that it stays bad for awhile.)

This is all still theoretical... I still have to figure out how to
measure this.  But does the theory make sense?

Perhaps some form of the proposed "deferrable timers" can
be used to ensure per-cpu calibration happens on different
processors at roughly the same moment?

Thanks,
Dan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Xen system skew MUCH worse than tsc skew (was RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasinghvm guest time))
  2008-07-22 22:27                                     ` Dan Magenheimer
@ 2008-07-22 23:07                                       ` Ian Pratt
  2008-07-23  0:40                                         ` Dan Magenheimer
  2008-07-23  6:11                                       ` Tian, Kevin
  1 sibling, 1 reply; 29+ messages in thread
From: Ian Pratt @ 2008-07-22 23:07 UTC (permalink / raw)
  To: Dan Magenheimer, Keir Fraser, Xen-Devel (E-mail); +Cc: Ian Pratt, Dave Winchell

> 
> I'm not positive yet, but I think I have an explanation for
> this.  The issue is not HOW LONG it takes to execute the
> calibration function but WHEN relative to other processors
> the calibration function executes.  If jitter on the platform
> timer occurs and the (e.g. two) calibration functions are triggered
> "temporally maximally distant" (e.g. cpu0 at 1.0, 2.0, 3.0
> and cpu1 at 1.5, 2.5, 3.5), their differing slope during the
> interim partial-second could result in greater skew.  Since activity
> on a processor will result in different locks held, interrupts
> on/off, etc, system load differences between processors is more
> likely to cause distance to vary between the scheduled calibration
> functions on each processor.

If you want to test this theory, you can easily get all the CPUs to
recalibrate at the same instant, though it's a bit expensive:

Get one CPU to issue an smp_call_function on all CPUs (including
itself). The called function should atomic_inc a variable and then spin
waiting reading the count until all CPUs have reached this point. When
this happens, turn interrupts off, atomic_dec the same counter, spin
until it hits zero, then read the TSC, re-enable interrupts, finish.

The TSC reads should all happen very close to each other. One of the
CPUs could read the platform timer after the TSC to tie everything
together.

The only thing that could mess this up would be NMI's or SMI's. You
could at least detect that by reading the TSC after all CPUs have
incremented the counter, and check that only a "reasonable" amount of
time had elapsed. If not, set a flag to indicate that a recalibration is
required (you'd need to add another gather loop to enable all CPUs to
vote on whether they're happy).

Ian

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Xen system skew MUCH worse than tsc skew (was RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasinghvm guest time))
  2008-07-22 23:07                                       ` Xen system skew MUCH worse than tsc skew (was RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasinghvm " Ian Pratt
@ 2008-07-23  0:40                                         ` Dan Magenheimer
  2008-07-23  1:16                                           ` Ian Pratt
  0 siblings, 1 reply; 29+ messages in thread
From: Dan Magenheimer @ 2008-07-23  0:40 UTC (permalink / raw)
  To: Ian Pratt, Keir Fraser, Xen-Devel (E-mail); +Cc: Dave Winchell

> If you want to test this theory, you can easily get all the CPUs to
> recalibrate at the same instant, though it's a bit expensive:
> 
> Get one CPU to issue an smp_call_function on all CPUs (including
> itself). The called function should atomic_inc a variable and 
> then spin
> waiting reading the count until all CPUs have reached this point. When
> this happens, turn interrupts off, atomic_dec the same counter, spin
> until it hits zero, then read the TSC, re-enable interrupts, finish.
> The TSC reads should all happen very close to each other. 

The code invoked by "xm debug-key t" does exactly that and I've been
using it (as one way) to measure skew.  Any idea how expensive it is?
Is it too expensive to do once/second?  If it's not more expensive
than the (1Hz per processor) local_time_calibration(), perhaps we
should just use it to set TSC on all processors once/second and dispense
with the existing (beautiful but one additional frequency to resonate)
platform-timer-interpolated-by-tsc approach?

On the other hand, I'll bet the bigger the system, the more difficult
it is to rendezvous them... and the more natural skew there will be
between the sockets.

> The only thing that could mess this up would be NMI's or SMI's. You
> could at least detect that by reading the TSC after all CPUs have
> incremented the counter, and check that only a "reasonable" amount of
> time had elapsed. If not, set a flag to indicate that a 
> recalibration is
> required (you'd need to add another gather loop to enable all CPUs to
> vote on whether they're happy).

I think I've seen this code in recent Linux.

But assuming we stay with the existing approach, I'm not sure
the processors need to be calibrated at "exactly" the same time,
just "close".  Something similar to "round jiffies" (see
http://lkml.org/lkml/2006/10/10/189) may be enough... though
I guess that depends on the character of the timesource jitter.

Thanks,
Dan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Xen system skew MUCH worse than tsc skew (was RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasinghvm guest time))
  2008-07-23  0:40                                         ` Dan Magenheimer
@ 2008-07-23  1:16                                           ` Ian Pratt
  0 siblings, 0 replies; 29+ messages in thread
From: Ian Pratt @ 2008-07-23  1:16 UTC (permalink / raw)
  To: Dan Magenheimer, Keir Fraser, Xen-Devel (E-mail); +Cc: Ian Pratt, Dave Winchell

> Is it too expensive to do once/second?  If it's not more expensive
> than the (1Hz per processor) local_time_calibration(), perhaps we
> should just use it to set TSC on all processors once/second and
> dispense
> with the existing (beautiful but one additional frequency to resonate)
> platform-timer-interpolated-by-tsc approach?

It doesn't need to be done very frequently, e.g. every 10-30s -- anytime
before the TSC wraps should work.

> On the other hand, I'll bet the bigger the system, the more difficult
> it is to rendezvous them... 

Yes, but it shouldn't be too horrendous -- we have to do stuff like this
for some (rare) synchronous TLB flushes anyhow. 

> and the more natural skew there will be between the sockets.

This skew will still be tiny, sub microsecond.

> > The only thing that could mess this up would be NMI's or SMI's. You
> > could at least detect that by reading the TSC after all CPUs have
> > incremented the counter, and check that only a "reasonable" amount
of
> > time had elapsed. If not, set a flag to indicate that a
> > recalibration is
> > required (you'd need to add another gather loop to enable all CPUs
to
> > vote on whether they're happy).
> 
> I think I've seen this code in recent Linux.

It's worth implementing this just to see how good a job we could do.

Ian

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Xen system skew MUCH worse than tsc skew (was RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasinghvm guest time))
  2008-07-22 22:27                                     ` Dan Magenheimer
  2008-07-22 23:07                                       ` Xen system skew MUCH worse than tsc skew (was RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasinghvm " Ian Pratt
@ 2008-07-23  6:11                                       ` Tian, Kevin
  1 sibling, 0 replies; 29+ messages in thread
From: Tian, Kevin @ 2008-07-23  6:11 UTC (permalink / raw)
  To: dan.magenheimer, Keir Fraser, Xen-Devel (E-mail); +Cc: Dave Winchell

>From: Dan Magenheimer
>Sent: 2008年7月23日 6:27
>
>Perhaps some form of the proposed "deferrable timers" can
>be used to ensure per-cpu calibration happens on different
>processors at roughly the same moment?
>

It can't. Deferrable timer is a per-cpu concept, to rendezvous
what can be deferred on local cpu. There's nothing to 
coordinate across-cpu activities, for which Instead you have to 
use some form of IPIs and self-defined sync process as what
Ian suggested.

Thanks,
Kevin

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2008-07-23  6:11 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-02 16:03 [PATCH] strictly increasing hvm guest time Dan Magenheimer
2008-07-02 16:07 ` Keir Fraser
2008-07-02 21:50   ` Dan Magenheimer
2008-07-02 22:41     ` [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time) Dan Magenheimer
2008-07-03  8:03       ` Keir Fraser
2008-07-03 16:24         ` Dan Magenheimer
2008-07-03 16:35           ` Dan Magenheimer
2008-07-03 20:03             ` Dan Magenheimer
2008-07-03 23:00               ` Keir Fraser
2008-07-04 15:11                 ` Dan Magenheimer
2008-07-04 15:22                   ` Keir Fraser
2008-07-04 19:32                     ` Dan Magenheimer
2008-07-04 19:56                       ` Keir Fraser
2008-07-10  0:24                         ` Dan Magenheimer
2008-07-10  7:40                           ` Keir Fraser
2008-07-10 22:42                             ` Xen system skew MUCH worse than tsc skew (was RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)) Dan Magenheimer
2008-07-11  8:27                               ` Keir Fraser
2008-07-11 20:53                                 ` Dan Magenheimer
2008-07-11 21:27                                   ` Xen system skew MUCH worse than tsc skew (was RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasinghvm " Ian Pratt
2008-07-12 21:05                                     ` Dan Magenheimer
2008-07-11 21:27                                   ` Xen system skew MUCH worse than tsc skew (was RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm " Keir Fraser
2008-07-12 21:07                                     ` Dan Magenheimer
2008-07-19 17:51                                 ` Dan Magenheimer
2008-07-21  8:32                                   ` Keir Fraser
2008-07-22 22:27                                     ` Dan Magenheimer
2008-07-22 23:07                                       ` Xen system skew MUCH worse than tsc skew (was RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasinghvm " Ian Pratt
2008-07-23  0:40                                         ` Dan Magenheimer
2008-07-23  1:16                                           ` Ian Pratt
2008-07-23  6:11                                       ` Tian, Kevin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.