All of lore.kernel.org
 help / color / mirror / Atom feed
* Time skew on HP DL785 (and possibly other boxes)
@ 2009-03-27 20:49 Dan Magenheimer
  2009-03-27 22:36 ` Jeremy Fitzhardinge
  2009-03-28  2:29 ` Tian, Kevin
  0 siblings, 2 replies; 18+ messages in thread
From: Dan Magenheimer @ 2009-03-27 20:49 UTC (permalink / raw)
  To: Xen-Devel (E-mail); +Cc: john.v.morris

(Raising a yellow flag because this could turn into
a serious issue for Xen and it may take quite a bit
of work to come up with a solution.)

We recently measured Xen system time skew on an HP DL785
and found it to be horrible... nearly a quarter millisecond
worst case (with only about 10000 samples so it may get worse).

This box uses 8 quad-core AMD chips connected via
hypertransport.  BUT each chip is on a separate motherboard.
On this system hypertransport is fast and cross-node
memory accesses are fast enough so that these NUMA systems
need not behave like NUMA systems from a memory access
perspective.  So Xen just views the system as a 32-cpu box
(other than some code in the memory allocator that tries
to allocate near-memory where possible, but silently falls
back to far-memory if necessary) and guest vcpus migrate
freely between the nodes.  (Correct?)

However, I'm told that its not possible to route a clocksource
over hypertransport, so TSC's on processors on different
motherboards may be VERY different and apparently the
mechanisms for synchronizing Xen system time across
motherboards may not be up to the challenge.  As a result,
OS's and apps sensitive to time that are running on PV
domains may be in for a rough ride on systems like this.
(HVM domains may run into other problems because time will
apparently stop for a "long time".)

Since systems like this are targeted for consolidation
and virtualization, I see this as a potentially big problem
as it may appear to real Xen customers as bizarre
non-reproducible problems, such as "make" failing,
leading to questions about the stability and viability
of using Xen.

Comments?

Dan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Time skew on HP DL785 (and possibly other boxes)
  2009-03-27 20:49 Time skew on HP DL785 (and possibly other boxes) Dan Magenheimer
@ 2009-03-27 22:36 ` Jeremy Fitzhardinge
  2009-04-03 22:23   ` Dan Magenheimer
  2009-03-28  2:29 ` Tian, Kevin
  1 sibling, 1 reply; 18+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-27 22:36 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: john.v.morris, Xen-Devel (E-mail)

Dan Magenheimer wrote:
> However, I'm told that its not possible to route a clocksource
> over hypertransport, so TSC's on processors on different
> motherboards may be VERY different and apparently the
> mechanisms for synchronizing Xen system time across
> motherboards may not be up to the challenge.  As a result,
> OS's and apps sensitive to time that are running on PV
> domains may be in for a rough ride on systems like this.
> (HVM domains may run into other problems because time will
> apparently stop for a "long time".)
>   

I don't see what the problem is.  If each individual cpu has well known 
tsc parameters (rate and offset), then a PV client will get those timing 
parameters and use it to compute its time.  It doesn't matter if they're 
syncronized between cpus or nodes.

Xen will need to calibrate each of them against a good reference 
(hpet?), but that's no different from now.  I guess its possible that 
this system has more variation and latency for hpet access, which may 
mean that the calibration algorithm needs tweaking.

Of course, if the tsc rates on each cpu is changing in some 
unpredictable way then that's a whole other barrel of problems.  Guests 
rely on Xen maintaing accurate tsc timing parameters.

> Since systems like this are targeted for consolidation
> and virtualization, I see this as a potentially big problem
> as it may appear to real Xen customers as bizarre
> non-reproducible problems, such as "make" failing,
> leading to questions about the stability and viability
> of using Xen.
>
> Comments?
>   

In Linux there's this function:

/*
 * apic_is_clustered_box() -- Check if we can expect good TSC
 *
 * Thus far, the major user of this is IBM's Summit2 series:
 *
 * Clustered boxes may have unsynced TSC problems if they are
 * multi-chassis. Use available data to take a good guess.
 * If in doubt, go HPET.
 */
__cpuinit int apic_is_clustered_box(void)
{...}


Which deals with Summit2 and ScaleSMP vsmp systems which also have 
unsynchronized tscs across nodes.  At the moment it assumes that no 
non-VSMP AMD system has unsynchronized tscs; sounds like it will need 
updating for this system.

    J

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Time skew on HP DL785 (and possibly other boxes)
  2009-03-27 20:49 Time skew on HP DL785 (and possibly other boxes) Dan Magenheimer
  2009-03-27 22:36 ` Jeremy Fitzhardinge
@ 2009-03-28  2:29 ` Tian, Kevin
  2009-03-31 22:08   ` Dan Magenheimer
  1 sibling, 1 reply; 18+ messages in thread
From: Tian, Kevin @ 2009-03-28  2:29 UTC (permalink / raw)
  To: Dan Magenheimer, Xen-Devel (E-mail); +Cc: john.v.morris@hp.com

[-- Attachment #1: Type: text/plain, Size: 2128 bytes --]

>From: Dan Magenheimer
>Sent: 2009年3月28日 4:50
>
>(Raising a yellow flag because this could turn into
>a serious issue for Xen and it may take quite a bit
>of work to come up with a solution.)
>
>We recently measured Xen system time skew on an HP DL785
>and found it to be horrible... nearly a quarter millisecond
>worst case (with only about 10000 samples so it may get worse).
>
>This box uses 8 quad-core AMD chips connected via
>hypertransport.  BUT each chip is on a separate motherboard.
>On this system hypertransport is fast and cross-node
>memory accesses are fast enough so that these NUMA systems
>need not behave like NUMA systems from a memory access
>perspective.  So Xen just views the system as a 32-cpu box
>(other than some code in the memory allocator that tries
>to allocate near-memory where possible, but silently falls
>back to far-memory if necessary) and guest vcpus migrate
>freely between the nodes.  (Correct?)

Then instead user'd better to enable NUMA aware bits with Xen which
imposes some affinity limitation but looks a reasonable model on large 
scale system.

Thanks,
Kevin

>
>However, I'm told that its not possible to route a clocksource
>over hypertransport, so TSC's on processors on different
>motherboards may be VERY different and apparently the
>mechanisms for synchronizing Xen system time across
>motherboards may not be up to the challenge.  As a result,
>OS's and apps sensitive to time that are running on PV
>domains may be in for a rough ride on systems like this.
>(HVM domains may run into other problems because time will
>apparently stop for a "long time".)
>
>Since systems like this are targeted for consolidation
>and virtualization, I see this as a potentially big problem
>as it may appear to real Xen customers as bizarre
>non-reproducible problems, such as "make" failing,
>leading to questions about the stability and viability
>of using Xen.
>
>Comments?
>
>Dan
>
>_______________________________________________
>Xen-devel mailing list
>Xen-devel@lists.xensource.com
>http://lists.xensource.com/xen-devel
>

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Time skew on HP DL785 (and possibly other boxes)
  2009-03-28  2:29 ` Tian, Kevin
@ 2009-03-31 22:08   ` Dan Magenheimer
  2009-03-31 22:48     ` Tian, Kevin
  0 siblings, 1 reply; 18+ messages in thread
From: Dan Magenheimer @ 2009-03-31 22:08 UTC (permalink / raw)
  To: Tian, Kevin, Xen-Devel (E-mail); +Cc: john.v.morris

> >This box uses 8 quad-core AMD chips connected via
> >hypertransport.  BUT each chip is on a separate motherboard.
> >On this system hypertransport is fast and cross-node
> >memory accesses are fast enough so that these NUMA systems
> >need not behave like NUMA systems from a memory access
> >perspective.  So Xen just views the system as a 32-cpu box
> >(other than some code in the memory allocator that tries
> >to allocate near-memory where possible, but silently falls
> >back to far-memory if necessary) and guest vcpus migrate
> >freely between the nodes.  (Correct?)
>
> Then instead user'd better to enable NUMA aware bits with Xen which
> imposes some affinity limitation but looks a reasonable model
> on large
> scale system.
>
> Thanks,
> Kevin

Hi Kevin --

Are you suggesting that only NUMA-aware guests should be
run on systems like this?  If not, what do you mean by
"NUMA aware bits"?

Thanks,
Dan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Time skew on HP DL785 (and possibly other boxes)
  2009-03-31 22:08   ` Dan Magenheimer
@ 2009-03-31 22:48     ` Tian, Kevin
  2009-03-31 23:21       ` Dan Magenheimer
  0 siblings, 1 reply; 18+ messages in thread
From: Tian, Kevin @ 2009-03-31 22:48 UTC (permalink / raw)
  To: Dan Magenheimer, Xen-Devel (E-mail); +Cc: john.v.morris@hp.com

[-- Attachment #1: Type: text/plain, Size: 1339 bytes --]

>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] 
>Sent: 2009年4月1日 6:08
>
>> >This box uses 8 quad-core AMD chips connected via
>> >hypertransport.  BUT each chip is on a separate motherboard.
>> >On this system hypertransport is fast and cross-node
>> >memory accesses are fast enough so that these NUMA systems
>> >need not behave like NUMA systems from a memory access
>> >perspective.  So Xen just views the system as a 32-cpu box
>> >(other than some code in the memory allocator that tries
>> >to allocate near-memory where possible, but silently falls
>> >back to far-memory if necessary) and guest vcpus migrate
>> >freely between the nodes.  (Correct?)
>>
>> Then instead user'd better to enable NUMA aware bits with Xen which
>> imposes some affinity limitation but looks a reasonable model
>> on large
>> scale system.
>>
>> Thanks,
>> Kevin
>
>Hi Kevin --
>
>Are you suggesting that only NUMA-aware guests should be
>run on systems like this?  If not, what do you mean by
>"NUMA aware bits"?
>

No. I meant the physical NUMA features in Xen. IIRC, once NUMA support is
turned on in Xen (by "numa" boot option), one guest is limited in one node
automatically, meaning both cpu affinity only matching to that node and also
memory allocated locally within that node. 

Thanks,
Kevin

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Time skew on HP DL785 (and possibly other boxes)
  2009-03-31 22:48     ` Tian, Kevin
@ 2009-03-31 23:21       ` Dan Magenheimer
  0 siblings, 0 replies; 18+ messages in thread
From: Dan Magenheimer @ 2009-03-31 23:21 UTC (permalink / raw)
  To: Tian, Kevin, Xen-Devel (E-mail); +Cc: john.v.morris

> >> Then instead user'd better to enable NUMA aware bits with Xen which
> >> imposes some affinity limitation but looks a reasonable model
> >> on large
> >> scale system.
> >
> >Are you suggesting that only NUMA-aware guests should be
> >run on systems like this?  If not, what do you mean by
> >"NUMA aware bits"?
> >
>
> No. I meant the physical NUMA features in Xen. IIRC, once
> NUMA support is
> turned on in Xen (by "numa" boot option), one guest is
> limited in one node
> automatically, meaning both cpu affinity only matching to
> that node and also
> memory allocated locally within that node.
>
> Thanks,
> Kevin

OK, I see.  That seems too restrictive when the
interprocessor link is very fast like HT or QPI.
I hope there is a solution that will allow xen system
time to be fairly accurate and synchronized in this
kind of system without depending on tsc.

Dan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Time skew on HP DL785 (and possibly other boxes)
  2009-03-27 22:36 ` Jeremy Fitzhardinge
@ 2009-04-03 22:23   ` Dan Magenheimer
  2009-04-05  7:56     ` Keir Fraser
  2009-04-05 12:59     ` Tian, Kevin
  0 siblings, 2 replies; 18+ messages in thread
From: Dan Magenheimer @ 2009-04-03 22:23 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: john.v.morris, Tian, Kevin, Xen-Devel (E-mail)

I think I still have a real concern here.  Let me see if
I can explain.

The goal for Xen timekeeping is to ensure that if a guest
could somehow magically read any of its virtual clocks
(tsc, pit, hpet, pmtimer, ??) on all its virtual processors
simultaneously, the values read must always obey this
"virtual clock law":

   max - min < delta

We can argue how large that delta can reasonably be and it
may vary depending on what the workload is, but
it's certainly under a millisecond, ten microseconds
might not be a bad starting point, and it is getting
smaller as processors get faster.

If xen can't guarantee that, then it must turn on "numa"
mode, which appears to me to be extremely restrictive
and no system vendor could sell honestly sell the true
promise of virtualization on such a box.  So we'd like
to avoid that if possible.

Now HP DL785-like designs are likely to become more common
because an HT/QPI interconnect makes it possible to build
a single model that is low cost but very-expandible.
Such boxes use multiple motherboards because its much easier
to expand by adding field replaceable units.

Unfortunately, the current Xen system time model (which I think
is also used by kvm?) may not be scaleable to these boxes.
If the current Xen system time algorithm is scaleable, great.
We are done.  If it can be tweaked to be scaleable, great,
no problem.  But if the model needs to be changed substantially,
for example if everything needs to be built on a platform
timer because we just can't guarantee the "virtual clock law",
then we may have a real problem... and not just performance.

Why?  Because the "paravirtual clock" API is hard-coded
in every existing PV domain... and in current and future
versions of the linux kernel (and probably in Windows too?).
If the new model is unable to use the same API, every
prepackaged VM is broken.

So I think we need to be very sure that we either:

A) do not need to change the xen system time model to
   ensure the "virtual clock law" can be obeyed on
   such boxes, or
B) DO need to change the xen system time model, but the
   paravirtual clock API does NOT need to change, or
C) modify/augment the paravirtual clock API and start
   getting the updated version into guests/kernels asap, or
D) ensure that system vendors know that Xen will never run
   guests reliably on such a box, without restricting
   operation to NUMA mode

Note that the Linux approach doesn't work here
because: 1) a guest's clocks might obey the "virtual clock
law" at one moment on one set of physical processors
and not at the next moment; 2) guests access to all
clocks (except the tsc) is emulated so even if a guest
decides the tsc is unreliable, that just doesn't help
if the alternate clock it chooses (e.g. HPET) is silently
emulated on top of xen system time using the physical tsc.

Now does that make my concern more clear?

Thanks,
Dan

> -----Original Message-----
> From: Jeremy Fitzhardinge [mailto:jeremy@goop.org]
> Sent: Friday, March 27, 2009 4:37 PM
> To: Dan Magenheimer
> Cc: Xen-Devel (E-mail); john.v.morris@hp.com
> Subject: Re: [Xen-devel] Time skew on HP DL785 (and possibly other
> boxes)
> 
> 
> Dan Magenheimer wrote:
> > However, I'm told that its not possible to route a clocksource
> > over hypertransport, so TSC's on processors on different
> > motherboards may be VERY different and apparently the
> > mechanisms for synchronizing Xen system time across
> > motherboards may not be up to the challenge.  As a result,
> > OS's and apps sensitive to time that are running on PV
> > domains may be in for a rough ride on systems like this.
> > (HVM domains may run into other problems because time will
> > apparently stop for a "long time".)
> >   
> 
> I don't see what the problem is.  If each individual cpu has 
> well known 
> tsc parameters (rate and offset), then a PV client will get 
> those timing 
> parameters and use it to compute its time.  It doesn't matter 
> if they're 
> syncronized between cpus or nodes.
> 
> Xen will need to calibrate each of them against a good reference 
> (hpet?), but that's no different from now.  I guess its possible that 
> this system has more variation and latency for hpet access, which may 
> mean that the calibration algorithm needs tweaking.
> 
> Of course, if the tsc rates on each cpu is changing in some 
> unpredictable way then that's a whole other barrel of 
> problems.  Guests 
> rely on Xen maintaing accurate tsc timing parameters.
> 
> > Since systems like this are targeted for consolidation
> > and virtualization, I see this as a potentially big problem
> > as it may appear to real Xen customers as bizarre
> > non-reproducible problems, such as "make" failing,
> > leading to questions about the stability and viability
> > of using Xen.
> >
> > Comments?
> >   
> 
> In Linux there's this function:
> 
> /*
>  * apic_is_clustered_box() -- Check if we can expect good TSC
>  *
>  * Thus far, the major user of this is IBM's Summit2 series:
>  *
>  * Clustered boxes may have unsynced TSC problems if they are
>  * multi-chassis. Use available data to take a good guess.
>  * If in doubt, go HPET.
>  */
> __cpuinit int apic_is_clustered_box(void)
> {...}
> 
> 
> Which deals with Summit2 and ScaleSMP vsmp systems which also have 
> unsynchronized tscs across nodes.  At the moment it assumes that no 
> non-VSMP AMD system has unsynchronized tscs; sounds like it will need 
> updating for this system.
> 
>     J
> 
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Time skew on HP DL785 (and possibly other boxes)
  2009-04-03 22:23   ` Dan Magenheimer
@ 2009-04-05  7:56     ` Keir Fraser
  2009-04-05 12:17       ` Tian, Kevin
                         ` (2 more replies)
  2009-04-05 12:59     ` Tian, Kevin
  1 sibling, 3 replies; 18+ messages in thread
From: Keir Fraser @ 2009-04-05  7:56 UTC (permalink / raw)
  To: Dan Magenheimer, Ian Pratt
  Cc: john.v.morris@hp.com, Tian, Kevin, Xen-Devel (E-mail)

On 03/04/2009 23:23, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> I think I still have a real concern here.  Let me see if
> I can explain.
> 
> The goal for Xen timekeeping is to ensure that if a guest
> could somehow magically read any of its virtual clocks
> (tsc, pit, hpet, pmtimer, ??) on all its virtual processors
> simultaneously, the values read must always obey this
> "virtual clock law":

We can do this for all except TSC for HVM guests because there virtual TSC
is hardwired onto the physical TSC (plus a configurable offset). If TSCs run
at significantly different rates then that will be hard to hide from the
guest. Luckily Windows is pretty robust to iffy timers, and no doubt
particularly suspicious of TSCs in multiprocessor environments.

Everything else builds on Xen system time, and Xen system time should just
require each CPU's TSC to be individually stable. This is true even with
your 3.3 patch to rendezvous and snapshot all TSCs at the same instant in
time. This doesn't rely on all TSCs running at the same rate! The approach
should work just as well if they run at their own separate stable rates off
separate crystals. I think the benefit of your patch was in sync'ing system
time across all CPUs at the same time, which significantly reduced maximum
divergence.

One concern I have however, is Intel's X86_FEATURE_CONSTANT_TSC logic. This
was added by them to prevent TSCs from diverging due to Cx deep sleep
states, by observing that usually all TSCs will tick at the same exact rate,
so all that needs to be done is to rewrite all AP TSCs to that of the BP
periodically. This seems to work well on small systems, but the trigger for
this mode is rather suspicious. CONSTANT_TSC feature means that a CPU's TSC
is invariant across frequency/voltage changes -- it *doesn't* mean that all
TSCs across a large MP box are at matched frequency! I wonder whether this
optimisation will bite us on big iron? Probably it ought to disable itself
if it detects significant TSC divergence, or at the very least maybe we
should add a command-line option to disable (or enable?) it.

 -- Keir

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Time skew on HP DL785 (and possibly other boxes)
  2009-04-05  7:56     ` Keir Fraser
@ 2009-04-05 12:17       ` Tian, Kevin
  2009-04-05 13:27         ` Keir Fraser
  2009-04-05 12:41       ` Tian, Kevin
  2009-04-06 14:34       ` Dan Magenheimer
  2 siblings, 1 reply; 18+ messages in thread
From: Tian, Kevin @ 2009-04-05 12:17 UTC (permalink / raw)
  To: Keir Fraser, Dan Magenheimer, Ian Pratt
  Cc: john.v.morris@hp.com, Xen-Devel (E-mail)

[-- Attachment #1: Type: text/plain, Size: 1923 bytes --]

>From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] 
>Sent: 2009年4月5日 15:56
>
>One concern I have however, is Intel's 
>X86_FEATURE_CONSTANT_TSC logic. This
>was added by them to prevent TSCs from diverging due to Cx deep sleep
>states, by observing that usually all TSCs will tick at the 
>same exact rate,

Here one correction is, that constant tsc logic is introduced for 
P-states instead of C-states, to have TSC always stepping in 
constant pace on a given processor, regardless of whatever 
opertion point is being requested by cpufreq governor. It 
doesn't say anything that all TSCs tick at same rate however.

>so all that needs to be done is to rewrite all AP TSCs to that 
>of the BP
>periodically. This seems to work well on small systems, but 
>the trigger for
>this mode is rather suspicious. CONSTANT_TSC feature means 
>that a CPU's TSC
>is invariant across frequency/voltage changes -- it *doesn't* 
>mean that all
>TSCs across a large MP box are at matched frequency! I wonder 

You're exactly right here. To use it does require that all cpus
are driven by a single crystal, which is not true for a large
system with multipe crystals. So this approach (sync all TSCs
to minimize skews caused by TSC stop from deep C-states)
doesn't work in all cases.

>whether this
>optimisation will bite us on big iron? Probably it ought to 
>disable itself
>if it detects significant TSC divergence, or at the very least maybe we
>should add a command-line option to disable (or enable?) it.

I guess thing won't be that worse in this C-state specific
area. Large system based on Intel core-i7 processors or later
always have invariant tsc feature (non-stop tsc) integrated, and
thus no software recovery is required, while in the meantime,
iirc, previous large scale servers don't have deep C-state (>=C3)
implemented and so it's not an issue too. :-)

Thanks,
Kevin

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Time skew on HP DL785 (and possibly other boxes)
  2009-04-05  7:56     ` Keir Fraser
  2009-04-05 12:17       ` Tian, Kevin
@ 2009-04-05 12:41       ` Tian, Kevin
  2009-04-05 12:43         ` Tian, Kevin
  2009-04-06 14:34       ` Dan Magenheimer
  2 siblings, 1 reply; 18+ messages in thread
From: Tian, Kevin @ 2009-04-05 12:41 UTC (permalink / raw)
  To: Keir Fraser, Dan Magenheimer, Ian Pratt
  Cc: john.v.morris@hp.com, Xen-Devel (E-mail)

[-- Attachment #1: Type: text/plain, Size: 1441 bytes --]

>From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] 
>Sent: 2009年4月5日 15:56
>
>On 03/04/2009 23:23, "Dan Magenheimer" 
><dan.magenheimer@oracle.com> wrote:
>
>> I think I still have a real concern here.  Let me see if
>> I can explain.
>> 
>> The goal for Xen timekeeping is to ensure that if a guest
>> could somehow magically read any of its virtual clocks
>> (tsc, pit, hpet, pmtimer, ??) on all its virtual processors
>> simultaneously, the values read must always obey this
>> "virtual clock law":
>
>We can do this for all except TSC for HVM guests because there 
>virtual TSC
>is hardwired onto the physical TSC (plus a configurable 
>offset). If TSCs run
>at significantly different rates then that will be hard to 
>hide from the
>guest. Luckily Windows is pretty robust to iffy timers, and no doubt
>particularly suspicious of TSCs in multiprocessor environments.
>

In that case then Xen'd better figure out some hints to have
HVM guest recognize TSC as unreliable timer source, and 
then fall back to other virtual platform timers (since even keeping
tsc still require emulation for every access now, which would
give wrong illusion to guest and also be harder to be accurately
emulated due to assumed high frequency). Although extra
overhead could be incurred, that's the fact if HVM can be
assured with affinity to one node or several nodes with known
same frequency...

Thanks,
Kevin

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Time skew on HP DL785 (and possibly other boxes)
  2009-04-05 12:41       ` Tian, Kevin
@ 2009-04-05 12:43         ` Tian, Kevin
  0 siblings, 0 replies; 18+ messages in thread
From: Tian, Kevin @ 2009-04-05 12:43 UTC (permalink / raw)
  To: Tian, Kevin, Keir Fraser, Dan Magenheimer, Ian Pratt
  Cc: john.v.morris@hp.com, Xen-Devel (E-mail)

[-- Attachment #1: Type: text/plain, Size: 1614 bytes --]

>From: Tian, Kevin
>Sent: 2009年4月5日 20:41
>
>>From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] 
>>Sent: 2009年4月5日 15:56
>>
>>On 03/04/2009 23:23, "Dan Magenheimer" 
>><dan.magenheimer@oracle.com> wrote:
>>
>>> I think I still have a real concern here.  Let me see if
>>> I can explain.
>>> 
>>> The goal for Xen timekeeping is to ensure that if a guest
>>> could somehow magically read any of its virtual clocks
>>> (tsc, pit, hpet, pmtimer, ??) on all its virtual processors
>>> simultaneously, the values read must always obey this
>>> "virtual clock law":
>>
>>We can do this for all except TSC for HVM guests because there 
>>virtual TSC
>>is hardwired onto the physical TSC (plus a configurable 
>>offset). If TSCs run
>>at significantly different rates then that will be hard to 
>>hide from the
>>guest. Luckily Windows is pretty robust to iffy timers, and no doubt
>>particularly suspicious of TSCs in multiprocessor environments.
>>
>
>In that case then Xen'd better figure out some hints to have
>HVM guest recognize TSC as unreliable timer source, and 
>then fall back to other virtual platform timers (since even keeping
>tsc still require emulation for every access now, which would
>give wrong illusion to guest and also be harder to be accurately
>emulated due to assumed high frequency). Although extra
>overhead could be incurred, that's the fact if HVM can be
                                                                         ^^^^
I meant 'can't be' here.

>assured with affinity to one node or several nodes with known
>same frequency...

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Time skew on HP DL785 (and possibly other boxes)
  2009-04-03 22:23   ` Dan Magenheimer
  2009-04-05  7:56     ` Keir Fraser
@ 2009-04-05 12:59     ` Tian, Kevin
  2009-04-06 14:41       ` Dan Magenheimer
  1 sibling, 1 reply; 18+ messages in thread
From: Tian, Kevin @ 2009-04-05 12:59 UTC (permalink / raw)
  To: Dan Magenheimer, Jeremy Fitzhardinge
  Cc: john.v.morris@hp.com, Xen-Devel (E-mail)

[-- Attachment #1: Type: text/plain, Size: 2713 bytes --]

>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] 
>Sent: 2009年4月4日 6:23
>
>I think I still have a real concern here.  Let me see if
>I can explain.
>
>The goal for Xen timekeeping is to ensure that if a guest
>could somehow magically read any of its virtual clocks
>(tsc, pit, hpet, pmtimer, ??) on all its virtual processors
>simultaneously, the values read must always obey this
>"virtual clock law":
>
>   max - min < delta
>
>We can argue how large that delta can reasonably be and it
>may vary depending on what the workload is, but
>it's certainly under a millisecond, ten microseconds
>might not be a bad starting point, and it is getting
>smaller as processors get faster.
>
>If xen can't guarantee that, then it must turn on "numa"
>mode, which appears to me to be extremely restrictive
>and no system vendor could sell honestly sell the true
>promise of virtualization on such a box.  So we'd like
>to avoid that if possible.

I also heard one concern that completely random load balance
may also work suboptimally on large scale system, being
fierce contention on shared data structures, and thus some
coarse-grained soft partition or limitation are welcomed to
ensure accurate control on assigned resources to given VM
and also avoid cross node traffic as possible. In such case
enable 'numa' could serve the purpose to some extent, which
simply refine given VM's activity within one node, but definitely 
allow administrative tools to move it across node at its
disposal. I once heard that typical deployed VMs nowadays
are provisioned with 1 - 4 vcpus which normally fits in one 
node. But this may not be true in all cases.

Well, my point is a bit out of topic here. Of course your
concern about cross-node TSC variance still makes sense
whether or not node affinity is enforced, as long as VM is
possibly migrated cross-nodes. My point is just that turn
on 'numa' itself is really not a 'extremely restrictive' thing. :-)

>
>Note that the Linux approach doesn't work here
>because: 1) a guest's clocks might obey the "virtual clock
>law" at one moment on one set of physical processors
>and not at the next moment; 2) guests access to all
>clocks (except the tsc) is emulated so even if a guest
>decides the tsc is unreliable, that just doesn't help
>if the alternate clock it chooses (e.g. HPET) is silently
>emulated on top of xen system time using the physical tsc.

As Keir said, Xen system time itself is implemented in
a stable style, and thus as long as HVM timer virtualization
finally falls into emulation path, it should be stable too by
adding some overhead atop current tsc virtualization path.

Thanks,
Kevin

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Time skew on HP DL785 (and possibly other boxes)
  2009-04-05 12:17       ` Tian, Kevin
@ 2009-04-05 13:27         ` Keir Fraser
  2009-04-05 13:37           ` Tian, Kevin
  0 siblings, 1 reply; 18+ messages in thread
From: Keir Fraser @ 2009-04-05 13:27 UTC (permalink / raw)
  To: Tian, Kevin, Dan Magenheimer, Ian Pratt
  Cc: john.v.morris@hp.com, Xen-Devel (E-mail)

On 05/04/2009 13:17, "Tian, Kevin" <kevin.tian@intel.com> wrote:

>> One concern I have however, is Intel's
>> X86_FEATURE_CONSTANT_TSC logic. This
>> was added by them to prevent TSCs from diverging due to Cx deep sleep
>> states, by observing that usually all TSCs will tick at the
>> same exact rate,
> 
> Here one correction is, that constant tsc logic is introduced for
> P-states instead of C-states, to have TSC always stepping in
> constant pace on a given processor, regardless of whatever
> opertion point is being requested by cpufreq governor. It
> doesn't say anything that all TSCs tick at same rate however.

Then changeset 18923 is indeed broken and should be reverted? The problem is
this changeset doesn't just affect the cases it is meant to 'fix' (usage of
C states for CPUs without no-stop TSC). Apart from the fact it can be broken
for systems with that type of CPU as well, it's actually enabled for any
modern CPU (anything advertising the constant-tsc feature). Probably I
shouldn't have checked in that patch in the first place.

 -- Keir

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Time skew on HP DL785 (and possibly other boxes)
  2009-04-05 13:27         ` Keir Fraser
@ 2009-04-05 13:37           ` Tian, Kevin
  0 siblings, 0 replies; 18+ messages in thread
From: Tian, Kevin @ 2009-04-05 13:37 UTC (permalink / raw)
  To: Keir Fraser, Dan Magenheimer, Ian Pratt
  Cc: john.v.morris@hp.com, Xen-Devel (E-mail), Wei, Gang

[-- Attachment #1: Type: text/plain, Size: 1297 bytes --]

>From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] 
>Sent: 2009年4月5日 21:28
>
>On 05/04/2009 13:17, "Tian, Kevin" <kevin.tian@intel.com> wrote:
>
>>> One concern I have however, is Intel's
>>> X86_FEATURE_CONSTANT_TSC logic. This
>>> was added by them to prevent TSCs from diverging due to Cx 
>deep sleep
>>> states, by observing that usually all TSCs will tick at the
>>> same exact rate,
>> 
>> Here one correction is, that constant tsc logic is introduced for
>> P-states instead of C-states, to have TSC always stepping in
>> constant pace on a given processor, regardless of whatever
>> opertion point is being requested by cpufreq governor. It
>> doesn't say anything that all TSCs tick at same rate however.
>
>Then changeset 18923 is indeed broken and should be reverted? 
>The problem is
>this changeset doesn't just affect the cases it is meant to 
>'fix' (usage of
>C states for CPUs without no-stop TSC). Apart from the fact it 
>can be broken
>for systems with that type of CPU as well, it's actually 
>enabled for any
>modern CPU (anything advertising the constant-tsc feature). Probably I
>shouldn't have checked in that patch in the first place.
>

How about making it a selectable option, instead of reversing
it completely?

Thanks,.
Kevin

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Time skew on HP DL785 (and possibly other boxes)
  2009-04-05  7:56     ` Keir Fraser
  2009-04-05 12:17       ` Tian, Kevin
  2009-04-05 12:41       ` Tian, Kevin
@ 2009-04-06 14:34       ` Dan Magenheimer
  2009-04-06 14:48         ` Keir Fraser
  2 siblings, 1 reply; 18+ messages in thread
From: Dan Magenheimer @ 2009-04-06 14:34 UTC (permalink / raw)
  To: Keir Fraser, Ian Pratt; +Cc: john.v.morris, Tian, Kevin, Xen-Devel (E-mail)

> > The goal for Xen timekeeping is to ensure that if a guest
> > could somehow magically read any of its virtual clocks
> > (tsc, pit, hpet, pmtimer, ??) on all its virtual processors
> > simultaneously, the values read must always obey this
> > "virtual clock law":
> 
> We can do this for all except TSC for HVM guests because 

I understand that this is true IFF Xen system time itself
obeys the virtual clock law.  I am concerned that maybe
it cannot on machines such as this.  If not, NO HVM guest
clock will obey the law, correct?

> Everything else builds on Xen system time, and Xen system 
> time should just
> require each CPU's TSC to be individually stable.
> ...I think the benefit of your patch was in 
> sync'ing system
> time across all CPUs at the same time, which significantly 
> reduced maximum divergence.

The problem was, in our testing on this DL785, the maximum
divergence was not reduced enough!  This was tested with
xen-unstable (not sure what c/s).

> One concern I have however, is Intel's 
> X86_FEATURE_CONSTANT_TSC logic.

It's possible that this (or some other problem) has resulted
in the divergenece on the DL785.  So more testing is in
order.

Dan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Time skew on HP DL785 (and possibly other boxes)
  2009-04-05 12:59     ` Tian, Kevin
@ 2009-04-06 14:41       ` Dan Magenheimer
  2009-04-06 22:48         ` Tian, Kevin
  0 siblings, 1 reply; 18+ messages in thread
From: Dan Magenheimer @ 2009-04-06 14:41 UTC (permalink / raw)
  To: Tian, Kevin, Jeremy Fitzhardinge; +Cc: john.v.morris, Xen-Devel (E-mail)

> Well, my point is a bit out of topic here. Of course your
> concern about cross-node TSC variance still makes sense
> whether or not node affinity is enforced, as long as VM is
> possibly migrated cross-nodes. My point is just that turn
> on 'numa' itself is really not a 'extremely restrictive'
> thing. :-)

Hi Kevin --

I think numa-mode is extremely restrictive because
it makes a 32-way box work like eight 4-way blades.

I think the whole point of HT/QPI is to reduce the
memory latency enough so that a NUMA box does not
look like a NUMA box.  If time synchronization fails
so that this type of box is forced to be partitioned,
the value of HT/QPI is greatly diminished (at least
in a virtualization environment).

Dan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Time skew on HP DL785 (and possibly other boxes)
  2009-04-06 14:34       ` Dan Magenheimer
@ 2009-04-06 14:48         ` Keir Fraser
  0 siblings, 0 replies; 18+ messages in thread
From: Keir Fraser @ 2009-04-06 14:48 UTC (permalink / raw)
  To: Dan Magenheimer, Ian Pratt
  Cc: john.v.morris@hp.com, Tian, Kevin, Xen-Devel (E-mail)

On 06/04/2009 15:34, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

>> One concern I have however, is Intel's
>> X86_FEATURE_CONSTANT_TSC logic.
> 
> It's possible that this (or some other problem) has resulted
> in the divergenece on the DL785.  So more testing is in
> order.

The Intel patch is enabled only via a command-line option as of c/s 19506.

 -- Keir

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Time skew on HP DL785 (and possibly other boxes)
  2009-04-06 14:41       ` Dan Magenheimer
@ 2009-04-06 22:48         ` Tian, Kevin
  0 siblings, 0 replies; 18+ messages in thread
From: Tian, Kevin @ 2009-04-06 22:48 UTC (permalink / raw)
  To: Dan Magenheimer, Jeremy Fitzhardinge
  Cc: john.v.morris@hp.com, Xen-Devel (E-mail)

[-- Attachment #1: Type: text/plain, Size: 1310 bytes --]

>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] 
>Sent: 2009年4月6日 22:41
>
>> Well, my point is a bit out of topic here. Of course your
>> concern about cross-node TSC variance still makes sense
>> whether or not node affinity is enforced, as long as VM is
>> possibly migrated cross-nodes. My point is just that turn
>> on 'numa' itself is really not a 'extremely restrictive'
>> thing. :-)
>
>Hi Kevin --
>
>I think numa-mode is extremely restrictive because
>it makes a 32-way box work like eight 4-way blades.

virtualization in itself is something partitioned with each VM
representing one working set. Most VMs deployed so far
haven't requirement over virtual 4-way blades, and thus above
restriction is less relaxed. Then it's natural to span them
in-nodes instead of cross-nodes.

>
>I think the whole point of HT/QPI is to reduce the
>memory latency enough so that a NUMA box does not
>look like a NUMA box.  If time synchronization fails
>so that this type of box is forced to be partitioned,
>the value of HT/QPI is greatly diminished (at least
>in a virtualization environment).
>

It's orthogonal. The effort to keep reducing memory latency
on NUMA box doesn't mean no observable memory latency
difference for local and remote memory.

Thanks,
Kevin

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2009-04-06 22:48 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-27 20:49 Time skew on HP DL785 (and possibly other boxes) Dan Magenheimer
2009-03-27 22:36 ` Jeremy Fitzhardinge
2009-04-03 22:23   ` Dan Magenheimer
2009-04-05  7:56     ` Keir Fraser
2009-04-05 12:17       ` Tian, Kevin
2009-04-05 13:27         ` Keir Fraser
2009-04-05 13:37           ` Tian, Kevin
2009-04-05 12:41       ` Tian, Kevin
2009-04-05 12:43         ` Tian, Kevin
2009-04-06 14:34       ` Dan Magenheimer
2009-04-06 14:48         ` Keir Fraser
2009-04-05 12:59     ` Tian, Kevin
2009-04-06 14:41       ` Dan Magenheimer
2009-04-06 22:48         ` Tian, Kevin
2009-03-28  2:29 ` Tian, Kevin
2009-03-31 22:08   ` Dan Magenheimer
2009-03-31 22:48     ` Tian, Kevin
2009-03-31 23:21       ` Dan Magenheimer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.