[RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
@ 2009-10-02 17:51 Dan Magenheimer
  2009-10-07 21:07 ` Dan Magenheimer
  0 siblings, 1 reply; 17+ messages in thread
From: Dan Magenheimer @ 2009-10-02 17:51 UTC (permalink / raw)
  To: Xen-Devel (E-mail); +Cc: kurt.hackel, Ian Pratt, Keir Fraser

=============
Premise 1:  A large and growing percentage of servers
running Xen have a "reliable" TSC and Xen can determine
conclusively whether a server does or does not have a
reliable TSC.
=============

The truth of this statement has been vociferously
challenged in other threads, so I'd LOVE TO GET
FEEDBACK OR CONFIRMATION FROM PROCESSOR AND SERVER
VENDORS.

The rest of this is long though hopefully educational,
but if you have no interest in the rdtsc instruction
or timestamping, please move on to [2 of 4].

Since my overall premise is a bit vague, I need to
first very clearly define my terms.  And to define
those terms clearly, I need to provide some more
background.  As far as I can find, there is no
publication which clearly describes all of these
concepts.

The rdtsc instruction was at one time the easiest
and cheapest and most precise method for "approximating
the passage of time"; as such rdtsc was widely
used by x86 performance practitioners and high-end
apps that needed to provide extensive metrics.  When
commodity SMP x86 systems emerged, rdtsc fell into
disfavor because: (a) it was difficult to for
different CPU packages to share a crystal or
ensure different crystals were synchronized or
increasing at precisely the same rate, and
(b) SMP apps were oblivious to which CPU their
thread(s) were running on so two rdtsc instructions
in the same thread might execute on different
CPU's and thus unwittingly use different crystals,
resulting in strange things like the appearance that
time went backwards (sometimes by a large amount)
or events appearing to take different amounts of
time depending on whether they were running on
processor A or processor B.  We will call this
the "inconsistent TSC" problem.

Processor and system vendors attempted to fix the
inconsistent TSC problem by providing a new class
of "platform timers" (e.g. HPET), but these proved
to be slow and difficult to use, especially for
apps that required frequent fine metrics.

Processor and system vendors eventually figured out
how to synchronize TSC with the same crystal, but
then a new set of problems emerged: Power features
sometimes caused the clock on one processor to
slow down or even stop, thus destroying the synchrony
with other processors.  This was fixed first
by ensuring that the tick rate did not change
("constant TSC") and later that it did not stop
("nonstop TSC"), unless ALL of the TSCs on all of
the processors stopped.  Nearly all of the most recent
generations of server processors support these
capabilities, so as a result on most recent servers,
the TSC on all processors/cores/sockets is driven by
the same crystal, always ticks at the same rate,
and doesn't stop independently of other processors'
TSCs.  This is what we call a "reliable TSC".

But we're not done yet.  What does a reliable TSC
provide?  We need to define a few more terms.

A "perfect TSC" would be one where a magic logic
analyzer with a cesium clock could confirm that
the TSC's on every processor increment at precisely
the same femtosecond.  Both the speed of light
and the pricing models of commodity processors
make a perfect TSC unlikely :-)

How close is good enough?  We define two TSCs
as being "unobservably different" if code running
on the two processors can never see time going
backwards, because the difference bettween their
TSCs is smaller than the memory access overhead
due to cache synchronization. (This is sometimes
called a "cache bounce".) To wit, suppose processor
A does a rdtsc and writes the result into memory;
meanwhile processor B is spinning until it sees that the
memory location has changed, reads A's value
from memory and then does its own rdtsc.  If
B's rdtsc is NEVER less OR equal to A's rdtsc,
we will call this an "optimal TSC".

A reliable TSC is not guaranteed to be optimal;
it may just be very close to optimal, meaning
the difference between two TSCs may sometimes
be observable but it will always be very small.
(As far as I know, processor and server vendors
will not guarantee exactly how small.)  To simulate
an optimal TSC with a reliable TSC, a software
wrapper can be placed around the reads from a
reliable TSC to catch and "fix" the rare
circumstances where time goes backwards.
If this wrapper, ensures that time never goes
backwards AND ensures that time always moves
forward, we call this a monotonically-increasing
wrapper.  If it instead ensures that time never
goes backwards AND may appear to stop, we call
this a monotonically-non-decreasing wrapper.

Note also that a reliable TSC is not guaranteed
to never stop; it is just guaranteed that if
the TSC on one processor is stopped, the TSC on
all other processors will also be stopped.  As
a result, a reliable TSC cannot be used as
a wallclock, at least without other software
support that can properly adjust the TSC on all
processors when all processors awaken.

Last, there is the issue of whether or not Xen can
conclusively determine if the TSC is reliable.
This is still an open challenge.  There exists
a CPUID bit which purports to do this, but it
is not known with certainty if there are exceptions.
Notably, there is concern if certain newer
larger NUMA servers will truly provide a reliable
TSC across all system processors even if the
CPUID bit on each CPU package says the package
does provide a reliable TSC.  One large server vendor
claims that this is not a problem anymore, but
ideally we would like to test this dynamically
and there is GPL code available to do exactly
that.  This code is used in Linux in some
circumstances once at boot-time to test for
an "optimal TSC".  But in some cases the CPUID
bit defuses this test.  And in any case a boottime
test may not catch all problems, such as a
power event that doesn't handle TSC quite properly.
So without some form of ongoing post-boottime
test, we just don't know.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
  2009-10-02 17:51 [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC Dan Magenheimer
@ 2009-10-07 21:07 ` Dan Magenheimer
  2009-10-08  6:45   ` Keir Fraser
  2009-10-08  9:13   ` Tim Deegan
  0 siblings, 2 replies; 17+ messages in thread
From: Dan Magenheimer @ 2009-10-07 21:07 UTC (permalink / raw)
  To: Xen-Devel (E-mail)
  Cc: kurt.hackel, Ian Pratt, Keir Fraser, Jeremy Fitzhardinge

FYI, I finally found a published source describing
the TSC Invariant bit in Nehalem.  See 2.2.6 in:

http://www.intel.com/Assets/PDF/appnote/241618.pdf

"In the Core i7 AND FUTURE PROCESSOR GENERATIONS
[my emphasis] the TSC will continue to run in the
deepest C-states.  Therefore, the TSC will run at
a constant rate in all ACPI P-, C-, and T-states.
Support for this feature is indicated by
CPUID.0x8000_0007.EDX[8].  On processors with
invariant TSC support, the OS may use the TSC
for wall clock timer services (instead of ACPI
or HPET timers).  TSC reads are much more efficient
and do not incur the overhead associated with a
ring transition or access to a platform resource."

Linux upstream now does exactly that; if this
bit is set (on Intel processors), tsc is utilized
as the system clocksource and afaict there
is NO path that will test or revert this
decision.

Admittedly, this doesn't guarantee that a multi-socket
platform obeys invariance, but apparently this
feature utilizes a crystal available externally
to the socket so it is easy to leverage in a
system design to ensure invariance across
multiple sockets, or even across multiple enclosures
that are all on a QPI link.  So system designers
(other than perhaps for the very largest superNUMA
machines) would be silly to not use it.

So, I'd recommend that:

1) On (Intel, maybe later AMD) systems where this
   bit is set, the mechanisms enabled by the
   Xen consistent_tscs boot option should be enabled
   automatically for Xen.
2) The time_calibration_tsc_rendezvous loop in
   timer.c could/should be rewritten or removed
   and certainly should NOT write_tsc().

Keir, I know you are very sensitive around
this code, so thought I'd check before messing
with it.  Or feel free to do it yourself.

Thanks,
Dan

> -----Original Message-----
> From: Dan Magenheimer 
> Sent: Friday, October 02, 2009 11:51 AM
> To: Xen-Devel (E-mail)
> Cc: Kurt Hackel; Ian Pratt; Keir Fraser
> Subject: [Xen-devel] [RFC] Correct/fast timestamping in apps under Xen
> [1 of 4]: Reliable TSC
> 
> 
> =============
> Premise 1:  A large and growing percentage of servers
> running Xen have a "reliable" TSC and Xen can determine
> conclusively whether a server does or does not have a
> reliable TSC.
> =============
> 
> The truth of this statement has been vociferously
> challenged in other threads, so I'd LOVE TO GET
> FEEDBACK OR CONFIRMATION FROM PROCESSOR AND SERVER
> VENDORS.
> 
> The rest of this is long though hopefully educational,
> but if you have no interest in the rdtsc instruction
> or timestamping, please move on to [2 of 4].
> 
> Since my overall premise is a bit vague, I need to
> first very clearly define my terms.  And to define
> those terms clearly, I need to provide some more
> background.  As far as I can find, there is no
> publication which clearly describes all of these
> concepts.
> 
> The rdtsc instruction was at one time the easiest
> and cheapest and most precise method for "approximating
> the passage of time"; as such rdtsc was widely
> used by x86 performance practitioners and high-end
> apps that needed to provide extensive metrics.  When
> commodity SMP x86 systems emerged, rdtsc fell into
> disfavor because: (a) it was difficult to for
> different CPU packages to share a crystal or
> ensure different crystals were synchronized or
> increasing at precisely the same rate, and
> (b) SMP apps were oblivious to which CPU their
> thread(s) were running on so two rdtsc instructions
> in the same thread might execute on different
> CPU's and thus unwittingly use different crystals,
> resulting in strange things like the appearance that
> time went backwards (sometimes by a large amount)
> or events appearing to take different amounts of
> time depending on whether they were running on
> processor A or processor B.  We will call this
> the "inconsistent TSC" problem.
> 
> Processor and system vendors attempted to fix the
> inconsistent TSC problem by providing a new class
> of "platform timers" (e.g. HPET), but these proved
> to be slow and difficult to use, especially for
> apps that required frequent fine metrics.
> 
> Processor and system vendors eventually figured out
> how to synchronize TSC with the same crystal, but
> then a new set of problems emerged: Power features
> sometimes caused the clock on one processor to
> slow down or even stop, thus destroying the synchrony
> with other processors.  This was fixed first
> by ensuring that the tick rate did not change
> ("constant TSC") and later that it did not stop
> ("nonstop TSC"), unless ALL of the TSCs on all of
> the processors stopped.  Nearly all of the most recent
> generations of server processors support these
> capabilities, so as a result on most recent servers,
> the TSC on all processors/cores/sockets is driven by
> the same crystal, always ticks at the same rate,
> and doesn't stop independently of other processors'
> TSCs.  This is what we call a "reliable TSC".
> 
> But we're not done yet.  What does a reliable TSC
> provide?  We need to define a few more terms.
> 
> A "perfect TSC" would be one where a magic logic
> analyzer with a cesium clock could confirm that
> the TSC's on every processor increment at precisely
> the same femtosecond.  Both the speed of light
> and the pricing models of commodity processors
> make a perfect TSC unlikely :-)
> 
> How close is good enough?  We define two TSCs
> as being "unobservably different" if code running
> on the two processors can never see time going
> backwards, because the difference bettween their
> TSCs is smaller than the memory access overhead
> due to cache synchronization. (This is sometimes
> called a "cache bounce".) To wit, suppose processor
> A does a rdtsc and writes the result into memory;
> meanwhile processor B is spinning until it sees that the
> memory location has changed, reads A's value
> from memory and then does its own rdtsc.  If
> B's rdtsc is NEVER less OR equal to A's rdtsc,
> we will call this an "optimal TSC".
> 
> A reliable TSC is not guaranteed to be optimal;
> it may just be very close to optimal, meaning
> the difference between two TSCs may sometimes
> be observable but it will always be very small.
> (As far as I know, processor and server vendors
> will not guarantee exactly how small.)  To simulate
> an optimal TSC with a reliable TSC, a software
> wrapper can be placed around the reads from a
> reliable TSC to catch and "fix" the rare
> circumstances where time goes backwards.
> If this wrapper, ensures that time never goes
> backwards AND ensures that time always moves
> forward, we call this a monotonically-increasing
> wrapper.  If it instead ensures that time never
> goes backwards AND may appear to stop, we call
> this a monotonically-non-decreasing wrapper.
> 
> Note also that a reliable TSC is not guaranteed
> to never stop; it is just guaranteed that if
> the TSC on one processor is stopped, the TSC on
> all other processors will also be stopped.  As
> a result, a reliable TSC cannot be used as
> a wallclock, at least without other software
> support that can properly adjust the TSC on all
> processors when all processors awaken.
> 
> Last, there is the issue of whether or not Xen can
> conclusively determine if the TSC is reliable.
> This is still an open challenge.  There exists
> a CPUID bit which purports to do this, but it
> is not known with certainty if there are exceptions.
> Notably, there is concern if certain newer
> larger NUMA servers will truly provide a reliable
> TSC across all system processors even if the
> CPUID bit on each CPU package says the package
> does provide a reliable TSC.  One large server vendor
> claims that this is not a problem anymore, but
> ideally we would like to test this dynamically
> and there is GPL code available to do exactly
> that.  This code is used in Linux in some
> circumstances once at boot-time to test for
> an "optimal TSC".  But in some cases the CPUID
> bit defuses this test.  And in any case a boottime
> test may not catch all problems, such as a
> power event that doesn't handle TSC quite properly.
> So without some form of ongoing post-boottime
> test, we just don't know.
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
> 
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
  2009-10-07 21:07 ` Dan Magenheimer
@ 2009-10-08  6:45   ` Keir Fraser
  2009-10-08  6:54     ` Keir Fraser
  2009-10-08  9:13   ` Tim Deegan
  1 sibling, 1 reply; 17+ messages in thread
From: Keir Fraser @ 2009-10-08  6:45 UTC (permalink / raw)
  To: Dan Magenheimer, Xen-Devel (E-mail)
  Cc: kurt.hackel@oracle.com, Ian Pratt, Jeremy Fitzhardinge

On 07/10/2009 22:07, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> So, I'd recommend that:
> 
> 1) On (Intel, maybe later AMD) systems where this
>    bit is set, the mechanisms enabled by the
>    Xen consistent_tscs boot option should be enabled
>    automatically for Xen.
> 2) The time_calibration_tsc_rendezvous loop in
>    timer.c could/should be rewritten or removed
>    and certainly should NOT write_tsc().
> 
> Keir, I know you are very sensitive around
> this code, so thought I'd check before messing
> with it.  Or feel free to do it yourself.

Feel free to make a patch.

 K.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
  2009-10-08  6:45   ` Keir Fraser
@ 2009-10-08  6:54     ` Keir Fraser
  0 siblings, 0 replies; 17+ messages in thread
From: Keir Fraser @ 2009-10-08  6:54 UTC (permalink / raw)
  To: Dan Magenheimer, Xen-Devel (E-mail)
  Cc: kurt.hackel@oracle.com, Ian Pratt, Jeremy Fitzhardinge

On 08/10/2009 07:45, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote:

>> 1) On (Intel, maybe later AMD) systems where this
>>    bit is set, the mechanisms enabled by the
>>    Xen consistent_tscs boot option should be enabled
>>    automatically for Xen.
>> 2) The time_calibration_tsc_rendezvous loop in
>>    timer.c could/should be rewritten or removed
>>    and certainly should NOT write_tsc().
>> 
>> Keir, I know you are very sensitive around
>> this code, so thought I'd check before messing
>> with it.  Or feel free to do it yourself.
> 
> Feel free to make a patch.

At least, make a patch for (1). I don't think (2) can be easily removed in
all cases. For example, Intel's method for rate-invariant TSC which stops on
deep sleeps does involve rewriting TSC values to forcibly keep them in sync.
Perhaps change code to never write_tsc() just in the case of TSC_RELIABLE,
or whatever you call it? Or perhaps just do (1) for now.

 -- Keir

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
  2009-10-07 21:07 ` Dan Magenheimer
  2009-10-08  6:45   ` Keir Fraser
@ 2009-10-08  9:13   ` Tim Deegan
  2009-10-08  9:22     ` Keir Fraser
  1 sibling, 1 reply; 17+ messages in thread
From: Tim Deegan @ 2009-10-08  9:13 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Fitzhardinge, Xen-Devel (E-mail), Jeremy, kurt.hackel@oracle.com,
	Ian Pratt, Keir Fraser

At 22:07 +0100 on 07 Oct (1254953275), Dan Magenheimer wrote:
> Admittedly, this doesn't guarantee that a multi-socket
> platform obeys invariance, but apparently this
> feature utilizes a crystal available externally
> to the socket so it is easy to leverage in a
> system design to ensure invariance across
> multiple sockets, or even across multiple enclosures
> that are all on a QPI link.  So system designers
> (other than perhaps for the very largest superNUMA
> machines) would be silly to not use it.

Oh, that's reassuring.  System designers would never do something that
silly.  :)

If linux relies on it, that's a good sign, but surely we shouldn't get
rid of any existing correction mechanisms. 

Tim.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Citrix Systems (R&D) Ltd.
[Company #02300071, SL9 0DZ, UK.]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
  2009-10-08  9:13   ` Tim Deegan
@ 2009-10-08  9:22     ` Keir Fraser
  2009-10-08 16:24       ` Dan Magenheimer
  0 siblings, 1 reply; 17+ messages in thread
From: Keir Fraser @ 2009-10-08  9:22 UTC (permalink / raw)
  To: Tim Deegan, Dan Magenheimer
  Cc: kurt.hackel@oracle.com, Ian Pratt, Xen-Devel (E-mail),
	Jeremy Fitzhardinge

On 08/10/2009 10:13, "Tim Deegan" <Tim.Deegan@eu.citrix.com> wrote:

> So system designers
>> (other than perhaps for the very largest superNUMA
>> machines) would be silly to not use it.
> 
> Oh, that's reassuring.  System designers would never do something that
> silly.  :)
> 
> If linux relies on it, that's a good sign, but surely we shouldn't get
> rid of any existing correction mechanisms.

I think at the very least this new 'reliable tsc' mode must be self
contained, not impact the existing modes, and continue to be switchable via
a boot parameter.

 -- Keir

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
  2009-10-08  9:22     ` Keir Fraser
@ 2009-10-08 16:24       ` Dan Magenheimer
  2009-10-09  9:34         ` Tim Deegan
  0 siblings, 1 reply; 17+ messages in thread
From: Dan Magenheimer @ 2009-10-08 16:24 UTC (permalink / raw)
  To: Keir Fraser, Tim Deegan
  Cc: kurt.hackel, Ian Pratt, Xen-Devel (E-mail), Jeremy, Fitzhardinge

> On 08/10/2009 10:13, "Tim Deegan" <Tim.Deegan@eu.citrix.com> wrote:
> 
> > So system designers
> >> (other than perhaps for the very largest superNUMA
> >> machines) would be silly to not use it.
> > 
> > Oh, that's reassuring.  System designers would never do 
> > something that silly.  :)

Tongue-in-cheek noted. ;-)  But seriously, what I'm proposing
is that now that this is architected by the processor, poorly
designed systems (or extremely large systems) should be the rare
exception, not the rule.  Specifically I'm proposing that
(at least for Intel... AMD TBD) if the architectural bit is
set Xen should trust it by default, but provide a boot-time
parameter (e.g. "tsc_broken") to override the default for
any rare poorly-designed or superNUMA systems.

> > If linux relies on it, that's a good sign, but surely we 
> shouldn't get
> > rid of any existing correction mechanisms.

Unfortunately, Xen has no existing detection mechanism so
also has no existing correction mechanism.  Xen currently
blindly assumes tsc is wrong and overwrites all tscs at
boottime, after deep C-state, and at 1Hz if the boottime
consistent_tscs option is set.

> I think at the very least this new 'reliable tsc' mode must be self
> contained, not impact the existing modes, and continue to be 
> switchable via a boot parameter.

OK, let me suggest the following taxonomy of tsc "safeness":

A) unsafe (neither constant nor power-invariant)
B) semi-safe (constant = P-,T-state invariant, C-state may stop)
C) safe (constant+non-stop = P-,T-,and C-state invariant)
D) false-positive safe (CPUs safe, system-wide is not)

Xen currently assumes A.  This is sufficient for Xen's needs,
and for the pvclock algorithm, but insufficient for my
plans to expose "TSC reliability" to usermode.

B (constant) is now determined in Xen by checking family ids
but only used to override consistent_tscs if constant is
NOT set.

C is architecturally-defined by a cpuid bit but Xen doesn't
currently use it.  Intel guarantees TSC invariance across
P-, T-, and C-states when it is set (AMD TBD).

I'm proposing that:
1) for case C, Xen shall never overwrite TSC
2) for case D, a new "tsc_broken" boot option must be specified
   when Xen is booted on a broken machine
3) for case B, always use it when the hardware supports it
   (unless overridden by "tsc_broken")

We are also investigating whether the write_tsc() in
the cstate recovery code obviates the need for the
write_tsc in time_calibration_tsc_rendezvous.

Comments?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
  2009-10-08 16:24       ` Dan Magenheimer
@ 2009-10-09  9:34         ` Tim Deegan
  2009-10-09 14:38           ` Dan Magenheimer
  2009-10-09 20:28           ` Jeremy Fitzhardinge
  0 siblings, 2 replies; 17+ messages in thread
From: Tim Deegan @ 2009-10-09  9:34 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Fitzhardinge, Xen-Devel (E-mail), Jeremy, kurt.hackel@oracle.com,
	Ian Pratt, Keir Fraser

At 17:24 +0100 on 08 Oct (1255022685), Dan Magenheimer wrote:
> Tongue-in-cheek noted. ;-)  But seriously, what I'm proposing
> is that now that this is architected by the processor, poorly
> designed systems (or extremely large systems) should be the rare
> exception, not the rule. 

That seems like unwarranted optimism, but we'll just have to wait and
see.  I've seen enough bugs that boiled down to reputable system
builders doing things that software engineers thought would surely never
happen.

> A) unsafe (neither constant nor power-invariant)
> B) semi-safe (constant = P-,T-state invariant, C-state may stop)
> C) safe (constant+non-stop = P-,T-,and C-state invariant)
> D) false-positive safe (CPUs safe, system-wide is not)

OK; for the record I believe C should be assumed to be D.  

> Xen currently assumes A. 

That's what I meant by detection and correction.

> This is sufficient for Xen's needs,
> and for the pvclock algorithm, but insufficient for my
> plans to expose "TSC reliability" to usermode.

Your plans for usermode<-->hypervisor direct TSC integration seem to me
to be an unpleasant hack.  I understand that you have good business
reasons for wanting it (even if you're not allowed to tell us explicitly
what they are) and we've seen the justifications enough times that we
don't need to cover them again here, but it's still a hack.

I'm unhappy with the idea of kicking around the Xen timekeeping code
(and introducing the usual bug-tail) to support introducing a usermode
TSC.  If there is to be a new mode for this, it should default to the
current (works for everyone except the engineering team of a
not-to-be-named enterprise application) behaviour.

> I'm proposing that:
> 1) for case C, Xen shall never overwrite TSC
> 2) for case D, a new "tsc_broken" boot option must be specified
>    when Xen is booted on a broken machine

Might as well call it "application_broken" and default it the other
way. :)  The system builders are entirely within their rights to have
separate clocks for separate sockets.

Cheers,

Tim.

> 3) for case B, always use it when the hardware supports it
>    (unless overridden by "tsc_broken")
> 
> We are also investigating whether the write_tsc() in
> the cstate recovery code obviates the need for the
> write_tsc in time_calibration_tsc_rendezvous.
> 
> Comments?
> 

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Citrix Systems (R&D) Ltd.
[Company #02300071, SL9 0DZ, UK.]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
  2009-10-09  9:34         ` Tim Deegan
@ 2009-10-09 14:38           ` Dan Magenheimer
  2009-10-12  9:51             ` Tim Deegan
  2009-10-09 20:28           ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 17+ messages in thread
From: Dan Magenheimer @ 2009-10-09 14:38 UTC (permalink / raw)
  To: Tim Deegan
  Cc: kurt.hackel, Fitzhardinge, Xen-Devel (E-mail), Ian Pratt,
	Keir Fraser

Hi Tim --

Thanks for your comments!

> At 17:24 +0100 on 08 Oct (1255022685), Dan Magenheimer wrote:
> > Tongue-in-cheek noted. ;-)  But seriously, what I'm proposing
> > is that now that this is architected by the processor, poorly
> > designed systems (or extremely large systems) should be the rare
> > exception, not the rule. 
> 
> That seems like unwarranted optimism, but we'll just have to wait and
> see.  I've seen enough bugs that boiled down to reputable system
> builders doing things that software engineers thought would 
> surely never happen.

Well, app providers have been beating up on processor and
system vendors for years to "fix the d*mn timestamp problem".
They finally have, and have even made it architectural.

I can think of one large enterprise software provider
that would gladly redlist systems that regress in this
area.

So color me optimistic that the problem is solved or
at least that system vendors will only sin for a very
good reason; and their indiscretions will be public enough
that their need for a special boottime Xen option will not
be a closely-guarded secret.

Now all I'm trying to do is ensure that Xen virtual machines
don't suffer their own "d*mn timestamp problem", especially
given that VMware doesn't have one.

> > A) unsafe (neither constant nor power-invariant)
> > B) semi-safe (constant = P-,T-state invariant, C-state may stop)
> > C) safe (constant+non-stop = P-,T-,and C-state invariant)
> > D) false-positive safe (CPUs safe, system-wide is not)
> 
> OK; for the record I believe C should be assumed to be D.  

What?!? And waste all that hard work by processor and
system vendors to finally fix the problem? ;-)

I admit that I have some reservations as well, so would
like Xen to verify "safeness" at each boot, and
preferably periodically for the life of the system.
Verification turns out to be quite ugly though,
and probably even more so for those superNUMA
systems that might be most likely to fail the test.

> > Xen currently assumes A. 
> 
> That's what I meant by detection and correction.

IMHO, the road to software performance hell is paved with
least-common-denominator solutions.  (And, yes, to
take the words right out of your mouth before you
say them, the road to software maintenance hell is paved
with never-used special cases.)

> > This is sufficient for Xen's needs,
> > and for the pvclock algorithm, but insufficient for my
> > plans to expose "TSC reliability" to usermode.
> 
> Your plans for usermode<-->hypervisor direct TSC integration 
> seem to me to be an unpleasant hack.

Yes, I admit it offends my aesthetics some.  But I defend
it to myself by believing that this is just a first step
in a long road of closer collaboration between hypervisor
and apps.  Really the whole point of paravirtualization
is to benefit from knowing that the underlying platform
is virtual.  Why should apps be excluded from the party?

> I understand that you have good business
> reasons for wanting it (even if you're not allowed to tell us 
> explicitly
> what they are) and we've seen the justifications enough times that we
> don't need to cover them again here, but it's still a hack.

I think I've been very explicit: Some very large apps, both
Oracle and non-Oracle, need a way to get a timestamp
at a high frequency in a way that is both correct and
very fast and works across a range of hardware/software
environments, INCLUDING running under Xen.

I AM exposed to some other companies' confidential
information, so any appearance that I am hiding something
is due to my clumsy attempts to dance around that
in a public forum.

> I'm unhappy with the idea of kicking around the Xen timekeeping code
> (and introducing the usual bug-tail) to support introducing a usermode
> TSC.  If there is to be a new mode for this, it should default to the
> current (works for everyone except the engineering team of a
> not-to-be-named enterprise application) behaviour.

This isn't a new mode, it's a new (not-so-new for AMD)
hardware feature that Xen has yet to make proper use of.

And I'm not introducing a usermode TSC... Intel did that
years ago.

And if, by "new mode" you're referring to rdtsc emulation,
that's certainly not for Oracle's benefit.

> > I'm proposing that:
> > 1) for case C, Xen shall never overwrite TSC
> > 2) for case D, a new "tsc_broken" boot option must be specified
> >    when Xen is booted on a broken machine
> 
> Might as well call it "application_broken" and default it the other
> way. :)  The system builders are entirely within their rights to have
> separate clocks for separate sockets.

If you agree with Jeremy's opinion that "any app that uses
rdtsc is fundamentally broken", your syntax makes sense.
As you know, I disagree, especially as it applies to future
hardware and software.

Dan

P.S. I'll have infrequent access to email for the next week.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
  2009-10-09  9:34         ` Tim Deegan
  2009-10-09 14:38           ` Dan Magenheimer
@ 2009-10-09 20:28           ` Jeremy Fitzhardinge
  2009-10-09 21:35             ` Dan Magenheimer
  1 sibling, 1 reply; 17+ messages in thread
From: Jeremy Fitzhardinge @ 2009-10-09 20:28 UTC (permalink / raw)
  To: Tim Deegan
  Cc: kurt.hackel@oracle.com, Dan Magenheimer, Xen-Devel (E-mail),
	Ian Pratt, Keir Fraser

On 10/09/09 02:34, Tim Deegan wrote:
> Your plans for usermode<-->hypervisor direct TSC integration seem to me
> to be an unpleasant hack.  I understand that you have good business
> reasons for wanting it (even if you're not allowed to tell us explicitly
> what they are) and we've seen the justifications enough times that we
> don't need to cover them again here, but it's still a hack.
>
> I'm unhappy with the idea of kicking around the Xen timekeeping code
> (and introducing the usual bug-tail) to support introducing a usermode
> TSC.  If there is to be a new mode for this, it should default to the
> current (works for everyone except the engineering team of a
> not-to-be-named enterprise application) behaviour.
>   

I'm seeing an approx 12x performance improvement with gettimeofday() and
clock_gettime() on systems with my vsyscall support patches
(~1200ns/call -> ~100ns[*]).  I think that should go a long way towards
mitigating the performance concerns using standard APIs.  

There's probably some scope for improving those numbers on systems with
better-than-baseline tsc support (ie rdtscp and/or guaranteed synced
tscs), but I think its enough to get started with, especially given the
broad applicability and relatively simple engineering.

[*] With native tsc; emulated tsc makes that 1700 -> 500, or only ~3.3x
improvement.

    J

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
  2009-10-09 20:28           ` Jeremy Fitzhardinge
@ 2009-10-09 21:35             ` Dan Magenheimer
  2009-10-10  0:22               ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 17+ messages in thread
From: Dan Magenheimer @ 2009-10-09 21:35 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, Tim Deegan
  Cc: kurt.hackel, Ian Pratt, Xen-Devel (E-mail), Keir Fraser

Excellent!  This is an extremely important piece
of the puzzle now filled in.

Just for completeness, on your machine, what is
the measurement for raw rdtsc?

(And if anybody believes this is the ONLY piece
of the puzzle that is necessary, I would be happy
to expand further.)

> From: Jeremy Fitzhardinge [mailto:jeremy@goop.org]
> On 10/09/09 02:34, Tim Deegan wrote:
> > Your plans for usermode<-->hypervisor direct TSC 
> integration seem to me
> > to be an unpleasant hack.  I understand that you have good business
> > reasons for wanting it (even if you're not allowed to tell 
> us explicitly
> > what they are) and we've seen the justifications enough 
> times that we
> > don't need to cover them again here, but it's still a hack.
> >
> > I'm unhappy with the idea of kicking around the Xen timekeeping code
> > (and introducing the usual bug-tail) to support introducing 
> a usermode
> > TSC.  If there is to be a new mode for this, it should 
> default to the
> > current (works for everyone except the engineering team of a
> > not-to-be-named enterprise application) behaviour.
> 
> I'm seeing an approx 12x performance improvement with 
> gettimeofday() and
> clock_gettime() on systems with my vsyscall support patches
> (~1200ns/call -> ~100ns[*]).  I think that should go a long 
> way towards
> mitigating the performance concerns using standard APIs.  
> 
> There's probably some scope for improving those numbers on 
> systems with
> better-than-baseline tsc support (ie rdtscp and/or guaranteed synced
> tscs), but I think its enough to get started with, especially 
> given the
> broad applicability and relatively simple engineering.
> 
> [*] With native tsc; emulated tsc makes that 1700 -> 500, or 
> only ~3.3x
> improvement.
> 
>     J

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
  2009-10-09 21:35             ` Dan Magenheimer
@ 2009-10-10  0:22               ` Jeremy Fitzhardinge
  2009-10-10  2:36                 ` Dan Magenheimer
  0 siblings, 1 reply; 17+ messages in thread
From: Jeremy Fitzhardinge @ 2009-10-10  0:22 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: kurt.hackel, Ian Pratt, Xen-Devel (E-mail), Tim Deegan,
	Keir Fraser

On 10/09/09 14:35, Dan Magenheimer wrote:
> Excellent!  This is an extremely important piece
> of the puzzle now filled in.
>
> Just for completeness, on your machine, what is
> the measurement for raw rdtsc?
>   

A naked inline rdtsc is about 30ns, so only about a factor of 3 better. 
Which is a surprisingly small improvement given that the full
gettimeofday path has ~150 instructions, including a couple of
multiplies, quite a few jumps and two "lsl" instructions for vgetcpu
(which each cost about 10ns).  rdtsc is an expensive instruction...

    J

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
  2009-10-10  0:22               ` Jeremy Fitzhardinge
@ 2009-10-10  2:36                 ` Dan Magenheimer
  2009-10-10  5:55                   ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 17+ messages in thread
From: Dan Magenheimer @ 2009-10-10  2:36 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: kurt.hackel, Ian Pratt, Xen-Devel (E-mail), Tim Deegan,
	Keir Fraser

> On 10/09/09 14:35, Dan Magenheimer wrote:
> > Excellent!  This is an extremely important piece
> > of the puzzle now filled in.
> >
> > Just for completeness, on your machine, what is
> > the measurement for raw rdtsc?
> >   
> 
> A naked inline rdtsc is about 30ns, so only about a factor of 
> 3 better. 
> Which is a surprisingly small improvement given that the full
> gettimeofday path has ~150 instructions, including a couple of
> multiplies, quite a few jumps and two "lsl" instructions for vgetcpu
> (which each cost about 10ns).  rdtsc is an expensive instruction...
> 
>     J

Very nice!

One more measurement if you haven't already torn down
your test environment:   If you are at xen-unstable tip,
with tsc emulation on, please try something like:

for i in {0..100}; do
xm debug-key s; xm dmesg | tail; sleep 1;
done

to get an idea of the number of rdtsc's being
done per second (and also divide by the number
of cores so we have rdtsc's/sec/core).  This is
of course unloaded, so if you have a favorite
load to throw on it, that would be very interesting
also.

(Note that the s debug-key may be slow because
xen is also now running check_tsc_warp each time.)

Thanks,
Dan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
  2009-10-10  2:36                 ` Dan Magenheimer
@ 2009-10-10  5:55                   ` Jeremy Fitzhardinge
  2009-10-10  6:35                     ` Keir Fraser
  0 siblings, 1 reply; 17+ messages in thread
From: Jeremy Fitzhardinge @ 2009-10-10  5:55 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: kurt.hackel, Ian Pratt, Xen-Devel (E-mail), Tim Deegan,
	Keir Fraser

On 10/09/09 19:36, Dan Magenheimer wrote:
> Very nice!
>
> One more measurement if you haven't already torn down
> your test environment:   If you are at xen-unstable tip,
> with tsc emulation on, please try something like:
>
> for i in {0..100}; do
> xm debug-key s; xm dmesg | tail; sleep 1;
> done
>
> to get an idea of the number of rdtsc's being
> done per second (and also divide by the number
> of cores so we have rdtsc's/sec/core).  This is
> of course unloaded, so if you have a favorite
> load to throw on it, that would be very interesting
> also.
>   

The kernel does about between 400k and 1.4M/sec, median around ~600k,
for a git pull (which I think is single-threaded), and about
200k-500k/sec for a kernel compile (-j4 on 2 vcpus).  Usermode is a much
lower rate; around 1000/sec for the kernel compile.

Baseline idle is around 1000/sec kernel, 10/sec user.

Also, my inline naked rdtsc benchmark shows that the emulated rdtsc is
taking around 465ns (vs 30, a 15x slowdown).

    J

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
  2009-10-10  5:55                   ` Jeremy Fitzhardinge
@ 2009-10-10  6:35                     ` Keir Fraser
  2009-10-10 14:22                       ` Dan Magenheimer
  0 siblings, 1 reply; 17+ messages in thread
From: Keir Fraser @ 2009-10-10  6:35 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, Dan Magenheimer
  Cc: Tim Deegan, kurt.hackel@oracle.com, Ian Pratt, Xen-Devel (E-mail)

On 10/10/2009 06:55, "Jeremy Fitzhardinge" <jeremy@goop.org> wrote:

> The kernel does about between 400k and 1.4M/sec, median around ~600k,
> for a git pull (which I think is single-threaded), and about
> 200k-500k/sec for a kernel compile (-j4 on 2 vcpus).  Usermode is a much
> lower rate; around 1000/sec for the kernel compile.
> 
> Baseline idle is around 1000/sec kernel, 10/sec user.
> 
> Also, my inline naked rdtsc benchmark shows that the emulated rdtsc is
> taking around 465ns (vs 30, a 15x slowdown).

Hmmm... So at 600k/sec, the kernel spends an appreciable amount of time
(1-2%) doing RDTSCs? And with emulation that'll be more like 25-30%.

It's quite a surprisingly high rate.

 -- Keir

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
  2009-10-10  6:35                     ` Keir Fraser
@ 2009-10-10 14:22                       ` Dan Magenheimer
  0 siblings, 0 replies; 17+ messages in thread
From: Dan Magenheimer @ 2009-10-10 14:22 UTC (permalink / raw)
  To: Keir Fraser, Jeremy Fitzhardinge
  Cc: Tim Deegan, kurt.hackel, Ian Pratt, Xen-Devel (E-mail)

> On 10/10/2009 06:55, "Jeremy Fitzhardinge" <jeremy@goop.org> wrote:
> 
> > The kernel does about between 400k and 1.4M/sec, median 
> around ~600k,
> > for a git pull (which I think is single-threaded), and about
> > 200k-500k/sec for a kernel compile (-j4 on 2 vcpus).  
> Usermode is a much
> > lower rate; around 1000/sec for the kernel compile.
> > 
> > Baseline idle is around 1000/sec kernel, 10/sec user.
> > 
> > Also, my inline naked rdtsc benchmark shows that the 
> emulated rdtsc is
> > taking around 465ns (vs 30, a 15x slowdown).
> 
> Hmmm... So at 600k/sec, the kernel spends an appreciable 
> amount of time
> (1-2%) doing RDTSCs? And with emulation that'll be more like 25-30%.
> 
> It's quite a surprisingly high rate.
> 
>  -- Keir

I'm trying a kernel compile (-j4, 2 vcpus, 2 pcpus)
and seeing about 1300/sec kernel and 500/sec user.
My "idle" rate appears to be about 400/sec (kernel,
and every now and then a handful of user rdtscs).
That's with a cpu-only load.... 

# while true; do i=i+1; done && while true; do i=i+1; done

It seems to be about 100/sec for a truly idle domain.

With an NFS untar I am seeing higher numbers though
(~10K/sec).

(All these loads are on a EL5u2 32-bit PV guest.)

Jeremy, were you maybe measuring per hundred seconds, or
per minute?  Or, on the git pull, maybe your VNIC
throughput is much much higher than mine and there
is a getnstimeofday() call for each packet?

Another scary thought... what is gcc doing using
rdtsc?  Might it be randomly sensitive to the rdtsc
discontinuites one will encounter with migration
and we've just not seen it yet?  In other words,
is gcc a "fundamentally broken" app? ;-)  And what
is that other usermode app (service?) that seems to use
a handful of rdtsc's when the system is cpu-only loaded?
Is it using rdtsc safely?

The time ratio of emulated rdtsc to native rdtsc matches
what I measured on my machine (360 vs 22), so 15x
seems like a safe multiplier estimate to use.

Curiouser and curiouser...

Dan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
  2009-10-09 14:38           ` Dan Magenheimer
@ 2009-10-12  9:51             ` Tim Deegan
  0 siblings, 0 replies; 17+ messages in thread
From: Tim Deegan @ 2009-10-12  9:51 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: kurt.hackel@oracle.com, Fitzhardinge, Xen-Devel (E-mail),
	Ian Pratt, Keir Fraser

Hi,

At 15:38 +0100 on 09 Oct (1255102732), Dan Magenheimer wrote:
> So color me optimistic that the problem is solved or
> at least that system vendors will only sin for a very
> good reason;

OK, we disagree, that's fine. 

> Now all I'm trying to do is ensure that Xen virtual machines
> don't suffer their own "d*mn timestamp problem", especially
> given that VMware doesn't have one.

Jeremy seems to be taking care of this, AFAICS, using the existing APIs.
That way everybody wins, not just people who are clued in enough to find
and use a new xen-specific API.  Also, AFAICS, without needing changes
to Xen's own timekeeping/TSC code.

There seem to be some other bugs on the HVM side but that's a separate
discussion, I think.

> Yes, I admit it offends my aesthetics some.  But I defend
> it to myself by believing that this is just a first step
> in a long road of closer collaboration between hypervisor
> and apps.  Really the whole point of paravirtualization
> is to benefit from knowing that the underlying platform
> is virtual.  Why should apps be excluded from the party?

In this case I don't think it helps.  The OS should and can provide
a fast and reliable time source to user space without needing a new
hypervisor-to-application API for it, with all the portability and
maintenance fun that that would bring.

> I think I've been very explicit: Some very large apps, both
> Oracle and non-Oracle, need a way to get a timestamp
> at a high frequency in a way that is both correct and
> very fast and works across a range of hardware/software
> environments, INCLUDING running under Xen.

There's a further requirement (which you have mentioned before) that
people are unwilling/unable to accept kernel changes.  I think that's
a bit unreasonable.

> I AM exposed to some other companies' confidential
> information, so any appearance that I am hiding something
> is due to my clumsy attempts to dance around that
> in a public forum.

Understood; I don't blame you for the situation you find yourself in.
But it doesn't change what you're asking for: an unpleasant hack and a
reshuffle of the core timekeeping code to support unnamed third parties.

> > Might as well call it "application_broken" and default it the other
> > way. :)  The system builders are entirely within their rights to have
> > separate clocks for separate sockets.
> 
> If you agree with Jeremy's opinion that "any app that uses
> rdtsc is fundamentally broken", your syntax makes sense.

I'll stick with "many apps that use rdtsc are broken": it's harder than
most people think and it doesn't do what some people want.  (But no, I
wouldn't seriously use that as the option name.)

Cheers,

Tim.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Citrix Systems (R&D) Ltd.
[Company #02300071, SL9 0DZ, UK.]

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2009-10-12  9:51 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-02 17:51 [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC Dan Magenheimer
2009-10-07 21:07 ` Dan Magenheimer
2009-10-08  6:45   ` Keir Fraser
2009-10-08  6:54     ` Keir Fraser
2009-10-08  9:13   ` Tim Deegan
2009-10-08  9:22     ` Keir Fraser
2009-10-08 16:24       ` Dan Magenheimer
2009-10-09  9:34         ` Tim Deegan
2009-10-09 14:38           ` Dan Magenheimer
2009-10-12  9:51             ` Tim Deegan
2009-10-09 20:28           ` Jeremy Fitzhardinge
2009-10-09 21:35             ` Dan Magenheimer
2009-10-10  0:22               ` Jeremy Fitzhardinge
2009-10-10  2:36                 ` Dan Magenheimer
2009-10-10  5:55                   ` Jeremy Fitzhardinge
2009-10-10  6:35                     ` Keir Fraser
2009-10-10 14:22                       ` Dan Magenheimer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.