[PATCH 0/2] Improve hpet accuracy

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] Improve hpet accuracy
@ 2008-06-05 14:59 Ben Guthro
  2008-06-06  8:58 ` Keir Fraser
  2008-06-06 15:35 ` Steven Hand
  0 siblings, 2 replies; 51+ messages in thread
From: Ben Guthro @ 2008-06-05 14:59 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1.1: Type: text/plain, Size: 11317 bytes --]

1. Introduction

This patch improves the hpet based guest clock in terms of drift and monotonicity.
Prior to this work the drift with hpet was greater than 2%, far above the .05% limit
for ntp to synchronize. With this code, the drift ranges from .001% to .0033% depending
on guest and physical platform.

Using hpet allows guest operating systems to provide monotonic time to their
applications. Time sources other than hpet are not monotonic because
of their reliance on tsc, which is not synchronized across physical processors.

Windows 2k864 and many Linux guests are supported with two policies, one for guests
that handle missed clock interrupts and the other for guests that require the
correct number of interrupts.

Guests may use hpet for the timing source even if the physical platform has no visible
hpet. Migration is supported between physical machines which differ in physical
hpet visibility.

Most of the changes are in hpet.c. Two general facilities are added to track interrupt
progress. The ideas here and the facilities would be useful in vpt.c, for other time
sources, though no attempt is made here to improve vpt.c.

The following sections discuss hpet dependencies, interrupt delivery policies, live migration,
test results, and relation to recent work with monotonic time. 

2. Virtual Hpet dependencies

The virtual hpet depends on the ability to read the physical or simulated
(see discussion below) hpet.  For timekeeping, the virtual hpet also depends
on two new interrupt notification facilities to implement its policies for
interrupt delivery. 

2.1. Two modes of low-level hpet main counter reads.

In this implementation, the virtual hpet reads with read_64_main_counter(), exported by
time.c, either the real physical hpet main counter register directly or a "simulated"
hpet main counter.

The simulated mode uses a monotonic version of get_s_time() (NOW()), where the last
time value is returned whenever the current time value is less than the last time
value. In simulated mode, since it is layered on s_time, the underlying hardware
can be hpet or some other device. The frequency of the main counter in simulated
mode is the same as the standard physical hpet frequency, allowing live migration
between nodes that are configured differently.

If the physical platform does not have an hpet device, or if xen is configured not
to use the device, then the simulated method is used. If there is a physical hpet device,
and xen has initialized it, then either simulated or physical mode can be used.
This is governed by a boot time option, hpet-avoid. Setting this option to 1 gives the
simulated mode and 0 the physical mode. The default is physical mode.

A disadvantage of the physical mode is that may take longer to read the device
than in simulated mode. On some platforms the cost is about the same (less than 250 nsec) for
physical and simulated modes, while on others physical cost is much higher than simulated.
A disadvantage of the simulated mode is that it can return the same value
for the counter in consecutive calls.

2.2. Interrupt notification facilities.

Two interrupt notification facilities are introduced, one is hvm_isa_irq_assert_cb()
and the other hvm_register_intr_en_notif().

The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to the vioapic.
hvm_isa_irq_assert_cb allows a callback to be passed along to vioapic_deliver()
and this callback is called with a mask of the vcpus which will get the
interrupt. This callback is made before any vcpus receive an interrupt.

Vhpet uses hvm_register_intr_en_notif() to register a handler for a particular
vector that will be called when that vector is injected in [vmx,svm]_intr_assist()
and also when the guest finishes handling the interrupt. Here finished is defined
as the point when the guest re-enables interrupts or lowers the tpr value.
EOI is not used as the end of interrupt as this is sometimes returned before
the interrupt handler has done its work. A flag is passed to the handler indicating
whether this is the injection point (post = 1) or the interrupt finished (post = 0) point.
The need for the finished point callback is discussed in the missed ticks policy section.

To prevent a possible early trigger of the finished callback, intr_en_notif logic
has a two stage arm, the first at injection (hvm_intr_en_notif_arm()) and the second when
interrupts are seen to be disabled (hvm_intr_en_notif_disarm()). Once fully armed, re-enabling
interrupts will cause hvm_intr_en_notif_disarm() to make the end of interrupt
callback. hvm_intr_en_notif_arm() and hvm_intr_en_notif_disarm() are called by
[vmx,svm]_intr_assist().

3. Interrupt delivery policies

The existing hpet interrupt delivery is preserved. This includes
vcpu round robin delivery used by Linux and broadcast delivery used by Windows.

There are two policies for interrupt delivery, one for Windows 2k8-64 and the other
for Linux. The Linux policy takes advantage of the (guest) Linux missed tick and offset
calculations and does not attempt to deliver the right number of interrupts.
The Windows policy delivers the correct number of interrupts, even if sometimes much
closer to each other than the period. The policies are similar to those in vpt.c, though
there are some important differences.

Policies are selected with an HVMOP_set_param hypercall with index HVM_PARAM_TIMER_MODE.
Two new values are added, HVM_HPET_guest_computes_missed_ticks and
HVM_HPET_guest_does_not_compute_missed_ticks.  The reason that two new ones are added is that
in some guests (32bit Linux) a no-missed policy is needed for clock sources other than hpet
and a missed ticks policy for hpet. It was felt that there would be less confusion by simply
introducing the two hpet policies.

3.1. The missed ticks policy

The Linux clock interrupt handler for hpet calculates missed ticks and offset using the hpet
main counter. The algorithm works well when the time since the last interrupt is greater than
or equal to a period and poorly otherwise.

The missed ticks policy ensures that no two clock interrupts are delivered to the guest at
a time interval less than a period. A time stamp (hpet main counter value) is recorded (by a
callback registered with hvm_register_intr_en_notif) when Linux finishes handling the clock
interrupt. Then, ensuing interrupts are delivered to the vioapic only if the current main
counter value is a period greater than when the last interrupt was handled.

Tests showed a significant improvement in clock drift with end of interrupt time stamps
versus beginning of interrupt[1]. It is believed that the reason for the improvement
is that the clock interrupt handler goes for a spinlock and can be therefore delayed in its
processing. Furthermore, the main counter is read by the guest under the lock. The net
effect is that if we time stamp injection, we can get the difference in time
between successive interrupt handler lock acquisitions to be less than the period.

3.2. The no-missed ticks policy

Windows 2k864 keeps very poor time with the missed ticks policy. So the no-missed ticks policy
was developed. In the no-missed ticks policy we deliver the correct number of interrupts,
even if they are spaced less than a period apart (when catching up).

Windows 2k864 uses a broadcast mode in the interrupt routing such that
all vcpus get the clock interrupt. The best Windows drift performance was achieved when the
policy code ensured that all the previous interrupts (on the various vcpus) had been injected
before injecting the next interrupt to the vioapic..

The policy code works as follows. It uses the hvm_isa_irq_assert_cb() to record
the vcpus to be interrupted in h->hpet.pending_mask. Then, in the callback registered
with hvm_register_intr_en_notif() at post=1 time it clears the current vcpu in the pending_mask.
When the pending_mask is clear it decrements hpet.intr_pending_nr and if intr_pending_nr is still
non-zero posts another interrupt to the ioapic with hvm_isa_irq_assert_cb().
Intr_pending_nr is incremented in hpet_route_decision_not_missed_ticks().

The missed ticks policy intr_en_notif callback also uses the pending_mask method. So even though
Linux does not broadcast its interrupts, the code could handle it if it did.
In this case the end of interrupt time stamp is made when the pending_mask is clear.

4. Live Migration

Live migration with hpet preserves the current offset of the guest clock with respect
to ntp. This is accomplished by migrating all of the state in the h->hpet data structure
in the usual way. The hp->mc_offset is recalculated on the receiving node so that the
guest sees a continuous hpet main counter.

Code as been added to xc_domain_save.c to send a small message after the
domain context is sent. The contents of the message is the physical tsc timestamp, last_tsc,
read just before the message is sent. When the last_tsc message is received in xc_domain_restore.c,
another physical tsc timestamp, cur_tsc, is read. The two timestamps are loaded into the domain
structure as last_tsc_sender and first_tsc_receiver with hypercalls. Then xc_domain_hvm_setcontext
is called so that hpet_load has access to these time stamps. Hpet_load uses the timestamps
to account for the time spent saving and loading the domain context. With this technique,
the only neglected time is the time spent sending a small network message.

5. Test Results

Some recent test results are:

5.1 Linux 4u664 and Windows 2k864 load test.
      Duration: 70 hours.
      Test date: 6/2/08
      Loads: usex -b48 on Linux; burn-in on Windows
      Guest vcpus: 8 for Linux; 2 for Windows
      Hardware: 8 physical cpu AMD
      Clock drift : Linux: .0012% Windows: .009%

5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test
      Duration: 23 hours.
      Test date: 6/3/08
      Loads: none
      Guest vcpus: 8 for each Linux; 2 for Windows
      Hardware: 4 physical cpu AMD
      Clock drift : Linux: .033% Windows: .019%

6. Relation to recent work in xen-unstable

There is a similarity between hvm_get_guest_time() in xen-unstable and read_64_main_counter()
in this code. However, read_64_main_counter() is more tuned to the needs of hpet.c. It has no
"set" operation, only the get. It isolates the mode, physical or simulated, in read_64_main_counter()
itself. It uses no vcpu or domain state as it is a physical entity, in either mode. And it provides a real
physical mode for every read for those applications that desire this.

7. Conclusion

The virtual hpet is improved by this patch in terms of accuracy and monotonicity.
Tests performed to date verify this and more testing is under way.

8. Future Work

Testing with Windows Vista will be performed soon. The reason for accuracy variations
on different platforms using the physical hpet device will be investigated.
Additional overhead measurements on simulated vs physical hpet mode will be made.

Footnotes:

1. I don't recall the accuracy improvement with end of interrupt stamping, but it was
significant, perhaps better than two to one improvement. It would be a very simple matter
to re-measure the improvement as the facility can call back at injection time as well.

Signed-off-by: Dave Winchell <dwinchell@virtualiron.com>
Signed-off-by: Ben Guthro <bguthro@virtualiron.com>

[-- Attachment #1.2: Type: text/html, Size: 11902 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/2] Improve hpet accuracy
  2008-06-05 14:59 [PATCH 0/2] Improve hpet accuracy Ben Guthro
@ 2008-06-06  8:58 ` Keir Fraser
  2008-06-06 10:45   ` Dave Winchell
  2008-06-06 15:35 ` Steven Hand
  1 sibling, 1 reply; 51+ messages in thread
From: Keir Fraser @ 2008-06-06  8:58 UTC (permalink / raw)
  To: Ben Guthro, xen-devel; +Cc: dan.magenheimer@oracle.com


[-- Attachment #1.1: Type: text/plain, Size: 12727 bytes --]

Are these patches needed now the timers are built on Xen system time rather
than host TSC? Dan has reported much better time-keeping with his patch
checked in, and it¹s for sure a lot less invasive than this patchset.

 -- Keir

On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote:

> 
> 1. Introduction
> 
> This patch improves the hpet based guest clock in terms of drift and
> monotonicity.
> Prior to this work the drift with hpet was greater than 2%, far above the .05%
> limit
> for ntp to synchronize. With this code, the drift ranges from .001% to .0033%
> depending
> on guest and physical platform.
> 
> Using hpet allows guest operating systems to provide monotonic time to their
> applications. Time sources other than hpet are not monotonic because
> of their reliance on tsc, which is not synchronized across physical
> processors.
> 
> Windows 2k864 and many Linux guests are supported with two policies, one for
> guests
> that handle missed clock interrupts and the other for guests that require the
> correct number of interrupts.
> 
> Guests may use hpet for the timing source even if the physical platform has no
> visible
> hpet. Migration is supported between physical machines which differ in
> physical
> hpet visibility.
> 
> Most of the changes are in hpet.c. Two general facilities are added to track
> interrupt
> progress. The ideas here and the facilities would be useful in vpt.c, for
> other time
> sources, though no attempt is made here to improve vpt.c.
> 
> The following sections discuss hpet dependencies, interrupt delivery policies,
> live migration,
> test results, and relation to recent work with monotonic time.
> 
> 
> 2. Virtual Hpet dependencies
> 
> The virtual hpet depends on the ability to read the physical or simulated
> (see discussion below) hpet.  For timekeeping, the virtual hpet also depends
> on two new interrupt notification facilities to implement its policies for
> interrupt delivery.
> 
> 2.1. Two modes of low-level hpet main counter reads.
> 
> In this implementation, the virtual hpet reads with read_64_main_counter(),
> exported by
> time.c, either the real physical hpet main counter register directly or a
> "simulated"
> hpet main counter.
> 
> The simulated mode uses a monotonic version of get_s_time() (NOW()), where the
> last
> time value is returned whenever the current time value is less than the last
> time
> value. In simulated mode, since it is layered on s_time, the underlying
> hardware
> can be hpet or some other device. The frequency of the main counter in
> simulated
> mode is the same as the standard physical hpet frequency, allowing live
> migration
> between nodes that are configured differently.
> 
> If the physical platform does not have an hpet device, or if xen is configured
> not
> to use the device, then the simulated method is used. If there is a physical
> hpet device,
> and xen has initialized it, then either simulated or physical mode can be
> used.
> This is governed by a boot time option, hpet-avoid. Setting this option to 1
> gives the
> simulated mode and 0 the physical mode. The default is physical mode.
> 
> A disadvantage of the physical mode is that may take longer to read the device
> than in simulated mode. On some platforms the cost is about the same (less
> than 250 nsec) for
> physical and simulated modes, while on others physical cost is much higher
> than simulated.
> A disadvantage of the simulated mode is that it can return the same value
> for the counter in consecutive calls.
> 
> 2.2. Interrupt notification facilities.
> 
> Two interrupt notification facilities are introduced, one is
> hvm_isa_irq_assert_cb()
> and the other hvm_register_intr_en_notif().
> 
> The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to the vioapic.
> hvm_isa_irq_assert_cb allows a callback to be passed along to
> vioapic_deliver()
> and this callback is called with a mask of the vcpus which will get the
> interrupt. This callback is made before any vcpus receive an interrupt.
> 
> Vhpet uses hvm_register_intr_en_notif() to register a handler for a particular
> vector that will be called when that vector is injected in
> [vmx,svm]_intr_assist()
> and also when the guest finishes handling the interrupt. Here finished is
> defined
> as the point when the guest re-enables interrupts or lowers the tpr value.
> EOI is not used as the end of interrupt as this is sometimes returned before
> the interrupt handler has done its work. A flag is passed to the handler
> indicating
> whether this is the injection point (post = 1) or the interrupt finished (post
> = 0) point.
> The need for the finished point callback is discussed in the missed ticks
> policy section.
> 
> To prevent a possible early trigger of the finished callback, intr_en_notif
> logic
> has a two stage arm, the first at injection (hvm_intr_en_notif_arm()) and the
> second when
> interrupts are seen to be disabled (hvm_intr_en_notif_disarm()). Once fully
> armed, re-enabling
> interrupts will cause hvm_intr_en_notif_disarm() to make the end of interrupt
> callback. hvm_intr_en_notif_arm() and hvm_intr_en_notif_disarm() are called by
> [vmx,svm]_intr_assist().
> 
> 3. Interrupt delivery policies
> 
> The existing hpet interrupt delivery is preserved. This includes
> vcpu round robin delivery used by Linux and broadcast delivery used by
> Windows.
> 
> There are two policies for interrupt delivery, one for Windows 2k8-64 and the
> other
> for Linux. The Linux policy takes advantage of the (guest) Linux missed tick
> and offset
> calculations and does not attempt to deliver the right number of interrupts.
> The Windows policy delivers the correct number of interrupts, even if
> sometimes much
> closer to each other than the period. The policies are similar to those in
> vpt.c, though
> there are some important differences.
> 
> Policies are selected with an HVMOP_set_param hypercall with index
> HVM_PARAM_TIMER_MODE.
> Two new values are added, HVM_HPET_guest_computes_missed_ticks and
> HVM_HPET_guest_does_not_compute_missed_ticks.  The reason that two new ones
> are added is that
> in some guests (32bit Linux) a no-missed policy is needed for clock sources
> other than hpet
> and a missed ticks policy for hpet. It was felt that there would be less
> confusion by simply
> introducing the two hpet policies.
> 
> 3.1. The missed ticks policy
> 
> The Linux clock interrupt handler for hpet calculates missed ticks and offset
> using the hpet
> main counter. The algorithm works well when the time since the last interrupt
> is greater than
> or equal to a period and poorly otherwise.
> 
> The missed ticks policy ensures that no two clock interrupts are delivered to
> the guest at
> a time interval less than a period. A time stamp (hpet main counter value) is
> recorded (by a
> callback registered with hvm_register_intr_en_notif) when Linux finishes
> handling the clock
> interrupt. Then, ensuing interrupts are delivered to the vioapic only if the
> current main
> counter value is a period greater than when the last interrupt was handled.
> 
> Tests showed a significant improvement in clock drift with end of interrupt
> time stamps
> versus beginning of interrupt[1]. It is believed that the reason for the
> improvement
> is that the clock interrupt handler goes for a spinlock and can be therefore
> delayed in its
> processing. Furthermore, the main counter is read by the guest under the lock.
> The net
> effect is that if we time stamp injection, we can get the difference in time
> between successive interrupt handler lock acquisitions to be less than the
> period.
> 
> 3.2. The no-missed ticks policy
> 
> Windows 2k864 keeps very poor time with the missed ticks policy. So the
> no-missed ticks policy
> was developed. In the no-missed ticks policy we deliver the correct number of
> interrupts,
> even if they are spaced less than a period apart (when catching up).
> 
> Windows 2k864 uses a broadcast mode in the interrupt routing such that
> all vcpus get the clock interrupt. The best Windows drift performance was
> achieved when the
> policy code ensured that all the previous interrupts (on the various vcpus)
> had been injected
> before injecting the next interrupt to the vioapic..
> 
> The policy code works as follows. It uses the hvm_isa_irq_assert_cb() to
> record
> the vcpus to be interrupted in h->hpet.pending_mask. Then, in the callback
> registered
> with hvm_register_intr_en_notif() at post=1 time it clears the current vcpu in
> the pending_mask.
> When the pending_mask is clear it decrements hpet.intr_pending_nr and if
> intr_pending_nr is still
> non-zero posts another interrupt to the ioapic with hvm_isa_irq_assert_cb().
> Intr_pending_nr is incremented in hpet_route_decision_not_missed_ticks().
> 
> The missed ticks policy intr_en_notif callback also uses the pending_mask
> method. So even though
> Linux does not broadcast its interrupts, the code could handle it if it did.
> In this case the end of interrupt time stamp is made when the pending_mask is
> clear.
> 
> 4. Live Migration
> 
> Live migration with hpet preserves the current offset of the guest clock with
> respect
> to ntp. This is accomplished by migrating all of the state in the h->hpet data
> structure
> in the usual way. The hp->mc_offset is recalculated on the receiving node so
> that the
> guest sees a continuous hpet main counter.
> 
> Code as been added to xc_domain_save.c to send a small message after the
> domain context is sent. The contents of the message is the physical tsc
> timestamp, last_tsc,
> read just before the message is sent. When the last_tsc message is received in
> xc_domain_restore.c,
> another physical tsc timestamp, cur_tsc, is read. The two timestamps are
> loaded into the domain
> structure as last_tsc_sender and first_tsc_receiver with hypercalls. Then
> xc_domain_hvm_setcontext
> is called so that hpet_load has access to these time stamps. Hpet_load uses
> the timestamps
> to account for the time spent saving and loading the domain context. With this
> technique,
> the only neglected time is the time spent sending a small network message.
> 
> 5. Test Results
> 
> Some recent test results are:
> 
> 5.1 Linux 4u664 and Windows 2k864 load test.
>       Duration: 70 hours.
>       Test date: 6/2/08
>       Loads: usex -b48 on Linux; burn-in on Windows
>       Guest vcpus: 8 for Linux; 2 for Windows
>       Hardware: 8 physical cpu AMD
>       Clock drift : Linux: .0012% Windows: .009%
> 
> 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test
>       Duration: 23 hours.
>       Test date: 6/3/08
>       Loads: none
>       Guest vcpus: 8 for each Linux; 2 for Windows
>       Hardware: 4 physical cpu AMD
>       Clock drift : Linux: .033% Windows: .019%
> 
> 6. Relation to recent work in xen-unstable
> 
> There is a similarity between hvm_get_guest_time() in xen-unstable and
> read_64_main_counter()
> in this code. However, read_64_main_counter() is more tuned to the needs of
> hpet.c. It has no
> "set" operation, only the get. It isolates the mode, physical or simulated, in
> read_64_main_counter()
> itself. It uses no vcpu or domain state as it is a physical entity, in either
> mode. And it provides a real
> physical mode for every read for those applications that desire this.
> 
> 7. Conclusion
> 
> The virtual hpet is improved by this patch in terms of accuracy and
> monotonicity.
> Tests performed to date verify this and more testing is under way.
> 
> 8. Future Work
> 
> Testing with Windows Vista will be performed soon. The reason for accuracy
> variations
> on different platforms using the physical hpet device will be investigated.
> Additional overhead measurements on simulated vs physical hpet mode will be
> made.
> 
> Footnotes:
> 
> 1. I don't recall the accuracy improvement with end of interrupt stamping, but
> it was
> significant, perhaps better than two to one improvement. It would be a very
> simple matter
> to re-measure the improvement as the facility can call back at injection time
> as well.
> 
> 
> Signed-off-by: Dave Winchell <dwinchell@virtualiron.com>
> <mailto:dwinchell@virtualiron.com>
> Signed-off-by: Ben Guthro <bguthro@virtualiron.com>
> <mailto:bguthro@virtualiron.com>
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel



[-- Attachment #1.2: Type: text/html, Size: 14074 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [PATCH 0/2] Improve hpet accuracy
  2008-06-06  8:58 ` Keir Fraser
@ 2008-06-06 10:45   ` Dave Winchell
  2008-06-06 15:53     ` Dan Magenheimer
  0 siblings, 1 reply; 51+ messages in thread
From: Dave Winchell @ 2008-06-06 10:45 UTC (permalink / raw)
  To: Keir Fraser, Ben Guthro, xen-devel; +Cc: dan.magenheimer, Dave Winchell


[-- Attachment #1.1: Type: text/plain, Size: 13111 bytes --]

Keir,

I think the changes are required. We'll run some tests today today so
that we have some data to talk about.

-Dave


-----Original Message-----
From: xen-devel-bounces@lists.xensource.com on behalf of Keir Fraser
Sent: Fri 6/6/2008 4:58 AM
To: Ben Guthro; xen-devel
Cc: dan.magenheimer@oracle.com
Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
 
Are these patches needed now the timers are built on Xen system time rather
than host TSC? Dan has reported much better time-keeping with his patch
checked in, and it¹s for sure a lot less invasive than this patchset.


 -- Keir

On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote:

> 
> 1. Introduction
> 
> This patch improves the hpet based guest clock in terms of drift and
> monotonicity.
> Prior to this work the drift with hpet was greater than 2%, far above the .05%
> limit
> for ntp to synchronize. With this code, the drift ranges from .001% to .0033%
> depending
> on guest and physical platform.
> 
> Using hpet allows guest operating systems to provide monotonic time to their
> applications. Time sources other than hpet are not monotonic because
> of their reliance on tsc, which is not synchronized across physical
> processors.
> 
> Windows 2k864 and many Linux guests are supported with two policies, one for
> guests
> that handle missed clock interrupts and the other for guests that require the
> correct number of interrupts.
> 
> Guests may use hpet for the timing source even if the physical platform has no
> visible
> hpet. Migration is supported between physical machines which differ in
> physical
> hpet visibility.
> 
> Most of the changes are in hpet.c. Two general facilities are added to track
> interrupt
> progress. The ideas here and the facilities would be useful in vpt.c, for
> other time
> sources, though no attempt is made here to improve vpt.c.
> 
> The following sections discuss hpet dependencies, interrupt delivery policies,
> live migration,
> test results, and relation to recent work with monotonic time.
> 
> 
> 2. Virtual Hpet dependencies
> 
> The virtual hpet depends on the ability to read the physical or simulated
> (see discussion below) hpet.  For timekeeping, the virtual hpet also depends
> on two new interrupt notification facilities to implement its policies for
> interrupt delivery.
> 
> 2.1. Two modes of low-level hpet main counter reads.
> 
> In this implementation, the virtual hpet reads with read_64_main_counter(),
> exported by
> time.c, either the real physical hpet main counter register directly or a
> "simulated"
> hpet main counter.
> 
> The simulated mode uses a monotonic version of get_s_time() (NOW()), where the
> last
> time value is returned whenever the current time value is less than the last
> time
> value. In simulated mode, since it is layered on s_time, the underlying
> hardware
> can be hpet or some other device. The frequency of the main counter in
> simulated
> mode is the same as the standard physical hpet frequency, allowing live
> migration
> between nodes that are configured differently.
> 
> If the physical platform does not have an hpet device, or if xen is configured
> not
> to use the device, then the simulated method is used. If there is a physical
> hpet device,
> and xen has initialized it, then either simulated or physical mode can be
> used.
> This is governed by a boot time option, hpet-avoid. Setting this option to 1
> gives the
> simulated mode and 0 the physical mode. The default is physical mode.
> 
> A disadvantage of the physical mode is that may take longer to read the device
> than in simulated mode. On some platforms the cost is about the same (less
> than 250 nsec) for
> physical and simulated modes, while on others physical cost is much higher
> than simulated.
> A disadvantage of the simulated mode is that it can return the same value
> for the counter in consecutive calls.
> 
> 2.2. Interrupt notification facilities.
> 
> Two interrupt notification facilities are introduced, one is
> hvm_isa_irq_assert_cb()
> and the other hvm_register_intr_en_notif().
> 
> The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to the vioapic.
> hvm_isa_irq_assert_cb allows a callback to be passed along to
> vioapic_deliver()
> and this callback is called with a mask of the vcpus which will get the
> interrupt. This callback is made before any vcpus receive an interrupt.
> 
> Vhpet uses hvm_register_intr_en_notif() to register a handler for a particular
> vector that will be called when that vector is injected in
> [vmx,svm]_intr_assist()
> and also when the guest finishes handling the interrupt. Here finished is
> defined
> as the point when the guest re-enables interrupts or lowers the tpr value.
> EOI is not used as the end of interrupt as this is sometimes returned before
> the interrupt handler has done its work. A flag is passed to the handler
> indicating
> whether this is the injection point (post = 1) or the interrupt finished (post
> = 0) point.
> The need for the finished point callback is discussed in the missed ticks
> policy section.
> 
> To prevent a possible early trigger of the finished callback, intr_en_notif
> logic
> has a two stage arm, the first at injection (hvm_intr_en_notif_arm()) and the
> second when
> interrupts are seen to be disabled (hvm_intr_en_notif_disarm()). Once fully
> armed, re-enabling
> interrupts will cause hvm_intr_en_notif_disarm() to make the end of interrupt
> callback. hvm_intr_en_notif_arm() and hvm_intr_en_notif_disarm() are called by
> [vmx,svm]_intr_assist().
> 
> 3. Interrupt delivery policies
> 
> The existing hpet interrupt delivery is preserved. This includes
> vcpu round robin delivery used by Linux and broadcast delivery used by
> Windows.
> 
> There are two policies for interrupt delivery, one for Windows 2k8-64 and the
> other
> for Linux. The Linux policy takes advantage of the (guest) Linux missed tick
> and offset
> calculations and does not attempt to deliver the right number of interrupts.
> The Windows policy delivers the correct number of interrupts, even if
> sometimes much
> closer to each other than the period. The policies are similar to those in
> vpt.c, though
> there are some important differences.
> 
> Policies are selected with an HVMOP_set_param hypercall with index
> HVM_PARAM_TIMER_MODE.
> Two new values are added, HVM_HPET_guest_computes_missed_ticks and
> HVM_HPET_guest_does_not_compute_missed_ticks.  The reason that two new ones
> are added is that
> in some guests (32bit Linux) a no-missed policy is needed for clock sources
> other than hpet
> and a missed ticks policy for hpet. It was felt that there would be less
> confusion by simply
> introducing the two hpet policies.
> 
> 3.1. The missed ticks policy
> 
> The Linux clock interrupt handler for hpet calculates missed ticks and offset
> using the hpet
> main counter. The algorithm works well when the time since the last interrupt
> is greater than
> or equal to a period and poorly otherwise.
> 
> The missed ticks policy ensures that no two clock interrupts are delivered to
> the guest at
> a time interval less than a period. A time stamp (hpet main counter value) is
> recorded (by a
> callback registered with hvm_register_intr_en_notif) when Linux finishes
> handling the clock
> interrupt. Then, ensuing interrupts are delivered to the vioapic only if the
> current main
> counter value is a period greater than when the last interrupt was handled.
> 
> Tests showed a significant improvement in clock drift with end of interrupt
> time stamps
> versus beginning of interrupt[1]. It is believed that the reason for the
> improvement
> is that the clock interrupt handler goes for a spinlock and can be therefore
> delayed in its
> processing. Furthermore, the main counter is read by the guest under the lock.
> The net
> effect is that if we time stamp injection, we can get the difference in time
> between successive interrupt handler lock acquisitions to be less than the
> period.
> 
> 3.2. The no-missed ticks policy
> 
> Windows 2k864 keeps very poor time with the missed ticks policy. So the
> no-missed ticks policy
> was developed. In the no-missed ticks policy we deliver the correct number of
> interrupts,
> even if they are spaced less than a period apart (when catching up).
> 
> Windows 2k864 uses a broadcast mode in the interrupt routing such that
> all vcpus get the clock interrupt. The best Windows drift performance was
> achieved when the
> policy code ensured that all the previous interrupts (on the various vcpus)
> had been injected
> before injecting the next interrupt to the vioapic..
> 
> The policy code works as follows. It uses the hvm_isa_irq_assert_cb() to
> record
> the vcpus to be interrupted in h->hpet.pending_mask. Then, in the callback
> registered
> with hvm_register_intr_en_notif() at post=1 time it clears the current vcpu in
> the pending_mask.
> When the pending_mask is clear it decrements hpet.intr_pending_nr and if
> intr_pending_nr is still
> non-zero posts another interrupt to the ioapic with hvm_isa_irq_assert_cb().
> Intr_pending_nr is incremented in hpet_route_decision_not_missed_ticks().
> 
> The missed ticks policy intr_en_notif callback also uses the pending_mask
> method. So even though
> Linux does not broadcast its interrupts, the code could handle it if it did.
> In this case the end of interrupt time stamp is made when the pending_mask is
> clear.
> 
> 4. Live Migration
> 
> Live migration with hpet preserves the current offset of the guest clock with
> respect
> to ntp. This is accomplished by migrating all of the state in the h->hpet data
> structure
> in the usual way. The hp->mc_offset is recalculated on the receiving node so
> that the
> guest sees a continuous hpet main counter.
> 
> Code as been added to xc_domain_save.c to send a small message after the
> domain context is sent. The contents of the message is the physical tsc
> timestamp, last_tsc,
> read just before the message is sent. When the last_tsc message is received in
> xc_domain_restore.c,
> another physical tsc timestamp, cur_tsc, is read. The two timestamps are
> loaded into the domain
> structure as last_tsc_sender and first_tsc_receiver with hypercalls. Then
> xc_domain_hvm_setcontext
> is called so that hpet_load has access to these time stamps. Hpet_load uses
> the timestamps
> to account for the time spent saving and loading the domain context. With this
> technique,
> the only neglected time is the time spent sending a small network message.
> 
> 5. Test Results
> 
> Some recent test results are:
> 
> 5.1 Linux 4u664 and Windows 2k864 load test.
>       Duration: 70 hours.
>       Test date: 6/2/08
>       Loads: usex -b48 on Linux; burn-in on Windows
>       Guest vcpus: 8 for Linux; 2 for Windows
>       Hardware: 8 physical cpu AMD
>       Clock drift : Linux: .0012% Windows: .009%
> 
> 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test
>       Duration: 23 hours.
>       Test date: 6/3/08
>       Loads: none
>       Guest vcpus: 8 for each Linux; 2 for Windows
>       Hardware: 4 physical cpu AMD
>       Clock drift : Linux: .033% Windows: .019%
> 
> 6. Relation to recent work in xen-unstable
> 
> There is a similarity between hvm_get_guest_time() in xen-unstable and
> read_64_main_counter()
> in this code. However, read_64_main_counter() is more tuned to the needs of
> hpet.c. It has no
> "set" operation, only the get. It isolates the mode, physical or simulated, in
> read_64_main_counter()
> itself. It uses no vcpu or domain state as it is a physical entity, in either
> mode. And it provides a real
> physical mode for every read for those applications that desire this.
> 
> 7. Conclusion
> 
> The virtual hpet is improved by this patch in terms of accuracy and
> monotonicity.
> Tests performed to date verify this and more testing is under way.
> 
> 8. Future Work
> 
> Testing with Windows Vista will be performed soon. The reason for accuracy
> variations
> on different platforms using the physical hpet device will be investigated.
> Additional overhead measurements on simulated vs physical hpet mode will be
> made.
> 
> Footnotes:
> 
> 1. I don't recall the accuracy improvement with end of interrupt stamping, but
> it was
> significant, perhaps better than two to one improvement. It would be a very
> simple matter
> to re-measure the improvement as the facility can call back at injection time
> as well.
> 
> 
> Signed-off-by: Dave Winchell <dwinchell@virtualiron.com>
> <mailto:dwinchell@virtualiron.com>
> Signed-off-by: Ben Guthro <bguthro@virtualiron.com>
> <mailto:bguthro@virtualiron.com>
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel




[-- Attachment #1.2: Type: text/html, Size: 16245 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/2] Improve hpet accuracy
  2008-06-05 14:59 [PATCH 0/2] Improve hpet accuracy Ben Guthro
  2008-06-06  8:58 ` Keir Fraser
@ 2008-06-06 15:35 ` Steven Hand
  2008-06-06 17:34   ` Dave Winchell
  1 sibling, 1 reply; 51+ messages in thread
From: Steven Hand @ 2008-06-06 15:35 UTC (permalink / raw)
  To: Ben Guthro; +Cc: xen-devel, Steven.Hand


This seems to break the save/restore format (in at least two places)...


S.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [PATCH 0/2] Improve hpet accuracy
  2008-06-06 10:45   ` Dave Winchell
@ 2008-06-06 15:53     ` Dan Magenheimer
  2008-06-06 17:54       ` Dave Winchell
  2008-06-06 19:33       ` Dave Winchell
  0 siblings, 2 replies; 51+ messages in thread
From: Dan Magenheimer @ 2008-06-06 15:53 UTC (permalink / raw)
  To: Dave Winchell, Keir Fraser, Ben Guthro, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 14624 bytes --]

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracyHi Dave and Ben --

When running tests on xen-unstable (without your patch), please ensure that hpet=1 is set in the hvm config and also I think that when hpet is the clocksource on RHEL4-32, the clock IS resilient to missed ticks so timer_mode should be 2 (vs when pit is the clocksource on RHEL4-32, all clock ticks must be delivered and so timer_mode should be 0).

Per http://lists.xensource.com/archives/html/xen-devel/2008-06/msg00098.html it's my intent to clean this up, but I won't get to it until next week.

Thanks,
Dan
  -----Original Message-----
  From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-bounces@lists.xensource.com]On Behalf Of Dave Winchell
  Sent: Friday, June 06, 2008 4:46 AM
  To: Keir Fraser; Ben Guthro; xen-devel
  Cc: dan.magenheimer@oracle.com; Dave Winchell
  Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy


  Keir,

  I think the changes are required. We'll run some tests today today so
  that we have some data to talk about.

  -Dave


  -----Original Message-----
  From: xen-devel-bounces@lists.xensource.com on behalf of Keir Fraser
  Sent: Fri 6/6/2008 4:58 AM
  To: Ben Guthro; xen-devel
  Cc: dan.magenheimer@oracle.com
  Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

  Are these patches needed now the timers are built on Xen system time rather
  than host TSC? Dan has reported much better time-keeping with his patch
  checked in, and it¹s for sure a lot less invasive than this patchset.


   -- Keir

  On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote:

  >
  > 1. Introduction
  >
  > This patch improves the hpet based guest clock in terms of drift and
  > monotonicity.
  > Prior to this work the drift with hpet was greater than 2%, far above the .05%
  > limit
  > for ntp to synchronize. With this code, the drift ranges from .001% to .0033%
  > depending
  > on guest and physical platform.
  >
  > Using hpet allows guest operating systems to provide monotonic time to their
  > applications. Time sources other than hpet are not monotonic because
  > of their reliance on tsc, which is not synchronized across physical
  > processors.
  >
  > Windows 2k864 and many Linux guests are supported with two policies, one for
  > guests
  > that handle missed clock interrupts and the other for guests that require the
  > correct number of interrupts.
  >
  > Guests may use hpet for the timing source even if the physical platform has no
  > visible
  > hpet. Migration is supported between physical machines which differ in
  > physical
  > hpet visibility.
  >
  > Most of the changes are in hpet.c. Two general facilities are added to track
  > interrupt
  > progress. The ideas here and the facilities would be useful in vpt.c, for
  > other time
  > sources, though no attempt is made here to improve vpt.c.
  >
  > The following sections discuss hpet dependencies, interrupt delivery policies,
  > live migration,
  > test results, and relation to recent work with monotonic time.
  >
  >
  > 2. Virtual Hpet dependencies
  >
  > The virtual hpet depends on the ability to read the physical or simulated
  > (see discussion below) hpet.  For timekeeping, the virtual hpet also depends
  > on two new interrupt notification facilities to implement its policies for
  > interrupt delivery.
  >
  > 2.1. Two modes of low-level hpet main counter reads.
  >
  > In this implementation, the virtual hpet reads with read_64_main_counter(),
  > exported by
  > time.c, either the real physical hpet main counter register directly or a
  > "simulated"
  > hpet main counter.
  >
  > The simulated mode uses a monotonic version of get_s_time() (NOW()), where the
  > last
  > time value is returned whenever the current time value is less than the last
  > time
  > value. In simulated mode, since it is layered on s_time, the underlying
  > hardware
  > can be hpet or some other device. The frequency of the main counter in
  > simulated
  > mode is the same as the standard physical hpet frequency, allowing live
  > migration
  > between nodes that are configured differently.
  >
  > If the physical platform does not have an hpet device, or if xen is configured
  > not
  > to use the device, then the simulated method is used. If there is a physical
  > hpet device,
  > and xen has initialized it, then either simulated or physical mode can be
  > used.
  > This is governed by a boot time option, hpet-avoid. Setting this option to 1
  > gives the
  > simulated mode and 0 the physical mode. The default is physical mode.
  >
  > A disadvantage of the physical mode is that may take longer to read the device
  > than in simulated mode. On some platforms the cost is about the same (less
  > than 250 nsec) for
  > physical and simulated modes, while on others physical cost is much higher
  > than simulated.
  > A disadvantage of the simulated mode is that it can return the same value
  > for the counter in consecutive calls.
  >
  > 2.2. Interrupt notification facilities.
  >
  > Two interrupt notification facilities are introduced, one is
  > hvm_isa_irq_assert_cb()
  > and the other hvm_register_intr_en_notif().
  >
  > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to the vioapic.
  > hvm_isa_irq_assert_cb allows a callback to be passed along to
  > vioapic_deliver()
  > and this callback is called with a mask of the vcpus which will get the
  > interrupt. This callback is made before any vcpus receive an interrupt.
  >
  > Vhpet uses hvm_register_intr_en_notif() to register a handler for a particular
  > vector that will be called when that vector is injected in
  > [vmx,svm]_intr_assist()
  > and also when the guest finishes handling the interrupt. Here finished is
  > defined
  > as the point when the guest re-enables interrupts or lowers the tpr value.
  > EOI is not used as the end of interrupt as this is sometimes returned before
  > the interrupt handler has done its work. A flag is passed to the handler
  > indicating
  > whether this is the injection point (post = 1) or the interrupt finished (post
  > = 0) point.
  > The need for the finished point callback is discussed in the missed ticks
  > policy section.
  >
  > To prevent a possible early trigger of the finished callback, intr_en_notif
  > logic
  > has a two stage arm, the first at injection (hvm_intr_en_notif_arm()) and the
  > second when
  > interrupts are seen to be disabled (hvm_intr_en_notif_disarm()). Once fully
  > armed, re-enabling
  > interrupts will cause hvm_intr_en_notif_disarm() to make the end of interrupt
  > callback. hvm_intr_en_notif_arm() and hvm_intr_en_notif_disarm() are called by
  > [vmx,svm]_intr_assist().
  >
  > 3. Interrupt delivery policies
  >
  > The existing hpet interrupt delivery is preserved. This includes
  > vcpu round robin delivery used by Linux and broadcast delivery used by
  > Windows.
  >
  > There are two policies for interrupt delivery, one for Windows 2k8-64 and the
  > other
  > for Linux. The Linux policy takes advantage of the (guest) Linux missed tick
  > and offset
  > calculations and does not attempt to deliver the right number of interrupts.
  > The Windows policy delivers the correct number of interrupts, even if
  > sometimes much
  > closer to each other than the period. The policies are similar to those in
  > vpt.c, though
  > there are some important differences.
  >
  > Policies are selected with an HVMOP_set_param hypercall with index
  > HVM_PARAM_TIMER_MODE.
  > Two new values are added, HVM_HPET_guest_computes_missed_ticks and
  > HVM_HPET_guest_does_not_compute_missed_ticks.  The reason that two new ones
  > are added is that
  > in some guests (32bit Linux) a no-missed policy is needed for clock sources
  > other than hpet
  > and a missed ticks policy for hpet. It was felt that there would be less
  > confusion by simply
  > introducing the two hpet policies.
  >
  > 3.1. The missed ticks policy
  >
  > The Linux clock interrupt handler for hpet calculates missed ticks and offset
  > using the hpet
  > main counter. The algorithm works well when the time since the last interrupt
  > is greater than
  > or equal to a period and poorly otherwise.
  >
  > The missed ticks policy ensures that no two clock interrupts are delivered to
  > the guest at
  > a time interval less than a period. A time stamp (hpet main counter value) is
  > recorded (by a
  > callback registered with hvm_register_intr_en_notif) when Linux finishes
  > handling the clock
  > interrupt. Then, ensuing interrupts are delivered to the vioapic only if the
  > current main
  > counter value is a period greater than when the last interrupt was handled.
  >
  > Tests showed a significant improvement in clock drift with end of interrupt
  > time stamps
  > versus beginning of interrupt[1]. It is believed that the reason for the
  > improvement
  > is that the clock interrupt handler goes for a spinlock and can be therefore
  > delayed in its
  > processing. Furthermore, the main counter is read by the guest under the lock.
  > The net
  > effect is that if we time stamp injection, we can get the difference in time
  > between successive interrupt handler lock acquisitions to be less than the
  > period.
  >
  > 3.2. The no-missed ticks policy
  >
  > Windows 2k864 keeps very poor time with the missed ticks policy. So the
  > no-missed ticks policy
  > was developed. In the no-missed ticks policy we deliver the correct number of
  > interrupts,
  > even if they are spaced less than a period apart (when catching up).
  >
  > Windows 2k864 uses a broadcast mode in the interrupt routing such that
  > all vcpus get the clock interrupt. The best Windows drift performance was
  > achieved when the
  > policy code ensured that all the previous interrupts (on the various vcpus)
  > had been injected
  > before injecting the next interrupt to the vioapic..
  >
  > The policy code works as follows. It uses the hvm_isa_irq_assert_cb() to
  > record
  > the vcpus to be interrupted in h->hpet.pending_mask. Then, in the callback
  > registered
  > with hvm_register_intr_en_notif() at post=1 time it clears the current vcpu in
  > the pending_mask.
  > When the pending_mask is clear it decrements hpet.intr_pending_nr and if
  > intr_pending_nr is still
  > non-zero posts another interrupt to the ioapic with hvm_isa_irq_assert_cb().
  > Intr_pending_nr is incremented in hpet_route_decision_not_missed_ticks().
  >
  > The missed ticks policy intr_en_notif callback also uses the pending_mask
  > method. So even though
  > Linux does not broadcast its interrupts, the code could handle it if it did.
  > In this case the end of interrupt time stamp is made when the pending_mask is
  > clear.
  >
  > 4. Live Migration
  >
  > Live migration with hpet preserves the current offset of the guest clock with
  > respect
  > to ntp. This is accomplished by migrating all of the state in the h->hpet data
  > structure
  > in the usual way. The hp->mc_offset is recalculated on the receiving node so
  > that the
  > guest sees a continuous hpet main counter.
  >
  > Code as been added to xc_domain_save.c to send a small message after the
  > domain context is sent. The contents of the message is the physical tsc
  > timestamp, last_tsc,
  > read just before the message is sent. When the last_tsc message is received in
  > xc_domain_restore.c,
  > another physical tsc timestamp, cur_tsc, is read. The two timestamps are
  > loaded into the domain
  > structure as last_tsc_sender and first_tsc_receiver with hypercalls. Then
  > xc_domain_hvm_setcontext
  > is called so that hpet_load has access to these time stamps. Hpet_load uses
  > the timestamps
  > to account for the time spent saving and loading the domain context. With this
  > technique,
  > the only neglected time is the time spent sending a small network message.
  >
  > 5. Test Results
  >
  > Some recent test results are:
  >
  > 5.1 Linux 4u664 and Windows 2k864 load test.
  >       Duration: 70 hours.
  >       Test date: 6/2/08
  >       Loads: usex -b48 on Linux; burn-in on Windows
  >       Guest vcpus: 8 for Linux; 2 for Windows
  >       Hardware: 8 physical cpu AMD
  >       Clock drift : Linux: .0012% Windows: .009%
  >
  > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test
  >       Duration: 23 hours.
  >       Test date: 6/3/08
  >       Loads: none
  >       Guest vcpus: 8 for each Linux; 2 for Windows
  >       Hardware: 4 physical cpu AMD
  >       Clock drift : Linux: .033% Windows: .019%
  >
  > 6. Relation to recent work in xen-unstable
  >
  > There is a similarity between hvm_get_guest_time() in xen-unstable and
  > read_64_main_counter()
  > in this code. However, read_64_main_counter() is more tuned to the needs of
  > hpet.c. It has no
  > "set" operation, only the get. It isolates the mode, physical or simulated, in
  > read_64_main_counter()
  > itself. It uses no vcpu or domain state as it is a physical entity, in either
  > mode. And it provides a real
  > physical mode for every read for those applications that desire this.
  >
  > 7. Conclusion
  >
  > The virtual hpet is improved by this patch in terms of accuracy and
  > monotonicity.
  > Tests performed to date verify this and more testing is under way.
  >
  > 8. Future Work
  >
  > Testing with Windows Vista will be performed soon. The reason for accuracy
  > variations
  > on different platforms using the physical hpet device will be investigated.
  > Additional overhead measurements on simulated vs physical hpet mode will be
  > made.
  >
  > Footnotes:
  >
  > 1. I don't recall the accuracy improvement with end of interrupt stamping, but
  > it was
  > significant, perhaps better than two to one improvement. It would be a very
  > simple matter
  > to re-measure the improvement as the facility can call back at injection time
  > as well.
  >
  >
  > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com>
  > <mailto:dwinchell@virtualiron.com>
  > Signed-off-by: Ben Guthro <bguthro@virtualiron.com>
  > <mailto:bguthro@virtualiron.com>
  >
  >
  > _______________________________________________
  > Xen-devel mailing list
  > Xen-devel@lists.xensource.com
  > http://lists.xensource.com/xen-devel





[-- Attachment #1.2: Type: text/html, Size: 18704 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/2] Improve hpet accuracy
  2008-06-06 15:35 ` Steven Hand
@ 2008-06-06 17:34   ` Dave Winchell
  0 siblings, 0 replies; 51+ messages in thread
From: Dave Winchell @ 2008-06-06 17:34 UTC (permalink / raw)
  To: Steven Hand; +Cc: Dave Winchell, xen-devel, Ben Guthro

Steven Hand wrote:

>This seems to break the save/restore format (in at least two places)...
>  
>
Steven,

Can you give me more information on this? What sort of failures are you 
seeing?

thanks,
Dave

>
>S.
>
>_______________________________________________
>Xen-devel mailing list
>Xen-devel@lists.xensource.com
>http://lists.xensource.com/xen-devel
>  
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/2] Improve hpet accuracy
  2008-06-06 15:53     ` Dan Magenheimer
@ 2008-06-06 17:54       ` Dave Winchell
  2008-06-06 19:33       ` Dave Winchell
  1 sibling, 0 replies; 51+ messages in thread
From: Dave Winchell @ 2008-06-06 17:54 UTC (permalink / raw)
  To: dan.magenheimer@oracle.com
  Cc: Dave Winchell, xen-devel, Keir Fraser, Ben Guthro

Hi Dan,

I am running with hpet=1 and timer_mode=2. I don't see where timer_mode 
is checked for
hpet timekeeping but I set it nevertheless.

thanks,
Dave


Dan Magenheimer wrote:

> Hi Dave and Ben --
>  
> When running tests on xen-unstable (without your patch), please ensure 
> that hpet=1 is set in the hvm config and also I think that when hpet 
> is the clocksource on RHEL4-32, the clock IS resilient to missed ticks 
> so timer_mode should be 2 (vs when pit is the clocksource on RHEL4-32, 
> all clock ticks must be delivered and so timer_mode should be 0).
>  
> Per 
> http://lists.xensource.com/archives/html/xen-devel/2008-06/msg00098.html it's 
> my intent to clean this up, but I won't get to it until next week.
>  
> Thanks,
> Dan
>
>     -----Original Message-----
>     *From:* xen-devel-bounces@lists.xensource.com
>     [mailto:xen-devel-bounces@lists.xensource.com]*On Behalf Of *Dave
>     Winchell
>     *Sent:* Friday, June 06, 2008 4:46 AM
>     *To:* Keir Fraser; Ben Guthro; xen-devel
>     *Cc:* dan.magenheimer@oracle.com; Dave Winchell
>     *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
>     Keir,
>
>     I think the changes are required. We'll run some tests today today so
>     that we have some data to talk about.
>
>     -Dave
>
>
>     -----Original Message-----
>     From: xen-devel-bounces@lists.xensource.com on behalf of Keir Fraser
>     Sent: Fri 6/6/2008 4:58 AM
>     To: Ben Guthro; xen-devel
>     Cc: dan.magenheimer@oracle.com
>     Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
>     Are these patches needed now the timers are built on Xen system
>     time rather
>     than host TSC? Dan has reported much better time-keeping with his
>     patch
>     checked in, and it¹s for sure a lot less invasive than this patchset.
>
>
>      -- Keir
>
>     On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote:
>
>     >
>     > 1. Introduction
>     >
>     > This patch improves the hpet based guest clock in terms of drift and
>     > monotonicity.
>     > Prior to this work the drift with hpet was greater than 2%, far
>     above the .05%
>     > limit
>     > for ntp to synchronize. With this code, the drift ranges from
>     .001% to .0033%
>     > depending
>     > on guest and physical platform.
>     >
>     > Using hpet allows guest operating systems to provide monotonic
>     time to their
>     > applications. Time sources other than hpet are not monotonic because
>     > of their reliance on tsc, which is not synchronized across physical
>     > processors.
>     >
>     > Windows 2k864 and many Linux guests are supported with two
>     policies, one for
>     > guests
>     > that handle missed clock interrupts and the other for guests
>     that require the
>     > correct number of interrupts.
>     >
>     > Guests may use hpet for the timing source even if the physical
>     platform has no
>     > visible
>     > hpet. Migration is supported between physical machines which
>     differ in
>     > physical
>     > hpet visibility.
>     >
>     > Most of the changes are in hpet.c. Two general facilities are
>     added to track
>     > interrupt
>     > progress. The ideas here and the facilities would be useful in
>     vpt.c, for
>     > other time
>     > sources, though no attempt is made here to improve vpt.c.
>     >
>     > The following sections discuss hpet dependencies, interrupt
>     delivery policies,
>     > live migration,
>     > test results, and relation to recent work with monotonic time.
>     >
>     >
>     > 2. Virtual Hpet dependencies
>     >
>     > The virtual hpet depends on the ability to read the physical or
>     simulated
>     > (see discussion below) hpet.  For timekeeping, the virtual hpet
>     also depends
>     > on two new interrupt notification facilities to implement its
>     policies for
>     > interrupt delivery.
>     >
>     > 2.1. Two modes of low-level hpet main counter reads.
>     >
>     > In this implementation, the virtual hpet reads with
>     read_64_main_counter(),
>     > exported by
>     > time.c, either the real physical hpet main counter register
>     directly or a
>     > "simulated"
>     > hpet main counter.
>     >
>     > The simulated mode uses a monotonic version of get_s_time()
>     (NOW()), where the
>     > last
>     > time value is returned whenever the current time value is less
>     than the last
>     > time
>     > value. In simulated mode, since it is layered on s_time, the
>     underlying
>     > hardware
>     > can be hpet or some other device. The frequency of the main
>     counter in
>     > simulated
>     > mode is the same as the standard physical hpet frequency,
>     allowing live
>     > migration
>     > between nodes that are configured differently.
>     >
>     > If the physical platform does not have an hpet device, or if xen
>     is configured
>     > not
>     > to use the device, then the simulated method is used. If there
>     is a physical
>     > hpet device,
>     > and xen has initialized it, then either simulated or physical
>     mode can be
>     > used.
>     > This is governed by a boot time option, hpet-avoid. Setting this
>     option to 1
>     > gives the
>     > simulated mode and 0 the physical mode. The default is physical
>     mode.
>     >
>     > A disadvantage of the physical mode is that may take longer to
>     read the device
>     > than in simulated mode. On some platforms the cost is about the
>     same (less
>     > than 250 nsec) for
>     > physical and simulated modes, while on others physical cost is
>     much higher
>     > than simulated.
>     > A disadvantage of the simulated mode is that it can return the
>     same value
>     > for the counter in consecutive calls.
>     >
>     > 2.2. Interrupt notification facilities.
>     >
>     > Two interrupt notification facilities are introduced, one is
>     > hvm_isa_irq_assert_cb()
>     > and the other hvm_register_intr_en_notif().
>     >
>     > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to
>     the vioapic.
>     > hvm_isa_irq_assert_cb allows a callback to be passed along to
>     > vioapic_deliver()
>     > and this callback is called with a mask of the vcpus which will
>     get the
>     > interrupt. This callback is made before any vcpus receive an
>     interrupt.
>     >
>     > Vhpet uses hvm_register_intr_en_notif() to register a handler
>     for a particular
>     > vector that will be called when that vector is injected in
>     > [vmx,svm]_intr_assist()
>     > and also when the guest finishes handling the interrupt. Here
>     finished is
>     > defined
>     > as the point when the guest re-enables interrupts or lowers the
>     tpr value.
>     > EOI is not used as the end of interrupt as this is sometimes
>     returned before
>     > the interrupt handler has done its work. A flag is passed to the
>     handler
>     > indicating
>     > whether this is the injection point (post = 1) or the interrupt
>     finished (post
>     > = 0) point.
>     > The need for the finished point callback is discussed in the
>     missed ticks
>     > policy section.
>     >
>     > To prevent a possible early trigger of the finished callback,
>     intr_en_notif
>     > logic
>     > has a two stage arm, the first at injection
>     (hvm_intr_en_notif_arm()) and the
>     > second when
>     > interrupts are seen to be disabled (hvm_intr_en_notif_disarm()).
>     Once fully
>     > armed, re-enabling
>     > interrupts will cause hvm_intr_en_notif_disarm() to make the end
>     of interrupt
>     > callback. hvm_intr_en_notif_arm() and hvm_intr_en_notif_disarm()
>     are called by
>     > [vmx,svm]_intr_assist().
>     >
>     > 3. Interrupt delivery policies
>     >
>     > The existing hpet interrupt delivery is preserved. This includes
>     > vcpu round robin delivery used by Linux and broadcast delivery
>     used by
>     > Windows.
>     >
>     > There are two policies for interrupt delivery, one for Windows
>     2k8-64 and the
>     > other
>     > for Linux. The Linux policy takes advantage of the (guest) Linux
>     missed tick
>     > and offset
>     > calculations and does not attempt to deliver the right number of
>     interrupts.
>     > The Windows policy delivers the correct number of interrupts,
>     even if
>     > sometimes much
>     > closer to each other than the period. The policies are similar
>     to those in
>     > vpt.c, though
>     > there are some important differences.
>     >
>     > Policies are selected with an HVMOP_set_param hypercall with index
>     > HVM_PARAM_TIMER_MODE.
>     > Two new values are added, HVM_HPET_guest_computes_missed_ticks and
>     > HVM_HPET_guest_does_not_compute_missed_ticks.  The reason that
>     two new ones
>     > are added is that
>     > in some guests (32bit Linux) a no-missed policy is needed for
>     clock sources
>     > other than hpet
>     > and a missed ticks policy for hpet. It was felt that there would
>     be less
>     > confusion by simply
>     > introducing the two hpet policies.
>     >
>     > 3.1. The missed ticks policy
>     >
>     > The Linux clock interrupt handler for hpet calculates missed
>     ticks and offset
>     > using the hpet
>     > main counter. The algorithm works well when the time since the
>     last interrupt
>     > is greater than
>     > or equal to a period and poorly otherwise.
>     >
>     > The missed ticks policy ensures that no two clock interrupts are
>     delivered to
>     > the guest at
>     > a time interval less than a period. A time stamp (hpet main
>     counter value) is
>     > recorded (by a
>     > callback registered with hvm_register_intr_en_notif) when Linux
>     finishes
>     > handling the clock
>     > interrupt. Then, ensuing interrupts are delivered to the vioapic
>     only if the
>     > current main
>     > counter value is a period greater than when the last interrupt
>     was handled.
>     >
>     > Tests showed a significant improvement in clock drift with end
>     of interrupt
>     > time stamps
>     > versus beginning of interrupt[1]. It is believed that the reason
>     for the
>     > improvement
>     > is that the clock interrupt handler goes for a spinlock and can
>     be therefore
>     > delayed in its
>     > processing. Furthermore, the main counter is read by the guest
>     under the lock.
>     > The net
>     > effect is that if we time stamp injection, we can get the
>     difference in time
>     > between successive interrupt handler lock acquisitions to be
>     less than the
>     > period.
>     >
>     > 3.2. The no-missed ticks policy
>     >
>     > Windows 2k864 keeps very poor time with the missed ticks policy.
>     So the
>     > no-missed ticks policy
>     > was developed. In the no-missed ticks policy we deliver the
>     correct number of
>     > interrupts,
>     > even if they are spaced less than a period apart (when catching up).
>     >
>     > Windows 2k864 uses a broadcast mode in the interrupt routing
>     such that
>     > all vcpus get the clock interrupt. The best Windows drift
>     performance was
>     > achieved when the
>     > policy code ensured that all the previous interrupts (on the
>     various vcpus)
>     > had been injected
>     > before injecting the next interrupt to the vioapic..
>     >
>     > The policy code works as follows. It uses the
>     hvm_isa_irq_assert_cb() to
>     > record
>     > the vcpus to be interrupted in h->hpet.pending_mask. Then, in
>     the callback
>     > registered
>     > with hvm_register_intr_en_notif() at post=1 time it clears the
>     current vcpu in
>     > the pending_mask.
>     > When the pending_mask is clear it decrements
>     hpet.intr_pending_nr and if
>     > intr_pending_nr is still
>     > non-zero posts another interrupt to the ioapic with
>     hvm_isa_irq_assert_cb().
>     > Intr_pending_nr is incremented in
>     hpet_route_decision_not_missed_ticks().
>     >
>     > The missed ticks policy intr_en_notif callback also uses the
>     pending_mask
>     > method. So even though
>     > Linux does not broadcast its interrupts, the code could handle
>     it if it did.
>     > In this case the end of interrupt time stamp is made when the
>     pending_mask is
>     > clear.
>     >
>     > 4. Live Migration
>     >
>     > Live migration with hpet preserves the current offset of the
>     guest clock with
>     > respect
>     > to ntp. This is accomplished by migrating all of the state in
>     the h->hpet data
>     > structure
>     > in the usual way. The hp->mc_offset is recalculated on the
>     receiving node so
>     > that the
>     > guest sees a continuous hpet main counter.
>     >
>     > Code as been added to xc_domain_save.c to send a small message
>     after the
>     > domain context is sent. The contents of the message is the
>     physical tsc
>     > timestamp, last_tsc,
>     > read just before the message is sent. When the last_tsc message
>     is received in
>     > xc_domain_restore.c,
>     > another physical tsc timestamp, cur_tsc, is read. The two
>     timestamps are
>     > loaded into the domain
>     > structure as last_tsc_sender and first_tsc_receiver with
>     hypercalls. Then
>     > xc_domain_hvm_setcontext
>     > is called so that hpet_load has access to these time stamps.
>     Hpet_load uses
>     > the timestamps
>     > to account for the time spent saving and loading the domain
>     context. With this
>     > technique,
>     > the only neglected time is the time spent sending a small
>     network message.
>     >
>     > 5. Test Results
>     >
>     > Some recent test results are:
>     >
>     > 5.1 Linux 4u664 and Windows 2k864 load test.
>     >       Duration: 70 hours.
>     >       Test date: 6/2/08
>     >       Loads: usex -b48 on Linux; burn-in on Windows
>     >       Guest vcpus: 8 for Linux; 2 for Windows
>     >       Hardware: 8 physical cpu AMD
>     >       Clock drift : Linux: .0012% Windows: .009%
>     >
>     > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test
>     >       Duration: 23 hours.
>     >       Test date: 6/3/08
>     >       Loads: none
>     >       Guest vcpus: 8 for each Linux; 2 for Windows
>     >       Hardware: 4 physical cpu AMD
>     >       Clock drift : Linux: .033% Windows: .019%
>     >
>     > 6. Relation to recent work in xen-unstable
>     >
>     > There is a similarity between hvm_get_guest_time() in
>     xen-unstable and
>     > read_64_main_counter()
>     > in this code. However, read_64_main_counter() is more tuned to
>     the needs of
>     > hpet.c. It has no
>     > "set" operation, only the get. It isolates the mode, physical or
>     simulated, in
>     > read_64_main_counter()
>     > itself. It uses no vcpu or domain state as it is a physical
>     entity, in either
>     > mode. And it provides a real
>     > physical mode for every read for those applications that desire
>     this.
>     >
>     > 7. Conclusion
>     >
>     > The virtual hpet is improved by this patch in terms of accuracy and
>     > monotonicity.
>     > Tests performed to date verify this and more testing is under way.
>     >
>     > 8. Future Work
>     >
>     > Testing with Windows Vista will be performed soon. The reason
>     for accuracy
>     > variations
>     > on different platforms using the physical hpet device will be
>     investigated.
>     > Additional overhead measurements on simulated vs physical hpet
>     mode will be
>     > made.
>     >
>     > Footnotes:
>     >
>     > 1. I don't recall the accuracy improvement with end of interrupt
>     stamping, but
>     > it was
>     > significant, perhaps better than two to one improvement. It
>     would be a very
>     > simple matter
>     > to re-measure the improvement as the facility can call back at
>     injection time
>     > as well.
>     >
>     >
>     > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com>
>     > <mailto:dwinchell@virtualiron.com>
>     > Signed-off-by: Ben Guthro <bguthro@virtualiron.com>
>     > <mailto:bguthro@virtualiron.com>
>     >
>     >
>     > _______________________________________________
>     > Xen-devel mailing list
>     > Xen-devel@lists.xensource.com
>     > http://lists.xensource.com/xen-devel
>
>
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/2] Improve hpet accuracy
  2008-06-06 15:53     ` Dan Magenheimer
  2008-06-06 17:54       ` Dave Winchell
@ 2008-06-06 19:33       ` Dave Winchell
  2008-06-06 20:29         ` Dan Magenheimer
  1 sibling, 1 reply; 51+ messages in thread
From: Dave Winchell @ 2008-06-06 19:33 UTC (permalink / raw)
  To: dan.magenheimer@oracle.com, Keir Fraser
  Cc: Dave Winchell, xen-devel, Ben Guthro

Dan, Keir:

Preliminary tests results indicate an error of .1% for Linux 64 bit 
guests configured
for hpet with xen-unstable as is. As we have discussed many times, the 
ntp requirement is .05%.
Tests on the patch we just submitted for hpet have indicated errors of 
.0012%
on this platform under similar test conditions and .03% on other platforms.

Windows vista64 has an error of 11% using hpet with the xen-unstable bits.
In an overnight test with our hpet patch, the Windows vista error was .008%.

The tests are with two or three guests on a physical node, all under 
load, and with
the ratio of vcpus to phys cpus > 1.

I will continue to run tests over the next few days.

thanks,
Dave


Dan Magenheimer wrote:

> Hi Dave and Ben --
>  
> When running tests on xen-unstable (without your patch), please ensure 
> that hpet=1 is set in the hvm config and also I think that when hpet 
> is the clocksource on RHEL4-32, the clock IS resilient to missed ticks 
> so timer_mode should be 2 (vs when pit is the clocksource on RHEL4-32, 
> all clock ticks must be delivered and so timer_mode should be 0).
>  
> Per 
> http://lists.xensource.com/archives/html/xen-devel/2008-06/msg00098.html it's 
> my intent to clean this up, but I won't get to it until next week.
>  
> Thanks,
> Dan
>
>     -----Original Message-----
>     *From:* xen-devel-bounces@lists.xensource.com
>     [mailto:xen-devel-bounces@lists.xensource.com]*On Behalf Of *Dave
>     Winchell
>     *Sent:* Friday, June 06, 2008 4:46 AM
>     *To:* Keir Fraser; Ben Guthro; xen-devel
>     *Cc:* dan.magenheimer@oracle.com; Dave Winchell
>     *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
>     Keir,
>
>     I think the changes are required. We'll run some tests today today so
>     that we have some data to talk about.
>
>     -Dave
>
>
>     -----Original Message-----
>     From: xen-devel-bounces@lists.xensource.com on behalf of Keir Fraser
>     Sent: Fri 6/6/2008 4:58 AM
>     To: Ben Guthro; xen-devel
>     Cc: dan.magenheimer@oracle.com
>     Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
>     Are these patches needed now the timers are built on Xen system
>     time rather
>     than host TSC? Dan has reported much better time-keeping with his
>     patch
>     checked in, and it¹s for sure a lot less invasive than this patchset.
>
>
>      -- Keir
>
>     On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote:
>
>     >
>     > 1. Introduction
>     >
>     > This patch improves the hpet based guest clock in terms of drift and
>     > monotonicity.
>     > Prior to this work the drift with hpet was greater than 2%, far
>     above the .05%
>     > limit
>     > for ntp to synchronize. With this code, the drift ranges from
>     .001% to .0033%
>     > depending
>     > on guest and physical platform.
>     >
>     > Using hpet allows guest operating systems to provide monotonic
>     time to their
>     > applications. Time sources other than hpet are not monotonic because
>     > of their reliance on tsc, which is not synchronized across physical
>     > processors.
>     >
>     > Windows 2k864 and many Linux guests are supported with two
>     policies, one for
>     > guests
>     > that handle missed clock interrupts and the other for guests
>     that require the
>     > correct number of interrupts.
>     >
>     > Guests may use hpet for the timing source even if the physical
>     platform has no
>     > visible
>     > hpet. Migration is supported between physical machines which
>     differ in
>     > physical
>     > hpet visibility.
>     >
>     > Most of the changes are in hpet.c. Two general facilities are
>     added to track
>     > interrupt
>     > progress. The ideas here and the facilities would be useful in
>     vpt.c, for
>     > other time
>     > sources, though no attempt is made here to improve vpt.c.
>     >
>     > The following sections discuss hpet dependencies, interrupt
>     delivery policies,
>     > live migration,
>     > test results, and relation to recent work with monotonic time.
>     >
>     >
>     > 2. Virtual Hpet dependencies
>     >
>     > The virtual hpet depends on the ability to read the physical or
>     simulated
>     > (see discussion below) hpet.  For timekeeping, the virtual hpet
>     also depends
>     > on two new interrupt notification facilities to implement its
>     policies for
>     > interrupt delivery.
>     >
>     > 2.1. Two modes of low-level hpet main counter reads.
>     >
>     > In this implementation, the virtual hpet reads with
>     read_64_main_counter(),
>     > exported by
>     > time.c, either the real physical hpet main counter register
>     directly or a
>     > "simulated"
>     > hpet main counter.
>     >
>     > The simulated mode uses a monotonic version of get_s_time()
>     (NOW()), where the
>     > last
>     > time value is returned whenever the current time value is less
>     than the last
>     > time
>     > value. In simulated mode, since it is layered on s_time, the
>     underlying
>     > hardware
>     > can be hpet or some other device. The frequency of the main
>     counter in
>     > simulated
>     > mode is the same as the standard physical hpet frequency,
>     allowing live
>     > migration
>     > between nodes that are configured differently.
>     >
>     > If the physical platform does not have an hpet device, or if xen
>     is configured
>     > not
>     > to use the device, then the simulated method is used. If there
>     is a physical
>     > hpet device,
>     > and xen has initialized it, then either simulated or physical
>     mode can be
>     > used.
>     > This is governed by a boot time option, hpet-avoid. Setting this
>     option to 1
>     > gives the
>     > simulated mode and 0 the physical mode. The default is physical
>     mode.
>     >
>     > A disadvantage of the physical mode is that may take longer to
>     read the device
>     > than in simulated mode. On some platforms the cost is about the
>     same (less
>     > than 250 nsec) for
>     > physical and simulated modes, while on others physical cost is
>     much higher
>     > than simulated.
>     > A disadvantage of the simulated mode is that it can return the
>     same value
>     > for the counter in consecutive calls.
>     >
>     > 2.2. Interrupt notification facilities.
>     >
>     > Two interrupt notification facilities are introduced, one is
>     > hvm_isa_irq_assert_cb()
>     > and the other hvm_register_intr_en_notif().
>     >
>     > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to
>     the vioapic.
>     > hvm_isa_irq_assert_cb allows a callback to be passed along to
>     > vioapic_deliver()
>     > and this callback is called with a mask of the vcpus which will
>     get the
>     > interrupt. This callback is made before any vcpus receive an
>     interrupt.
>     >
>     > Vhpet uses hvm_register_intr_en_notif() to register a handler
>     for a particular
>     > vector that will be called when that vector is injected in
>     > [vmx,svm]_intr_assist()
>     > and also when the guest finishes handling the interrupt. Here
>     finished is
>     > defined
>     > as the point when the guest re-enables interrupts or lowers the
>     tpr value.
>     > EOI is not used as the end of interrupt as this is sometimes
>     returned before
>     > the interrupt handler has done its work. A flag is passed to the
>     handler
>     > indicating
>     > whether this is the injection point (post = 1) or the interrupt
>     finished (post
>     > = 0) point.
>     > The need for the finished point callback is discussed in the
>     missed ticks
>     > policy section.
>     >
>     > To prevent a possible early trigger of the finished callback,
>     intr_en_notif
>     > logic
>     > has a two stage arm, the first at injection
>     (hvm_intr_en_notif_arm()) and the
>     > second when
>     > interrupts are seen to be disabled (hvm_intr_en_notif_disarm()).
>     Once fully
>     > armed, re-enabling
>     > interrupts will cause hvm_intr_en_notif_disarm() to make the end
>     of interrupt
>     > callback. hvm_intr_en_notif_arm() and hvm_intr_en_notif_disarm()
>     are called by
>     > [vmx,svm]_intr_assist().
>     >
>     > 3. Interrupt delivery policies
>     >
>     > The existing hpet interrupt delivery is preserved. This includes
>     > vcpu round robin delivery used by Linux and broadcast delivery
>     used by
>     > Windows.
>     >
>     > There are two policies for interrupt delivery, one for Windows
>     2k8-64 and the
>     > other
>     > for Linux. The Linux policy takes advantage of the (guest) Linux
>     missed tick
>     > and offset
>     > calculations and does not attempt to deliver the right number of
>     interrupts.
>     > The Windows policy delivers the correct number of interrupts,
>     even if
>     > sometimes much
>     > closer to each other than the period. The policies are similar
>     to those in
>     > vpt.c, though
>     > there are some important differences.
>     >
>     > Policies are selected with an HVMOP_set_param hypercall with index
>     > HVM_PARAM_TIMER_MODE.
>     > Two new values are added, HVM_HPET_guest_computes_missed_ticks and
>     > HVM_HPET_guest_does_not_compute_missed_ticks.  The reason that
>     two new ones
>     > are added is that
>     > in some guests (32bit Linux) a no-missed policy is needed for
>     clock sources
>     > other than hpet
>     > and a missed ticks policy for hpet. It was felt that there would
>     be less
>     > confusion by simply
>     > introducing the two hpet policies.
>     >
>     > 3.1. The missed ticks policy
>     >
>     > The Linux clock interrupt handler for hpet calculates missed
>     ticks and offset
>     > using the hpet
>     > main counter. The algorithm works well when the time since the
>     last interrupt
>     > is greater than
>     > or equal to a period and poorly otherwise.
>     >
>     > The missed ticks policy ensures that no two clock interrupts are
>     delivered to
>     > the guest at
>     > a time interval less than a period. A time stamp (hpet main
>     counter value) is
>     > recorded (by a
>     > callback registered with hvm_register_intr_en_notif) when Linux
>     finishes
>     > handling the clock
>     > interrupt. Then, ensuing interrupts are delivered to the vioapic
>     only if the
>     > current main
>     > counter value is a period greater than when the last interrupt
>     was handled.
>     >
>     > Tests showed a significant improvement in clock drift with end
>     of interrupt
>     > time stamps
>     > versus beginning of interrupt[1]. It is believed that the reason
>     for the
>     > improvement
>     > is that the clock interrupt handler goes for a spinlock and can
>     be therefore
>     > delayed in its
>     > processing. Furthermore, the main counter is read by the guest
>     under the lock.
>     > The net
>     > effect is that if we time stamp injection, we can get the
>     difference in time
>     > between successive interrupt handler lock acquisitions to be
>     less than the
>     > period.
>     >
>     > 3.2. The no-missed ticks policy
>     >
>     > Windows 2k864 keeps very poor time with the missed ticks policy.
>     So the
>     > no-missed ticks policy
>     > was developed. In the no-missed ticks policy we deliver the
>     correct number of
>     > interrupts,
>     > even if they are spaced less than a period apart (when catching up).
>     >
>     > Windows 2k864 uses a broadcast mode in the interrupt routing
>     such that
>     > all vcpus get the clock interrupt. The best Windows drift
>     performance was
>     > achieved when the
>     > policy code ensured that all the previous interrupts (on the
>     various vcpus)
>     > had been injected
>     > before injecting the next interrupt to the vioapic..
>     >
>     > The policy code works as follows. It uses the
>     hvm_isa_irq_assert_cb() to
>     > record
>     > the vcpus to be interrupted in h->hpet.pending_mask. Then, in
>     the callback
>     > registered
>     > with hvm_register_intr_en_notif() at post=1 time it clears the
>     current vcpu in
>     > the pending_mask.
>     > When the pending_mask is clear it decrements
>     hpet.intr_pending_nr and if
>     > intr_pending_nr is still
>     > non-zero posts another interrupt to the ioapic with
>     hvm_isa_irq_assert_cb().
>     > Intr_pending_nr is incremented in
>     hpet_route_decision_not_missed_ticks().
>     >
>     > The missed ticks policy intr_en_notif callback also uses the
>     pending_mask
>     > method. So even though
>     > Linux does not broadcast its interrupts, the code could handle
>     it if it did.
>     > In this case the end of interrupt time stamp is made when the
>     pending_mask is
>     > clear.
>     >
>     > 4. Live Migration
>     >
>     > Live migration with hpet preserves the current offset of the
>     guest clock with
>     > respect
>     > to ntp. This is accomplished by migrating all of the state in
>     the h->hpet data
>     > structure
>     > in the usual way. The hp->mc_offset is recalculated on the
>     receiving node so
>     > that the
>     > guest sees a continuous hpet main counter.
>     >
>     > Code as been added to xc_domain_save.c to send a small message
>     after the
>     > domain context is sent. The contents of the message is the
>     physical tsc
>     > timestamp, last_tsc,
>     > read just before the message is sent. When the last_tsc message
>     is received in
>     > xc_domain_restore.c,
>     > another physical tsc timestamp, cur_tsc, is read. The two
>     timestamps are
>     > loaded into the domain
>     > structure as last_tsc_sender and first_tsc_receiver with
>     hypercalls. Then
>     > xc_domain_hvm_setcontext
>     > is called so that hpet_load has access to these time stamps.
>     Hpet_load uses
>     > the timestamps
>     > to account for the time spent saving and loading the domain
>     context. With this
>     > technique,
>     > the only neglected time is the time spent sending a small
>     network message.
>     >
>     > 5. Test Results
>     >
>     > Some recent test results are:
>     >
>     > 5.1 Linux 4u664 and Windows 2k864 load test.
>     >       Duration: 70 hours.
>     >       Test date: 6/2/08
>     >       Loads: usex -b48 on Linux; burn-in on Windows
>     >       Guest vcpus: 8 for Linux; 2 for Windows
>     >       Hardware: 8 physical cpu AMD
>     >       Clock drift : Linux: .0012% Windows: .009%
>     >
>     > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test
>     >       Duration: 23 hours.
>     >       Test date: 6/3/08
>     >       Loads: none
>     >       Guest vcpus: 8 for each Linux; 2 for Windows
>     >       Hardware: 4 physical cpu AMD
>     >       Clock drift : Linux: .033% Windows: .019%
>     >
>     > 6. Relation to recent work in xen-unstable
>     >
>     > There is a similarity between hvm_get_guest_time() in
>     xen-unstable and
>     > read_64_main_counter()
>     > in this code. However, read_64_main_counter() is more tuned to
>     the needs of
>     > hpet.c. It has no
>     > "set" operation, only the get. It isolates the mode, physical or
>     simulated, in
>     > read_64_main_counter()
>     > itself. It uses no vcpu or domain state as it is a physical
>     entity, in either
>     > mode. And it provides a real
>     > physical mode for every read for those applications that desire
>     this.
>     >
>     > 7. Conclusion
>     >
>     > The virtual hpet is improved by this patch in terms of accuracy and
>     > monotonicity.
>     > Tests performed to date verify this and more testing is under way.
>     >
>     > 8. Future Work
>     >
>     > Testing with Windows Vista will be performed soon. The reason
>     for accuracy
>     > variations
>     > on different platforms using the physical hpet device will be
>     investigated.
>     > Additional overhead measurements on simulated vs physical hpet
>     mode will be
>     > made.
>     >
>     > Footnotes:
>     >
>     > 1. I don't recall the accuracy improvement with end of interrupt
>     stamping, but
>     > it was
>     > significant, perhaps better than two to one improvement. It
>     would be a very
>     > simple matter
>     > to re-measure the improvement as the facility can call back at
>     injection time
>     > as well.
>     >
>     >
>     > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com>
>     > <mailto:dwinchell@virtualiron.com>
>     > Signed-off-by: Ben Guthro <bguthro@virtualiron.com>
>     > <mailto:bguthro@virtualiron.com>
>     >
>     >
>     > _______________________________________________
>     > Xen-devel mailing list
>     > Xen-devel@lists.xensource.com
>     > http://lists.xensource.com/xen-devel
>
>
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [PATCH 0/2] Improve hpet accuracy
  2008-06-06 19:33       ` Dave Winchell
@ 2008-06-06 20:29         ` Dan Magenheimer
  2008-06-06 22:34           ` Keir Fraser
  2008-06-08 20:32           ` Dave Winchell
  0 siblings, 2 replies; 51+ messages in thread
From: Dan Magenheimer @ 2008-06-06 20:29 UTC (permalink / raw)
  To: Dave Winchell, Keir Fraser; +Cc: xen-devel, Ben Guthro

Dave --

Thanks much for posting the preliminary results!

While I am fully supportive of offering hardware hpet as an option
for hvm guests (let's call it hwhpet=1 for shorthand), I am very
surprised by your preliminary results; the most obvious conclusion
is that Xen system time is losing time at the rate of 1000 PPM
though its possible there's a bug somewhere else in the "time
stack".  Your Windows result is jaw-dropping and inexplicable,
though I have to admit ignorance of how Windows manages time.

I think with my recent patch and hpet=1 (essentially the same as
your emulated hpet), hvm guest time should track Xen system time.
I wonder if domain0 (which if I understand correctly is directly
using Xen system time) is also seeing an error of .1%?  Also
I wonder for the skew you are seeing (in both hvm guests and
domain0) is time moving too fast or two slow?

Although hwhpet=1 is a fine alternative in many cases, it may
be unavailable on some systems and may cause significant performance
issues on others.  So I think we will still need to track down
the poor accuracy when hwhpet=0.  And if for some reason
Xen system time can't be made accurate enough (< 0.05%), then
I think we should consider building Xen system time itself on
top of hardware hpet instead of TSC... at least when Xen discovers
a capable hpet.

One more thought... do you know the accuracy of the TSC crystals
on your test systems?  I posted a patch awhile ago that was
intended to test that, though I guess it was only testing skew
of different TSCs on the same system, not TSCs against an
external time source.

Or maybe there's a computation error somewhere in the hvm hpet
scaling code?  Hmmm...

Thanks,
Dan

> -----Original Message-----
> From: Dave Winchell [mailto:dwinchell@virtualiron.com]
> Sent: Friday, June 06, 2008 1:33 PM
> To: dan.magenheimer@oracle.com; Keir Fraser
> Cc: Ben Guthro; xen-devel; Dave Winchell
> Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> 
> 
> Dan, Keir:
> 
> Preliminary tests results indicate an error of .1% for Linux 64 bit
> guests configured
> for hpet with xen-unstable as is. As we have discussed many times, the
> ntp requirement is .05%.
> Tests on the patch we just submitted for hpet have indicated errors of
> .0012%
> on this platform under similar test conditions and .03% on 
> other platforms.
> 
> Windows vista64 has an error of 11% using hpet with the 
> xen-unstable bits.
> In an overnight test with our hpet patch, the Windows vista 
> error was .008%.
> 
> The tests are with two or three guests on a physical node, all under
> load, and with
> the ratio of vcpus to phys cpus > 1.
> 
> I will continue to run tests over the next few days.
> 
> thanks,
> Dave
> 
> 
> Dan Magenheimer wrote:
> 
> > Hi Dave and Ben --
> >
> > When running tests on xen-unstable (without your patch), 
> please ensure
> > that hpet=1 is set in the hvm config and also I think that when hpet
> > is the clocksource on RHEL4-32, the clock IS resilient to 
> missed ticks
> > so timer_mode should be 2 (vs when pit is the clocksource 
> on RHEL4-32,
> > all clock ticks must be delivered and so timer_mode should be 0).
> >
> > Per
> > 
> http://lists.xensource.com/archives/html/xen-devel/2008-06/msg
> 00098.html it's
> > my intent to clean this up, but I won't get to it until next week.
> >
> > Thanks,
> > Dan
> >
> >     -----Original Message-----
> >     *From:* xen-devel-bounces@lists.xensource.com
> >     [mailto:xen-devel-bounces@lists.xensource.com]*On 
> Behalf Of *Dave
> >     Winchell
> >     *Sent:* Friday, June 06, 2008 4:46 AM
> >     *To:* Keir Fraser; Ben Guthro; xen-devel
> >     *Cc:* dan.magenheimer@oracle.com; Dave Winchell
> >     *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >
> >     Keir,
> >
> >     I think the changes are required. We'll run some tests 
> today today so
> >     that we have some data to talk about.
> >
> >     -Dave
> >
> >
> >     -----Original Message-----
> >     From: xen-devel-bounces@lists.xensource.com on behalf 
> of Keir Fraser
> >     Sent: Fri 6/6/2008 4:58 AM
> >     To: Ben Guthro; xen-devel
> >     Cc: dan.magenheimer@oracle.com
> >     Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >
> >     Are these patches needed now the timers are built on Xen system
> >     time rather
> >     than host TSC? Dan has reported much better 
> time-keeping with his
> >     patch
> >     checked in, and it¹s for sure a lot less invasive than 
> this patchset.
> >
> >
> >      -- Keir
> >
> >     On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote:
> >
> >     >
> >     > 1. Introduction
> >     >
> >     > This patch improves the hpet based guest clock in 
> terms of drift and
> >     > monotonicity.
> >     > Prior to this work the drift with hpet was greater 
> than 2%, far
> >     above the .05%
> >     > limit
> >     > for ntp to synchronize. With this code, the drift ranges from
> >     .001% to .0033%
> >     > depending
> >     > on guest and physical platform.
> >     >
> >     > Using hpet allows guest operating systems to provide monotonic
> >     time to their
> >     > applications. Time sources other than hpet are not 
> monotonic because
> >     > of their reliance on tsc, which is not synchronized 
> across physical
> >     > processors.
> >     >
> >     > Windows 2k864 and many Linux guests are supported with two
> >     policies, one for
> >     > guests
> >     > that handle missed clock interrupts and the other for guests
> >     that require the
> >     > correct number of interrupts.
> >     >
> >     > Guests may use hpet for the timing source even if the physical
> >     platform has no
> >     > visible
> >     > hpet. Migration is supported between physical machines which
> >     differ in
> >     > physical
> >     > hpet visibility.
> >     >
> >     > Most of the changes are in hpet.c. Two general facilities are
> >     added to track
> >     > interrupt
> >     > progress. The ideas here and the facilities would be useful in
> >     vpt.c, for
> >     > other time
> >     > sources, though no attempt is made here to improve vpt.c.
> >     >
> >     > The following sections discuss hpet dependencies, interrupt
> >     delivery policies,
> >     > live migration,
> >     > test results, and relation to recent work with monotonic time.
> >     >
> >     >
> >     > 2. Virtual Hpet dependencies
> >     >
> >     > The virtual hpet depends on the ability to read the 
> physical or
> >     simulated
> >     > (see discussion below) hpet.  For timekeeping, the 
> virtual hpet
> >     also depends
> >     > on two new interrupt notification facilities to implement its
> >     policies for
> >     > interrupt delivery.
> >     >
> >     > 2.1. Two modes of low-level hpet main counter reads.
> >     >
> >     > In this implementation, the virtual hpet reads with
> >     read_64_main_counter(),
> >     > exported by
> >     > time.c, either the real physical hpet main counter register
> >     directly or a
> >     > "simulated"
> >     > hpet main counter.
> >     >
> >     > The simulated mode uses a monotonic version of get_s_time()
> >     (NOW()), where the
> >     > last
> >     > time value is returned whenever the current time value is less
> >     than the last
> >     > time
> >     > value. In simulated mode, since it is layered on s_time, the
> >     underlying
> >     > hardware
> >     > can be hpet or some other device. The frequency of the main
> >     counter in
> >     > simulated
> >     > mode is the same as the standard physical hpet frequency,
> >     allowing live
> >     > migration
> >     > between nodes that are configured differently.
> >     >
> >     > If the physical platform does not have an hpet 
> device, or if xen
> >     is configured
> >     > not
> >     > to use the device, then the simulated method is used. If there
> >     is a physical
> >     > hpet device,
> >     > and xen has initialized it, then either simulated or physical
> >     mode can be
> >     > used.
> >     > This is governed by a boot time option, hpet-avoid. 
> Setting this
> >     option to 1
> >     > gives the
> >     > simulated mode and 0 the physical mode. The default 
> is physical
> >     mode.
> >     >
> >     > A disadvantage of the physical mode is that may take longer to
> >     read the device
> >     > than in simulated mode. On some platforms the cost is 
> about the
> >     same (less
> >     > than 250 nsec) for
> >     > physical and simulated modes, while on others physical cost is
> >     much higher
> >     > than simulated.
> >     > A disadvantage of the simulated mode is that it can return the
> >     same value
> >     > for the counter in consecutive calls.
> >     >
> >     > 2.2. Interrupt notification facilities.
> >     >
> >     > Two interrupt notification facilities are introduced, one is
> >     > hvm_isa_irq_assert_cb()
> >     > and the other hvm_register_intr_en_notif().
> >     >
> >     > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to
> >     the vioapic.
> >     > hvm_isa_irq_assert_cb allows a callback to be passed along to
> >     > vioapic_deliver()
> >     > and this callback is called with a mask of the vcpus 
> which will
> >     get the
> >     > interrupt. This callback is made before any vcpus receive an
> >     interrupt.
> >     >
> >     > Vhpet uses hvm_register_intr_en_notif() to register a handler
> >     for a particular
> >     > vector that will be called when that vector is injected in
> >     > [vmx,svm]_intr_assist()
> >     > and also when the guest finishes handling the interrupt. Here
> >     finished is
> >     > defined
> >     > as the point when the guest re-enables interrupts or 
> lowers the
> >     tpr value.
> >     > EOI is not used as the end of interrupt as this is sometimes
> >     returned before
> >     > the interrupt handler has done its work. A flag is 
> passed to the
> >     handler
> >     > indicating
> >     > whether this is the injection point (post = 1) or the 
> interrupt
> >     finished (post
> >     > = 0) point.
> >     > The need for the finished point callback is discussed in the
> >     missed ticks
> >     > policy section.
> >     >
> >     > To prevent a possible early trigger of the finished callback,
> >     intr_en_notif
> >     > logic
> >     > has a two stage arm, the first at injection
> >     (hvm_intr_en_notif_arm()) and the
> >     > second when
> >     > interrupts are seen to be disabled 
> (hvm_intr_en_notif_disarm()).
> >     Once fully
> >     > armed, re-enabling
> >     > interrupts will cause hvm_intr_en_notif_disarm() to 
> make the end
> >     of interrupt
> >     > callback. hvm_intr_en_notif_arm() and 
> hvm_intr_en_notif_disarm()
> >     are called by
> >     > [vmx,svm]_intr_assist().
> >     >
> >     > 3. Interrupt delivery policies
> >     >
> >     > The existing hpet interrupt delivery is preserved. 
> This includes
> >     > vcpu round robin delivery used by Linux and broadcast delivery
> >     used by
> >     > Windows.
> >     >
> >     > There are two policies for interrupt delivery, one for Windows
> >     2k8-64 and the
> >     > other
> >     > for Linux. The Linux policy takes advantage of the 
> (guest) Linux
> >     missed tick
> >     > and offset
> >     > calculations and does not attempt to deliver the 
> right number of
> >     interrupts.
> >     > The Windows policy delivers the correct number of interrupts,
> >     even if
> >     > sometimes much
> >     > closer to each other than the period. The policies are similar
> >     to those in
> >     > vpt.c, though
> >     > there are some important differences.
> >     >
> >     > Policies are selected with an HVMOP_set_param 
> hypercall with index
> >     > HVM_PARAM_TIMER_MODE.
> >     > Two new values are added, 
> HVM_HPET_guest_computes_missed_ticks and
> >     > HVM_HPET_guest_does_not_compute_missed_ticks.  The reason that
> >     two new ones
> >     > are added is that
> >     > in some guests (32bit Linux) a no-missed policy is needed for
> >     clock sources
> >     > other than hpet
> >     > and a missed ticks policy for hpet. It was felt that 
> there would
> >     be less
> >     > confusion by simply
> >     > introducing the two hpet policies.
> >     >
> >     > 3.1. The missed ticks policy
> >     >
> >     > The Linux clock interrupt handler for hpet calculates missed
> >     ticks and offset
> >     > using the hpet
> >     > main counter. The algorithm works well when the time since the
> >     last interrupt
> >     > is greater than
> >     > or equal to a period and poorly otherwise.
> >     >
> >     > The missed ticks policy ensures that no two clock 
> interrupts are
> >     delivered to
> >     > the guest at
> >     > a time interval less than a period. A time stamp (hpet main
> >     counter value) is
> >     > recorded (by a
> >     > callback registered with hvm_register_intr_en_notif) 
> when Linux
> >     finishes
> >     > handling the clock
> >     > interrupt. Then, ensuing interrupts are delivered to 
> the vioapic
> >     only if the
> >     > current main
> >     > counter value is a period greater than when the last interrupt
> >     was handled.
> >     >
> >     > Tests showed a significant improvement in clock drift with end
> >     of interrupt
> >     > time stamps
> >     > versus beginning of interrupt[1]. It is believed that 
> the reason
> >     for the
> >     > improvement
> >     > is that the clock interrupt handler goes for a 
> spinlock and can
> >     be therefore
> >     > delayed in its
> >     > processing. Furthermore, the main counter is read by the guest
> >     under the lock.
> >     > The net
> >     > effect is that if we time stamp injection, we can get the
> >     difference in time
> >     > between successive interrupt handler lock acquisitions to be
> >     less than the
> >     > period.
> >     >
> >     > 3.2. The no-missed ticks policy
> >     >
> >     > Windows 2k864 keeps very poor time with the missed 
> ticks policy.
> >     So the
> >     > no-missed ticks policy
> >     > was developed. In the no-missed ticks policy we deliver the
> >     correct number of
> >     > interrupts,
> >     > even if they are spaced less than a period apart 
> (when catching up).
> >     >
> >     > Windows 2k864 uses a broadcast mode in the interrupt routing
> >     such that
> >     > all vcpus get the clock interrupt. The best Windows drift
> >     performance was
> >     > achieved when the
> >     > policy code ensured that all the previous interrupts (on the
> >     various vcpus)
> >     > had been injected
> >     > before injecting the next interrupt to the vioapic..
> >     >
> >     > The policy code works as follows. It uses the
> >     hvm_isa_irq_assert_cb() to
> >     > record
> >     > the vcpus to be interrupted in h->hpet.pending_mask. Then, in
> >     the callback
> >     > registered
> >     > with hvm_register_intr_en_notif() at post=1 time it clears the
> >     current vcpu in
> >     > the pending_mask.
> >     > When the pending_mask is clear it decrements
> >     hpet.intr_pending_nr and if
> >     > intr_pending_nr is still
> >     > non-zero posts another interrupt to the ioapic with
> >     hvm_isa_irq_assert_cb().
> >     > Intr_pending_nr is incremented in
> >     hpet_route_decision_not_missed_ticks().
> >     >
> >     > The missed ticks policy intr_en_notif callback also uses the
> >     pending_mask
> >     > method. So even though
> >     > Linux does not broadcast its interrupts, the code could handle
> >     it if it did.
> >     > In this case the end of interrupt time stamp is made when the
> >     pending_mask is
> >     > clear.
> >     >
> >     > 4. Live Migration
> >     >
> >     > Live migration with hpet preserves the current offset of the
> >     guest clock with
> >     > respect
> >     > to ntp. This is accomplished by migrating all of the state in
> >     the h->hpet data
> >     > structure
> >     > in the usual way. The hp->mc_offset is recalculated on the
> >     receiving node so
> >     > that the
> >     > guest sees a continuous hpet main counter.
> >     >
> >     > Code as been added to xc_domain_save.c to send a small message
> >     after the
> >     > domain context is sent. The contents of the message is the
> >     physical tsc
> >     > timestamp, last_tsc,
> >     > read just before the message is sent. When the 
> last_tsc message
> >     is received in
> >     > xc_domain_restore.c,
> >     > another physical tsc timestamp, cur_tsc, is read. The two
> >     timestamps are
> >     > loaded into the domain
> >     > structure as last_tsc_sender and first_tsc_receiver with
> >     hypercalls. Then
> >     > xc_domain_hvm_setcontext
> >     > is called so that hpet_load has access to these time stamps.
> >     Hpet_load uses
> >     > the timestamps
> >     > to account for the time spent saving and loading the domain
> >     context. With this
> >     > technique,
> >     > the only neglected time is the time spent sending a small
> >     network message.
> >     >
> >     > 5. Test Results
> >     >
> >     > Some recent test results are:
> >     >
> >     > 5.1 Linux 4u664 and Windows 2k864 load test.
> >     >       Duration: 70 hours.
> >     >       Test date: 6/2/08
> >     >       Loads: usex -b48 on Linux; burn-in on Windows
> >     >       Guest vcpus: 8 for Linux; 2 for Windows
> >     >       Hardware: 8 physical cpu AMD
> >     >       Clock drift : Linux: .0012% Windows: .009%
> >     >
> >     > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test
> >     >       Duration: 23 hours.
> >     >       Test date: 6/3/08
> >     >       Loads: none
> >     >       Guest vcpus: 8 for each Linux; 2 for Windows
> >     >       Hardware: 4 physical cpu AMD
> >     >       Clock drift : Linux: .033% Windows: .019%
> >     >
> >     > 6. Relation to recent work in xen-unstable
> >     >
> >     > There is a similarity between hvm_get_guest_time() in
> >     xen-unstable and
> >     > read_64_main_counter()
> >     > in this code. However, read_64_main_counter() is more tuned to
> >     the needs of
> >     > hpet.c. It has no
> >     > "set" operation, only the get. It isolates the mode, 
> physical or
> >     simulated, in
> >     > read_64_main_counter()
> >     > itself. It uses no vcpu or domain state as it is a physical
> >     entity, in either
> >     > mode. And it provides a real
> >     > physical mode for every read for those applications 
> that desire
> >     this.
> >     >
> >     > 7. Conclusion
> >     >
> >     > The virtual hpet is improved by this patch in terms 
> of accuracy and
> >     > monotonicity.
> >     > Tests performed to date verify this and more testing 
> is under way.
> >     >
> >     > 8. Future Work
> >     >
> >     > Testing with Windows Vista will be performed soon. The reason
> >     for accuracy
> >     > variations
> >     > on different platforms using the physical hpet device will be
> >     investigated.
> >     > Additional overhead measurements on simulated vs physical hpet
> >     mode will be
> >     > made.
> >     >
> >     > Footnotes:
> >     >
> >     > 1. I don't recall the accuracy improvement with end 
> of interrupt
> >     stamping, but
> >     > it was
> >     > significant, perhaps better than two to one improvement. It
> >     would be a very
> >     > simple matter
> >     > to re-measure the improvement as the facility can call back at
> >     injection time
> >     > as well.
> >     >
> >     >
> >     > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com>
> >     > <mailto:dwinchell@virtualiron.com>
> >     > Signed-off-by: Ben Guthro <bguthro@virtualiron.com>
> >     > <mailto:bguthro@virtualiron.com>
> >     >
> >     >
> >     > _______________________________________________
> >     > Xen-devel mailing list
> >     > Xen-devel@lists.xensource.com
> >     > http://lists.xensource.com/xen-devel
> >
> >
> >
> 
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/2] Improve hpet accuracy
  2008-06-06 20:29         ` Dan Magenheimer
@ 2008-06-06 22:34           ` Keir Fraser
  2008-06-07 21:20             ` Dave Winchell
  2008-06-08 20:32           ` Dave Winchell
  1 sibling, 1 reply; 51+ messages in thread
From: Keir Fraser @ 2008-06-06 22:34 UTC (permalink / raw)
  To: dan.magenheimer@oracle.com, Dave Winchell; +Cc: xen-devel, Ben Guthro

On 6/6/08 21:29, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> Although hwhpet=1 is a fine alternative in many cases, it may
> be unavailable on some systems and may cause significant performance
> issues on others.  So I think we will still need to track down
> the poor accuracy when hwhpet=0.  And if for some reason
> Xen system time can't be made accurate enough (< 0.05%), then
> I think we should consider building Xen system time itself on
> top of hardware hpet instead of TSC... at least when Xen discovers
> a capable hpet.

Yes, this would be a sensible extra timer_mode: have hvm_get_guest_time()
call to the platform time read function, and bypass TSC altogether. This
would be cleaner than having only the vHPET code punch through to the
physical HPET: instead we have the boot-time chosen platform timesource used
by all virtual timers.

> Or maybe there's a computation error somewhere in the hvm hpet
> scaling code?  Hmmm...

Possibly there are bugs in the hpet device model which are fixed by Dave's
patch. If this is actually the case, it would be nice to break those out as
separate patches, as I think an 11% drift must largely be due to
device-model bugs rather than relatively insignificant differences between
hvm_get_guest_time() and physical HPET main counter.

 -- Keir

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [PATCH 0/2] Improve hpet accuracy
  2008-06-06 22:34           ` Keir Fraser
@ 2008-06-07 21:20             ` Dave Winchell
  2008-06-09 21:07               ` Dan Magenheimer
  0 siblings, 1 reply; 51+ messages in thread
From: Dave Winchell @ 2008-06-07 21:20 UTC (permalink / raw)
  To: Keir Fraser, dan.magenheimer; +Cc: Dave Winchell, xen-devel, Ben Guthro

[-- Attachment #1.1: Type: text/plain, Size: 2703 bytes --]

> Possibly there are bugs in the hpet device model which are fixed by Dave's
> patch. If this is actually the case, it would be nice to break those out as
> separate patches, as I think an 11% drift must largely be due to
> device-model bugs rather than relatively insignificant differences between
> hvm_get_guest_time() and physical HPET main counter.

Hi Keir,

I tried an experiment on Friday where I short circuited the missed ticks policy
code in the hpet.c patch, but used the physical hpet each access. The result for Linux
was a drift of .1%, same as the xen-unstable bits.

Conversely I get very good drift numbers, i.e., under .03%, when using the missed ticks
policy code and  running in simulated mode (layered on stime) when stime uses hpet.

So clearly, the improvement from .1% to .03% is due to the policy code.
I haven't run the short circuit test with the windows policy but I can do that
on Monday.

Note: For Windows and Linux I get < .03% drift using the policy code and running
simulated mode whether stime is using hpet or some other device.

regards,
Dave

-----Original Message-----
From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
Sent: Fri 6/6/2008 6:34 PM
To: dan.magenheimer@oracle.com; Dave Winchell
Cc: Ben Guthro; xen-devel
Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

On 6/6/08 21:29, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> Although hwhpet=1 is a fine alternative in many cases, it may
> be unavailable on some systems and may cause significant performance
> issues on others.  So I think we will still need to track down
> the poor accuracy when hwhpet=0.  And if for some reason
> Xen system time can't be made accurate enough (< 0.05%), then
> I think we should consider building Xen system time itself on
> top of hardware hpet instead of TSC... at least when Xen discovers
> a capable hpet.

Yes, this would be a sensible extra timer_mode: have hvm_get_guest_time()
call to the platform time read function, and bypass TSC altogether. This
would be cleaner than having only the vHPET code punch through to the
physical HPET: instead we have the boot-time chosen platform timesource used
by all virtual timers.

> Or maybe there's a computation error somewhere in the hvm hpet
> scaling code?  Hmmm...

Possibly there are bugs in the hpet device model which are fixed by Dave's
patch. If this is actually the case, it would be nice to break those out as
separate patches, as I think an 11% drift must largely be due to
device-model bugs rather than relatively insignificant differences between
hvm_get_guest_time() and physical HPET main counter.

 -- Keir

[-- Attachment #1.2: Type: text/html, Size: 3493 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [PATCH 0/2] Improve hpet accuracy
  2008-06-06 20:29         ` Dan Magenheimer
  2008-06-06 22:34           ` Keir Fraser
@ 2008-06-08 20:32           ` Dave Winchell
  2008-06-08 21:10             ` Dan Magenheimer
                               ` (2 more replies)
  1 sibling, 3 replies; 51+ messages in thread
From: Dave Winchell @ 2008-06-08 20:32 UTC (permalink / raw)
  To: dan.magenheimer, Keir Fraser; +Cc: Dave Winchell, xen-devel, Ben Guthro


[-- Attachment #1.1: Type: text/plain, Size: 23180 bytes --]

Hi Dan,

> While I am fully supportive of offering hardware hpet as an option
> for hvm guests (let's call it hwhpet=1 for shorthand), I am very
> surprised by your preliminary results; the most obvious conclusion
> is that Xen system time is losing time at the rate of 1000 PPM
> though its possible there's a bug somewhere else in the "time
> stack".  Your Windows result is jaw-dropping and inexplicable,
> though I have to admit ignorance of how Windows manages time.

I think xen system time is fine. You have to add the interrupt
delivery policies decribed in the write-up for the patch to get
accurate timekeeping in the guest.

The windows policy is obvious and results in a large improvement
in accuracy. The Linux policy is more subtle, but is required to go
from .1% to .03%.

> I think with my recent patch and hpet=1 (essentially the same as
> your emulated hpet), hvm guest time should track Xen system time.
> I wonder if domain0 (which if I understand correctly is directly
> using Xen system time) is also seeing an error of .1%?  Also
> I wonder for the skew you are seeing (in both hvm guests and
> domain0) is time moving too fast or two slow?

I don't recall the direction. I can look it up in my notes at work
tomorrow.

> Although hwhpet=1 is a fine alternative in many cases, it may
> be unavailable on some systems and may cause significant performance
> issues on others.  So I think we will still need to track down
> the poor accuracy when hwhpet=0.

Our patch is accurate to < .03% using the physical hpet mode or
the simulated mode.

> And if for some reason
> Xen system time can't be made accurate enough (< 0.05%), then
> I think we should consider building Xen system time itself on
> top of hardware hpet instead of TSC... at least when Xen discovers
> a capable hpet.

In our experience, Xen system time is accurate enough now.

> One more thought... do you know the accuracy of the TSC crystals
> on your test systems?  I posted a patch awhile ago that was
> intended to test that, though I guess it was only testing skew
> of different TSCs on the same system, not TSCs against an
> external time source.

I do not know the tsc accuracy.

> Or maybe there's a computation error somewhere in the hvm hpet
> scaling code?  Hmmm...


Regards,
Dave


-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Fri 6/6/2008 4:29 PM
To: Dave Winchell; Keir Fraser
Cc: Ben Guthro; xen-devel
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
 
Dave --

Thanks much for posting the preliminary results!

While I am fully supportive of offering hardware hpet as an option
for hvm guests (let's call it hwhpet=1 for shorthand), I am very
surprised by your preliminary results; the most obvious conclusion
is that Xen system time is losing time at the rate of 1000 PPM
though its possible there's a bug somewhere else in the "time
stack".  Your Windows result is jaw-dropping and inexplicable,
though I have to admit ignorance of how Windows manages time.


I think with my recent patch and hpet=1 (essentially the same as
your emulated hpet), hvm guest time should track Xen system time.
I wonder if domain0 (which if I understand correctly is directly
using Xen system time) is also seeing an error of .1%?  Also
I wonder for the skew you are seeing (in both hvm guests and
domain0) is time moving too fast or two slow?

Although hwhpet=1 is a fine alternative in many cases, it may
be unavailable on some systems and may cause significant performance
issues on others.  So I think we will still need to track down
the poor accuracy when hwhpet=0.  And if for some reason
Xen system time can't be made accurate enough (< 0.05%), then
I think we should consider building Xen system time itself on
top of hardware hpet instead of TSC... at least when Xen discovers
a capable hpet.

One more thought... do you know the accuracy of the TSC crystals
on your test systems?  I posted a patch awhile ago that was
intended to test that, though I guess it was only testing skew
of different TSCs on the same system, not TSCs against an
external time source.

Or maybe there's a computation error somewhere in the hvm hpet
scaling code?  Hmmm...

Thanks,
Dan

> -----Original Message-----
> From: Dave Winchell [mailto:dwinchell@virtualiron.com]
> Sent: Friday, June 06, 2008 1:33 PM
> To: dan.magenheimer@oracle.com; Keir Fraser
> Cc: Ben Guthro; xen-devel; Dave Winchell
> Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> 
> 
> Dan, Keir:
> 
> Preliminary tests results indicate an error of .1% for Linux 64 bit
> guests configured
> for hpet with xen-unstable as is. As we have discussed many times, the
> ntp requirement is .05%.
> Tests on the patch we just submitted for hpet have indicated errors of
> .0012%
> on this platform under similar test conditions and .03% on 
> other platforms.
> 
> Windows vista64 has an error of 11% using hpet with the 
> xen-unstable bits.
> In an overnight test with our hpet patch, the Windows vista 
> error was .008%.
> 
> The tests are with two or three guests on a physical node, all under
> load, and with
> the ratio of vcpus to phys cpus > 1.
> 
> I will continue to run tests over the next few days.
> 
> thanks,
> Dave
> 
> 
> Dan Magenheimer wrote:
> 
> > Hi Dave and Ben --
> >
> > When running tests on xen-unstable (without your patch), 
> please ensure
> > that hpet=1 is set in the hvm config and also I think that when hpet
> > is the clocksource on RHEL4-32, the clock IS resilient to 
> missed ticks
> > so timer_mode should be 2 (vs when pit is the clocksource 
> on RHEL4-32,
> > all clock ticks must be delivered and so timer_mode should be 0).
> >
> > Per
> > 
> http://lists.xensource.com/archives/html/xen-devel/2008-06/msg
> 00098.html it's
> > my intent to clean this up, but I won't get to it until next week.
> >
> > Thanks,
> > Dan
> >
> >     -----Original Message-----
> >     *From:* xen-devel-bounces@lists.xensource.com
> >     [mailto:xen-devel-bounces@lists.xensource.com]*On 
> Behalf Of *Dave
> >     Winchell
> >     *Sent:* Friday, June 06, 2008 4:46 AM
> >     *To:* Keir Fraser; Ben Guthro; xen-devel
> >     *Cc:* dan.magenheimer@oracle.com; Dave Winchell
> >     *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >
> >     Keir,
> >
> >     I think the changes are required. We'll run some tests 
> today today so
> >     that we have some data to talk about.
> >
> >     -Dave
> >
> >
> >     -----Original Message-----
> >     From: xen-devel-bounces@lists.xensource.com on behalf 
> of Keir Fraser
> >     Sent: Fri 6/6/2008 4:58 AM
> >     To: Ben Guthro; xen-devel
> >     Cc: dan.magenheimer@oracle.com
> >     Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >
> >     Are these patches needed now the timers are built on Xen system
> >     time rather
> >     than host TSC? Dan has reported much better 
> time-keeping with his
> >     patch
> >     checked in, and it¹s for sure a lot less invasive than 
> this patchset.
> >
> >
> >      -- Keir
> >
> >     On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote:
> >
> >     >
> >     > 1. Introduction
> >     >
> >     > This patch improves the hpet based guest clock in 
> terms of drift and
> >     > monotonicity.
> >     > Prior to this work the drift with hpet was greater 
> than 2%, far
> >     above the .05%
> >     > limit
> >     > for ntp to synchronize. With this code, the drift ranges from
> >     .001% to .0033%
> >     > depending
> >     > on guest and physical platform.
> >     >
> >     > Using hpet allows guest operating systems to provide monotonic
> >     time to their
> >     > applications. Time sources other than hpet are not 
> monotonic because
> >     > of their reliance on tsc, which is not synchronized 
> across physical
> >     > processors.
> >     >
> >     > Windows 2k864 and many Linux guests are supported with two
> >     policies, one for
> >     > guests
> >     > that handle missed clock interrupts and the other for guests
> >     that require the
> >     > correct number of interrupts.
> >     >
> >     > Guests may use hpet for the timing source even if the physical
> >     platform has no
> >     > visible
> >     > hpet. Migration is supported between physical machines which
> >     differ in
> >     > physical
> >     > hpet visibility.
> >     >
> >     > Most of the changes are in hpet.c. Two general facilities are
> >     added to track
> >     > interrupt
> >     > progress. The ideas here and the facilities would be useful in
> >     vpt.c, for
> >     > other time
> >     > sources, though no attempt is made here to improve vpt.c.
> >     >
> >     > The following sections discuss hpet dependencies, interrupt
> >     delivery policies,
> >     > live migration,
> >     > test results, and relation to recent work with monotonic time.
> >     >
> >     >
> >     > 2. Virtual Hpet dependencies
> >     >
> >     > The virtual hpet depends on the ability to read the 
> physical or
> >     simulated
> >     > (see discussion below) hpet.  For timekeeping, the 
> virtual hpet
> >     also depends
> >     > on two new interrupt notification facilities to implement its
> >     policies for
> >     > interrupt delivery.
> >     >
> >     > 2.1. Two modes of low-level hpet main counter reads.
> >     >
> >     > In this implementation, the virtual hpet reads with
> >     read_64_main_counter(),
> >     > exported by
> >     > time.c, either the real physical hpet main counter register
> >     directly or a
> >     > "simulated"
> >     > hpet main counter.
> >     >
> >     > The simulated mode uses a monotonic version of get_s_time()
> >     (NOW()), where the
> >     > last
> >     > time value is returned whenever the current time value is less
> >     than the last
> >     > time
> >     > value. In simulated mode, since it is layered on s_time, the
> >     underlying
> >     > hardware
> >     > can be hpet or some other device. The frequency of the main
> >     counter in
> >     > simulated
> >     > mode is the same as the standard physical hpet frequency,
> >     allowing live
> >     > migration
> >     > between nodes that are configured differently.
> >     >
> >     > If the physical platform does not have an hpet 
> device, or if xen
> >     is configured
> >     > not
> >     > to use the device, then the simulated method is used. If there
> >     is a physical
> >     > hpet device,
> >     > and xen has initialized it, then either simulated or physical
> >     mode can be
> >     > used.
> >     > This is governed by a boot time option, hpet-avoid. 
> Setting this
> >     option to 1
> >     > gives the
> >     > simulated mode and 0 the physical mode. The default 
> is physical
> >     mode.
> >     >
> >     > A disadvantage of the physical mode is that may take longer to
> >     read the device
> >     > than in simulated mode. On some platforms the cost is 
> about the
> >     same (less
> >     > than 250 nsec) for
> >     > physical and simulated modes, while on others physical cost is
> >     much higher
> >     > than simulated.
> >     > A disadvantage of the simulated mode is that it can return the
> >     same value
> >     > for the counter in consecutive calls.
> >     >
> >     > 2.2. Interrupt notification facilities.
> >     >
> >     > Two interrupt notification facilities are introduced, one is
> >     > hvm_isa_irq_assert_cb()
> >     > and the other hvm_register_intr_en_notif().
> >     >
> >     > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to
> >     the vioapic.
> >     > hvm_isa_irq_assert_cb allows a callback to be passed along to
> >     > vioapic_deliver()
> >     > and this callback is called with a mask of the vcpus 
> which will
> >     get the
> >     > interrupt. This callback is made before any vcpus receive an
> >     interrupt.
> >     >
> >     > Vhpet uses hvm_register_intr_en_notif() to register a handler
> >     for a particular
> >     > vector that will be called when that vector is injected in
> >     > [vmx,svm]_intr_assist()
> >     > and also when the guest finishes handling the interrupt. Here
> >     finished is
> >     > defined
> >     > as the point when the guest re-enables interrupts or 
> lowers the
> >     tpr value.
> >     > EOI is not used as the end of interrupt as this is sometimes
> >     returned before
> >     > the interrupt handler has done its work. A flag is 
> passed to the
> >     handler
> >     > indicating
> >     > whether this is the injection point (post = 1) or the 
> interrupt
> >     finished (post
> >     > = 0) point.
> >     > The need for the finished point callback is discussed in the
> >     missed ticks
> >     > policy section.
> >     >
> >     > To prevent a possible early trigger of the finished callback,
> >     intr_en_notif
> >     > logic
> >     > has a two stage arm, the first at injection
> >     (hvm_intr_en_notif_arm()) and the
> >     > second when
> >     > interrupts are seen to be disabled 
> (hvm_intr_en_notif_disarm()).
> >     Once fully
> >     > armed, re-enabling
> >     > interrupts will cause hvm_intr_en_notif_disarm() to 
> make the end
> >     of interrupt
> >     > callback. hvm_intr_en_notif_arm() and 
> hvm_intr_en_notif_disarm()
> >     are called by
> >     > [vmx,svm]_intr_assist().
> >     >
> >     > 3. Interrupt delivery policies
> >     >
> >     > The existing hpet interrupt delivery is preserved. 
> This includes
> >     > vcpu round robin delivery used by Linux and broadcast delivery
> >     used by
> >     > Windows.
> >     >
> >     > There are two policies for interrupt delivery, one for Windows
> >     2k8-64 and the
> >     > other
> >     > for Linux. The Linux policy takes advantage of the 
> (guest) Linux
> >     missed tick
> >     > and offset
> >     > calculations and does not attempt to deliver the 
> right number of
> >     interrupts.
> >     > The Windows policy delivers the correct number of interrupts,
> >     even if
> >     > sometimes much
> >     > closer to each other than the period. The policies are similar
> >     to those in
> >     > vpt.c, though
> >     > there are some important differences.
> >     >
> >     > Policies are selected with an HVMOP_set_param 
> hypercall with index
> >     > HVM_PARAM_TIMER_MODE.
> >     > Two new values are added, 
> HVM_HPET_guest_computes_missed_ticks and
> >     > HVM_HPET_guest_does_not_compute_missed_ticks.  The reason that
> >     two new ones
> >     > are added is that
> >     > in some guests (32bit Linux) a no-missed policy is needed for
> >     clock sources
> >     > other than hpet
> >     > and a missed ticks policy for hpet. It was felt that 
> there would
> >     be less
> >     > confusion by simply
> >     > introducing the two hpet policies.
> >     >
> >     > 3.1. The missed ticks policy
> >     >
> >     > The Linux clock interrupt handler for hpet calculates missed
> >     ticks and offset
> >     > using the hpet
> >     > main counter. The algorithm works well when the time since the
> >     last interrupt
> >     > is greater than
> >     > or equal to a period and poorly otherwise.
> >     >
> >     > The missed ticks policy ensures that no two clock 
> interrupts are
> >     delivered to
> >     > the guest at
> >     > a time interval less than a period. A time stamp (hpet main
> >     counter value) is
> >     > recorded (by a
> >     > callback registered with hvm_register_intr_en_notif) 
> when Linux
> >     finishes
> >     > handling the clock
> >     > interrupt. Then, ensuing interrupts are delivered to 
> the vioapic
> >     only if the
> >     > current main
> >     > counter value is a period greater than when the last interrupt
> >     was handled.
> >     >
> >     > Tests showed a significant improvement in clock drift with end
> >     of interrupt
> >     > time stamps
> >     > versus beginning of interrupt[1]. It is believed that 
> the reason
> >     for the
> >     > improvement
> >     > is that the clock interrupt handler goes for a 
> spinlock and can
> >     be therefore
> >     > delayed in its
> >     > processing. Furthermore, the main counter is read by the guest
> >     under the lock.
> >     > The net
> >     > effect is that if we time stamp injection, we can get the
> >     difference in time
> >     > between successive interrupt handler lock acquisitions to be
> >     less than the
> >     > period.
> >     >
> >     > 3.2. The no-missed ticks policy
> >     >
> >     > Windows 2k864 keeps very poor time with the missed 
> ticks policy.
> >     So the
> >     > no-missed ticks policy
> >     > was developed. In the no-missed ticks policy we deliver the
> >     correct number of
> >     > interrupts,
> >     > even if they are spaced less than a period apart 
> (when catching up).
> >     >
> >     > Windows 2k864 uses a broadcast mode in the interrupt routing
> >     such that
> >     > all vcpus get the clock interrupt. The best Windows drift
> >     performance was
> >     > achieved when the
> >     > policy code ensured that all the previous interrupts (on the
> >     various vcpus)
> >     > had been injected
> >     > before injecting the next interrupt to the vioapic..
> >     >
> >     > The policy code works as follows. It uses the
> >     hvm_isa_irq_assert_cb() to
> >     > record
> >     > the vcpus to be interrupted in h->hpet.pending_mask. Then, in
> >     the callback
> >     > registered
> >     > with hvm_register_intr_en_notif() at post=1 time it clears the
> >     current vcpu in
> >     > the pending_mask.
> >     > When the pending_mask is clear it decrements
> >     hpet.intr_pending_nr and if
> >     > intr_pending_nr is still
> >     > non-zero posts another interrupt to the ioapic with
> >     hvm_isa_irq_assert_cb().
> >     > Intr_pending_nr is incremented in
> >     hpet_route_decision_not_missed_ticks().
> >     >
> >     > The missed ticks policy intr_en_notif callback also uses the
> >     pending_mask
> >     > method. So even though
> >     > Linux does not broadcast its interrupts, the code could handle
> >     it if it did.
> >     > In this case the end of interrupt time stamp is made when the
> >     pending_mask is
> >     > clear.
> >     >
> >     > 4. Live Migration
> >     >
> >     > Live migration with hpet preserves the current offset of the
> >     guest clock with
> >     > respect
> >     > to ntp. This is accomplished by migrating all of the state in
> >     the h->hpet data
> >     > structure
> >     > in the usual way. The hp->mc_offset is recalculated on the
> >     receiving node so
> >     > that the
> >     > guest sees a continuous hpet main counter.
> >     >
> >     > Code as been added to xc_domain_save.c to send a small message
> >     after the
> >     > domain context is sent. The contents of the message is the
> >     physical tsc
> >     > timestamp, last_tsc,
> >     > read just before the message is sent. When the 
> last_tsc message
> >     is received in
> >     > xc_domain_restore.c,
> >     > another physical tsc timestamp, cur_tsc, is read. The two
> >     timestamps are
> >     > loaded into the domain
> >     > structure as last_tsc_sender and first_tsc_receiver with
> >     hypercalls. Then
> >     > xc_domain_hvm_setcontext
> >     > is called so that hpet_load has access to these time stamps.
> >     Hpet_load uses
> >     > the timestamps
> >     > to account for the time spent saving and loading the domain
> >     context. With this
> >     > technique,
> >     > the only neglected time is the time spent sending a small
> >     network message.
> >     >
> >     > 5. Test Results
> >     >
> >     > Some recent test results are:
> >     >
> >     > 5.1 Linux 4u664 and Windows 2k864 load test.
> >     >       Duration: 70 hours.
> >     >       Test date: 6/2/08
> >     >       Loads: usex -b48 on Linux; burn-in on Windows
> >     >       Guest vcpus: 8 for Linux; 2 for Windows
> >     >       Hardware: 8 physical cpu AMD
> >     >       Clock drift : Linux: .0012% Windows: .009%
> >     >
> >     > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test
> >     >       Duration: 23 hours.
> >     >       Test date: 6/3/08
> >     >       Loads: none
> >     >       Guest vcpus: 8 for each Linux; 2 for Windows
> >     >       Hardware: 4 physical cpu AMD
> >     >       Clock drift : Linux: .033% Windows: .019%
> >     >
> >     > 6. Relation to recent work in xen-unstable
> >     >
> >     > There is a similarity between hvm_get_guest_time() in
> >     xen-unstable and
> >     > read_64_main_counter()
> >     > in this code. However, read_64_main_counter() is more tuned to
> >     the needs of
> >     > hpet.c. It has no
> >     > "set" operation, only the get. It isolates the mode, 
> physical or
> >     simulated, in
> >     > read_64_main_counter()
> >     > itself. It uses no vcpu or domain state as it is a physical
> >     entity, in either
> >     > mode. And it provides a real
> >     > physical mode for every read for those applications 
> that desire
> >     this.
> >     >
> >     > 7. Conclusion
> >     >
> >     > The virtual hpet is improved by this patch in terms 
> of accuracy and
> >     > monotonicity.
> >     > Tests performed to date verify this and more testing 
> is under way.
> >     >
> >     > 8. Future Work
> >     >
> >     > Testing with Windows Vista will be performed soon. The reason
> >     for accuracy
> >     > variations
> >     > on different platforms using the physical hpet device will be
> >     investigated.
> >     > Additional overhead measurements on simulated vs physical hpet
> >     mode will be
> >     > made.
> >     >
> >     > Footnotes:
> >     >
> >     > 1. I don't recall the accuracy improvement with end 
> of interrupt
> >     stamping, but
> >     > it was
> >     > significant, perhaps better than two to one improvement. It
> >     would be a very
> >     > simple matter
> >     > to re-measure the improvement as the facility can call back at
> >     injection time
> >     > as well.
> >     >
> >     >
> >     > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com>
> >     > <mailto:dwinchell@virtualiron.com>
> >     > Signed-off-by: Ben Guthro <bguthro@virtualiron.com>
> >     > <mailto:bguthro@virtualiron.com>
> >     >
> >     >
> >     > _______________________________________________
> >     > Xen-devel mailing list
> >     > Xen-devel@lists.xensource.com
> >     > http://lists.xensource.com/xen-devel
> >
> >
> >
> 
> 



[-- Attachment #1.2: Type: text/html, Size: 39521 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [PATCH 0/2] Improve hpet accuracy
  2008-06-08 20:32           ` Dave Winchell
@ 2008-06-08 21:10             ` Dan Magenheimer
  2008-06-08 23:26               ` Dave Winchell
  2008-06-08 21:18             ` Dan Magenheimer
  2008-06-09 22:02             ` Dan Magenheimer
  2 siblings, 1 reply; 51+ messages in thread
From: Dan Magenheimer @ 2008-06-08 21:10 UTC (permalink / raw)
  To: Dave Winchell, Keir Fraser; +Cc: xen-devel, Ben Guthro


[-- Attachment #1.1: Type: text/plain, Size: 25865 bytes --]

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracyHi Dave --

Thanks for the additional explanation.

Could you please be very precise, when you say "Linux",
as to what you are (and are not) testing?  Specifically:
1) kernel version number and/or distro info
2) 32 vs 64
3) kernel parameters specified
4) config file parameters
5) relevant CPU info that may be passed through by Xen
   to hvm guests (e.g. whether "tsc is synchronized")
6) relevant xen boot parameters (if any)

As we've seen, different combinations of the above can yield
very different test results.  We'd like to confirm your tests,
but if we can avoid unnecessary additional iterations (due to
mismatches on the above), that would be helpful.

Our testing goal is to ensure that there is at least one
known good combination of parameters for each of RHEL3,
RHEL4, and RHEL5 (both 32 and 64) and that works on
both tsc-synchronized and tsc-unsynchronized Intel
and AMD boxes.  And hopefully that works with and without
a real physical hpet available.

We don't have a good test environment for Windows time,
but if you can provide the same test configuration detail,
we may be able to do some testing.

Thanks,
Dan

  -----Original Message-----
  From: Dave Winchell [mailto:dwinchell@virtualiron.com]
  Sent: Sunday, June 08, 2008 2:32 PM
  To: dan.magenheimer@oracle.com; Keir Fraser
  Cc: Ben Guthro; xen-devel; Dave Winchell
  Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy


  Hi Dan,

  > While I am fully supportive of offering hardware hpet as an option
  > for hvm guests (let's call it hwhpet=1 for shorthand), I am very
  > surprised by your preliminary results; the most obvious conclusion
  > is that Xen system time is losing time at the rate of 1000 PPM
  > though its possible there's a bug somewhere else in the "time
  > stack".  Your Windows result is jaw-dropping and inexplicable,
  > though I have to admit ignorance of how Windows manages time.

  I think xen system time is fine. You have to add the interrupt
  delivery policies decribed in the write-up for the patch to get
  accurate timekeeping in the guest.

  The windows policy is obvious and results in a large improvement
  in accuracy. The Linux policy is more subtle, but is required to go
  from .1% to .03%.

  > I think with my recent patch and hpet=1 (essentially the same as
  > your emulated hpet), hvm guest time should track Xen system time.
  > I wonder if domain0 (which if I understand correctly is directly
  > using Xen system time) is also seeing an error of .1%?  Also
  > I wonder for the skew you are seeing (in both hvm guests and
  > domain0) is time moving too fast or two slow?

  I don't recall the direction. I can look it up in my notes at work
  tomorrow.

  > Although hwhpet=1 is a fine alternative in many cases, it may
  > be unavailable on some systems and may cause significant performance
  > issues on others.  So I think we will still need to track down
  > the poor accuracy when hwhpet=0.

  Our patch is accurate to < .03% using the physical hpet mode or
  the simulated mode.

  > And if for some reason
  > Xen system time can't be made accurate enough (< 0.05%), then
  > I think we should consider building Xen system time itself on
  > top of hardware hpet instead of TSC... at least when Xen discovers
  > a capable hpet.

  In our experience, Xen system time is accurate enough now.

  > One more thought... do you know the accuracy of the TSC crystals
  > on your test systems?  I posted a patch awhile ago that was
  > intended to test that, though I guess it was only testing skew
  > of different TSCs on the same system, not TSCs against an
  > external time source.

  I do not know the tsc accuracy.

  > Or maybe there's a computation error somewhere in the hvm hpet
  > scaling code?  Hmmm...


  Regards,
  Dave


  -----Original Message-----
  From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
  Sent: Fri 6/6/2008 4:29 PM
  To: Dave Winchell; Keir Fraser
  Cc: Ben Guthro; xen-devel
  Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

  Dave --

  Thanks much for posting the preliminary results!

  While I am fully supportive of offering hardware hpet as an option
  for hvm guests (let's call it hwhpet=1 for shorthand), I am very
  surprised by your preliminary results; the most obvious conclusion
  is that Xen system time is losing time at the rate of 1000 PPM
  though its possible there's a bug somewhere else in the "time
  stack".  Your Windows result is jaw-dropping and inexplicable,
  though I have to admit ignorance of how Windows manages time.


  I think with my recent patch and hpet=1 (essentially the same as
  your emulated hpet), hvm guest time should track Xen system time.
  I wonder if domain0 (which if I understand correctly is directly
  using Xen system time) is also seeing an error of .1%?  Also
  I wonder for the skew you are seeing (in both hvm guests and
  domain0) is time moving too fast or two slow?

  Although hwhpet=1 is a fine alternative in many cases, it may
  be unavailable on some systems and may cause significant performance
  issues on others.  So I think we will still need to track down
  the poor accuracy when hwhpet=0.  And if for some reason
  Xen system time can't be made accurate enough (< 0.05%), then
  I think we should consider building Xen system time itself on
  top of hardware hpet instead of TSC... at least when Xen discovers
  a capable hpet.

  One more thought... do you know the accuracy of the TSC crystals
  on your test systems?  I posted a patch awhile ago that was
  intended to test that, though I guess it was only testing skew
  of different TSCs on the same system, not TSCs against an
  external time source.

  Or maybe there's a computation error somewhere in the hvm hpet
  scaling code?  Hmmm...

  Thanks,
  Dan

  > -----Original Message-----
  > From: Dave Winchell [mailto:dwinchell@virtualiron.com]
  > Sent: Friday, June 06, 2008 1:33 PM
  > To: dan.magenheimer@oracle.com; Keir Fraser
  > Cc: Ben Guthro; xen-devel; Dave Winchell
  > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
  >
  >
  > Dan, Keir:
  >
  > Preliminary tests results indicate an error of .1% for Linux 64 bit
  > guests configured
  > for hpet with xen-unstable as is. As we have discussed many times, the
  > ntp requirement is .05%.
  > Tests on the patch we just submitted for hpet have indicated errors of
  > .0012%
  > on this platform under similar test conditions and .03% on
  > other platforms.
  >
  > Windows vista64 has an error of 11% using hpet with the
  > xen-unstable bits.
  > In an overnight test with our hpet patch, the Windows vista
  > error was .008%.
  >
  > The tests are with two or three guests on a physical node, all under
  > load, and with
  > the ratio of vcpus to phys cpus > 1.
  >
  > I will continue to run tests over the next few days.
  >
  > thanks,
  > Dave
  >
  >
  > Dan Magenheimer wrote:
  >
  > > Hi Dave and Ben --
  > >
  > > When running tests on xen-unstable (without your patch),
  > please ensure
  > > that hpet=1 is set in the hvm config and also I think that when hpet
  > > is the clocksource on RHEL4-32, the clock IS resilient to
  > missed ticks
  > > so timer_mode should be 2 (vs when pit is the clocksource
  > on RHEL4-32,
  > > all clock ticks must be delivered and so timer_mode should be 0).
  > >
  > > Per
  > >
  > http://lists.xensource.com/archives/html/xen-devel/2008-06/msg
  > 00098.html it's
  > > my intent to clean this up, but I won't get to it until next week.
  > >
  > > Thanks,
  > > Dan
  > >
  > >     -----Original Message-----
  > >     *From:* xen-devel-bounces@lists.xensource.com
  > >     [mailto:xen-devel-bounces@lists.xensource.com]*On
  > Behalf Of *Dave
  > >     Winchell
  > >     *Sent:* Friday, June 06, 2008 4:46 AM
  > >     *To:* Keir Fraser; Ben Guthro; xen-devel
  > >     *Cc:* dan.magenheimer@oracle.com; Dave Winchell
  > >     *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
  > >
  > >     Keir,
  > >
  > >     I think the changes are required. We'll run some tests
  > today today so
  > >     that we have some data to talk about.
  > >
  > >     -Dave
  > >
  > >
  > >     -----Original Message-----
  > >     From: xen-devel-bounces@lists.xensource.com on behalf
  > of Keir Fraser
  > >     Sent: Fri 6/6/2008 4:58 AM
  > >     To: Ben Guthro; xen-devel
  > >     Cc: dan.magenheimer@oracle.com
  > >     Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
  > >
  > >     Are these patches needed now the timers are built on Xen system
  > >     time rather
  > >     than host TSC? Dan has reported much better
  > time-keeping with his
  > >     patch
  > >     checked in, and it¹s for sure a lot less invasive than
  > this patchset.
  > >
  > >
  > >      -- Keir
  > >
  > >     On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote:
  > >
  > >     >
  > >     > 1. Introduction
  > >     >
  > >     > This patch improves the hpet based guest clock in
  > terms of drift and
  > >     > monotonicity.
  > >     > Prior to this work the drift with hpet was greater
  > than 2%, far
  > >     above the .05%
  > >     > limit
  > >     > for ntp to synchronize. With this code, the drift ranges from
  > >     .001% to .0033%
  > >     > depending
  > >     > on guest and physical platform.
  > >     >
  > >     > Using hpet allows guest operating systems to provide monotonic
  > >     time to their
  > >     > applications. Time sources other than hpet are not
  > monotonic because
  > >     > of their reliance on tsc, which is not synchronized
  > across physical
  > >     > processors.
  > >     >
  > >     > Windows 2k864 and many Linux guests are supported with two
  > >     policies, one for
  > >     > guests
  > >     > that handle missed clock interrupts and the other for guests
  > >     that require the
  > >     > correct number of interrupts.
  > >     >
  > >     > Guests may use hpet for the timing source even if the physical
  > >     platform has no
  > >     > visible
  > >     > hpet. Migration is supported between physical machines which
  > >     differ in
  > >     > physical
  > >     > hpet visibility.
  > >     >
  > >     > Most of the changes are in hpet.c. Two general facilities are
  > >     added to track
  > >     > interrupt
  > >     > progress. The ideas here and the facilities would be useful in
  > >     vpt.c, for
  > >     > other time
  > >     > sources, though no attempt is made here to improve vpt.c.
  > >     >
  > >     > The following sections discuss hpet dependencies, interrupt
  > >     delivery policies,
  > >     > live migration,
  > >     > test results, and relation to recent work with monotonic time.
  > >     >
  > >     >
  > >     > 2. Virtual Hpet dependencies
  > >     >
  > >     > The virtual hpet depends on the ability to read the
  > physical or
  > >     simulated
  > >     > (see discussion below) hpet.  For timekeeping, the
  > virtual hpet
  > >     also depends
  > >     > on two new interrupt notification facilities to implement its
  > >     policies for
  > >     > interrupt delivery.
  > >     >
  > >     > 2.1. Two modes of low-level hpet main counter reads.
  > >     >
  > >     > In this implementation, the virtual hpet reads with
  > >     read_64_main_counter(),
  > >     > exported by
  > >     > time.c, either the real physical hpet main counter register
  > >     directly or a
  > >     > "simulated"
  > >     > hpet main counter.
  > >     >
  > >     > The simulated mode uses a monotonic version of get_s_time()
  > >     (NOW()), where the
  > >     > last
  > >     > time value is returned whenever the current time value is less
  > >     than the last
  > >     > time
  > >     > value. In simulated mode, since it is layered on s_time, the
  > >     underlying
  > >     > hardware
  > >     > can be hpet or some other device. The frequency of the main
  > >     counter in
  > >     > simulated
  > >     > mode is the same as the standard physical hpet frequency,
  > >     allowing live
  > >     > migration
  > >     > between nodes that are configured differently.
  > >     >
  > >     > If the physical platform does not have an hpet
  > device, or if xen
  > >     is configured
  > >     > not
  > >     > to use the device, then the simulated method is used. If there
  > >     is a physical
  > >     > hpet device,
  > >     > and xen has initialized it, then either simulated or physical
  > >     mode can be
  > >     > used.
  > >     > This is governed by a boot time option, hpet-avoid.
  > Setting this
  > >     option to 1
  > >     > gives the
  > >     > simulated mode and 0 the physical mode. The default
  > is physical
  > >     mode.
  > >     >
  > >     > A disadvantage of the physical mode is that may take longer to
  > >     read the device
  > >     > than in simulated mode. On some platforms the cost is
  > about the
  > >     same (less
  > >     > than 250 nsec) for
  > >     > physical and simulated modes, while on others physical cost is
  > >     much higher
  > >     > than simulated.
  > >     > A disadvantage of the simulated mode is that it can return the
  > >     same value
  > >     > for the counter in consecutive calls.
  > >     >
  > >     > 2.2. Interrupt notification facilities.
  > >     >
  > >     > Two interrupt notification facilities are introduced, one is
  > >     > hvm_isa_irq_assert_cb()
  > >     > and the other hvm_register_intr_en_notif().
  > >     >
  > >     > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to
  > >     the vioapic.
  > >     > hvm_isa_irq_assert_cb allows a callback to be passed along to
  > >     > vioapic_deliver()
  > >     > and this callback is called with a mask of the vcpus
  > which will
  > >     get the
  > >     > interrupt. This callback is made before any vcpus receive an
  > >     interrupt.
  > >     >
  > >     > Vhpet uses hvm_register_intr_en_notif() to register a handler
  > >     for a particular
  > >     > vector that will be called when that vector is injected in
  > >     > [vmx,svm]_intr_assist()
  > >     > and also when the guest finishes handling the interrupt. Here
  > >     finished is
  > >     > defined
  > >     > as the point when the guest re-enables interrupts or
  > lowers the
  > >     tpr value.
  > >     > EOI is not used as the end of interrupt as this is sometimes
  > >     returned before
  > >     > the interrupt handler has done its work. A flag is
  > passed to the
  > >     handler
  > >     > indicating
  > >     > whether this is the injection point (post = 1) or the
  > interrupt
  > >     finished (post
  > >     > = 0) point.
  > >     > The need for the finished point callback is discussed in the
  > >     missed ticks
  > >     > policy section.
  > >     >
  > >     > To prevent a possible early trigger of the finished callback,
  > >     intr_en_notif
  > >     > logic
  > >     > has a two stage arm, the first at injection
  > >     (hvm_intr_en_notif_arm()) and the
  > >     > second when
  > >     > interrupts are seen to be disabled
  > (hvm_intr_en_notif_disarm()).
  > >     Once fully
  > >     > armed, re-enabling
  > >     > interrupts will cause hvm_intr_en_notif_disarm() to
  > make the end
  > >     of interrupt
  > >     > callback. hvm_intr_en_notif_arm() and
  > hvm_intr_en_notif_disarm()
  > >     are called by
  > >     > [vmx,svm]_intr_assist().
  > >     >
  > >     > 3. Interrupt delivery policies
  > >     >
  > >     > The existing hpet interrupt delivery is preserved.
  > This includes
  > >     > vcpu round robin delivery used by Linux and broadcast delivery
  > >     used by
  > >     > Windows.
  > >     >
  > >     > There are two policies for interrupt delivery, one for Windows
  > >     2k8-64 and the
  > >     > other
  > >     > for Linux. The Linux policy takes advantage of the
  > (guest) Linux
  > >     missed tick
  > >     > and offset
  > >     > calculations and does not attempt to deliver the
  > right number of
  > >     interrupts.
  > >     > The Windows policy delivers the correct number of interrupts,
  > >     even if
  > >     > sometimes much
  > >     > closer to each other than the period. The policies are similar
  > >     to those in
  > >     > vpt.c, though
  > >     > there are some important differences.
  > >     >
  > >     > Policies are selected with an HVMOP_set_param
  > hypercall with index
  > >     > HVM_PARAM_TIMER_MODE.
  > >     > Two new values are added,
  > HVM_HPET_guest_computes_missed_ticks and
  > >     > HVM_HPET_guest_does_not_compute_missed_ticks.  The reason that
  > >     two new ones
  > >     > are added is that
  > >     > in some guests (32bit Linux) a no-missed policy is needed for
  > >     clock sources
  > >     > other than hpet
  > >     > and a missed ticks policy for hpet. It was felt that
  > there would
  > >     be less
  > >     > confusion by simply
  > >     > introducing the two hpet policies.
  > >     >
  > >     > 3.1. The missed ticks policy
  > >     >
  > >     > The Linux clock interrupt handler for hpet calculates missed
  > >     ticks and offset
  > >     > using the hpet
  > >     > main counter. The algorithm works well when the time since the
  > >     last interrupt
  > >     > is greater than
  > >     > or equal to a period and poorly otherwise.
  > >     >
  > >     > The missed ticks policy ensures that no two clock
  > interrupts are
  > >     delivered to
  > >     > the guest at
  > >     > a time interval less than a period. A time stamp (hpet main
  > >     counter value) is
  > >     > recorded (by a
  > >     > callback registered with hvm_register_intr_en_notif)
  > when Linux
  > >     finishes
  > >     > handling the clock
  > >     > interrupt. Then, ensuing interrupts are delivered to
  > the vioapic
  > >     only if the
  > >     > current main
  > >     > counter value is a period greater than when the last interrupt
  > >     was handled.
  > >     >
  > >     > Tests showed a significant improvement in clock drift with end
  > >     of interrupt
  > >     > time stamps
  > >     > versus beginning of interrupt[1]. It is believed that
  > the reason
  > >     for the
  > >     > improvement
  > >     > is that the clock interrupt handler goes for a
  > spinlock and can
  > >     be therefore
  > >     > delayed in its
  > >     > processing. Furthermore, the main counter is read by the guest
  > >     under the lock.
  > >     > The net
  > >     > effect is that if we time stamp injection, we can get the
  > >     difference in time
  > >     > between successive interrupt handler lock acquisitions to be
  > >     less than the
  > >     > period.
  > >     >
  > >     > 3.2. The no-missed ticks policy
  > >     >
  > >     > Windows 2k864 keeps very poor time with the missed
  > ticks policy.
  > >     So the
  > >     > no-missed ticks policy
  > >     > was developed. In the no-missed ticks policy we deliver the
  > >     correct number of
  > >     > interrupts,
  > >     > even if they are spaced less than a period apart
  > (when catching up).
  > >     >
  > >     > Windows 2k864 uses a broadcast mode in the interrupt routing
  > >     such that
  > >     > all vcpus get the clock interrupt. The best Windows drift
  > >     performance was
  > >     > achieved when the
  > >     > policy code ensured that all the previous interrupts (on the
  > >     various vcpus)
  > >     > had been injected
  > >     > before injecting the next interrupt to the vioapic..
  > >     >
  > >     > The policy code works as follows. It uses the
  > >     hvm_isa_irq_assert_cb() to
  > >     > record
  > >     > the vcpus to be interrupted in h->hpet.pending_mask. Then, in
  > >     the callback
  > >     > registered
  > >     > with hvm_register_intr_en_notif() at post=1 time it clears the
  > >     current vcpu in
  > >     > the pending_mask.
  > >     > When the pending_mask is clear it decrements
  > >     hpet.intr_pending_nr and if
  > >     > intr_pending_nr is still
  > >     > non-zero posts another interrupt to the ioapic with
  > >     hvm_isa_irq_assert_cb().
  > >     > Intr_pending_nr is incremented in
  > >     hpet_route_decision_not_missed_ticks().
  > >     >
  > >     > The missed ticks policy intr_en_notif callback also uses the
  > >     pending_mask
  > >     > method. So even though
  > >     > Linux does not broadcast its interrupts, the code could handle
  > >     it if it did.
  > >     > In this case the end of interrupt time stamp is made when the
  > >     pending_mask is
  > >     > clear.
  > >     >
  > >     > 4. Live Migration
  > >     >
  > >     > Live migration with hpet preserves the current offset of the
  > >     guest clock with
  > >     > respect
  > >     > to ntp. This is accomplished by migrating all of the state in
  > >     the h->hpet data
  > >     > structure
  > >     > in the usual way. The hp->mc_offset is recalculated on the
  > >     receiving node so
  > >     > that the
  > >     > guest sees a continuous hpet main counter.
  > >     >
  > >     > Code as been added to xc_domain_save.c to send a small message
  > >     after the
  > >     > domain context is sent. The contents of the message is the
  > >     physical tsc
  > >     > timestamp, last_tsc,
  > >     > read just before the message is sent. When the
  > last_tsc message
  > >     is received in
  > >     > xc_domain_restore.c,
  > >     > another physical tsc timestamp, cur_tsc, is read. The two
  > >     timestamps are
  > >     > loaded into the domain
  > >     > structure as last_tsc_sender and first_tsc_receiver with
  > >     hypercalls. Then
  > >     > xc_domain_hvm_setcontext
  > >     > is called so that hpet_load has access to these time stamps.
  > >     Hpet_load uses
  > >     > the timestamps
  > >     > to account for the time spent saving and loading the domain
  > >     context. With this
  > >     > technique,
  > >     > the only neglected time is the time spent sending a small
  > >     network message.
  > >     >
  > >     > 5. Test Results
  > >     >
  > >     > Some recent test results are:
  > >     >
  > >     > 5.1 Linux 4u664 and Windows 2k864 load test.
  > >     >       Duration: 70 hours.
  > >     >       Test date: 6/2/08
  > >     >       Loads: usex -b48 on Linux; burn-in on Windows
  > >     >       Guest vcpus: 8 for Linux; 2 for Windows
  > >     >       Hardware: 8 physical cpu AMD
  > >     >       Clock drift : Linux: .0012% Windows: .009%
  > >     >
  > >     > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test
  > >     >       Duration: 23 hours.
  > >     >       Test date: 6/3/08
  > >     >       Loads: none
  > >     >       Guest vcpus: 8 for each Linux; 2 for Windows
  > >     >       Hardware: 4 physical cpu AMD
  > >     >       Clock drift : Linux: .033% Windows: .019%
  > >     >
  > >     > 6. Relation to recent work in xen-unstable
  > >     >
  > >     > There is a similarity between hvm_get_guest_time() in
  > >     xen-unstable and
  > >     > read_64_main_counter()
  > >     > in this code. However, read_64_main_counter() is more tuned to
  > >     the needs of
  > >     > hpet.c. It has no
  > >     > "set" operation, only the get. It isolates the mode,
  > physical or
  > >     simulated, in
  > >     > read_64_main_counter()
  > >     > itself. It uses no vcpu or domain state as it is a physical
  > >     entity, in either
  > >     > mode. And it provides a real
  > >     > physical mode for every read for those applications
  > that desire
  > >     this.
  > >     >
  > >     > 7. Conclusion
  > >     >
  > >     > The virtual hpet is improved by this patch in terms
  > of accuracy and
  > >     > monotonicity.
  > >     > Tests performed to date verify this and more testing
  > is under way.
  > >     >
  > >     > 8. Future Work
  > >     >
  > >     > Testing with Windows Vista will be performed soon. The reason
  > >     for accuracy
  > >     > variations
  > >     > on different platforms using the physical hpet device will be
  > >     investigated.
  > >     > Additional overhead measurements on simulated vs physical hpet
  > >     mode will be
  > >     > made.
  > >     >
  > >     > Footnotes:
  > >     >
  > >     > 1. I don't recall the accuracy improvement with end
  > of interrupt
  > >     stamping, but
  > >     > it was
  > >     > significant, perhaps better than two to one improvement. It
  > >     would be a very
  > >     > simple matter
  > >     > to re-measure the improvement as the facility can call back at
  > >     injection time
  > >     > as well.
  > >     >
  > >     >
  > >     > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com>
  > >     > <mailto:dwinchell@virtualiron.com>
  > >     > Signed-off-by: Ben Guthro <bguthro@virtualiron.com>
  > >     > <mailto:bguthro@virtualiron.com>
  > >     >
  > >     >
  > >     > _______________________________________________
  > >     > Xen-devel mailing list
  > >     > Xen-devel@lists.xensource.com
  > >     > http://lists.xensource.com/xen-devel
  > >
  > >
  > >
  >
  >




[-- Attachment #1.2: Type: text/html, Size: 45032 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [PATCH 0/2] Improve hpet accuracy
  2008-06-08 20:32           ` Dave Winchell
  2008-06-08 21:10             ` Dan Magenheimer
@ 2008-06-08 21:18             ` Dan Magenheimer
  2008-06-09 22:02             ` Dan Magenheimer
  2 siblings, 0 replies; 51+ messages in thread
From: Dan Magenheimer @ 2008-06-08 21:18 UTC (permalink / raw)
  To: Dave Winchell, Keir Fraser; +Cc: xen-devel, Ben Guthro


[-- Attachment #1.1: Type: text/plain, Size: 25035 bytes --]

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy> A disadvantage of the simulated mode is that it can return the same value
> for the counter in consecutive calls.

It also occurs to me that if the granularity is good enough, an easy fix
to this problem might be to always increment the returned value
by at least one.  Then time is always at least increasing rather
than stopped
  -----Original Message-----
  From: Dave Winchell [mailto:dwinchell@virtualiron.com]
  Sent: Sunday, June 08, 2008 2:32 PM
  To: dan.magenheimer@oracle.com; Keir Fraser
  Cc: Ben Guthro; xen-devel; Dave Winchell
  Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy


  Hi Dan,

  > While I am fully supportive of offering hardware hpet as an option
  > for hvm guests (let's call it hwhpet=1 for shorthand), I am very
  > surprised by your preliminary results; the most obvious conclusion
  > is that Xen system time is losing time at the rate of 1000 PPM
  > though its possible there's a bug somewhere else in the "time
  > stack".  Your Windows result is jaw-dropping and inexplicable,
  > though I have to admit ignorance of how Windows manages time.

  I think xen system time is fine. You have to add the interrupt
  delivery policies decribed in the write-up for the patch to get
  accurate timekeeping in the guest.

  The windows policy is obvious and results in a large improvement
  in accuracy. The Linux policy is more subtle, but is required to go
  from .1% to .03%.

  > I think with my recent patch and hpet=1 (essentially the same as
  > your emulated hpet), hvm guest time should track Xen system time.
  > I wonder if domain0 (which if I understand correctly is directly
  > using Xen system time) is also seeing an error of .1%?  Also
  > I wonder for the skew you are seeing (in both hvm guests and
  > domain0) is time moving too fast or two slow?

  I don't recall the direction. I can look it up in my notes at work
  tomorrow.

  > Although hwhpet=1 is a fine alternative in many cases, it may
  > be unavailable on some systems and may cause significant performance
  > issues on others.  So I think we will still need to track down
  > the poor accuracy when hwhpet=0.

  Our patch is accurate to < .03% using the physical hpet mode or
  the simulated mode.

  > And if for some reason
  > Xen system time can't be made accurate enough (< 0.05%), then
  > I think we should consider building Xen system time itself on
  > top of hardware hpet instead of TSC... at least when Xen discovers
  > a capable hpet.

  In our experience, Xen system time is accurate enough now.

  > One more thought... do you know the accuracy of the TSC crystals
  > on your test systems?  I posted a patch awhile ago that was
  > intended to test that, though I guess it was only testing skew
  > of different TSCs on the same system, not TSCs against an
  > external time source.

  I do not know the tsc accuracy.

  > Or maybe there's a computation error somewhere in the hvm hpet
  > scaling code?  Hmmm...


  Regards,
  Dave


  -----Original Message-----
  From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
  Sent: Fri 6/6/2008 4:29 PM
  To: Dave Winchell; Keir Fraser
  Cc: Ben Guthro; xen-devel
  Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

  Dave --

  Thanks much for posting the preliminary results!

  While I am fully supportive of offering hardware hpet as an option
  for hvm guests (let's call it hwhpet=1 for shorthand), I am very
  surprised by your preliminary results; the most obvious conclusion
  is that Xen system time is losing time at the rate of 1000 PPM
  though its possible there's a bug somewhere else in the "time
  stack".  Your Windows result is jaw-dropping and inexplicable,
  though I have to admit ignorance of how Windows manages time.


  I think with my recent patch and hpet=1 (essentially the same as
  your emulated hpet), hvm guest time should track Xen system time.
  I wonder if domain0 (which if I understand correctly is directly
  using Xen system time) is also seeing an error of .1%?  Also
  I wonder for the skew you are seeing (in both hvm guests and
  domain0) is time moving too fast or two slow?

  Although hwhpet=1 is a fine alternative in many cases, it may
  be unavailable on some systems and may cause significant performance
  issues on others.  So I think we will still need to track down
  the poor accuracy when hwhpet=0.  And if for some reason
  Xen system time can't be made accurate enough (< 0.05%), then
  I think we should consider building Xen system time itself on
  top of hardware hpet instead of TSC... at least when Xen discovers
  a capable hpet.

  One more thought... do you know the accuracy of the TSC crystals
  on your test systems?  I posted a patch awhile ago that was
  intended to test that, though I guess it was only testing skew
  of different TSCs on the same system, not TSCs against an
  external time source.

  Or maybe there's a computation error somewhere in the hvm hpet
  scaling code?  Hmmm...

  Thanks,
  Dan

  > -----Original Message-----
  > From: Dave Winchell [mailto:dwinchell@virtualiron.com]
  > Sent: Friday, June 06, 2008 1:33 PM
  > To: dan.magenheimer@oracle.com; Keir Fraser
  > Cc: Ben Guthro; xen-devel; Dave Winchell
  > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
  >
  >
  > Dan, Keir:
  >
  > Preliminary tests results indicate an error of .1% for Linux 64 bit
  > guests configured
  > for hpet with xen-unstable as is. As we have discussed many times, the
  > ntp requirement is .05%.
  > Tests on the patch we just submitted for hpet have indicated errors of
  > .0012%
  > on this platform under similar test conditions and .03% on
  > other platforms.
  >
  > Windows vista64 has an error of 11% using hpet with the
  > xen-unstable bits.
  > In an overnight test with our hpet patch, the Windows vista
  > error was .008%.
  >
  > The tests are with two or three guests on a physical node, all under
  > load, and with
  > the ratio of vcpus to phys cpus > 1.
  >
  > I will continue to run tests over the next few days.
  >
  > thanks,
  > Dave
  >
  >
  > Dan Magenheimer wrote:
  >
  > > Hi Dave and Ben --
  > >
  > > When running tests on xen-unstable (without your patch),
  > please ensure
  > > that hpet=1 is set in the hvm config and also I think that when hpet
  > > is the clocksource on RHEL4-32, the clock IS resilient to
  > missed ticks
  > > so timer_mode should be 2 (vs when pit is the clocksource
  > on RHEL4-32,
  > > all clock ticks must be delivered and so timer_mode should be 0).
  > >
  > > Per
  > >
  > http://lists.xensource.com/archives/html/xen-devel/2008-06/msg
  > 00098.html it's
  > > my intent to clean this up, but I won't get to it until next week.
  > >
  > > Thanks,
  > > Dan
  > >
  > >     -----Original Message-----
  > >     *From:* xen-devel-bounces@lists.xensource.com
  > >     [mailto:xen-devel-bounces@lists.xensource.com]*On
  > Behalf Of *Dave
  > >     Winchell
  > >     *Sent:* Friday, June 06, 2008 4:46 AM
  > >     *To:* Keir Fraser; Ben Guthro; xen-devel
  > >     *Cc:* dan.magenheimer@oracle.com; Dave Winchell
  > >     *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
  > >
  > >     Keir,
  > >
  > >     I think the changes are required. We'll run some tests
  > today today so
  > >     that we have some data to talk about.
  > >
  > >     -Dave
  > >
  > >
  > >     -----Original Message-----
  > >     From: xen-devel-bounces@lists.xensource.com on behalf
  > of Keir Fraser
  > >     Sent: Fri 6/6/2008 4:58 AM
  > >     To: Ben Guthro; xen-devel
  > >     Cc: dan.magenheimer@oracle.com
  > >     Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
  > >
  > >     Are these patches needed now the timers are built on Xen system
  > >     time rather
  > >     than host TSC? Dan has reported much better
  > time-keeping with his
  > >     patch
  > >     checked in, and it¹s for sure a lot less invasive than
  > this patchset.
  > >
  > >
  > >      -- Keir
  > >
  > >     On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote:
  > >
  > >     >
  > >     > 1. Introduction
  > >     >
  > >     > This patch improves the hpet based guest clock in
  > terms of drift and
  > >     > monotonicity.
  > >     > Prior to this work the drift with hpet was greater
  > than 2%, far
  > >     above the .05%
  > >     > limit
  > >     > for ntp to synchronize. With this code, the drift ranges from
  > >     .001% to .0033%
  > >     > depending
  > >     > on guest and physical platform.
  > >     >
  > >     > Using hpet allows guest operating systems to provide monotonic
  > >     time to their
  > >     > applications. Time sources other than hpet are not
  > monotonic because
  > >     > of their reliance on tsc, which is not synchronized
  > across physical
  > >     > processors.
  > >     >
  > >     > Windows 2k864 and many Linux guests are supported with two
  > >     policies, one for
  > >     > guests
  > >     > that handle missed clock interrupts and the other for guests
  > >     that require the
  > >     > correct number of interrupts.
  > >     >
  > >     > Guests may use hpet for the timing source even if the physical
  > >     platform has no
  > >     > visible
  > >     > hpet. Migration is supported between physical machines which
  > >     differ in
  > >     > physical
  > >     > hpet visibility.
  > >     >
  > >     > Most of the changes are in hpet.c. Two general facilities are
  > >     added to track
  > >     > interrupt
  > >     > progress. The ideas here and the facilities would be useful in
  > >     vpt.c, for
  > >     > other time
  > >     > sources, though no attempt is made here to improve vpt.c.
  > >     >
  > >     > The following sections discuss hpet dependencies, interrupt
  > >     delivery policies,
  > >     > live migration,
  > >     > test results, and relation to recent work with monotonic time.
  > >     >
  > >     >
  > >     > 2. Virtual Hpet dependencies
  > >     >
  > >     > The virtual hpet depends on the ability to read the
  > physical or
  > >     simulated
  > >     > (see discussion below) hpet.  For timekeeping, the
  > virtual hpet
  > >     also depends
  > >     > on two new interrupt notification facilities to implement its
  > >     policies for
  > >     > interrupt delivery.
  > >     >
  > >     > 2.1. Two modes of low-level hpet main counter reads.
  > >     >
  > >     > In this implementation, the virtual hpet reads with
  > >     read_64_main_counter(),
  > >     > exported by
  > >     > time.c, either the real physical hpet main counter register
  > >     directly or a
  > >     > "simulated"
  > >     > hpet main counter.
  > >     >
  > >     > The simulated mode uses a monotonic version of get_s_time()
  > >     (NOW()), where the
  > >     > last
  > >     > time value is returned whenever the current time value is less
  > >     than the last
  > >     > time
  > >     > value. In simulated mode, since it is layered on s_time, the
  > >     underlying
  > >     > hardware
  > >     > can be hpet or some other device. The frequency of the main
  > >     counter in
  > >     > simulated
  > >     > mode is the same as the standard physical hpet frequency,
  > >     allowing live
  > >     > migration
  > >     > between nodes that are configured differently.
  > >     >
  > >     > If the physical platform does not have an hpet
  > device, or if xen
  > >     is configured
  > >     > not
  > >     > to use the device, then the simulated method is used. If there
  > >     is a physical
  > >     > hpet device,
  > >     > and xen has initialized it, then either simulated or physical
  > >     mode can be
  > >     > used.
  > >     > This is governed by a boot time option, hpet-avoid.
  > Setting this
  > >     option to 1
  > >     > gives the
  > >     > simulated mode and 0 the physical mode. The default
  > is physical
  > >     mode.
  > >     >
  > >     > A disadvantage of the physical mode is that may take longer to
  > >     read the device
  > >     > than in simulated mode. On some platforms the cost is
  > about the
  > >     same (less
  > >     > than 250 nsec) for
  > >     > physical and simulated modes, while on others physical cost is
  > >     much higher
  > >     > than simulated.
  > >     > A disadvantage of the simulated mode is that it can return the
  > >     same value
  > >     > for the counter in consecutive calls.
  > >     >
  > >     > 2.2. Interrupt notification facilities.
  > >     >
  > >     > Two interrupt notification facilities are introduced, one is
  > >     > hvm_isa_irq_assert_cb()
  > >     > and the other hvm_register_intr_en_notif().
  > >     >
  > >     > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to
  > >     the vioapic.
  > >     > hvm_isa_irq_assert_cb allows a callback to be passed along to
  > >     > vioapic_deliver()
  > >     > and this callback is called with a mask of the vcpus
  > which will
  > >     get the
  > >     > interrupt. This callback is made before any vcpus receive an
  > >     interrupt.
  > >     >
  > >     > Vhpet uses hvm_register_intr_en_notif() to register a handler
  > >     for a particular
  > >     > vector that will be called when that vector is injected in
  > >     > [vmx,svm]_intr_assist()
  > >     > and also when the guest finishes handling the interrupt. Here
  > >     finished is
  > >     > defined
  > >     > as the point when the guest re-enables interrupts or
  > lowers the
  > >     tpr value.
  > >     > EOI is not used as the end of interrupt as this is sometimes
  > >     returned before
  > >     > the interrupt handler has done its work. A flag is
  > passed to the
  > >     handler
  > >     > indicating
  > >     > whether this is the injection point (post = 1) or the
  > interrupt
  > >     finished (post
  > >     > = 0) point.
  > >     > The need for the finished point callback is discussed in the
  > >     missed ticks
  > >     > policy section.
  > >     >
  > >     > To prevent a possible early trigger of the finished callback,
  > >     intr_en_notif
  > >     > logic
  > >     > has a two stage arm, the first at injection
  > >     (hvm_intr_en_notif_arm()) and the
  > >     > second when
  > >     > interrupts are seen to be disabled
  > (hvm_intr_en_notif_disarm()).
  > >     Once fully
  > >     > armed, re-enabling
  > >     > interrupts will cause hvm_intr_en_notif_disarm() to
  > make the end
  > >     of interrupt
  > >     > callback. hvm_intr_en_notif_arm() and
  > hvm_intr_en_notif_disarm()
  > >     are called by
  > >     > [vmx,svm]_intr_assist().
  > >     >
  > >     > 3. Interrupt delivery policies
  > >     >
  > >     > The existing hpet interrupt delivery is preserved.
  > This includes
  > >     > vcpu round robin delivery used by Linux and broadcast delivery
  > >     used by
  > >     > Windows.
  > >     >
  > >     > There are two policies for interrupt delivery, one for Windows
  > >     2k8-64 and the
  > >     > other
  > >     > for Linux. The Linux policy takes advantage of the
  > (guest) Linux
  > >     missed tick
  > >     > and offset
  > >     > calculations and does not attempt to deliver the
  > right number of
  > >     interrupts.
  > >     > The Windows policy delivers the correct number of interrupts,
  > >     even if
  > >     > sometimes much
  > >     > closer to each other than the period. The policies are similar
  > >     to those in
  > >     > vpt.c, though
  > >     > there are some important differences.
  > >     >
  > >     > Policies are selected with an HVMOP_set_param
  > hypercall with index
  > >     > HVM_PARAM_TIMER_MODE.
  > >     > Two new values are added,
  > HVM_HPET_guest_computes_missed_ticks and
  > >     > HVM_HPET_guest_does_not_compute_missed_ticks.  The reason that
  > >     two new ones
  > >     > are added is that
  > >     > in some guests (32bit Linux) a no-missed policy is needed for
  > >     clock sources
  > >     > other than hpet
  > >     > and a missed ticks policy for hpet. It was felt that
  > there would
  > >     be less
  > >     > confusion by simply
  > >     > introducing the two hpet policies.
  > >     >
  > >     > 3.1. The missed ticks policy
  > >     >
  > >     > The Linux clock interrupt handler for hpet calculates missed
  > >     ticks and offset
  > >     > using the hpet
  > >     > main counter. The algorithm works well when the time since the
  > >     last interrupt
  > >     > is greater than
  > >     > or equal to a period and poorly otherwise.
  > >     >
  > >     > The missed ticks policy ensures that no two clock
  > interrupts are
  > >     delivered to
  > >     > the guest at
  > >     > a time interval less than a period. A time stamp (hpet main
  > >     counter value) is
  > >     > recorded (by a
  > >     > callback registered with hvm_register_intr_en_notif)
  > when Linux
  > >     finishes
  > >     > handling the clock
  > >     > interrupt. Then, ensuing interrupts are delivered to
  > the vioapic
  > >     only if the
  > >     > current main
  > >     > counter value is a period greater than when the last interrupt
  > >     was handled.
  > >     >
  > >     > Tests showed a significant improvement in clock drift with end
  > >     of interrupt
  > >     > time stamps
  > >     > versus beginning of interrupt[1]. It is believed that
  > the reason
  > >     for the
  > >     > improvement
  > >     > is that the clock interrupt handler goes for a
  > spinlock and can
  > >     be therefore
  > >     > delayed in its
  > >     > processing. Furthermore, the main counter is read by the guest
  > >     under the lock.
  > >     > The net
  > >     > effect is that if we time stamp injection, we can get the
  > >     difference in time
  > >     > between successive interrupt handler lock acquisitions to be
  > >     less than the
  > >     > period.
  > >     >
  > >     > 3.2. The no-missed ticks policy
  > >     >
  > >     > Windows 2k864 keeps very poor time with the missed
  > ticks policy.
  > >     So the
  > >     > no-missed ticks policy
  > >     > was developed. In the no-missed ticks policy we deliver the
  > >     correct number of
  > >     > interrupts,
  > >     > even if they are spaced less than a period apart
  > (when catching up).
  > >     >
  > >     > Windows 2k864 uses a broadcast mode in the interrupt routing
  > >     such that
  > >     > all vcpus get the clock interrupt. The best Windows drift
  > >     performance was
  > >     > achieved when the
  > >     > policy code ensured that all the previous interrupts (on the
  > >     various vcpus)
  > >     > had been injected
  > >     > before injecting the next interrupt to the vioapic..
  > >     >
  > >     > The policy code works as follows. It uses the
  > >     hvm_isa_irq_assert_cb() to
  > >     > record
  > >     > the vcpus to be interrupted in h->hpet.pending_mask. Then, in
  > >     the callback
  > >     > registered
  > >     > with hvm_register_intr_en_notif() at post=1 time it clears the
  > >     current vcpu in
  > >     > the pending_mask.
  > >     > When the pending_mask is clear it decrements
  > >     hpet.intr_pending_nr and if
  > >     > intr_pending_nr is still
  > >     > non-zero posts another interrupt to the ioapic with
  > >     hvm_isa_irq_assert_cb().
  > >     > Intr_pending_nr is incremented in
  > >     hpet_route_decision_not_missed_ticks().
  > >     >
  > >     > The missed ticks policy intr_en_notif callback also uses the
  > >     pending_mask
  > >     > method. So even though
  > >     > Linux does not broadcast its interrupts, the code could handle
  > >     it if it did.
  > >     > In this case the end of interrupt time stamp is made when the
  > >     pending_mask is
  > >     > clear.
  > >     >
  > >     > 4. Live Migration
  > >     >
  > >     > Live migration with hpet preserves the current offset of the
  > >     guest clock with
  > >     > respect
  > >     > to ntp. This is accomplished by migrating all of the state in
  > >     the h->hpet data
  > >     > structure
  > >     > in the usual way. The hp->mc_offset is recalculated on the
  > >     receiving node so
  > >     > that the
  > >     > guest sees a continuous hpet main counter.
  > >     >
  > >     > Code as been added to xc_domain_save.c to send a small message
  > >     after the
  > >     > domain context is sent. The contents of the message is the
  > >     physical tsc
  > >     > timestamp, last_tsc,
  > >     > read just before the message is sent. When the
  > last_tsc message
  > >     is received in
  > >     > xc_domain_restore.c,
  > >     > another physical tsc timestamp, cur_tsc, is read. The two
  > >     timestamps are
  > >     > loaded into the domain
  > >     > structure as last_tsc_sender and first_tsc_receiver with
  > >     hypercalls. Then
  > >     > xc_domain_hvm_setcontext
  > >     > is called so that hpet_load has access to these time stamps.
  > >     Hpet_load uses
  > >     > the timestamps
  > >     > to account for the time spent saving and loading the domain
  > >     context. With this
  > >     > technique,
  > >     > the only neglected time is the time spent sending a small
  > >     network message.
  > >     >
  > >     > 5. Test Results
  > >     >
  > >     > Some recent test results are:
  > >     >
  > >     > 5.1 Linux 4u664 and Windows 2k864 load test.
  > >     >       Duration: 70 hours.
  > >     >       Test date: 6/2/08
  > >     >       Loads: usex -b48 on Linux; burn-in on Windows
  > >     >       Guest vcpus: 8 for Linux; 2 for Windows
  > >     >       Hardware: 8 physical cpu AMD
  > >     >       Clock drift : Linux: .0012% Windows: .009%
  > >     >
  > >     > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test
  > >     >       Duration: 23 hours.
  > >     >       Test date: 6/3/08
  > >     >       Loads: none
  > >     >       Guest vcpus: 8 for each Linux; 2 for Windows
  > >     >       Hardware: 4 physical cpu AMD
  > >     >       Clock drift : Linux: .033% Windows: .019%
  > >     >
  > >     > 6. Relation to recent work in xen-unstable
  > >     >
  > >     > There is a similarity between hvm_get_guest_time() in
  > >     xen-unstable and
  > >     > read_64_main_counter()
  > >     > in this code. However, read_64_main_counter() is more tuned to
  > >     the needs of
  > >     > hpet.c. It has no
  > >     > "set" operation, only the get. It isolates the mode,
  > physical or
  > >     simulated, in
  > >     > read_64_main_counter()
  > >     > itself. It uses no vcpu or domain state as it is a physical
  > >     entity, in either
  > >     > mode. And it provides a real
  > >     > physical mode for every read for those applications
  > that desire
  > >     this.
  > >     >
  > >     > 7. Conclusion
  > >     >
  > >     > The virtual hpet is improved by this patch in terms
  > of accuracy and
  > >     > monotonicity.
  > >     > Tests performed to date verify this and more testing
  > is under way.
  > >     >
  > >     > 8. Future Work
  > >     >
  > >     > Testing with Windows Vista will be performed soon. The reason
  > >     for accuracy
  > >     > variations
  > >     > on different platforms using the physical hpet device will be
  > >     investigated.
  > >     > Additional overhead measurements on simulated vs physical hpet
  > >     mode will be
  > >     > made.
  > >     >
  > >     > Footnotes:
  > >     >
  > >     > 1. I don't recall the accuracy improvement with end
  > of interrupt
  > >     stamping, but
  > >     > it was
  > >     > significant, perhaps better than two to one improvement. It
  > >     would be a very
  > >     > simple matter
  > >     > to re-measure the improvement as the facility can call back at
  > >     injection time
  > >     > as well.
  > >     >
  > >     >
  > >     > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com>
  > >     > <mailto:dwinchell@virtualiron.com>
  > >     > Signed-off-by: Ben Guthro <bguthro@virtualiron.com>
  > >     > <mailto:bguthro@virtualiron.com>
  > >     >
  > >     >
  > >     > _______________________________________________
  > >     > Xen-devel mailing list
  > >     > Xen-devel@lists.xensource.com
  > >     > http://lists.xensource.com/xen-devel
  > >
  > >
  > >
  >
  >




[-- Attachment #1.2: Type: text/html, Size: 41717 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [PATCH 0/2] Improve hpet accuracy
  2008-06-08 21:10             ` Dan Magenheimer
@ 2008-06-08 23:26               ` Dave Winchell
  2008-06-09  7:36                 ` Keir Fraser
  0 siblings, 1 reply; 51+ messages in thread
From: Dave Winchell @ 2008-06-08 23:26 UTC (permalink / raw)
  To: dan.magenheimer, Keir Fraser; +Cc: Dave Winchell, xen-devel, Ben Guthro


[-- Attachment #1.1: Type: text/plain, Size: 28359 bytes --]

Hi Dan,

> Could you please be very precise, when you say "Linux",
> as to what you are (and are not) testing?  Specifically:
> 1) kernel version number and/or distro info

I personally have been testing with Linux red hat 4u4, 4u5
and 4u6 64 bit and red hat 4u4 32 bit. I've also
tested Windows 2k8sp0 64bit and Vista sp1 64 bit.
Our QA group is currently testing a wider set of guests.

> 2) 32 vs 64

both.

> 3) kernel parameters specified

I'm pretty sloppy here. Frequently I have clock=pit because
our build process defaults to that and I know that the guests
I use ignore clock= when hpet is in the acpi table.
However, I don't recommend that others do this.
I check that Linux is using hpet in its boot log and I have
statistics in the patch which tell me that hpet is being used.
You've done a lot of investigation on clock= and clocksource=
so I would like to hear your recommendation.

> 4) config file parameters

Hpet enabled. Timer_mode set to HVM_HPET_guest_computes_missed_ticks
for all Linux guests and to HVM_HPET_guest_does_not_compute_missed_ticks
for Windows 2k8-64 and Vista 64.
8 vcpus for Linux and 2 for Windows.

> 5) relevant CPU info that may be passed through by Xen
>   to hvm guests (e.g. whether "tsc is synchronized")

Whatever xen-unstable does. I have not changed it.

> 6) relevant xen boot parameters (if any)

Nothing relevant.

> As we've seen, different combinations of the above can yield
> very different test results.  We'd like to confirm your tests,
> but if we can avoid unnecessary additional iterations (due to
> mismatches on the above), that would be helpful.

> Our testing goal is to ensure that there is at least one
> known good combination of parameters for each of RHEL3,
> RHEL4, and RHEL5 (both 32 and 64) and that works on
> both tsc-synchronized and tsc-unsynchronized Intel
> and AMD boxes.  And hopefully that works with and without
> a real physical hpet available.

Thanks very much for testing this.

> We don't have a good test environment for Windows time,
> but if you can provide the same test configuration detail,
> we may be able to do some testing.

The configuration information was given above.

Thanks,
Dave



-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Sun 6/8/2008 5:10 PM
To: Dave Winchell; Keir Fraser
Cc: Ben Guthro; xen-devel
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
 
RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracyHi Dave --

Thanks for the additional explanation.

Could you please be very precise, when you say "Linux",
as to what you are (and are not) testing?  Specifically:
1) kernel version number and/or distro info
2) 32 vs 64
3) kernel parameters specified
4) config file parameters
5) relevant CPU info that may be passed through by Xen
   to hvm guests (e.g. whether "tsc is synchronized")
6) relevant xen boot parameters (if any)

As we've seen, different combinations of the above can yield
very different test results.  We'd like to confirm your tests,
but if we can avoid unnecessary additional iterations (due to
mismatches on the above), that would be helpful.

Our testing goal is to ensure that there is at least one
known good combination of parameters for each of RHEL3,
RHEL4, and RHEL5 (both 32 and 64) and that works on
both tsc-synchronized and tsc-unsynchronized Intel
and AMD boxes.  And hopefully that works with and without
a real physical hpet available.

We don't have a good test environment for Windows time,
but if you can provide the same test configuration detail,
we may be able to do some testing.

Thanks,
Dan

  -----Original Message-----
  From: Dave Winchell [mailto:dwinchell@virtualiron.com]
  Sent: Sunday, June 08, 2008 2:32 PM
  To: dan.magenheimer@oracle.com; Keir Fraser
  Cc: Ben Guthro; xen-devel; Dave Winchell
  Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy


  Hi Dan,

  > While I am fully supportive of offering hardware hpet as an option
  > for hvm guests (let's call it hwhpet=1 for shorthand), I am very
  > surprised by your preliminary results; the most obvious conclusion
  > is that Xen system time is losing time at the rate of 1000 PPM
  > though its possible there's a bug somewhere else in the "time
  > stack".  Your Windows result is jaw-dropping and inexplicable,
  > though I have to admit ignorance of how Windows manages time.

  I think xen system time is fine. You have to add the interrupt
  delivery policies decribed in the write-up for the patch to get
  accurate timekeeping in the guest.

  The windows policy is obvious and results in a large improvement
  in accuracy. The Linux policy is more subtle, but is required to go
  from .1% to .03%.

  > I think with my recent patch and hpet=1 (essentially the same as
  > your emulated hpet), hvm guest time should track Xen system time.
  > I wonder if domain0 (which if I understand correctly is directly
  > using Xen system time) is also seeing an error of .1%?  Also
  > I wonder for the skew you are seeing (in both hvm guests and
  > domain0) is time moving too fast or two slow?

  I don't recall the direction. I can look it up in my notes at work
  tomorrow.

  > Although hwhpet=1 is a fine alternative in many cases, it may
  > be unavailable on some systems and may cause significant performance
  > issues on others.  So I think we will still need to track down
  > the poor accuracy when hwhpet=0.

  Our patch is accurate to < .03% using the physical hpet mode or
  the simulated mode.

  > And if for some reason
  > Xen system time can't be made accurate enough (< 0.05%), then
  > I think we should consider building Xen system time itself on
  > top of hardware hpet instead of TSC... at least when Xen discovers
  > a capable hpet.

  In our experience, Xen system time is accurate enough now.

  > One more thought... do you know the accuracy of the TSC crystals
  > on your test systems?  I posted a patch awhile ago that was
  > intended to test that, though I guess it was only testing skew
  > of different TSCs on the same system, not TSCs against an
  > external time source.

  I do not know the tsc accuracy.

  > Or maybe there's a computation error somewhere in the hvm hpet
  > scaling code?  Hmmm...


  Regards,
  Dave


  -----Original Message-----
  From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
  Sent: Fri 6/6/2008 4:29 PM
  To: Dave Winchell; Keir Fraser
  Cc: Ben Guthro; xen-devel
  Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

  Dave --

  Thanks much for posting the preliminary results!

  While I am fully supportive of offering hardware hpet as an option
  for hvm guests (let's call it hwhpet=1 for shorthand), I am very
  surprised by your preliminary results; the most obvious conclusion
  is that Xen system time is losing time at the rate of 1000 PPM
  though its possible there's a bug somewhere else in the "time
  stack".  Your Windows result is jaw-dropping and inexplicable,
  though I have to admit ignorance of how Windows manages time.


  I think with my recent patch and hpet=1 (essentially the same as
  your emulated hpet), hvm guest time should track Xen system time.
  I wonder if domain0 (which if I understand correctly is directly
  using Xen system time) is also seeing an error of .1%?  Also
  I wonder for the skew you are seeing (in both hvm guests and
  domain0) is time moving too fast or two slow?

  Although hwhpet=1 is a fine alternative in many cases, it may
  be unavailable on some systems and may cause significant performance
  issues on others.  So I think we will still need to track down
  the poor accuracy when hwhpet=0.  And if for some reason
  Xen system time can't be made accurate enough (< 0.05%), then
  I think we should consider building Xen system time itself on
  top of hardware hpet instead of TSC... at least when Xen discovers
  a capable hpet.

  One more thought... do you know the accuracy of the TSC crystals
  on your test systems?  I posted a patch awhile ago that was
  intended to test that, though I guess it was only testing skew
  of different TSCs on the same system, not TSCs against an
  external time source.

  Or maybe there's a computation error somewhere in the hvm hpet
  scaling code?  Hmmm...

  Thanks,
  Dan

  > -----Original Message-----
  > From: Dave Winchell [mailto:dwinchell@virtualiron.com]
  > Sent: Friday, June 06, 2008 1:33 PM
  > To: dan.magenheimer@oracle.com; Keir Fraser
  > Cc: Ben Guthro; xen-devel; Dave Winchell
  > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
  >
  >
  > Dan, Keir:
  >
  > Preliminary tests results indicate an error of .1% for Linux 64 bit
  > guests configured
  > for hpet with xen-unstable as is. As we have discussed many times, the
  > ntp requirement is .05%.
  > Tests on the patch we just submitted for hpet have indicated errors of
  > .0012%
  > on this platform under similar test conditions and .03% on
  > other platforms.
  >
  > Windows vista64 has an error of 11% using hpet with the
  > xen-unstable bits.
  > In an overnight test with our hpet patch, the Windows vista
  > error was .008%.
  >
  > The tests are with two or three guests on a physical node, all under
  > load, and with
  > the ratio of vcpus to phys cpus > 1.
  >
  > I will continue to run tests over the next few days.
  >
  > thanks,
  > Dave
  >
  >
  > Dan Magenheimer wrote:
  >
  > > Hi Dave and Ben --
  > >
  > > When running tests on xen-unstable (without your patch),
  > please ensure
  > > that hpet=1 is set in the hvm config and also I think that when hpet
  > > is the clocksource on RHEL4-32, the clock IS resilient to
  > missed ticks
  > > so timer_mode should be 2 (vs when pit is the clocksource
  > on RHEL4-32,
  > > all clock ticks must be delivered and so timer_mode should be 0).
  > >
  > > Per
  > >
  > http://lists.xensource.com/archives/html/xen-devel/2008-06/msg
  > 00098.html it's
  > > my intent to clean this up, but I won't get to it until next week.
  > >
  > > Thanks,
  > > Dan
  > >
  > >     -----Original Message-----
  > >     *From:* xen-devel-bounces@lists.xensource.com
  > >     [mailto:xen-devel-bounces@lists.xensource.com]*On
  > Behalf Of *Dave
  > >     Winchell
  > >     *Sent:* Friday, June 06, 2008 4:46 AM
  > >     *To:* Keir Fraser; Ben Guthro; xen-devel
  > >     *Cc:* dan.magenheimer@oracle.com; Dave Winchell
  > >     *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
  > >
  > >     Keir,
  > >
  > >     I think the changes are required. We'll run some tests
  > today today so
  > >     that we have some data to talk about.
  > >
  > >     -Dave
  > >
  > >
  > >     -----Original Message-----
  > >     From: xen-devel-bounces@lists.xensource.com on behalf
  > of Keir Fraser
  > >     Sent: Fri 6/6/2008 4:58 AM
  > >     To: Ben Guthro; xen-devel
  > >     Cc: dan.magenheimer@oracle.com
  > >     Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
  > >
  > >     Are these patches needed now the timers are built on Xen system
  > >     time rather
  > >     than host TSC? Dan has reported much better
  > time-keeping with his
  > >     patch
  > >     checked in, and it¹s for sure a lot less invasive than
  > this patchset.
  > >
  > >
  > >      -- Keir
  > >
  > >     On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote:
  > >
  > >     >
  > >     > 1. Introduction
  > >     >
  > >     > This patch improves the hpet based guest clock in
  > terms of drift and
  > >     > monotonicity.
  > >     > Prior to this work the drift with hpet was greater
  > than 2%, far
  > >     above the .05%
  > >     > limit
  > >     > for ntp to synchronize. With this code, the drift ranges from
  > >     .001% to .0033%
  > >     > depending
  > >     > on guest and physical platform.
  > >     >
  > >     > Using hpet allows guest operating systems to provide monotonic
  > >     time to their
  > >     > applications. Time sources other than hpet are not
  > monotonic because
  > >     > of their reliance on tsc, which is not synchronized
  > across physical
  > >     > processors.
  > >     >
  > >     > Windows 2k864 and many Linux guests are supported with two
  > >     policies, one for
  > >     > guests
  > >     > that handle missed clock interrupts and the other for guests
  > >     that require the
  > >     > correct number of interrupts.
  > >     >
  > >     > Guests may use hpet for the timing source even if the physical
  > >     platform has no
  > >     > visible
  > >     > hpet. Migration is supported between physical machines which
  > >     differ in
  > >     > physical
  > >     > hpet visibility.
  > >     >
  > >     > Most of the changes are in hpet.c. Two general facilities are
  > >     added to track
  > >     > interrupt
  > >     > progress. The ideas here and the facilities would be useful in
  > >     vpt.c, for
  > >     > other time
  > >     > sources, though no attempt is made here to improve vpt.c.
  > >     >
  > >     > The following sections discuss hpet dependencies, interrupt
  > >     delivery policies,
  > >     > live migration,
  > >     > test results, and relation to recent work with monotonic time.
  > >     >
  > >     >
  > >     > 2. Virtual Hpet dependencies
  > >     >
  > >     > The virtual hpet depends on the ability to read the
  > physical or
  > >     simulated
  > >     > (see discussion below) hpet.  For timekeeping, the
  > virtual hpet
  > >     also depends
  > >     > on two new interrupt notification facilities to implement its
  > >     policies for
  > >     > interrupt delivery.
  > >     >
  > >     > 2.1. Two modes of low-level hpet main counter reads.
  > >     >
  > >     > In this implementation, the virtual hpet reads with
  > >     read_64_main_counter(),
  > >     > exported by
  > >     > time.c, either the real physical hpet main counter register
  > >     directly or a
  > >     > "simulated"
  > >     > hpet main counter.
  > >     >
  > >     > The simulated mode uses a monotonic version of get_s_time()
  > >     (NOW()), where the
  > >     > last
  > >     > time value is returned whenever the current time value is less
  > >     than the last
  > >     > time
  > >     > value. In simulated mode, since it is layered on s_time, the
  > >     underlying
  > >     > hardware
  > >     > can be hpet or some other device. The frequency of the main
  > >     counter in
  > >     > simulated
  > >     > mode is the same as the standard physical hpet frequency,
  > >     allowing live
  > >     > migration
  > >     > between nodes that are configured differently.
  > >     >
  > >     > If the physical platform does not have an hpet
  > device, or if xen
  > >     is configured
  > >     > not
  > >     > to use the device, then the simulated method is used. If there
  > >     is a physical
  > >     > hpet device,
  > >     > and xen has initialized it, then either simulated or physical
  > >     mode can be
  > >     > used.
  > >     > This is governed by a boot time option, hpet-avoid.
  > Setting this
  > >     option to 1
  > >     > gives the
  > >     > simulated mode and 0 the physical mode. The default
  > is physical
  > >     mode.
  > >     >
  > >     > A disadvantage of the physical mode is that may take longer to
  > >     read the device
  > >     > than in simulated mode. On some platforms the cost is
  > about the
  > >     same (less
  > >     > than 250 nsec) for
  > >     > physical and simulated modes, while on others physical cost is
  > >     much higher
  > >     > than simulated.
  > >     > A disadvantage of the simulated mode is that it can return the
  > >     same value
  > >     > for the counter in consecutive calls.
  > >     >
  > >     > 2.2. Interrupt notification facilities.
  > >     >
  > >     > Two interrupt notification facilities are introduced, one is
  > >     > hvm_isa_irq_assert_cb()
  > >     > and the other hvm_register_intr_en_notif().
  > >     >
  > >     > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to
  > >     the vioapic.
  > >     > hvm_isa_irq_assert_cb allows a callback to be passed along to
  > >     > vioapic_deliver()
  > >     > and this callback is called with a mask of the vcpus
  > which will
  > >     get the
  > >     > interrupt. This callback is made before any vcpus receive an
  > >     interrupt.
  > >     >
  > >     > Vhpet uses hvm_register_intr_en_notif() to register a handler
  > >     for a particular
  > >     > vector that will be called when that vector is injected in
  > >     > [vmx,svm]_intr_assist()
  > >     > and also when the guest finishes handling the interrupt. Here
  > >     finished is
  > >     > defined
  > >     > as the point when the guest re-enables interrupts or
  > lowers the
  > >     tpr value.
  > >     > EOI is not used as the end of interrupt as this is sometimes
  > >     returned before
  > >     > the interrupt handler has done its work. A flag is
  > passed to the
  > >     handler
  > >     > indicating
  > >     > whether this is the injection point (post = 1) or the
  > interrupt
  > >     finished (post
  > >     > = 0) point.
  > >     > The need for the finished point callback is discussed in the
  > >     missed ticks
  > >     > policy section.
  > >     >
  > >     > To prevent a possible early trigger of the finished callback,
  > >     intr_en_notif
  > >     > logic
  > >     > has a two stage arm, the first at injection
  > >     (hvm_intr_en_notif_arm()) and the
  > >     > second when
  > >     > interrupts are seen to be disabled
  > (hvm_intr_en_notif_disarm()).
  > >     Once fully
  > >     > armed, re-enabling
  > >     > interrupts will cause hvm_intr_en_notif_disarm() to
  > make the end
  > >     of interrupt
  > >     > callback. hvm_intr_en_notif_arm() and
  > hvm_intr_en_notif_disarm()
  > >     are called by
  > >     > [vmx,svm]_intr_assist().
  > >     >
  > >     > 3. Interrupt delivery policies
  > >     >
  > >     > The existing hpet interrupt delivery is preserved.
  > This includes
  > >     > vcpu round robin delivery used by Linux and broadcast delivery
  > >     used by
  > >     > Windows.
  > >     >
  > >     > There are two policies for interrupt delivery, one for Windows
  > >     2k8-64 and the
  > >     > other
  > >     > for Linux. The Linux policy takes advantage of the
  > (guest) Linux
  > >     missed tick
  > >     > and offset
  > >     > calculations and does not attempt to deliver the
  > right number of
  > >     interrupts.
  > >     > The Windows policy delivers the correct number of interrupts,
  > >     even if
  > >     > sometimes much
  > >     > closer to each other than the period. The policies are similar
  > >     to those in
  > >     > vpt.c, though
  > >     > there are some important differences.
  > >     >
  > >     > Policies are selected with an HVMOP_set_param
  > hypercall with index
  > >     > HVM_PARAM_TIMER_MODE.
  > >     > Two new values are added,
  > HVM_HPET_guest_computes_missed_ticks and
  > >     > HVM_HPET_guest_does_not_compute_missed_ticks.  The reason that
  > >     two new ones
  > >     > are added is that
  > >     > in some guests (32bit Linux) a no-missed policy is needed for
  > >     clock sources
  > >     > other than hpet
  > >     > and a missed ticks policy for hpet. It was felt that
  > there would
  > >     be less
  > >     > confusion by simply
  > >     > introducing the two hpet policies.
  > >     >
  > >     > 3.1. The missed ticks policy
  > >     >
  > >     > The Linux clock interrupt handler for hpet calculates missed
  > >     ticks and offset
  > >     > using the hpet
  > >     > main counter. The algorithm works well when the time since the
  > >     last interrupt
  > >     > is greater than
  > >     > or equal to a period and poorly otherwise.
  > >     >
  > >     > The missed ticks policy ensures that no two clock
  > interrupts are
  > >     delivered to
  > >     > the guest at
  > >     > a time interval less than a period. A time stamp (hpet main
  > >     counter value) is
  > >     > recorded (by a
  > >     > callback registered with hvm_register_intr_en_notif)
  > when Linux
  > >     finishes
  > >     > handling the clock
  > >     > interrupt. Then, ensuing interrupts are delivered to
  > the vioapic
  > >     only if the
  > >     > current main
  > >     > counter value is a period greater than when the last interrupt
  > >     was handled.
  > >     >
  > >     > Tests showed a significant improvement in clock drift with end
  > >     of interrupt
  > >     > time stamps
  > >     > versus beginning of interrupt[1]. It is believed that
  > the reason
  > >     for the
  > >     > improvement
  > >     > is that the clock interrupt handler goes for a
  > spinlock and can
  > >     be therefore
  > >     > delayed in its
  > >     > processing. Furthermore, the main counter is read by the guest
  > >     under the lock.
  > >     > The net
  > >     > effect is that if we time stamp injection, we can get the
  > >     difference in time
  > >     > between successive interrupt handler lock acquisitions to be
  > >     less than the
  > >     > period.
  > >     >
  > >     > 3.2. The no-missed ticks policy
  > >     >
  > >     > Windows 2k864 keeps very poor time with the missed
  > ticks policy.
  > >     So the
  > >     > no-missed ticks policy
  > >     > was developed. In the no-missed ticks policy we deliver the
  > >     correct number of
  > >     > interrupts,
  > >     > even if they are spaced less than a period apart
  > (when catching up).
  > >     >
  > >     > Windows 2k864 uses a broadcast mode in the interrupt routing
  > >     such that
  > >     > all vcpus get the clock interrupt. The best Windows drift
  > >     performance was
  > >     > achieved when the
  > >     > policy code ensured that all the previous interrupts (on the
  > >     various vcpus)
  > >     > had been injected
  > >     > before injecting the next interrupt to the vioapic..
  > >     >
  > >     > The policy code works as follows. It uses the
  > >     hvm_isa_irq_assert_cb() to
  > >     > record
  > >     > the vcpus to be interrupted in h->hpet.pending_mask. Then, in
  > >     the callback
  > >     > registered
  > >     > with hvm_register_intr_en_notif() at post=1 time it clears the
  > >     current vcpu in
  > >     > the pending_mask.
  > >     > When the pending_mask is clear it decrements
  > >     hpet.intr_pending_nr and if
  > >     > intr_pending_nr is still
  > >     > non-zero posts another interrupt to the ioapic with
  > >     hvm_isa_irq_assert_cb().
  > >     > Intr_pending_nr is incremented in
  > >     hpet_route_decision_not_missed_ticks().
  > >     >
  > >     > The missed ticks policy intr_en_notif callback also uses the
  > >     pending_mask
  > >     > method. So even though
  > >     > Linux does not broadcast its interrupts, the code could handle
  > >     it if it did.
  > >     > In this case the end of interrupt time stamp is made when the
  > >     pending_mask is
  > >     > clear.
  > >     >
  > >     > 4. Live Migration
  > >     >
  > >     > Live migration with hpet preserves the current offset of the
  > >     guest clock with
  > >     > respect
  > >     > to ntp. This is accomplished by migrating all of the state in
  > >     the h->hpet data
  > >     > structure
  > >     > in the usual way. The hp->mc_offset is recalculated on the
  > >     receiving node so
  > >     > that the
  > >     > guest sees a continuous hpet main counter.
  > >     >
  > >     > Code as been added to xc_domain_save.c to send a small message
  > >     after the
  > >     > domain context is sent. The contents of the message is the
  > >     physical tsc
  > >     > timestamp, last_tsc,
  > >     > read just before the message is sent. When the
  > last_tsc message
  > >     is received in
  > >     > xc_domain_restore.c,
  > >     > another physical tsc timestamp, cur_tsc, is read. The two
  > >     timestamps are
  > >     > loaded into the domain
  > >     > structure as last_tsc_sender and first_tsc_receiver with
  > >     hypercalls. Then
  > >     > xc_domain_hvm_setcontext
  > >     > is called so that hpet_load has access to these time stamps.
  > >     Hpet_load uses
  > >     > the timestamps
  > >     > to account for the time spent saving and loading the domain
  > >     context. With this
  > >     > technique,
  > >     > the only neglected time is the time spent sending a small
  > >     network message.
  > >     >
  > >     > 5. Test Results
  > >     >
  > >     > Some recent test results are:
  > >     >
  > >     > 5.1 Linux 4u664 and Windows 2k864 load test.
  > >     >       Duration: 70 hours.
  > >     >       Test date: 6/2/08
  > >     >       Loads: usex -b48 on Linux; burn-in on Windows
  > >     >       Guest vcpus: 8 for Linux; 2 for Windows
  > >     >       Hardware: 8 physical cpu AMD
  > >     >       Clock drift : Linux: .0012% Windows: .009%
  > >     >
  > >     > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test
  > >     >       Duration: 23 hours.
  > >     >       Test date: 6/3/08
  > >     >       Loads: none
  > >     >       Guest vcpus: 8 for each Linux; 2 for Windows
  > >     >       Hardware: 4 physical cpu AMD
  > >     >       Clock drift : Linux: .033% Windows: .019%
  > >     >
  > >     > 6. Relation to recent work in xen-unstable
  > >     >
  > >     > There is a similarity between hvm_get_guest_time() in
  > >     xen-unstable and
  > >     > read_64_main_counter()
  > >     > in this code. However, read_64_main_counter() is more tuned to
  > >     the needs of
  > >     > hpet.c. It has no
  > >     > "set" operation, only the get. It isolates the mode,
  > physical or
  > >     simulated, in
  > >     > read_64_main_counter()
  > >     > itself. It uses no vcpu or domain state as it is a physical
  > >     entity, in either
  > >     > mode. And it provides a real
  > >     > physical mode for every read for those applications
  > that desire
  > >     this.
  > >     >
  > >     > 7. Conclusion
  > >     >
  > >     > The virtual hpet is improved by this patch in terms
  > of accuracy and
  > >     > monotonicity.
  > >     > Tests performed to date verify this and more testing
  > is under way.
  > >     >
  > >     > 8. Future Work
  > >     >
  > >     > Testing with Windows Vista will be performed soon. The reason
  > >     for accuracy
  > >     > variations
  > >     > on different platforms using the physical hpet device will be
  > >     investigated.
  > >     > Additional overhead measurements on simulated vs physical hpet
  > >     mode will be
  > >     > made.
  > >     >
  > >     > Footnotes:
  > >     >
  > >     > 1. I don't recall the accuracy improvement with end
  > of interrupt
  > >     stamping, but
  > >     > it was
  > >     > significant, perhaps better than two to one improvement. It
  > >     would be a very
  > >     > simple matter
  > >     > to re-measure the improvement as the facility can call back at
  > >     injection time
  > >     > as well.
  > >     >
  > >     >
  > >     > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com>
  > >     > <mailto:dwinchell@virtualiron.com>
  > >     > Signed-off-by: Ben Guthro <bguthro@virtualiron.com>
  > >     > <mailto:bguthro@virtualiron.com>
  > >     >
  > >     >
  > >     > _______________________________________________
  > >     > Xen-devel mailing list
  > >     > Xen-devel@lists.xensource.com
  > >     > http://lists.xensource.com/xen-devel
  > >
  > >
  > >
  >
  >





[-- Attachment #1.2: Type: text/html, Size: 48614 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/2] Improve hpet accuracy
  2008-06-08 23:26               ` Dave Winchell
@ 2008-06-09  7:36                 ` Keir Fraser
  2008-06-09 11:13                   ` Dave Winchell
  2008-06-09 20:48                   ` Dan Magenheimer
  0 siblings, 2 replies; 51+ messages in thread
From: Keir Fraser @ 2008-06-09  7:36 UTC (permalink / raw)
  To: Dave Winchell, dan.magenheimer; +Cc: xen-devel, Ben Guthro

[-- Attachment #1.1: Type: text/plain, Size: 1111 bytes --]

On 9/6/08 00:26, "Dave Winchell" <dwinchell@virtualiron.com> wrote:

>> > 4) config file parameters
> 
> Hpet enabled. Timer_mode set to HVM_HPET_guest_computes_missed_ticks
> for all Linux guests and to HVM_HPET_guest_does_not_compute_missed_ticks
> for Windows 2k8-64 and Vista 64.
> 8 vcpus for Linux and 2 for Windows.

These new HVM_HPET options seems a weird design choice. It appears that you
can only set these or one of the old options, so there¹s not actually any
independence between the mode used by vpt.c and the mode used by hpet.c. At
guest install time you ought to be able to tell whether the guest will use
hpet or not based on its version (RHELx, SLESy, Winz etc etc) and decide
whether missed-ticks accounting is required or not.

I¹d be more agreeable to a partch that stripped out the physical hpet
accesses (since you say they are not the reason for the improvement in
accuracy), built hpet on top of vpt, and added the necessary extra
mechanisms to deal with interrupt broadcasts into vpt.c. And which was split
into more separate pieces of mechanism.

 -- Keir

[-- Attachment #1.2: Type: text/html, Size: 1584 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [PATCH 0/2] Improve hpet accuracy
  2008-06-09  7:36                 ` Keir Fraser
@ 2008-06-09 11:13                   ` Dave Winchell
  2008-06-09 12:03                     ` Keir Fraser
  2008-06-09 20:48                   ` Dan Magenheimer
  1 sibling, 1 reply; 51+ messages in thread
From: Dave Winchell @ 2008-06-09 11:13 UTC (permalink / raw)
  To: Keir Fraser, dan.magenheimer; +Cc: Dave Winchell, xen-devel, Ben Guthro

[-- Attachment #1.1: Type: text/plain, Size: 2304 bytes --]

> These new HVM_HPET options seems a weird design choice. It appears that you
> can only set these or one of the old options, so there¹s not actually any
> independence between the mode used by vpt.c and the mode used by hpet.c. At
> guest install time you ought to be able to tell whether the guest will use
> hpet or not based on its version (RHELx, SLESy, Winz etc etc) and decide
> whether missed-ticks accounting is required or not.

I'll use the existing options instead.

> I¹d be more agreeable to a partch that stripped out the physical hpet
> accesses (since you say they are not the reason for the improvement in
> accuracy), built hpet on top of vpt, and added the necessary extra
> mechanisms to deal with interrupt broadcasts into vpt.c. And which was split
> into more separate pieces of mechanism.

ok, I'll work on this. How much time do I have to make the release you are
working on?

thanks,
Dave

-----Original Message-----
From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
Sent: Mon 6/9/2008 3:36 AM
To: Dave Winchell; dan.magenheimer@oracle.com
Cc: Ben Guthro; xen-devel
Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

On 9/6/08 00:26, "Dave Winchell" <dwinchell@virtualiron.com> wrote:

>> > 4) config file parameters
> 
> Hpet enabled. Timer_mode set to HVM_HPET_guest_computes_missed_ticks
> for all Linux guests and to HVM_HPET_guest_does_not_compute_missed_ticks
> for Windows 2k8-64 and Vista 64.
> 8 vcpus for Linux and 2 for Windows.

These new HVM_HPET options seems a weird design choice. It appears that you
can only set these or one of the old options, so there¹s not actually any
independence between the mode used by vpt.c and the mode used by hpet.c. At
guest install time you ought to be able to tell whether the guest will use
hpet or not based on its version (RHELx, SLESy, Winz etc etc) and decide
whether missed-ticks accounting is required or not.

I¹d be more agreeable to a partch that stripped out the physical hpet
accesses (since you say they are not the reason for the improvement in
accuracy), built hpet on top of vpt, and added the necessary extra
mechanisms to deal with interrupt broadcasts into vpt.c. And which was split
into more separate pieces of mechanism.

 -- Keir

[-- Attachment #1.2: Type: text/html, Size: 3043 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/2] Improve hpet accuracy
  2008-06-09 11:13                   ` Dave Winchell
@ 2008-06-09 12:03                     ` Keir Fraser
  2008-06-09 12:10                       ` Keir Fraser
  0 siblings, 1 reply; 51+ messages in thread
From: Keir Fraser @ 2008-06-09 12:03 UTC (permalink / raw)
  To: Dave Winchell, dan.magenheimer; +Cc: xen-devel, Ben Guthro


[-- Attachment #1.1: Type: text/plain, Size: 597 bytes --]

On 9/6/08 12:13, "Dave Winchell" <dwinchell@virtualiron.com> wrote:

>> > I¹d be more agreeable to a partch that stripped out the physical hpet
>> > accesses (since you say they are not the reason for the improvement in
>> > accuracy), built hpet on top of vpt, and added the necessary extra
>> > mechanisms to deal with interrupt broadcasts into vpt.c. And which was
>> split
>> > into more separate pieces of mechanism.
> 
> ok, I'll work on this. How much time do I have to make the release you are
> working on?

I¹m thinking feature freeze at the end of the month.

 -- Keir


[-- Attachment #1.2: Type: text/html, Size: 1072 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/2] Improve hpet accuracy
  2008-06-09 12:03                     ` Keir Fraser
@ 2008-06-09 12:10                       ` Keir Fraser
  2008-06-09 13:08                         ` Dave Winchell
  0 siblings, 1 reply; 51+ messages in thread
From: Keir Fraser @ 2008-06-09 12:10 UTC (permalink / raw)
  To: Dave Winchell, dan.magenheimer; +Cc: xen-devel, Ben Guthro

[-- Attachment #1.1: Type: text/plain, Size: 1220 bytes --]

On 9/6/08 13:03, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote:

>> ok, I'll work on this. How much time do I have to make the release you are
>> working on?
> 
> I¹m thinking feature freeze at the end of the month.

Oh, while I¹m commenting on the current version of the patch, I will point
out that changes to the save/restore format need careful thought. At the
very least we¹d like to be backward compatible (old images restore on new
Xen) even if we don¹t achieve compatibility the other way round. I didn¹t
really look too closely at this aspect of the patches so perhaps these
patches are fine in this regard. Either way I think the core re-architecting
of the hpet device model to handle missed ticks and so on is independent of
interface changes anyway (apart from reasonable extensions to the timer_mode
option). Changes to save/restore format, addition of extra debugging code,
and peripheral things like that it¹d be nice to have in separate patches. It
makes the core stuff easier to review and more likely to get accepted
without quibble (because there¹s less of it, and it is all dedicated to a
single purpose which I think we all agree is where we want to be).

 Thanks,
 Keir

[-- Attachment #1.2: Type: text/html, Size: 1815 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/2] Improve hpet accuracy
  2008-06-09 12:10                       ` Keir Fraser
@ 2008-06-09 13:08                         ` Dave Winchell
  0 siblings, 0 replies; 51+ messages in thread
From: Dave Winchell @ 2008-06-09 13:08 UTC (permalink / raw)
  To: Keir Fraser; +Cc: dan.magenheimer, xen-devel, Dave Winchell, Ben Guthro

Keir Fraser wrote:

> On 9/6/08 13:03, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote:
>
>         ok, I'll work on this. How much time do I have to make the
>         release you are
>         working on?
>
>
>     I’m thinking feature freeze at the end of the month.
>
>
> Oh, while I’m commenting on the current version of the patch, I will 
> point out that changes to the save/restore format need careful 
> thought. At the very least we’d like to be backward compatible (old 
> images restore on new Xen) even if we don’t achieve compatibility the 
> other way round. I didn’t really look too closely at this aspect of 
> the patches so perhaps these patches are fine in this regard.

I'll look into the live migrate from old to new.

> Either way I think the core re-architecting of the hpet device model 
> to handle missed ticks and so on is independent of interface changes 
> anyway (apart from reasonable extensions to the timer_mode option). 
> Changes to save/restore format, addition of extra debugging code, and 
> peripheral things like that it’d be nice to have in separate patches. 
> It makes the core stuff easier to review and more likely to get 
> accepted without quibble (because there’s less of it, and it is all 
> dedicated to a single purpose which I think we all agree is where we 
> want to be).

This is fine. I'll break up the patch along the lines you suggest.

>
> Thanks,
> Keir

Thanks,
Dave

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [PATCH 0/2] Improve hpet accuracy
  2008-06-09  7:36                 ` Keir Fraser
  2008-06-09 11:13                   ` Dave Winchell
@ 2008-06-09 20:48                   ` Dan Magenheimer
  2008-06-09 21:18                     ` Keir Fraser
  1 sibling, 1 reply; 51+ messages in thread
From: Dan Magenheimer @ 2008-06-09 20:48 UTC (permalink / raw)
  To: Keir Fraser, Dave Winchell; +Cc: xen-devel, Ben Guthro

> At guest install time you ought to be able to tell whether the guest
> will use hpet or not based on its version (RHELx, SLESy, Winz etc etc)
> and decide whether missed-ticks accounting is required or not.

Unfortunately this is not true on Linux, at least without gathering
(and hardcoding) more information about the system.  Whether hpet is
used or not is dependent not only on the OS/version and hvm config
parameters, but also on kernel command line parameters and even
the underlying CPU.  For example, on RHEL5u1, if the tsc is synchronized
and the CPU is Intel, and no kernel parameters are chosen, tsc will be
chosen as the default clocksource even if hpet is present.  Ugly.

That said, if Dave's patch achieves the stated accuracy on most
versions of Linux (e.g. at least RHEL4+5, 32+64, smp+1p) for SOME
set of parameters (which might be different on each Linux version),
it would still be better than what we have now.

The ideal solution, I think, would be for the default hvm settings
to "do the right thing" for both Windows and Linux at least for the
vast majority of configuration choices.  I'm not sure this is possible,
but it sure would be nice.

Dan

-----Original Message-----
From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
Sent: Monday, June 09, 2008 1:36 AM
To: Dave Winchell; dan.magenheimer@oracle.com
Cc: Ben Guthro; xen-devel
Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

On 9/6/08 00:26, "Dave Winchell" <dwinchell@virtualiron.com> wrote:

> 4) config file parameters

Hpet enabled. Timer_mode set to HVM_HPET_guest_computes_missed_ticks
for all Linux guests and to HVM_HPET_guest_does_not_compute_missed_ticks
for Windows 2k8-64 and Vista 64.
8 vcpus for Linux and 2 for Windows.

These new HVM_HPET options seems a weird design choice. It appears that you can only set these or one of the old options, so there's not actually any independence between the mode used by vpt.c and the mode used by hpet.c. At guest install time you ought to be able to tell whether the guest will use hpet or not based on its version (RHELx, SLESy, Winz etc etc) and decide whether missed-ticks accounting is required or not.

I'd be more agreeable to a partch that stripped out the physical hpet accesses (since you say they are not the reason for the improvement in accuracy), built hpet on top of vpt, and added the necessary extra mechanisms to deal with interrupt broadcasts into vpt.c. And which was split into more separate pieces of mechanism.

 -- Keir

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [PATCH 0/2] Improve hpet accuracy
  2008-06-07 21:20             ` Dave Winchell
@ 2008-06-09 21:07               ` Dan Magenheimer
  2008-06-09 21:44                 ` Dave Winchell
  0 siblings, 1 reply; 51+ messages in thread
From: Dan Magenheimer @ 2008-06-09 21:07 UTC (permalink / raw)
  To: Dave Winchell, Keir Fraser; +Cc: xen-devel, Ben Guthro

I'm wondering what is "magic" about 0.03% in all the non-hw-hpet
measurements.  Is that just the accuracy of the underlying tsc
on your test system, e.g. the skew of tsc relative to an external
(ntp) source?  Or is Xen (tsc-based) system time skewing that much
on an overcommitted system (and skewing much less than 0.03% on an
unloaded system)?

Running the following on dom0 both on an unloaded and overcommitted
system (with ntpd off in dom0 and all guests) might be interesting:

# ntpdate $NTPSERVER; sleep 3600; ntpdate -q $NTPSERVER

-----Original Message-----
From: Dave Winchell [mailto:dwinchell@virtualiron.com]
Sent: Saturday, June 07, 2008 3:21 PM
To: Keir Fraser; dan.magenheimer@oracle.com
Cc: Ben Guthro; xen-devel; Dave Winchell
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

> Possibly there are bugs in the hpet device model which are fixed by Dave's
> patch. If this is actually the case, it would be nice to break those out as
> separate patches, as I think an 11% drift must largely be due to
> device-model bugs rather than relatively insignificant differences between
> hvm_get_guest_time() and physical HPET main counter.

Hi Keir,

I tried an experiment on Friday where I short circuited the missed ticks policy
code in the hpet.c patch, but used the physical hpet each access. The result for Linux
was a drift of .1%, same as the xen-unstable bits.

Conversely I get very good drift numbers, i.e., under .03%, when using the missed ticks
policy code and  running in simulated mode (layered on stime) when stime uses hpet.

So clearly, the improvement from .1% to .03% is due to the policy code.
I haven't run the short circuit test with the windows policy but I can do that
on Monday.

Note: For Windows and Linux I get < .03% drift using the policy code and running
simulated mode whether stime is using hpet or some other device.

regards,
Dave

-----Original Message-----
From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
Sent: Fri 6/6/2008 6:34 PM
To: dan.magenheimer@oracle.com; Dave Winchell
Cc: Ben Guthro; xen-devel
Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

On 6/6/08 21:29, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> Although hwhpet=1 is a fine alternative in many cases, it may
> be unavailable on some systems and may cause significant performance
> issues on others.  So I think we will still need to track down
> the poor accuracy when hwhpet=0.  And if for some reason
> Xen system time can't be made accurate enough (< 0.05%), then
> I think we should consider building Xen system time itself on
> top of hardware hpet instead of TSC... at least when Xen discovers
> a capable hpet.

Yes, this would be a sensible extra timer_mode: have hvm_get_guest_time()
call to the platform time read function, and bypass TSC altogether. This
would be cleaner than having only the vHPET code punch through to the
physical HPET: instead we have the boot-time chosen platform timesource used
by all virtual timers.

> Or maybe there's a computation error somewhere in the hvm hpet
> scaling code?  Hmmm...

Possibly there are bugs in the hpet device model which are fixed by Dave's
patch. If this is actually the case, it would be nice to break those out as
separate patches, as I think an 11% drift must largely be due to
device-model bugs rather than relatively insignificant differences between
hvm_get_guest_time() and physical HPET main counter.

 -- Keir

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/2] Improve hpet accuracy
  2008-06-09 20:48                   ` Dan Magenheimer
@ 2008-06-09 21:18                     ` Keir Fraser
  2008-06-09 21:46                       ` Dan Magenheimer
  0 siblings, 1 reply; 51+ messages in thread
From: Keir Fraser @ 2008-06-09 21:18 UTC (permalink / raw)
  To: dan.magenheimer@oracle.com, Dave Winchell; +Cc: xen-devel, Ben Guthro

On 9/6/08 21:48, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

>> At guest install time you ought to be able to tell whether the guest
>> will use hpet or not based on its version (RHELx, SLESy, Winz etc etc)
>> and decide whether missed-ticks accounting is required or not.
> 
> Unfortunately this is not true on Linux, at least without gathering
> (and hardcoding) more information about the system.  Whether hpet is
> used or not is dependent not only on the OS/version and hvm config
> parameters, but also on kernel command line parameters and even
> the underlying CPU.  For example, on RHEL5u1, if the tsc is synchronized
> and the CPU is Intel, and no kernel parameters are chosen, tsc will be
> chosen as the default clocksource even if hpet is present.  Ugly.

It's not immediately obvious that adding further independent configuration
knobs to twiddle would make our lives that much easier. However it certainly
increases the test matrix.

In your example above, by synchronised TSC do you mean constant-rate TSC?
That can at least be hidden in CPUID now.

 -- Keir

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/2] Improve hpet accuracy
  2008-06-09 21:07               ` Dan Magenheimer
@ 2008-06-09 21:44                 ` Dave Winchell
  0 siblings, 0 replies; 51+ messages in thread
From: Dave Winchell @ 2008-06-09 21:44 UTC (permalink / raw)
  To: dan.magenheimer@oracle.com
  Cc: Dave Winchell, xen-devel, Keir Fraser, Ben Guthro

Dan Magenheimer wrote:

>I'm wondering what is "magic" about 0.03% in all the non-hw-hpet
>measurements.
>
.03% is simply the maximum error we've seen with hpet.
The maximum value (.03) is the same whether its simulated or physical.
The best value physical is .001% and I don't remember the best value
simulated bit I believe it is under .01%, perhaps well under. I'll have 
to repeat
that measurement. I would think that simulated and physical would give 
roughly
the same drift values, but perhaps at very low drifts that doesn't hold.

I think the .03% is mostly due to the stability of the physical hpet
device on a platform. I've noticed on some platforms, the simulated hpet
time actually improves if you disable the hpet in the bios so that
stime() is layered on the pm timer or whatever.

I would like to get to the bottom of this hpet stability variance
from platform to platform.

Regards,
Dave

>  Is that just the accuracy of the underlying tsc
>on your test system, e.g. the skew of tsc relative to an external
>(ntp) source?  Or is Xen (tsc-based) system time skewing that much
>on an overcommitted system (and skewing much less than 0.03% on an
>unloaded system)?
>
>Running the following on dom0 both on an unloaded and overcommitted
>system (with ntpd off in dom0 and all guests) might be interesting:
>
># ntpdate $NTPSERVER; sleep 3600; ntpdate -q $NTPSERVER
>
>-----Original Message-----
>From: Dave Winchell [mailto:dwinchell@virtualiron.com]
>Sent: Saturday, June 07, 2008 3:21 PM
>To: Keir Fraser; dan.magenheimer@oracle.com
>Cc: Ben Guthro; xen-devel; Dave Winchell
>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
>
>  
>
>>Possibly there are bugs in the hpet device model which are fixed by Dave's
>>patch. If this is actually the case, it would be nice to break those out as
>>separate patches, as I think an 11% drift must largely be due to
>>device-model bugs rather than relatively insignificant differences between
>>hvm_get_guest_time() and physical HPET main counter.
>>    
>>
>
>Hi Keir,
>
>I tried an experiment on Friday where I short circuited the missed ticks policy
>code in the hpet.c patch, but used the physical hpet each access. The result for Linux
>was a drift of .1%, same as the xen-unstable bits.
>
>Conversely I get very good drift numbers, i.e., under .03%, when using the missed ticks
>policy code and  running in simulated mode (layered on stime) when stime uses hpet.
>
>So clearly, the improvement from .1% to .03% is due to the policy code.
>I haven't run the short circuit test with the windows policy but I can do that
>on Monday.
>
>Note: For Windows and Linux I get < .03% drift using the policy code and running
>simulated mode whether stime is using hpet or some other device.
>
>
>regards,
>Dave
>
>
>
>-----Original Message-----
>From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
>Sent: Fri 6/6/2008 6:34 PM
>To: dan.magenheimer@oracle.com; Dave Winchell
>Cc: Ben Guthro; xen-devel
>Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
>On 6/6/08 21:29, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:
>
>  
>
>>Although hwhpet=1 is a fine alternative in many cases, it may
>>be unavailable on some systems and may cause significant performance
>>issues on others.  So I think we will still need to track down
>>the poor accuracy when hwhpet=0.  And if for some reason
>>Xen system time can't be made accurate enough (< 0.05%), then
>>I think we should consider building Xen system time itself on
>>top of hardware hpet instead of TSC... at least when Xen discovers
>>a capable hpet.
>>    
>>
>
>Yes, this would be a sensible extra timer_mode: have hvm_get_guest_time()
>call to the platform time read function, and bypass TSC altogether. This
>would be cleaner than having only the vHPET code punch through to the
>physical HPET: instead we have the boot-time chosen platform timesource used
>by all virtual timers.
>
>  
>
>>Or maybe there's a computation error somewhere in the hvm hpet
>>scaling code?  Hmmm...
>>    
>>
>
>Possibly there are bugs in the hpet device model which are fixed by Dave's
>patch. If this is actually the case, it would be nice to break those out as
>separate patches, as I think an 11% drift must largely be due to
>device-model bugs rather than relatively insignificant differences between
>hvm_get_guest_time() and physical HPET main counter.
>
> -- Keir
>
>  
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [PATCH 0/2] Improve hpet accuracy
  2008-06-09 21:18                     ` Keir Fraser
@ 2008-06-09 21:46                       ` Dan Magenheimer
  0 siblings, 0 replies; 51+ messages in thread
From: Dan Magenheimer @ 2008-06-09 21:46 UTC (permalink / raw)
  To: Keir Fraser, Dave Winchell; +Cc: xen-devel, Ben Guthro

> >> At guest install time you ought to be able to tell whether 
> the guest
> >> will use hpet or not based on its version (RHELx, SLESy, 
> Winz etc etc)
> >> and decide whether missed-ticks accounting is required or not.
> >
> > Unfortunately this is not true on Linux, at least without gathering
> > (and hardcoding) more information about the system.  Whether hpet is
> > used or not is dependent not only on the OS/version and hvm config
> > parameters, but also on kernel command line parameters and even
> > the underlying CPU.  For example, on RHEL5u1, if the tsc is 
> synchronized
> > and the CPU is Intel, and no kernel parameters are chosen, 
> tsc will be
> > chosen as the default clocksource even if hpet is present.  Ugly.
> 
> It's not immediately obvious that adding further independent 
> configuration
> knobs to twiddle would make our lives that much easier. 
> However it certainly
> increases the test matrix.

I fully agree.  That's why I think the default parameters in Xen
should "do the right thing".  The default will get the most
testing and if users say "time hurts when I change the parameters"
we can say "then don't change the parameters" ;-)

> In your example above, by synchronised TSC do you mean 
> constant-rate TSC?
> That can at least be hidden in CPUID now.

I meant synchronized as defined in 2.6.18/arch/x86_64/kernel/time.c
in the function unsynchronized_tsc() and as used in the same file
in time_init_gtod().  To make this more complicated, these routines
have had not-insignificant bug fixes in RHEL5u1/2.

But yes, it might be a good idea if X86_FEATURE_CONSTANT_TSC
always returns 0 (or at least is configurable and defaults off),
since the test may only be made in the guest at boot time and
the guest may migrate to a machine without the feature.

More ugliness, I know.  My head hurts.

Dan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [PATCH 0/2] Improve hpet accuracy
  2008-06-08 20:32           ` Dave Winchell
  2008-06-08 21:10             ` Dan Magenheimer
  2008-06-08 21:18             ` Dan Magenheimer
@ 2008-06-09 22:02             ` Dan Magenheimer
  2008-06-09 23:34               ` Dave Winchell
  2 siblings, 1 reply; 51+ messages in thread
From: Dan Magenheimer @ 2008-06-09 22:02 UTC (permalink / raw)
  To: Dave Winchell, Keir Fraser; +Cc: xen-devel, Ben Guthro

> The Linux policy is more subtle, but is required to go
> from .1% to .03%.

Thanks for the good documentation which I hadn't thoroughly
read until now.  I now understand that the essence of your
hpet missed ticks policy is to ensure that ticks are never
delivered too close together.  But I'm trying to understand
WHY your patch works, in other words, what problem it is
countering.  I care about this for more reasons than just
because it is interesting: (1) I'd like to feel confident that
it is fixing a bug rather than just a symptom of a bug;
and (2) I wonder how universally it is applicable.

I see from code examination in mark_offset_hpet() in
RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that
the correction for lost ticks is just plain wrong in
a virtual environment. (Suppose for example that a virtual
tick was delivered every 1.999*hpet_tick... I think
the clock would be off by 50%!)  Is this the bug that
is being "countered" by your policy?

However, the lost tick handling in RHEL5u1/kernel/timer.c
(which I think is used also for hpet) is much better
so I am eager to find out if your policy works there
too.

If the hpet missed tick policy works for both, though,
I should be happy, though I wonder about upstream kernels
(e.g. the trend toward tickless).  That said, I'd rather
see this get into Xen 3.3 and worry about upstream kernels
later :-)

-----Original Message-----
From: Dave Winchell [mailto:dwinchell@virtualiron.com]
Sent: Sunday, June 08, 2008 2:32 PM
To: dan.magenheimer@oracle.com; Keir Fraser
Cc: Ben Guthro; xen-devel; Dave Winchell
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy


Hi Dan,

> While I am fully supportive of offering hardware hpet as an option
> for hvm guests (let's call it hwhpet=1 for shorthand), I am very
> surprised by your preliminary results; the most obvious conclusion
> is that Xen system time is losing time at the rate of 1000 PPM
> though its possible there's a bug somewhere else in the "time
> stack".  Your Windows result is jaw-dropping and inexplicable,
> though I have to admit ignorance of how Windows manages time.

I think xen system time is fine. You have to add the interrupt
delivery policies decribed in the write-up for the patch to get
accurate timekeeping in the guest.

The windows policy is obvious and results in a large improvement
in accuracy. The Linux policy is more subtle, but is required to go
from .1% to .03%.

> I think with my recent patch and hpet=1 (essentially the same as
> your emulated hpet), hvm guest time should track Xen system time.
> I wonder if domain0 (which if I understand correctly is directly
> using Xen system time) is also seeing an error of .1%?  Also
> I wonder for the skew you are seeing (in both hvm guests and
> domain0) is time moving too fast or two slow?

I don't recall the direction. I can look it up in my notes at work
tomorrow.

> Although hwhpet=1 is a fine alternative in many cases, it may
> be unavailable on some systems and may cause significant performance
> issues on others.  So I think we will still need to track down
> the poor accuracy when hwhpet=0.

Our patch is accurate to < .03% using the physical hpet mode or
the simulated mode.

> And if for some reason
> Xen system time can't be made accurate enough (< 0.05%), then
> I think we should consider building Xen system time itself on
> top of hardware hpet instead of TSC... at least when Xen discovers
> a capable hpet.

In our experience, Xen system time is accurate enough now.

> One more thought... do you know the accuracy of the TSC crystals
> on your test systems?  I posted a patch awhile ago that was
> intended to test that, though I guess it was only testing skew
> of different TSCs on the same system, not TSCs against an
> external time source.

I do not know the tsc accuracy.

> Or maybe there's a computation error somewhere in the hvm hpet
> scaling code?  Hmmm...


Regards,
Dave


-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Fri 6/6/2008 4:29 PM
To: Dave Winchell; Keir Fraser
Cc: Ben Guthro; xen-devel
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Dave --

Thanks much for posting the preliminary results!

While I am fully supportive of offering hardware hpet as an option
for hvm guests (let's call it hwhpet=1 for shorthand), I am very
surprised by your preliminary results; the most obvious conclusion
is that Xen system time is losing time at the rate of 1000 PPM
though its possible there's a bug somewhere else in the "time
stack".  Your Windows result is jaw-dropping and inexplicable,
though I have to admit ignorance of how Windows manages time.


I think with my recent patch and hpet=1 (essentially the same as
your emulated hpet), hvm guest time should track Xen system time.
I wonder if domain0 (which if I understand correctly is directly
using Xen system time) is also seeing an error of .1%?  Also
I wonder for the skew you are seeing (in both hvm guests and
domain0) is time moving too fast or two slow?

Although hwhpet=1 is a fine alternative in many cases, it may
be unavailable on some systems and may cause significant performance
issues on others.  So I think we will still need to track down
the poor accuracy when hwhpet=0.  And if for some reason
Xen system time can't be made accurate enough (< 0.05%), then
I think we should consider building Xen system time itself on
top of hardware hpet instead of TSC... at least when Xen discovers
a capable hpet.

One more thought... do you know the accuracy of the TSC crystals
on your test systems?  I posted a patch awhile ago that was
intended to test that, though I guess it was only testing skew
of different TSCs on the same system, not TSCs against an
external time source.

Or maybe there's a computation error somewhere in the hvm hpet
scaling code?  Hmmm...

Thanks,
Dan

> -----Original Message-----
> From: Dave Winchell [mailto:dwinchell@virtualiron.com]
> Sent: Friday, June 06, 2008 1:33 PM
> To: dan.magenheimer@oracle.com; Keir Fraser
> Cc: Ben Guthro; xen-devel; Dave Winchell
> Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
>
> Dan, Keir:
>
> Preliminary tests results indicate an error of .1% for Linux 64 bit
> guests configured
> for hpet with xen-unstable as is. As we have discussed many times, the
> ntp requirement is .05%.
> Tests on the patch we just submitted for hpet have indicated errors of
> .0012%
> on this platform under similar test conditions and .03% on
> other platforms.
>
> Windows vista64 has an error of 11% using hpet with the
> xen-unstable bits.
> In an overnight test with our hpet patch, the Windows vista
> error was .008%.
>
> The tests are with two or three guests on a physical node, all under
> load, and with
> the ratio of vcpus to phys cpus > 1.
>
> I will continue to run tests over the next few days.
>
> thanks,
> Dave
>
>
> Dan Magenheimer wrote:
>
> > Hi Dave and Ben --
> >
> > When running tests on xen-unstable (without your patch),
> please ensure
> > that hpet=1 is set in the hvm config and also I think that when hpet
> > is the clocksource on RHEL4-32, the clock IS resilient to
> missed ticks
> > so timer_mode should be 2 (vs when pit is the clocksource
> on RHEL4-32,
> > all clock ticks must be delivered and so timer_mode should be 0).
> >
> > Per
> >
> http://lists.xensource.com/archives/html/xen-devel/2008-06/msg
> 00098.html it's
> > my intent to clean this up, but I won't get to it until next week.
> >
> > Thanks,
> > Dan
> >
> >     -----Original Message-----
> >     *From:* xen-devel-bounces@lists.xensource.com
> >     [mailto:xen-devel-bounces@lists.xensource.com]*On
> Behalf Of *Dave
> >     Winchell
> >     *Sent:* Friday, June 06, 2008 4:46 AM
> >     *To:* Keir Fraser; Ben Guthro; xen-devel
> >     *Cc:* dan.magenheimer@oracle.com; Dave Winchell
> >     *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >
> >     Keir,
> >
> >     I think the changes are required. We'll run some tests
> today today so
> >     that we have some data to talk about.
> >
> >     -Dave
> >
> >
> >     -----Original Message-----
> >     From: xen-devel-bounces@lists.xensource.com on behalf
> of Keir Fraser
> >     Sent: Fri 6/6/2008 4:58 AM
> >     To: Ben Guthro; xen-devel
> >     Cc: dan.magenheimer@oracle.com
> >     Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >
> >     Are these patches needed now the timers are built on Xen system
> >     time rather
> >     than host TSC? Dan has reported much better
> time-keeping with his
> >     patch
> >     checked in, and it¹s for sure a lot less invasive than
> this patchset.
> >
> >
> >      -- Keir
> >
> >     On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote:
> >
> >     >
> >     > 1. Introduction
> >     >
> >     > This patch improves the hpet based guest clock in
> terms of drift and
> >     > monotonicity.
> >     > Prior to this work the drift with hpet was greater
> than 2%, far
> >     above the .05%
> >     > limit
> >     > for ntp to synchronize. With this code, the drift ranges from
> >     .001% to .0033%
> >     > depending
> >     > on guest and physical platform.
> >     >
> >     > Using hpet allows guest operating systems to provide monotonic
> >     time to their
> >     > applications. Time sources other than hpet are not
> monotonic because
> >     > of their reliance on tsc, which is not synchronized
> across physical
> >     > processors.
> >     >
> >     > Windows 2k864 and many Linux guests are supported with two
> >     policies, one for
> >     > guests
> >     > that handle missed clock interrupts and the other for guests
> >     that require the
> >     > correct number of interrupts.
> >     >
> >     > Guests may use hpet for the timing source even if the physical
> >     platform has no
> >     > visible
> >     > hpet. Migration is supported between physical machines which
> >     differ in
> >     > physical
> >     > hpet visibility.
> >     >
> >     > Most of the changes are in hpet.c. Two general facilities are
> >     added to track
> >     > interrupt
> >     > progress. The ideas here and the facilities would be useful in
> >     vpt.c, for
> >     > other time
> >     > sources, though no attempt is made here to improve vpt.c.
> >     >
> >     > The following sections discuss hpet dependencies, interrupt
> >     delivery policies,
> >     > live migration,
> >     > test results, and relation to recent work with monotonic time.
> >     >
> >     >
> >     > 2. Virtual Hpet dependencies
> >     >
> >     > The virtual hpet depends on the ability to read the
> physical or
> >     simulated
> >     > (see discussion below) hpet.  For timekeeping, the
> virtual hpet
> >     also depends
> >     > on two new interrupt notification facilities to implement its
> >     policies for
> >     > interrupt delivery.
> >     >
> >     > 2.1. Two modes of low-level hpet main counter reads.
> >     >
> >     > In this implementation, the virtual hpet reads with
> >     read_64_main_counter(),
> >     > exported by
> >     > time.c, either the real physical hpet main counter register
> >     directly or a
> >     > "simulated"
> >     > hpet main counter.
> >     >
> >     > The simulated mode uses a monotonic version of get_s_time()
> >     (NOW()), where the
> >     > last
> >     > time value is returned whenever the current time value is less
> >     than the last
> >     > time
> >     > value. In simulated mode, since it is layered on s_time, the
> >     underlying
> >     > hardware
> >     > can be hpet or some other device. The frequency of the main
> >     counter in
> >     > simulated
> >     > mode is the same as the standard physical hpet frequency,
> >     allowing live
> >     > migration
> >     > between nodes that are configured differently.
> >     >
> >     > If the physical platform does not have an hpet
> device, or if xen
> >     is configured
> >     > not
> >     > to use the device, then the simulated method is used. If there
> >     is a physical
> >     > hpet device,
> >     > and xen has initialized it, then either simulated or physical
> >     mode can be
> >     > used.
> >     > This is governed by a boot time option, hpet-avoid.
> Setting this
> >     option to 1
> >     > gives the
> >     > simulated mode and 0 the physical mode. The default
> is physical
> >     mode.
> >     >
> >     > A disadvantage of the physical mode is that may take longer to
> >     read the device
> >     > than in simulated mode. On some platforms the cost is
> about the
> >     same (less
> >     > than 250 nsec) for
> >     > physical and simulated modes, while on others physical cost is
> >     much higher
> >     > than simulated.
> >     > A disadvantage of the simulated mode is that it can return the
> >     same value
> >     > for the counter in consecutive calls.
> >     >
> >     > 2.2. Interrupt notification facilities.
> >     >
> >     > Two interrupt notification facilities are introduced, one is
> >     > hvm_isa_irq_assert_cb()
> >     > and the other hvm_register_intr_en_notif().
> >     >
> >     > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to
> >     the vioapic.
> >     > hvm_isa_irq_assert_cb allows a callback to be passed along to
> >     > vioapic_deliver()
> >     > and this callback is called with a mask of the vcpus
> which will
> >     get the
> >     > interrupt. This callback is made before any vcpus receive an
> >     interrupt.
> >     >
> >     > Vhpet uses hvm_register_intr_en_notif() to register a handler
> >     for a particular
> >     > vector that will be called when that vector is injected in
> >     > [vmx,svm]_intr_assist()
> >     > and also when the guest finishes handling the interrupt. Here
> >     finished is
> >     > defined
> >     > as the point when the guest re-enables interrupts or
> lowers the
> >     tpr value.
> >     > EOI is not used as the end of interrupt as this is sometimes
> >     returned before
> >     > the interrupt handler has done its work. A flag is
> passed to the
> >     handler
> >     > indicating
> >     > whether this is the injection point (post = 1) or the
> interrupt
> >     finished (post
> >     > = 0) point.
> >     > The need for the finished point callback is discussed in the
> >     missed ticks
> >     > policy section.
> >     >
> >     > To prevent a possible early trigger of the finished callback,
> >     intr_en_notif
> >     > logic
> >     > has a two stage arm, the first at injection
> >     (hvm_intr_en_notif_arm()) and the
> >     > second when
> >     > interrupts are seen to be disabled
> (hvm_intr_en_notif_disarm()).
> >     Once fully
> >     > armed, re-enabling
> >     > interrupts will cause hvm_intr_en_notif_disarm() to
> make the end
> >     of interrupt
> >     > callback. hvm_intr_en_notif_arm() and
> hvm_intr_en_notif_disarm()
> >     are called by
> >     > [vmx,svm]_intr_assist().
> >     >
> >     > 3. Interrupt delivery policies
> >     >
> >     > The existing hpet interrupt delivery is preserved.
> This includes
> >     > vcpu round robin delivery used by Linux and broadcast delivery
> >     used by
> >     > Windows.
> >     >
> >     > There are two policies for interrupt delivery, one for Windows
> >     2k8-64 and the
> >     > other
> >     > for Linux. The Linux policy takes advantage of the
> (guest) Linux
> >     missed tick
> >     > and offset
> >     > calculations and does not attempt to deliver the
> right number of
> >     interrupts.
> >     > The Windows policy delivers the correct number of interrupts,
> >     even if
> >     > sometimes much
> >     > closer to each other than the period. The policies are similar
> >     to those in
> >     > vpt.c, though
> >     > there are some important differences.
> >     >
> >     > Policies are selected with an HVMOP_set_param
> hypercall with index
> >     > HVM_PARAM_TIMER_MODE.
> >     > Two new values are added,
> HVM_HPET_guest_computes_missed_ticks and
> >     > HVM_HPET_guest_does_not_compute_missed_ticks.  The reason that
> >     two new ones
> >     > are added is that
> >     > in some guests (32bit Linux) a no-missed policy is needed for
> >     clock sources
> >     > other than hpet
> >     > and a missed ticks policy for hpet. It was felt that
> there would
> >     be less
> >     > confusion by simply
> >     > introducing the two hpet policies.
> >     >
> >     > 3.1. The missed ticks policy
> >     >
> >     > The Linux clock interrupt handler for hpet calculates missed
> >     ticks and offset
> >     > using the hpet
> >     > main counter. The algorithm works well when the time since the
> >     last interrupt
> >     > is greater than
> >     > or equal to a period and poorly otherwise.
> >     >
> >     > The missed ticks policy ensures that no two clock
> interrupts are
> >     delivered to
> >     > the guest at
> >     > a time interval less than a period. A time stamp (hpet main
> >     counter value) is
> >     > recorded (by a
> >     > callback registered with hvm_register_intr_en_notif)
> when Linux
> >     finishes
> >     > handling the clock
> >     > interrupt. Then, ensuing interrupts are delivered to
> the vioapic
> >     only if the
> >     > current main
> >     > counter value is a period greater than when the last interrupt
> >     was handled.
> >     >
> >     > Tests showed a significant improvement in clock drift with end
> >     of interrupt
> >     > time stamps
> >     > versus beginning of interrupt[1]. It is believed that
> the reason
> >     for the
> >     > improvement
> >     > is that the clock interrupt handler goes for a
> spinlock and can
> >     be therefore
> >     > delayed in its
> >     > processing. Furthermore, the main counter is read by the guest
> >     under the lock.
> >     > The net
> >     > effect is that if we time stamp injection, we can get the
> >     difference in time
> >     > between successive interrupt handler lock acquisitions to be
> >     less than the
> >     > period.
> >     >
> >     > 3.2. The no-missed ticks policy
> >     >
> >     > Windows 2k864 keeps very poor time with the missed
> ticks policy.
> >     So the
> >     > no-missed ticks policy
> >     > was developed. In the no-missed ticks policy we deliver the
> >     correct number of
> >     > interrupts,
> >     > even if they are spaced less than a period apart
> (when catching up).
> >     >
> >     > Windows 2k864 uses a broadcast mode in the interrupt routing
> >     such that
> >     > all vcpus get the clock interrupt. The best Windows drift
> >     performance was
> >     > achieved when the
> >     > policy code ensured that all the previous interrupts (on the
> >     various vcpus)
> >     > had been injected
> >     > before injecting the next interrupt to the vioapic..
> >     >
> >     > The policy code works as follows. It uses the
> >     hvm_isa_irq_assert_cb() to
> >     > record
> >     > the vcpus to be interrupted in h->hpet.pending_mask. Then, in
> >     the callback
> >     > registered
> >     > with hvm_register_intr_en_notif() at post=1 time it clears the
> >     current vcpu in
> >     > the pending_mask.
> >     > When the pending_mask is clear it decrements
> >     hpet.intr_pending_nr and if
> >     > intr_pending_nr is still
> >     > non-zero posts another interrupt to the ioapic with
> >     hvm_isa_irq_assert_cb().
> >     > Intr_pending_nr is incremented in
> >     hpet_route_decision_not_missed_ticks().
> >     >
> >     > The missed ticks policy intr_en_notif callback also uses the
> >     pending_mask
> >     > method. So even though
> >     > Linux does not broadcast its interrupts, the code could handle
> >     it if it did.
> >     > In this case the end of interrupt time stamp is made when the
> >     pending_mask is
> >     > clear.
> >     >
> >     > 4. Live Migration
> >     >
> >     > Live migration with hpet preserves the current offset of the
> >     guest clock with
> >     > respect
> >     > to ntp. This is accomplished by migrating all of the state in
> >     the h->hpet data
> >     > structure
> >     > in the usual way. The hp->mc_offset is recalculated on the
> >     receiving node so
> >     > that the
> >     > guest sees a continuous hpet main counter.
> >     >
> >     > Code as been added to xc_domain_save.c to send a small message
> >     after the
> >     > domain context is sent. The contents of the message is the
> >     physical tsc
> >     > timestamp, last_tsc,
> >     > read just before the message is sent. When the
> last_tsc message
> >     is received in
> >     > xc_domain_restore.c,
> >     > another physical tsc timestamp, cur_tsc, is read. The two
> >     timestamps are
> >     > loaded into the domain
> >     > structure as last_tsc_sender and first_tsc_receiver with
> >     hypercalls. Then
> >     > xc_domain_hvm_setcontext
> >     > is called so that hpet_load has access to these time stamps.
> >     Hpet_load uses
> >     > the timestamps
> >     > to account for the time spent saving and loading the domain
> >     context. With this
> >     > technique,
> >     > the only neglected time is the time spent sending a small
> >     network message.
> >     >
> >     > 5. Test Results
> >     >
> >     > Some recent test results are:
> >     >
> >     > 5.1 Linux 4u664 and Windows 2k864 load test.
> >     >       Duration: 70 hours.
> >     >       Test date: 6/2/08
> >     >       Loads: usex -b48 on Linux; burn-in on Windows
> >     >       Guest vcpus: 8 for Linux; 2 for Windows
> >     >       Hardware: 8 physical cpu AMD
> >     >       Clock drift : Linux: .0012% Windows: .009%
> >     >
> >     > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test
> >     >       Duration: 23 hours.
> >     >       Test date: 6/3/08
> >     >       Loads: none
> >     >       Guest vcpus: 8 for each Linux; 2 for Windows
> >     >       Hardware: 4 physical cpu AMD
> >     >       Clock drift : Linux: .033% Windows: .019%
> >     >
> >     > 6. Relation to recent work in xen-unstable
> >     >
> >     > There is a similarity between hvm_get_guest_time() in
> >     xen-unstable and
> >     > read_64_main_counter()
> >     > in this code. However, read_64_main_counter() is more tuned to
> >     the needs of
> >     > hpet.c. It has no
> >     > "set" operation, only the get. It isolates the mode,
> physical or
> >     simulated, in
> >     > read_64_main_counter()
> >     > itself. It uses no vcpu or domain state as it is a physical
> >     entity, in either
> >     > mode. And it provides a real
> >     > physical mode for every read for those applications
> that desire
> >     this.
> >     >
> >     > 7. Conclusion
> >     >
> >     > The virtual hpet is improved by this patch in terms
> of accuracy and
> >     > monotonicity.
> >     > Tests performed to date verify this and more testing
> is under way.
> >     >
> >     > 8. Future Work
> >     >
> >     > Testing with Windows Vista will be performed soon. The reason
> >     for accuracy
> >     > variations
> >     > on different platforms using the physical hpet device will be
> >     investigated.
> >     > Additional overhead measurements on simulated vs physical hpet
> >     mode will be
> >     > made.
> >     >
> >     > Footnotes:
> >     >
> >     > 1. I don't recall the accuracy improvement with end
> of interrupt
> >     stamping, but
> >     > it was
> >     > significant, perhaps better than two to one improvement. It
> >     would be a very
> >     > simple matter
> >     > to re-measure the improvement as the facility can call back at
> >     injection time
> >     > as well.
> >     >
> >     >
> >     > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com>
> >     > <mailto:dwinchell@virtualiron.com>
> >     > Signed-off-by: Ben Guthro <bguthro@virtualiron.com>
> >     > <mailto:bguthro@virtualiron.com>
> >     >
> >     >
> >     > _______________________________________________
> >     > Xen-devel mailing list
> >     > Xen-devel@lists.xensource.com
> >     > http://lists.xensource.com/xen-devel
> >
> >
> >
>
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [PATCH 0/2] Improve hpet accuracy
  2008-06-09 22:02             ` Dan Magenheimer
@ 2008-06-09 23:34               ` Dave Winchell
  2008-06-10  3:21                 ` Dan Magenheimer
  2008-06-10  7:52                 ` Keir Fraser
  0 siblings, 2 replies; 51+ messages in thread
From: Dave Winchell @ 2008-06-09 23:34 UTC (permalink / raw)
  To: dan.magenheimer, Keir Fraser; +Cc: Dave Winchell, xen-devel, Ben Guthro


[-- Attachment #1.1: Type: text/plain, Size: 27888 bytes --]

> > The Linux policy is more subtle, but is required to go
> > from .1% to .03%.

> Thanks for the good documentation which I hadn't thoroughly
> read until now.
> I now understand that the essence of your
> hpet missed ticks policy is to ensure that ticks are never
> delivered too close together.  But I'm trying to understand
> WHY your patch works, in other words, what problem it is
> countering.

I'll tell  you what I recall about this. Tomorrow I'll check the
guest code to verify. I think that Linux declares a full tick,
even if the interrupt is early. That's the problem.
On the other hand, if the interrupt is late it in effect declares
a tick plus fraction. If it just declared the fraction in the first place,
we could deliver the interrupts whenever we wanted.

Its really not that different than the missed ticks policy in vpt.c
except that there the period in vpt.c is based on start of interrupt
and I have improved that with end-of interrupt as described
in the patch note.

I don't recall what prompted me to try end-of-interrupt,
but I saw a significant improvement. I may have been running
a monotonicity test at the same time to explain the lock
contention mentioned in the write-up.

>  I care about this for more reasons than just
> because it is interesting: (1) I'd like to feel confident that
> it is fixing a bug rather than just a symptom of a bug;
> and (2) I wonder how universally it is applicable.

Its worked well my my small set of guests. You and our
QA are going to tell us about the wider set. It doesn't
matter if guest A handles interrupts closely spaced or not,
just whether it handles them far apart. So it should be pretty
universal with guests that really handle missed ticks.
I think its interesting that some 32bit Linux guests handle
missed ticks for hpet.

> I see from code examination in mark_offset_hpet() in
> RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that
> the correction for lost ticks is just plain wrong in
> a virtual environment. (Suppose for example that a virtual
> tick was delivered every 1.999*hpet_tick... I think
> the clock would be off by 50%!)  Is this the bug that
> is being "countered" by your policy?

I haven't looked at that code, perhaps.
I'll check it tomorrow.

> However, the lost tick handling in RHEL5u1/kernel/timer.c
> (which I think is used also for hpet) is much better
> so I am eager to find out if your policy works there
> too.
> If the hpet missed tick policy works for both, though,
> I should be happy, though I wonder about upstream kernels
> (e.g. the trend toward tickless). 

I wasn't aware of this trend. If its robust, however, it should
handle late interrupts ...

> That said, I'd rather
> see this get into Xen 3.3 and worry about upstream kernels
> later :-)

Regards,
Dave



-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Mon 6/9/2008 6:02 PM
To: Dave Winchell; Keir Fraser
Cc: Ben Guthro; xen-devel
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
 
> The Linux policy is more subtle, but is required to go
> from .1% to .03%.

Thanks for the good documentation which I hadn't thoroughly
read until now.  I now understand that the essence of your
hpet missed ticks policy is to ensure that ticks are never
delivered too close together.  But I'm trying to understand
WHY your patch works, in other words, what problem it is
countering.  I care about this for more reasons than just
because it is interesting: (1) I'd like to feel confident that
it is fixing a bug rather than just a symptom of a bug;
and (2) I wonder how universally it is applicable.

I see from code examination in mark_offset_hpet() in
RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that
the correction for lost ticks is just plain wrong in
a virtual environment. (Suppose for example that a virtual
tick was delivered every 1.999*hpet_tick... I think
the clock would be off by 50%!)  Is this the bug that
is being "countered" by your policy?

However, the lost tick handling in RHEL5u1/kernel/timer.c
(which I think is used also for hpet) is much better
so I am eager to find out if your policy works there
too.

If the hpet missed tick policy works for both, though,
I should be happy, though I wonder about upstream kernels
(e.g. the trend toward tickless).  That said, I'd rather
see this get into Xen 3.3 and worry about upstream kernels
later :-)

-----Original Message-----
From: Dave Winchell [mailto:dwinchell@virtualiron.com]
Sent: Sunday, June 08, 2008 2:32 PM
To: dan.magenheimer@oracle.com; Keir Fraser
Cc: Ben Guthro; xen-devel; Dave Winchell
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy


Hi Dan,

> While I am fully supportive of offering hardware hpet as an option
> for hvm guests (let's call it hwhpet=1 for shorthand), I am very
> surprised by your preliminary results; the most obvious conclusion
> is that Xen system time is losing time at the rate of 1000 PPM
> though its possible there's a bug somewhere else in the "time
> stack".  Your Windows result is jaw-dropping and inexplicable,
> though I have to admit ignorance of how Windows manages time.

I think xen system time is fine. You have to add the interrupt
delivery policies decribed in the write-up for the patch to get
accurate timekeeping in the guest.

The windows policy is obvious and results in a large improvement
in accuracy. The Linux policy is more subtle, but is required to go
from .1% to .03%.

> I think with my recent patch and hpet=1 (essentially the same as
> your emulated hpet), hvm guest time should track Xen system time.
> I wonder if domain0 (which if I understand correctly is directly
> using Xen system time) is also seeing an error of .1%?  Also
> I wonder for the skew you are seeing (in both hvm guests and
> domain0) is time moving too fast or two slow?

I don't recall the direction. I can look it up in my notes at work
tomorrow.

> Although hwhpet=1 is a fine alternative in many cases, it may
> be unavailable on some systems and may cause significant performance
> issues on others.  So I think we will still need to track down
> the poor accuracy when hwhpet=0.

Our patch is accurate to < .03% using the physical hpet mode or
the simulated mode.

> And if for some reason
> Xen system time can't be made accurate enough (< 0.05%), then
> I think we should consider building Xen system time itself on
> top of hardware hpet instead of TSC... at least when Xen discovers
> a capable hpet.

In our experience, Xen system time is accurate enough now.

> One more thought... do you know the accuracy of the TSC crystals
> on your test systems?  I posted a patch awhile ago that was
> intended to test that, though I guess it was only testing skew
> of different TSCs on the same system, not TSCs against an
> external time source.

I do not know the tsc accuracy.

> Or maybe there's a computation error somewhere in the hvm hpet
> scaling code?  Hmmm...


Regards,
Dave


-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Fri 6/6/2008 4:29 PM
To: Dave Winchell; Keir Fraser
Cc: Ben Guthro; xen-devel
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Dave --

Thanks much for posting the preliminary results!

While I am fully supportive of offering hardware hpet as an option
for hvm guests (let's call it hwhpet=1 for shorthand), I am very
surprised by your preliminary results; the most obvious conclusion
is that Xen system time is losing time at the rate of 1000 PPM
though its possible there's a bug somewhere else in the "time
stack".  Your Windows result is jaw-dropping and inexplicable,
though I have to admit ignorance of how Windows manages time.


I think with my recent patch and hpet=1 (essentially the same as
your emulated hpet), hvm guest time should track Xen system time.
I wonder if domain0 (which if I understand correctly is directly
using Xen system time) is also seeing an error of .1%?  Also
I wonder for the skew you are seeing (in both hvm guests and
domain0) is time moving too fast or two slow?

Although hwhpet=1 is a fine alternative in many cases, it may
be unavailable on some systems and may cause significant performance
issues on others.  So I think we will still need to track down
the poor accuracy when hwhpet=0.  And if for some reason
Xen system time can't be made accurate enough (< 0.05%), then
I think we should consider building Xen system time itself on
top of hardware hpet instead of TSC... at least when Xen discovers
a capable hpet.

One more thought... do you know the accuracy of the TSC crystals
on your test systems?  I posted a patch awhile ago that was
intended to test that, though I guess it was only testing skew
of different TSCs on the same system, not TSCs against an
external time source.

Or maybe there's a computation error somewhere in the hvm hpet
scaling code?  Hmmm...

Thanks,
Dan

> -----Original Message-----
> From: Dave Winchell [mailto:dwinchell@virtualiron.com]
> Sent: Friday, June 06, 2008 1:33 PM
> To: dan.magenheimer@oracle.com; Keir Fraser
> Cc: Ben Guthro; xen-devel; Dave Winchell
> Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
>
> Dan, Keir:
>
> Preliminary tests results indicate an error of .1% for Linux 64 bit
> guests configured
> for hpet with xen-unstable as is. As we have discussed many times, the
> ntp requirement is .05%.
> Tests on the patch we just submitted for hpet have indicated errors of
> .0012%
> on this platform under similar test conditions and .03% on
> other platforms.
>
> Windows vista64 has an error of 11% using hpet with the
> xen-unstable bits.
> In an overnight test with our hpet patch, the Windows vista
> error was .008%.
>
> The tests are with two or three guests on a physical node, all under
> load, and with
> the ratio of vcpus to phys cpus > 1.
>
> I will continue to run tests over the next few days.
>
> thanks,
> Dave
>
>
> Dan Magenheimer wrote:
>
> > Hi Dave and Ben --
> >
> > When running tests on xen-unstable (without your patch),
> please ensure
> > that hpet=1 is set in the hvm config and also I think that when hpet
> > is the clocksource on RHEL4-32, the clock IS resilient to
> missed ticks
> > so timer_mode should be 2 (vs when pit is the clocksource
> on RHEL4-32,
> > all clock ticks must be delivered and so timer_mode should be 0).
> >
> > Per
> >
> http://lists.xensource.com/archives/html/xen-devel/2008-06/msg
> 00098.html it's
> > my intent to clean this up, but I won't get to it until next week.
> >
> > Thanks,
> > Dan
> >
> >     -----Original Message-----
> >     *From:* xen-devel-bounces@lists.xensource.com
> >     [mailto:xen-devel-bounces@lists.xensource.com]*On
> Behalf Of *Dave
> >     Winchell
> >     *Sent:* Friday, June 06, 2008 4:46 AM
> >     *To:* Keir Fraser; Ben Guthro; xen-devel
> >     *Cc:* dan.magenheimer@oracle.com; Dave Winchell
> >     *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >
> >     Keir,
> >
> >     I think the changes are required. We'll run some tests
> today today so
> >     that we have some data to talk about.
> >
> >     -Dave
> >
> >
> >     -----Original Message-----
> >     From: xen-devel-bounces@lists.xensource.com on behalf
> of Keir Fraser
> >     Sent: Fri 6/6/2008 4:58 AM
> >     To: Ben Guthro; xen-devel
> >     Cc: dan.magenheimer@oracle.com
> >     Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >
> >     Are these patches needed now the timers are built on Xen system
> >     time rather
> >     than host TSC? Dan has reported much better
> time-keeping with his
> >     patch
> >     checked in, and it¹s for sure a lot less invasive than
> this patchset.
> >
> >
> >      -- Keir
> >
> >     On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote:
> >
> >     >
> >     > 1. Introduction
> >     >
> >     > This patch improves the hpet based guest clock in
> terms of drift and
> >     > monotonicity.
> >     > Prior to this work the drift with hpet was greater
> than 2%, far
> >     above the .05%
> >     > limit
> >     > for ntp to synchronize. With this code, the drift ranges from
> >     .001% to .0033%
> >     > depending
> >     > on guest and physical platform.
> >     >
> >     > Using hpet allows guest operating systems to provide monotonic
> >     time to their
> >     > applications. Time sources other than hpet are not
> monotonic because
> >     > of their reliance on tsc, which is not synchronized
> across physical
> >     > processors.
> >     >
> >     > Windows 2k864 and many Linux guests are supported with two
> >     policies, one for
> >     > guests
> >     > that handle missed clock interrupts and the other for guests
> >     that require the
> >     > correct number of interrupts.
> >     >
> >     > Guests may use hpet for the timing source even if the physical
> >     platform has no
> >     > visible
> >     > hpet. Migration is supported between physical machines which
> >     differ in
> >     > physical
> >     > hpet visibility.
> >     >
> >     > Most of the changes are in hpet.c. Two general facilities are
> >     added to track
> >     > interrupt
> >     > progress. The ideas here and the facilities would be useful in
> >     vpt.c, for
> >     > other time
> >     > sources, though no attempt is made here to improve vpt.c.
> >     >
> >     > The following sections discuss hpet dependencies, interrupt
> >     delivery policies,
> >     > live migration,
> >     > test results, and relation to recent work with monotonic time.
> >     >
> >     >
> >     > 2. Virtual Hpet dependencies
> >     >
> >     > The virtual hpet depends on the ability to read the
> physical or
> >     simulated
> >     > (see discussion below) hpet.  For timekeeping, the
> virtual hpet
> >     also depends
> >     > on two new interrupt notification facilities to implement its
> >     policies for
> >     > interrupt delivery.
> >     >
> >     > 2.1. Two modes of low-level hpet main counter reads.
> >     >
> >     > In this implementation, the virtual hpet reads with
> >     read_64_main_counter(),
> >     > exported by
> >     > time.c, either the real physical hpet main counter register
> >     directly or a
> >     > "simulated"
> >     > hpet main counter.
> >     >
> >     > The simulated mode uses a monotonic version of get_s_time()
> >     (NOW()), where the
> >     > last
> >     > time value is returned whenever the current time value is less
> >     than the last
> >     > time
> >     > value. In simulated mode, since it is layered on s_time, the
> >     underlying
> >     > hardware
> >     > can be hpet or some other device. The frequency of the main
> >     counter in
> >     > simulated
> >     > mode is the same as the standard physical hpet frequency,
> >     allowing live
> >     > migration
> >     > between nodes that are configured differently.
> >     >
> >     > If the physical platform does not have an hpet
> device, or if xen
> >     is configured
> >     > not
> >     > to use the device, then the simulated method is used. If there
> >     is a physical
> >     > hpet device,
> >     > and xen has initialized it, then either simulated or physical
> >     mode can be
> >     > used.
> >     > This is governed by a boot time option, hpet-avoid.
> Setting this
> >     option to 1
> >     > gives the
> >     > simulated mode and 0 the physical mode. The default
> is physical
> >     mode.
> >     >
> >     > A disadvantage of the physical mode is that may take longer to
> >     read the device
> >     > than in simulated mode. On some platforms the cost is
> about the
> >     same (less
> >     > than 250 nsec) for
> >     > physical and simulated modes, while on others physical cost is
> >     much higher
> >     > than simulated.
> >     > A disadvantage of the simulated mode is that it can return the
> >     same value
> >     > for the counter in consecutive calls.
> >     >
> >     > 2.2. Interrupt notification facilities.
> >     >
> >     > Two interrupt notification facilities are introduced, one is
> >     > hvm_isa_irq_assert_cb()
> >     > and the other hvm_register_intr_en_notif().
> >     >
> >     > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to
> >     the vioapic.
> >     > hvm_isa_irq_assert_cb allows a callback to be passed along to
> >     > vioapic_deliver()
> >     > and this callback is called with a mask of the vcpus
> which will
> >     get the
> >     > interrupt. This callback is made before any vcpus receive an
> >     interrupt.
> >     >
> >     > Vhpet uses hvm_register_intr_en_notif() to register a handler
> >     for a particular
> >     > vector that will be called when that vector is injected in
> >     > [vmx,svm]_intr_assist()
> >     > and also when the guest finishes handling the interrupt. Here
> >     finished is
> >     > defined
> >     > as the point when the guest re-enables interrupts or
> lowers the
> >     tpr value.
> >     > EOI is not used as the end of interrupt as this is sometimes
> >     returned before
> >     > the interrupt handler has done its work. A flag is
> passed to the
> >     handler
> >     > indicating
> >     > whether this is the injection point (post = 1) or the
> interrupt
> >     finished (post
> >     > = 0) point.
> >     > The need for the finished point callback is discussed in the
> >     missed ticks
> >     > policy section.
> >     >
> >     > To prevent a possible early trigger of the finished callback,
> >     intr_en_notif
> >     > logic
> >     > has a two stage arm, the first at injection
> >     (hvm_intr_en_notif_arm()) and the
> >     > second when
> >     > interrupts are seen to be disabled
> (hvm_intr_en_notif_disarm()).
> >     Once fully
> >     > armed, re-enabling
> >     > interrupts will cause hvm_intr_en_notif_disarm() to
> make the end
> >     of interrupt
> >     > callback. hvm_intr_en_notif_arm() and
> hvm_intr_en_notif_disarm()
> >     are called by
> >     > [vmx,svm]_intr_assist().
> >     >
> >     > 3. Interrupt delivery policies
> >     >
> >     > The existing hpet interrupt delivery is preserved.
> This includes
> >     > vcpu round robin delivery used by Linux and broadcast delivery
> >     used by
> >     > Windows.
> >     >
> >     > There are two policies for interrupt delivery, one for Windows
> >     2k8-64 and the
> >     > other
> >     > for Linux. The Linux policy takes advantage of the
> (guest) Linux
> >     missed tick
> >     > and offset
> >     > calculations and does not attempt to deliver the
> right number of
> >     interrupts.
> >     > The Windows policy delivers the correct number of interrupts,
> >     even if
> >     > sometimes much
> >     > closer to each other than the period. The policies are similar
> >     to those in
> >     > vpt.c, though
> >     > there are some important differences.
> >     >
> >     > Policies are selected with an HVMOP_set_param
> hypercall with index
> >     > HVM_PARAM_TIMER_MODE.
> >     > Two new values are added,
> HVM_HPET_guest_computes_missed_ticks and
> >     > HVM_HPET_guest_does_not_compute_missed_ticks.  The reason that
> >     two new ones
> >     > are added is that
> >     > in some guests (32bit Linux) a no-missed policy is needed for
> >     clock sources
> >     > other than hpet
> >     > and a missed ticks policy for hpet. It was felt that
> there would
> >     be less
> >     > confusion by simply
> >     > introducing the two hpet policies.
> >     >
> >     > 3.1. The missed ticks policy
> >     >
> >     > The Linux clock interrupt handler for hpet calculates missed
> >     ticks and offset
> >     > using the hpet
> >     > main counter. The algorithm works well when the time since the
> >     last interrupt
> >     > is greater than
> >     > or equal to a period and poorly otherwise.
> >     >
> >     > The missed ticks policy ensures that no two clock
> interrupts are
> >     delivered to
> >     > the guest at
> >     > a time interval less than a period. A time stamp (hpet main
> >     counter value) is
> >     > recorded (by a
> >     > callback registered with hvm_register_intr_en_notif)
> when Linux
> >     finishes
> >     > handling the clock
> >     > interrupt. Then, ensuing interrupts are delivered to
> the vioapic
> >     only if the
> >     > current main
> >     > counter value is a period greater than when the last interrupt
> >     was handled.
> >     >
> >     > Tests showed a significant improvement in clock drift with end
> >     of interrupt
> >     > time stamps
> >     > versus beginning of interrupt[1]. It is believed that
> the reason
> >     for the
> >     > improvement
> >     > is that the clock interrupt handler goes for a
> spinlock and can
> >     be therefore
> >     > delayed in its
> >     > processing. Furthermore, the main counter is read by the guest
> >     under the lock.
> >     > The net
> >     > effect is that if we time stamp injection, we can get the
> >     difference in time
> >     > between successive interrupt handler lock acquisitions to be
> >     less than the
> >     > period.
> >     >
> >     > 3.2. The no-missed ticks policy
> >     >
> >     > Windows 2k864 keeps very poor time with the missed
> ticks policy.
> >     So the
> >     > no-missed ticks policy
> >     > was developed. In the no-missed ticks policy we deliver the
> >     correct number of
> >     > interrupts,
> >     > even if they are spaced less than a period apart
> (when catching up).
> >     >
> >     > Windows 2k864 uses a broadcast mode in the interrupt routing
> >     such that
> >     > all vcpus get the clock interrupt. The best Windows drift
> >     performance was
> >     > achieved when the
> >     > policy code ensured that all the previous interrupts (on the
> >     various vcpus)
> >     > had been injected
> >     > before injecting the next interrupt to the vioapic..
> >     >
> >     > The policy code works as follows. It uses the
> >     hvm_isa_irq_assert_cb() to
> >     > record
> >     > the vcpus to be interrupted in h->hpet.pending_mask. Then, in
> >     the callback
> >     > registered
> >     > with hvm_register_intr_en_notif() at post=1 time it clears the
> >     current vcpu in
> >     > the pending_mask.
> >     > When the pending_mask is clear it decrements
> >     hpet.intr_pending_nr and if
> >     > intr_pending_nr is still
> >     > non-zero posts another interrupt to the ioapic with
> >     hvm_isa_irq_assert_cb().
> >     > Intr_pending_nr is incremented in
> >     hpet_route_decision_not_missed_ticks().
> >     >
> >     > The missed ticks policy intr_en_notif callback also uses the
> >     pending_mask
> >     > method. So even though
> >     > Linux does not broadcast its interrupts, the code could handle
> >     it if it did.
> >     > In this case the end of interrupt time stamp is made when the
> >     pending_mask is
> >     > clear.
> >     >
> >     > 4. Live Migration
> >     >
> >     > Live migration with hpet preserves the current offset of the
> >     guest clock with
> >     > respect
> >     > to ntp. This is accomplished by migrating all of the state in
> >     the h->hpet data
> >     > structure
> >     > in the usual way. The hp->mc_offset is recalculated on the
> >     receiving node so
> >     > that the
> >     > guest sees a continuous hpet main counter.
> >     >
> >     > Code as been added to xc_domain_save.c to send a small message
> >     after the
> >     > domain context is sent. The contents of the message is the
> >     physical tsc
> >     > timestamp, last_tsc,
> >     > read just before the message is sent. When the
> last_tsc message
> >     is received in
> >     > xc_domain_restore.c,
> >     > another physical tsc timestamp, cur_tsc, is read. The two
> >     timestamps are
> >     > loaded into the domain
> >     > structure as last_tsc_sender and first_tsc_receiver with
> >     hypercalls. Then
> >     > xc_domain_hvm_setcontext
> >     > is called so that hpet_load has access to these time stamps.
> >     Hpet_load uses
> >     > the timestamps
> >     > to account for the time spent saving and loading the domain
> >     context. With this
> >     > technique,
> >     > the only neglected time is the time spent sending a small
> >     network message.
> >     >
> >     > 5. Test Results
> >     >
> >     > Some recent test results are:
> >     >
> >     > 5.1 Linux 4u664 and Windows 2k864 load test.
> >     >       Duration: 70 hours.
> >     >       Test date: 6/2/08
> >     >       Loads: usex -b48 on Linux; burn-in on Windows
> >     >       Guest vcpus: 8 for Linux; 2 for Windows
> >     >       Hardware: 8 physical cpu AMD
> >     >       Clock drift : Linux: .0012% Windows: .009%
> >     >
> >     > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test
> >     >       Duration: 23 hours.
> >     >       Test date: 6/3/08
> >     >       Loads: none
> >     >       Guest vcpus: 8 for each Linux; 2 for Windows
> >     >       Hardware: 4 physical cpu AMD
> >     >       Clock drift : Linux: .033% Windows: .019%
> >     >
> >     > 6. Relation to recent work in xen-unstable
> >     >
> >     > There is a similarity between hvm_get_guest_time() in
> >     xen-unstable and
> >     > read_64_main_counter()
> >     > in this code. However, read_64_main_counter() is more tuned to
> >     the needs of
> >     > hpet.c. It has no
> >     > "set" operation, only the get. It isolates the mode,
> physical or
> >     simulated, in
> >     > read_64_main_counter()
> >     > itself. It uses no vcpu or domain state as it is a physical
> >     entity, in either
> >     > mode. And it provides a real
> >     > physical mode for every read for those applications
> that desire
> >     this.
> >     >
> >     > 7. Conclusion
> >     >
> >     > The virtual hpet is improved by this patch in terms
> of accuracy and
> >     > monotonicity.
> >     > Tests performed to date verify this and more testing
> is under way.
> >     >
> >     > 8. Future Work
> >     >
> >     > Testing with Windows Vista will be performed soon. The reason
> >     for accuracy
> >     > variations
> >     > on different platforms using the physical hpet device will be
> >     investigated.
> >     > Additional overhead measurements on simulated vs physical hpet
> >     mode will be
> >     > made.
> >     >
> >     > Footnotes:
> >     >
> >     > 1. I don't recall the accuracy improvement with end
> of interrupt
> >     stamping, but
> >     > it was
> >     > significant, perhaps better than two to one improvement. It
> >     would be a very
> >     > simple matter
> >     > to re-measure the improvement as the facility can call back at
> >     injection time
> >     > as well.
> >     >
> >     >
> >     > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com>
> >     > <mailto:dwinchell@virtualiron.com>
> >     > Signed-off-by: Ben Guthro <bguthro@virtualiron.com>
> >     > <mailto:bguthro@virtualiron.com>
> >     >
> >     >
> >     > _______________________________________________
> >     > Xen-devel mailing list
> >     > Xen-devel@lists.xensource.com
> >     > http://lists.xensource.com/xen-devel
> >
> >
> >
>
>



[-- Attachment #1.2: Type: text/html, Size: 45026 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [PATCH 0/2] Improve hpet accuracy
  2008-06-09 23:34               ` Dave Winchell
@ 2008-06-10  3:21                 ` Dan Magenheimer
  2008-06-11  1:44                   ` Dan Magenheimer
  2008-06-10  7:52                 ` Keir Fraser
  1 sibling, 1 reply; 51+ messages in thread
From: Dan Magenheimer @ 2008-06-10  3:21 UTC (permalink / raw)
  To: Dave Winchell, Keir Fraser; +Cc: xen-devel, Ben Guthro

> I'll tell  you what I recall about this. Tomorrow I'll check the
> guest code to verify. I think that Linux declares a full tick,
> even if the interrupt is early. That's the problem.

Yes, that makes sense and concurs with what I remember from
the EL4u5-32 code.  If this is true, one would expect the
default "no missed tick" policy to see time moving faster
than an external source -- the first missed tick delivered
after a long sleep would "catch up" and then the remainder
would each add another tick.

> On the other hand, if the interrupt is late it in effect declares
> a tick plus fraction. If it just declared the fraction in the first place,
> we could deliver the interrupts whenever we wanted.

My read of the EL4u5-32 code is that the fraction is discarded
and a new tick period commences at "now", so the fractions
eventually accumulate as lost time.

In EL5u1-32 however it looks like the fractions are accounted
for.  Indeed the EL5u1-32 "lost tick handling" code resembles
the Linux/ia64 code which is what I've always assumed was
the "missed tick" model.  In this case, I think no policy
is necessary and the measured skew should be identical to
any physical hpet skew.  I'll have to test this hypothesis though.

-----Original Message-----
From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-bounces@lists.xensource.com]On Behalf Of Dave Winchell
Sent: Monday, June 09, 2008 5:35 PM
To: dan.magenheimer@oracle.com; Keir Fraser
Cc: Dave Winchell; xen-devel; Ben Guthro
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy


> > The Linux policy is more subtle, but is required to go
> > from .1% to .03%.

> Thanks for the good documentation which I hadn't thoroughly
> read until now.
> I now understand that the essence of your
> hpet missed ticks policy is to ensure that ticks are never
> delivered too close together.  But I'm trying to understand
> WHY your patch works, in other words, what problem it is
> countering.

I'll tell  you what I recall about this. Tomorrow I'll check the
guest code to verify. I think that Linux declares a full tick,
even if the interrupt is early. That's the problem.
On the other hand, if the interrupt is late it in effect declares
a tick plus fraction. If it just declared the fraction in the first place,
we could deliver the interrupts whenever we wanted.

Its really not that different than the missed ticks policy in vpt.c
except that there the period in vpt.c is based on start of interrupt
and I have improved that with end-of interrupt as described
in the patch note.

I don't recall what prompted me to try end-of-interrupt,
but I saw a significant improvement. I may have been running
a monotonicity test at the same time to explain the lock
contention mentioned in the write-up.

>  I care about this for more reasons than just
> because it is interesting: (1) I'd like to feel confident that
> it is fixing a bug rather than just a symptom of a bug;
> and (2) I wonder how universally it is applicable.

Its worked well my my small set of guests. You and our
QA are going to tell us about the wider set. It doesn't
matter if guest A handles interrupts closely spaced or not,
just whether it handles them far apart. So it should be pretty
universal with guests that really handle missed ticks.
I think its interesting that some 32bit Linux guests handle
missed ticks for hpet.

> I see from code examination in mark_offset_hpet() in
> RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that
> the correction for lost ticks is just plain wrong in
> a virtual environment. (Suppose for example that a virtual
> tick was delivered every 1.999*hpet_tick... I think
> the clock would be off by 50%!)  Is this the bug that
> is being "countered" by your policy?

I haven't looked at that code, perhaps.
I'll check it tomorrow.

> However, the lost tick handling in RHEL5u1/kernel/timer.c
> (which I think is used also for hpet) is much better
> so I am eager to find out if your policy works there
> too.
> If the hpet missed tick policy works for both, though,
> I should be happy, though I wonder about upstream kernels
> (e.g. the trend toward tickless).

I wasn't aware of this trend. If its robust, however, it should
handle late interrupts ...

> That said, I'd rather
> see this get into Xen 3.3 and worry about upstream kernels
> later :-)

Regards,
Dave



-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Mon 6/9/2008 6:02 PM
To: Dave Winchell; Keir Fraser
Cc: Ben Guthro; xen-devel
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

> The Linux policy is more subtle, but is required to go
> from .1% to .03%.

Thanks for the good documentation which I hadn't thoroughly
read until now.  I now understand that the essence of your
hpet missed ticks policy is to ensure that ticks are never
delivered too close together.  But I'm trying to understand
WHY your patch works, in other words, what problem it is
countering.  I care about this for more reasons than just
because it is interesting: (1) I'd like to feel confident that
it is fixing a bug rather than just a symptom of a bug;
and (2) I wonder how universally it is applicable.

I see from code examination in mark_offset_hpet() in
RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that
the correction for lost ticks is just plain wrong in
a virtual environment. (Suppose for example that a virtual
tick was delivered every 1.999*hpet_tick... I think
the clock would be off by 50%!)  Is this the bug that
is being "countered" by your policy?

However, the lost tick handling in RHEL5u1/kernel/timer.c
(which I think is used also for hpet) is much better
so I am eager to find out if your policy works there
too.

If the hpet missed tick policy works for both, though,
I should be happy, though I wonder about upstream kernels
(e.g. the trend toward tickless).  That said, I'd rather
see this get into Xen 3.3 and worry about upstream kernels
later :-)

-----Original Message-----
From: Dave Winchell [mailto:dwinchell@virtualiron.com]
Sent: Sunday, June 08, 2008 2:32 PM
To: dan.magenheimer@oracle.com; Keir Fraser
Cc: Ben Guthro; xen-devel; Dave Winchell
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy


Hi Dan,

> While I am fully supportive of offering hardware hpet as an option
> for hvm guests (let's call it hwhpet=1 for shorthand), I am very
> surprised by your preliminary results; the most obvious conclusion
> is that Xen system time is losing time at the rate of 1000 PPM
> though its possible there's a bug somewhere else in the "time
> stack".  Your Windows result is jaw-dropping and inexplicable,
> though I have to admit ignorance of how Windows manages time.

I think xen system time is fine. You have to add the interrupt
delivery policies decribed in the write-up for the patch to get
accurate timekeeping in the guest.

The windows policy is obvious and results in a large improvement
in accuracy. The Linux policy is more subtle, but is required to go
from .1% to .03%.

> I think with my recent patch and hpet=1 (essentially the same as
> your emulated hpet), hvm guest time should track Xen system time.
> I wonder if domain0 (which if I understand correctly is directly
> using Xen system time) is also seeing an error of .1%?  Also
> I wonder for the skew you are seeing (in both hvm guests and
> domain0) is time moving too fast or two slow?

I don't recall the direction. I can look it up in my notes at work
tomorrow.

> Although hwhpet=1 is a fine alternative in many cases, it may
> be unavailable on some systems and may cause significant performance
> issues on others.  So I think we will still need to track down
> the poor accuracy when hwhpet=0.

Our patch is accurate to < .03% using the physical hpet mode or
the simulated mode.

> And if for some reason
> Xen system time can't be made accurate enough (< 0.05%), then
> I think we should consider building Xen system time itself on
> top of hardware hpet instead of TSC... at least when Xen discovers
> a capable hpet.

In our experience, Xen system time is accurate enough now.

> One more thought... do you know the accuracy of the TSC crystals
> on your test systems?  I posted a patch awhile ago that was
> intended to test that, though I guess it was only testing skew
> of different TSCs on the same system, not TSCs against an
> external time source.

I do not know the tsc accuracy.

> Or maybe there's a computation error somewhere in the hvm hpet
> scaling code?  Hmmm...


Regards,
Dave


-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Fri 6/6/2008 4:29 PM
To: Dave Winchell; Keir Fraser
Cc: Ben Guthro; xen-devel
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Dave --

Thanks much for posting the preliminary results!

While I am fully supportive of offering hardware hpet as an option
for hvm guests (let's call it hwhpet=1 for shorthand), I am very
surprised by your preliminary results; the most obvious conclusion
is that Xen system time is losing time at the rate of 1000 PPM
though its possible there's a bug somewhere else in the "time
stack".  Your Windows result is jaw-dropping and inexplicable,
though I have to admit ignorance of how Windows manages time.


I think with my recent patch and hpet=1 (essentially the same as
your emulated hpet), hvm guest time should track Xen system time.
I wonder if domain0 (which if I understand correctly is directly
using Xen system time) is also seeing an error of .1%?  Also
I wonder for the skew you are seeing (in both hvm guests and
domain0) is time moving too fast or two slow?

Although hwhpet=1 is a fine alternative in many cases, it may
be unavailable on some systems and may cause significant performance
issues on others.  So I think we will still need to track down
the poor accuracy when hwhpet=0.  And if for some reason
Xen system time can't be made accurate enough (< 0.05%), then
I think we should consider building Xen system time itself on
top of hardware hpet instead of TSC... at least when Xen discovers
a capable hpet.

One more thought... do you know the accuracy of the TSC crystals
on your test systems?  I posted a patch awhile ago that was
intended to test that, though I guess it was only testing skew
of different TSCs on the same system, not TSCs against an
external time source.

Or maybe there's a computation error somewhere in the hvm hpet
scaling code?  Hmmm...

Thanks,
Dan

> -----Original Message-----
> From: Dave Winchell [mailto:dwinchell@virtualiron.com]
> Sent: Friday, June 06, 2008 1:33 PM
> To: dan.magenheimer@oracle.com; Keir Fraser
> Cc: Ben Guthro; xen-devel; Dave Winchell
> Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
>
> Dan, Keir:
>
> Preliminary tests results indicate an error of .1% for Linux 64 bit
> guests configured
> for hpet with xen-unstable as is. As we have discussed many times, the
> ntp requirement is .05%.
> Tests on the patch we just submitted for hpet have indicated errors of
> .0012%
> on this platform under similar test conditions and .03% on
> other platforms.
>
> Windows vista64 has an error of 11% using hpet with the
> xen-unstable bits.
> In an overnight test with our hpet patch, the Windows vista
> error was .008%.
>
> The tests are with two or three guests on a physical node, all under
> load, and with
> the ratio of vcpus to phys cpus > 1.
>
> I will continue to run tests over the next few days.
>
> thanks,
> Dave
>
>
> Dan Magenheimer wrote:
>
> > Hi Dave and Ben --
> >
> > When running tests on xen-unstable (without your patch),
> please ensure
> > that hpet=1 is set in the hvm config and also I think that when hpet
> > is the clocksource on RHEL4-32, the clock IS resilient to
> missed ticks
> > so timer_mode should be 2 (vs when pit is the clocksource
> on RHEL4-32,
> > all clock ticks must be delivered and so timer_mode should be 0).
> >
> > Per
> >
> http://lists.xensource.com/archives/html/xen-devel/2008-06/msg
> 00098.html it's
> > my intent to clean this up, but I won't get to it until next week.
> >
> > Thanks,
> > Dan
> >
> >     -----Original Message-----
> >     *From:* xen-devel-bounces@lists.xensource.com
> >     [mailto:xen-devel-bounces@lists.xensource.com]*On
> Behalf Of *Dave
> >     Winchell
> >     *Sent:* Friday, June 06, 2008 4:46 AM
> >     *To:* Keir Fraser; Ben Guthro; xen-devel
> >     *Cc:* dan.magenheimer@oracle.com; Dave Winchell
> >     *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >
> >     Keir,
> >
> >     I think the changes are required. We'll run some tests
> today today so
> >     that we have some data to talk about.
> >
> >     -Dave
> >
> >
> >     -----Original Message-----
> >     From: xen-devel-bounces@lists.xensource.com on behalf
> of Keir Fraser
> >     Sent: Fri 6/6/2008 4:58 AM
> >     To: Ben Guthro; xen-devel
> >     Cc: dan.magenheimer@oracle.com
> >     Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >
> >     Are these patches needed now the timers are built on Xen system
> >     time rather
> >     than host TSC? Dan has reported much better
> time-keeping with his
> >     patch
> >     checked in, and it¹s for sure a lot less invasive than
> this patchset.
> >
> >
> >      -- Keir
> >
> >     On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote:
> >
> >     >
> >     > 1. Introduction
> >     >
> >     > This patch improves the hpet based guest clock in
> terms of drift and
> >     > monotonicity.
> >     > Prior to this work the drift with hpet was greater
> than 2%, far
> >     above the .05%
> >     > limit
> >     > for ntp to synchronize. With this code, the drift ranges from
> >     .001% to .0033%
> >     > depending
> >     > on guest and physical platform.
> >     >
> >     > Using hpet allows guest operating systems to provide monotonic
> >     time to their
> >     > applications. Time sources other than hpet are not
> monotonic because
> >     > of their reliance on tsc, which is not synchronized
> across physical
> >     > processors.
> >     >
> >     > Windows 2k864 and many Linux guests are supported with two
> >     policies, one for
> >     > guests
> >     > that handle missed clock interrupts and the other for guests
> >     that require the
> >     > correct number of interrupts.
> >     >
> >     > Guests may use hpet for the timing source even if the physical
> >     platform has no
> >     > visible
> >     > hpet. Migration is supported between physical machines which
> >     differ in
> >     > physical
> >     > hpet visibility.
> >     >
> >     > Most of the changes are in hpet.c. Two general facilities are
> >     added to track
> >     > interrupt
> >     > progress. The ideas here and the facilities would be useful in
> >     vpt.c, for
> >     > other time
> >     > sources, though no attempt is made here to improve vpt.c.
> >     >
> >     > The following sections discuss hpet dependencies, interrupt
> >     delivery policies,
> >     > live migration,
> >     > test results, and relation to recent work with monotonic time.
> >     >
> >     >
> >     > 2. Virtual Hpet dependencies
> >     >
> >     > The virtual hpet depends on the ability to read the
> physical or
> >     simulated
> >     > (see discussion below) hpet.  For timekeeping, the
> virtual hpet
> >     also depends
> >     > on two new interrupt notification facilities to implement its
> >     policies for
> >     > interrupt delivery.
> >     >
> >     > 2.1. Two modes of low-level hpet main counter reads.
> >     >
> >     > In this implementation, the virtual hpet reads with
> >     read_64_main_counter(),
> >     > exported by
> >     > time.c, either the real physical hpet main counter register
> >     directly or a
> >     > "simulated"
> >     > hpet main counter.
> >     >
> >     > The simulated mode uses a monotonic version of get_s_time()
> >     (NOW()), where the
> >     > last
> >     > time value is returned whenever the current time value is less
> >     than the last
> >     > time
> >     > value. In simulated mode, since it is layered on s_time, the
> >     underlying
> >     > hardware
> >     > can be hpet or some other device. The frequency of the main
> >     counter in
> >     > simulated
> >     > mode is the same as the standard physical hpet frequency,
> >     allowing live
> >     > migration
> >     > between nodes that are configured differently.
> >     >
> >     > If the physical platform does not have an hpet
> device, or if xen
> >     is configured
> >     > not
> >     > to use the device, then the simulated method is used. If there
> >     is a physical
> >     > hpet device,
> >     > and xen has initialized it, then either simulated or physical
> >     mode can be
> >     > used.
> >     > This is governed by a boot time option, hpet-avoid.
> Setting this
> >     option to 1
> >     > gives the
> >     > simulated mode and 0 the physical mode. The default
> is physical
> >     mode.
> >     >
> >     > A disadvantage of the physical mode is that may take longer to
> >     read the device
> >     > than in simulated mode. On some platforms the cost is
> about the
> >     same (less
> >     > than 250 nsec) for
> >     > physical and simulated modes, while on others physical cost is
> >     much higher
> >     > than simulated.
> >     > A disadvantage of the simulated mode is that it can return the
> >     same value
> >     > for the counter in consecutive calls.
> >     >
> >     > 2.2. Interrupt notification facilities.
> >     >
> >     > Two interrupt notification facilities are introduced, one is
> >     > hvm_isa_irq_assert_cb()
> >     > and the other hvm_register_intr_en_notif().
> >     >
> >     > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to
> >     the vioapic.
> >     > hvm_isa_irq_assert_cb allows a callback to be passed along to
> >     > vioapic_deliver()
> >     > and this callback is called with a mask of the vcpus
> which will
> >     get the
> >     > interrupt. This callback is made before any vcpus receive an
> >     interrupt.
> >     >
> >     > Vhpet uses hvm_register_intr_en_notif() to register a handler
> >     for a particular
> >     > vector that will be called when that vector is injected in
> >     > [vmx,svm]_intr_assist()
> >     > and also when the guest finishes handling the interrupt. Here
> >     finished is
> >     > defined
> >     > as the point when the guest re-enables interrupts or
> lowers the
> >     tpr value.
> >     > EOI is not used as the end of interrupt as this is sometimes
> >     returned before
> >     > the interrupt handler has done its work. A flag is
> passed to the
> >     handler
> >     > indicating
> >     > whether this is the injection point (post = 1) or the
> interrupt
> >     finished (post
> >     > = 0) point.
> >     > The need for the finished point callback is discussed in the
> >     missed ticks
> >     > policy section.
> >     >
> >     > To prevent a possible early trigger of the finished callback,
> >     intr_en_notif
> >     > logic
> >     > has a two stage arm, the first at injection
> >     (hvm_intr_en_notif_arm()) and the
> >     > second when
> >     > interrupts are seen to be disabled
> (hvm_intr_en_notif_disarm()).
> >     Once fully
> >     > armed, re-enabling
> >     > interrupts will cause hvm_intr_en_notif_disarm() to
> make the end
> >     of interrupt
> >     > callback. hvm_intr_en_notif_arm() and
> hvm_intr_en_notif_disarm()
> >     are called by
> >     > [vmx,svm]_intr_assist().
> >     >
> >     > 3. Interrupt delivery policies
> >     >
> >     > The existing hpet interrupt delivery is preserved.
> This includes
> >     > vcpu round robin delivery used by Linux and broadcast delivery
> >     used by
> >     > Windows.
> >     >
> >     > There are two policies for interrupt delivery, one for Windows
> >     2k8-64 and the
> >     > other
> >     > for Linux. The Linux policy takes advantage of the
> (guest) Linux
> >     missed tick
> >     > and offset
> >     > calculations and does not attempt to deliver the
> right number of
> >     interrupts.
> >     > The Windows policy delivers the correct number of interrupts,
> >     even if
> >     > sometimes much
> >     > closer to each other than the period. The policies are similar
> >     to those in
> >     > vpt.c, though
> >     > there are some important differences.
> >     >
> >     > Policies are selected with an HVMOP_set_param
> hypercall with index
> >     > HVM_PARAM_TIMER_MODE.
> >     > Two new values are added,
> HVM_HPET_guest_computes_missed_ticks and
> >     > HVM_HPET_guest_does_not_compute_missed_ticks.  The reason that
> >     two new ones
> >     > are added is that
> >     > in some guests (32bit Linux) a no-missed policy is needed for
> >     clock sources
> >     > other than hpet
> >     > and a missed ticks policy for hpet. It was felt that
> there would
> >     be less
> >     > confusion by simply
> >     > introducing the two hpet policies.
> >     >
> >     > 3.1. The missed ticks policy
> >     >
> >     > The Linux clock interrupt handler for hpet calculates missed
> >     ticks and offset
> >     > using the hpet
> >     > main counter. The algorithm works well when the time since the
> >     last interrupt
> >     > is greater than
> >     > or equal to a period and poorly otherwise.
> >     >
> >     > The missed ticks policy ensures that no two clock
> interrupts are
> >     delivered to
> >     > the guest at
> >     > a time interval less than a period. A time stamp (hpet main
> >     counter value) is
> >     > recorded (by a
> >     > callback registered with hvm_register_intr_en_notif)
> when Linux
> >     finishes
> >     > handling the clock
> >     > interrupt. Then, ensuing interrupts are delivered to
> the vioapic
> >     only if the
> >     > current main
> >     > counter value is a period greater than when the last interrupt
> >     was handled.
> >     >
> >     > Tests showed a significant improvement in clock drift with end
> >     of interrupt
> >     > time stamps
> >     > versus beginning of interrupt[1]. It is believed that
> the reason
> >     for the
> >     > improvement
> >     > is that the clock interrupt handler goes for a
> spinlock and can
> >     be therefore
> >     > delayed in its
> >     > processing. Furthermore, the main counter is read by the guest
> >     under the lock.
> >     > The net
> >     > effect is that if we time stamp injection, we can get the
> >     difference in time
> >     > between successive interrupt handler lock acquisitions to be
> >     less than the
> >     > period.
> >     >
> >     > 3.2. The no-missed ticks policy
> >     >
> >     > Windows 2k864 keeps very poor time with the missed
> ticks policy.
> >     So the
> >     > no-missed ticks policy
> >     > was developed. In the no-missed ticks policy we deliver the
> >     correct number of
> >     > interrupts,
> >     > even if they are spaced less than a period apart
> (when catching up).
> >     >
> >     > Windows 2k864 uses a broadcast mode in the interrupt routing
> >     such that
> >     > all vcpus get the clock interrupt. The best Windows drift
> >     performance was
> >     > achieved when the
> >     > policy code ensured that all the previous interrupts (on the
> >     various vcpus)
> >     > had been injected
> >     > before injecting the next interrupt to the vioapic..
> >     >
> >     > The policy code works as follows. It uses the
> >     hvm_isa_irq_assert_cb() to
> >     > record
> >     > the vcpus to be interrupted in h->hpet.pending_mask. Then, in
> >     the callback
> >     > registered
> >     > with hvm_register_intr_en_notif() at post=1 time it clears the
> >     current vcpu in
> >     > the pending_mask.
> >     > When the pending_mask is clear it decrements
> >     hpet.intr_pending_nr and if
> >     > intr_pending_nr is still
> >     > non-zero posts another interrupt to the ioapic with
> >     hvm_isa_irq_assert_cb().
> >     > Intr_pending_nr is incremented in
> >     hpet_route_decision_not_missed_ticks().
> >     >
> >     > The missed ticks policy intr_en_notif callback also uses the
> >     pending_mask
> >     > method. So even though
> >     > Linux does not broadcast its interrupts, the code could handle
> >     it if it did.
> >     > In this case the end of interrupt time stamp is made when the
> >     pending_mask is
> >     > clear.
> >     >
> >     > 4. Live Migration
> >     >
> >     > Live migration with hpet preserves the current offset of the
> >     guest clock with
> >     > respect
> >     > to ntp. This is accomplished by migrating all of the state in
> >     the h->hpet data
> >     > structure
> >     > in the usual way. The hp->mc_offset is recalculated on the
> >     receiving node so
> >     > that the
> >     > guest sees a continuous hpet main counter.
> >     >
> >     > Code as been added to xc_domain_save.c to send a small message
> >     after the
> >     > domain context is sent. The contents of the message is the
> >     physical tsc
> >     > timestamp, last_tsc,
> >     > read just before the message is sent. When the
> last_tsc message
> >     is received in
> >     > xc_domain_restore.c,
> >     > another physical tsc timestamp, cur_tsc, is read. The two
> >     timestamps are
> >     > loaded into the domain
> >     > structure as last_tsc_sender and first_tsc_receiver with
> >     hypercalls. Then
> >     > xc_domain_hvm_setcontext
> >     > is called so that hpet_load has access to these time stamps.
> >     Hpet_load uses
> >     > the timestamps
> >     > to account for the time spent saving and loading the domain
> >     context. With this
> >     > technique,
> >     > the only neglected time is the time spent sending a small
> >     network message.
> >     >
> >     > 5. Test Results
> >     >
> >     > Some recent test results are:
> >     >
> >     > 5.1 Linux 4u664 and Windows 2k864 load test.
> >     >       Duration: 70 hours.
> >     >       Test date: 6/2/08
> >     >       Loads: usex -b48 on Linux; burn-in on Windows
> >     >       Guest vcpus: 8 for Linux; 2 for Windows
> >     >       Hardware: 8 physical cpu AMD
> >     >       Clock drift : Linux: .0012% Windows: .009%
> >     >
> >     > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test
> >     >       Duration: 23 hours.
> >     >       Test date: 6/3/08
> >     >       Loads: none
> >     >       Guest vcpus: 8 for each Linux; 2 for Windows
> >     >       Hardware: 4 physical cpu AMD
> >     >       Clock drift : Linux: .033% Windows: .019%
> >     >
> >     > 6. Relation to recent work in xen-unstable
> >     >
> >     > There is a similarity between hvm_get_guest_time() in
> >     xen-unstable and
> >     > read_64_main_counter()
> >     > in this code. However, read_64_main_counter() is more tuned to
> >     the needs of
> >     > hpet.c. It has no
> >     > "set" operation, only the get. It isolates the mode,
> physical or
> >     simulated, in
> >     > read_64_main_counter()
> >     > itself. It uses no vcpu or domain state as it is a physical
> >     entity, in either
> >     > mode. And it provides a real
> >     > physical mode for every read for those applications
> that desire
> >     this.
> >     >
> >     > 7. Conclusion
> >     >
> >     > The virtual hpet is improved by this patch in terms
> of accuracy and
> >     > monotonicity.
> >     > Tests performed to date verify this and more testing
> is under way.
> >     >
> >     > 8. Future Work
> >     >
> >     > Testing with Windows Vista will be performed soon. The reason
> >     for accuracy
> >     > variations
> >     > on different platforms using the physical hpet device will be
> >     investigated.
> >     > Additional overhead measurements on simulated vs physical hpet
> >     mode will be
> >     > made.
> >     >
> >     > Footnotes:
> >     >
> >     > 1. I don't recall the accuracy improvement with end
> of interrupt
> >     stamping, but
> >     > it was
> >     > significant, perhaps better than two to one improvement. It
> >     would be a very
> >     > simple matter
> >     > to re-measure the improvement as the facility can call back at
> >     injection time
> >     > as well.
> >     >
> >     >
> >     > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com>
> >     > <mailto:dwinchell@virtualiron.com>
> >     > Signed-off-by: Ben Guthro <bguthro@virtualiron.com>
> >     > <mailto:bguthro@virtualiron.com>
> >     >
> >     >
> >     > _______________________________________________
> >     > Xen-devel mailing list
> >     > Xen-devel@lists.xensource.com
> >     > http://lists.xensource.com/xen-devel
> >
> >
> >
>
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/2] Improve hpet accuracy
  2008-06-09 23:34               ` Dave Winchell
  2008-06-10  3:21                 ` Dan Magenheimer
@ 2008-06-10  7:52                 ` Keir Fraser
  2008-06-10 11:49                   ` Dave Winchell
  1 sibling, 1 reply; 51+ messages in thread
From: Keir Fraser @ 2008-06-10  7:52 UTC (permalink / raw)
  To: Dave Winchell, dan.magenheimer; +Cc: xen-devel, Ben Guthro

[-- Attachment #1.1: Type: text/plain, Size: 1333 bytes --]

On 10/6/08 00:34, "Dave Winchell" <dwinchell@virtualiron.com> wrote:

> I don't recall what prompted me to try end-of-interrupt,
> but I saw a significant improvement. I may have been running
> a monotonicity test at the same time to explain the lock
> contention mentioned in the write-up.

Doesn¹t this policy guarantee that you actually deliver interrupts at
consistently too low a rate? Since the delivery period is now timer period +
latency of interrupt handling? I suppose it works for this guest type
because it doesn¹t actually care about getting interrupts at the correct
rate, so long as the ticks are always a bit late?

For those that do need missed ticks to be delivered, do you track missed
ticks at the absolute correct rate?

This is perhaps a fine tradeoff for all platform timers ‹ those guests that
can handle missed ticks obviously do not care about getting their timer
interrupts  at absolutely the correct rate, and delivering a little late is
what they are geared to handle (getting delivered consistently early is just
weird!). Whereas guests that need all ticks also want them (at least over
the long run) at exactly the correct rate.

I think there¹s good empirical analysis in the work you¹ve done. We just
need the patches cleaned up and generalised for vpt.c now.

 -- Keir

[-- Attachment #1.2: Type: text/html, Size: 1837 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [PATCH 0/2] Improve hpet accuracy
  2008-06-10  7:52                 ` Keir Fraser
@ 2008-06-10 11:49                   ` Dave Winchell
  2008-06-10 12:34                     ` Dave Winchell
  0 siblings, 1 reply; 51+ messages in thread
From: Dave Winchell @ 2008-06-10 11:49 UTC (permalink / raw)
  To: Keir Fraser, dan.magenheimer; +Cc: Dave Winchell, xen-devel, Ben Guthro

[-- Attachment #1.1: Type: text/plain, Size: 3250 bytes --]

> > I don't recall what prompted me to try end-of-interrupt,
> > but I saw a significant improvement. I may have been running
> > a monotonicity test at the same time to explain the lock
> > contention mentioned in the write-up.

> Doesn¹t this policy guarantee that you actually deliver interrupts at
> consistently too low a rate? Since the delivery period is now timer period +
> latency of interrupt handling?

Yes, the rate ends up being about half the normal rate because
I toss the interrupt if it doesn't meet the requirement. If, in testing, we find
a guest that has a problem with half the normal rate, we can fine tune
the policy. For example, instead of discarding set a small timer.

> I suppose it works for this guest type
> because it doesn¹t actually care about getting interrupts at the correct
> rate, so long as the ticks are always a bit late?

True.

> For those that do need missed ticks to be delivered, do you track missed
> ticks at the absolute correct rate?

Yes.

> 
> This is perhaps a fine tradeoff for all platform timers < those guests that
> can handle missed ticks obviously do not care about getting their timer
> interrupts  at absolutely the correct rate, and delivering a little late is
> what they are geared to handle (getting delivered consistently early is just
> weird!). Whereas guests that need all ticks also want them (at least over
> the long run) at exactly the correct rate.

> I think there¹s good empirical analysis in the work you¹ve done. We just
> need the patches cleaned up and generalised for vpt.c now.

Thanks. I'll get to work on the generalization, etc.

-Dave

-----Original Message-----
From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
Sent: Tue 6/10/2008 3:52 AM
To: Dave Winchell; dan.magenheimer@oracle.com
Cc: Ben Guthro; xen-devel
Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

On 10/6/08 00:34, "Dave Winchell" <dwinchell@virtualiron.com> wrote:

> I don't recall what prompted me to try end-of-interrupt,
> but I saw a significant improvement. I may have been running
> a monotonicity test at the same time to explain the lock
> contention mentioned in the write-up.

Doesn¹t this policy guarantee that you actually deliver interrupts at
consistently too low a rate? Since the delivery period is now timer period +
latency of interrupt handling? I suppose it works for this guest type
because it doesn¹t actually care about getting interrupts at the correct
rate, so long as the ticks are always a bit late?

For those that do need missed ticks to be delivered, do you track missed
ticks at the absolute correct rate?

This is perhaps a fine tradeoff for all platform timers < those guests that
can handle missed ticks obviously do not care about getting their timer
interrupts  at absolutely the correct rate, and delivering a little late is
what they are geared to handle (getting delivered consistently early is just
weird!). Whereas guests that need all ticks also want them (at least over
the long run) at exactly the correct rate.

I think there¹s good empirical analysis in the work you¹ve done. We just
need the patches cleaned up and generalised for vpt.c now.

 -- Keir

[-- Attachment #1.2: Type: text/html, Size: 4119 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [PATCH 0/2] Improve hpet accuracy
  2008-06-10 11:49                   ` Dave Winchell
@ 2008-06-10 12:34                     ` Dave Winchell
  2008-06-10 12:42                       ` Keir Fraser
  0 siblings, 1 reply; 51+ messages in thread
From: Dave Winchell @ 2008-06-10 12:34 UTC (permalink / raw)
  To: Keir Fraser, dan.magenheimer; +Cc: Dave Winchell, xen-devel, Ben Guthro

[-- Attachment #1.1: Type: text/plain, Size: 4228 bytes --]

> > Doesn¹t this policy guarantee that you actually deliver interrupts at
> > consistently too low a rate? Since the delivery period is now timer period +
> > latency of interrupt handling?
> 
> Yes, the rate ends up being about half the normal rate because
> I toss the interrupt if it doesn't meet the requirement. If, in testing, we find
> a guest that has a problem with half the normal rate, we can fine tune
> the policy. For example, instead of discarding set a small timer.

A better solution is to set the new period timer at end-of-interrupt time
(in the callback) instead of delivery time (timer expiration time).
This way the rate will be very close to what the guest expects.
I think I'll make this change.

-Dave

-----Original Message-----
From: Dave Winchell
Sent: Tue 6/10/2008 7:49 AM
To: Keir Fraser; dan.magenheimer@oracle.com
Cc: Ben Guthro; xen-devel; Dave Winchell
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

> > I don't recall what prompted me to try end-of-interrupt,
> > but I saw a significant improvement. I may have been running
> > a monotonicity test at the same time to explain the lock
> > contention mentioned in the write-up.

> Doesn¹t this policy guarantee that you actually deliver interrupts at
> consistently too low a rate? Since the delivery period is now timer period +
> latency of interrupt handling?

Yes, the rate ends up being about half the normal rate because
I toss the interrupt if it doesn't meet the requirement. If, in testing, we find
a guest that has a problem with half the normal rate, we can fine tune
the policy. For example, instead of discarding set a small timer.

> I suppose it works for this guest type
> because it doesn¹t actually care about getting interrupts at the correct
> rate, so long as the ticks are always a bit late?

True.

> For those that do need missed ticks to be delivered, do you track missed
> ticks at the absolute correct rate?

Yes.

> 
> This is perhaps a fine tradeoff for all platform timers < those guests that
> can handle missed ticks obviously do not care about getting their timer
> interrupts  at absolutely the correct rate, and delivering a little late is
> what they are geared to handle (getting delivered consistently early is just
> weird!). Whereas guests that need all ticks also want them (at least over
> the long run) at exactly the correct rate.

> I think there¹s good empirical analysis in the work you¹ve done. We just
> need the patches cleaned up and generalised for vpt.c now.

Thanks. I'll get to work on the generalization, etc.

-Dave

-----Original Message-----
From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
Sent: Tue 6/10/2008 3:52 AM
To: Dave Winchell; dan.magenheimer@oracle.com
Cc: Ben Guthro; xen-devel
Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

On 10/6/08 00:34, "Dave Winchell" <dwinchell@virtualiron.com> wrote:

> I don't recall what prompted me to try end-of-interrupt,
> but I saw a significant improvement. I may have been running
> a monotonicity test at the same time to explain the lock
> contention mentioned in the write-up.

Doesn¹t this policy guarantee that you actually deliver interrupts at
consistently too low a rate? Since the delivery period is now timer period +
latency of interrupt handling? I suppose it works for this guest type
because it doesn¹t actually care about getting interrupts at the correct
rate, so long as the ticks are always a bit late?

For those that do need missed ticks to be delivered, do you track missed
ticks at the absolute correct rate?

This is perhaps a fine tradeoff for all platform timers < those guests that
can handle missed ticks obviously do not care about getting their timer
interrupts  at absolutely the correct rate, and delivering a little late is
what they are geared to handle (getting delivered consistently early is just
weird!). Whereas guests that need all ticks also want them (at least over
the long run) at exactly the correct rate.

I think there¹s good empirical analysis in the work you¹ve done. We just
need the patches cleaned up and generalised for vpt.c now.

 -- Keir

[-- Attachment #1.2: Type: text/html, Size: 5232 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/2] Improve hpet accuracy
  2008-06-10 12:34                     ` Dave Winchell
@ 2008-06-10 12:42                       ` Keir Fraser
  2008-06-10 17:13                         ` Dave Winchell
  0 siblings, 1 reply; 51+ messages in thread
From: Keir Fraser @ 2008-06-10 12:42 UTC (permalink / raw)
  To: Dave Winchell, dan.magenheimer; +Cc: xen-devel, Ben Guthro


[-- Attachment #1.1: Type: text/plain, Size: 719 bytes --]




On 10/6/08 13:34, "Dave Winchell" <dwinchell@virtualiron.com> wrote:

>> > Yes, the rate ends up being about half the normal rate because
>> > I toss the interrupt if it doesn't meet the requirement. If, in testing, we
>> find
>> > a guest that has a problem with half the normal rate, we can fine tune
>> > the policy. For example, instead of discarding set a small timer.
> 
> A better solution is to set the new period timer at end-of-interrupt time
> (in the callback) instead of delivery time (timer expiration time).
> This way the rate will be very close to what the guest expects.
> I think I'll make this change.

Yes, that¹s what I thought you¹d done. It sounds nicer to me.

 -- Keir


[-- Attachment #1.2: Type: text/html, Size: 1206 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/2] Improve hpet accuracy
  2008-06-10 12:42                       ` Keir Fraser
@ 2008-06-10 17:13                         ` Dave Winchell
  2008-06-11  8:30                           ` Keir Fraser
  0 siblings, 1 reply; 51+ messages in thread
From: Dave Winchell @ 2008-06-10 17:13 UTC (permalink / raw)
  To: Keir Fraser; +Cc: dan.magenheimer, xen-devel, Ben Guthro

[-- Attachment #1: Type: text/plain, Size: 141 bytes --]

Keir, Dan:

Although I plan to break up the patch, etc., I'm posting
this fix to the patch for anyone who might be interested.

thanks,
Dave

[-- Attachment #2: p.6.10.time.c --]
[-- Type: text/x-csrc, Size: 1209 bytes --]

# This is a BitKeeper generated diff -Nru style patch.
#
# ChangeSet
#   2008/06/10 12:20:48-04:00 winchell@dwinchell2.virtualiron.com 
#   vi-patch: xen-hpet
#   
#   Bug Id: 6057 
#   
#   Reviewed by: Robert
#   
#   SUMMARY: Fix wrap issue in monotonic s_time().
# 
# xen/arch/x86/time.c
#   2008/06/10 12:20:48-04:00 winchell@dwinchell2.virtualiron.com +3 -2
#   Fix wrap issue in monotonic s_time().
# 
diff -Nru a/xen/arch/x86/time.c b/xen/arch/x86/time.c
--- a/xen/arch/x86/time.c	2008-06-10 13:08:39 -04:00
+++ b/xen/arch/x86/time.c	2008-06-10 13:08:39 -04:00
@@ -534,7 +534,7 @@
     u64 count;
     unsigned long flags;
     struct cpu_time *t = &this_cpu(cpu_time);
-    u64 tsc, delta;
+    u64 tsc, delta, diff;
     s_time_t now;
 
     if(hpet_main_counter_phys_avoid_hdw || !hpet_physical_inited) {
@@ -542,7 +542,8 @@
         rdtscll(tsc);
         delta = tsc - t->local_tsc_stamp;
         now = t->stime_local_stamp + scale_delta(delta, &t->tsc_scale);
-        if(now > get_s_time_mon.last_ret)
+        diff = (u64)now - (u64)get_s_time_mon.last_ret;
+        if((s64)diff > (s64)0)
             get_s_time_mon.last_ret = now;
         else
             now = get_s_time_mon.last_ret;

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [PATCH 0/2] Improve hpet accuracy
  2008-06-10  3:21                 ` Dan Magenheimer
@ 2008-06-11  1:44                   ` Dan Magenheimer
  2008-06-11 13:58                     ` Dave Winchell
  2008-06-12 22:51                     ` Dan Magenheimer
  0 siblings, 2 replies; 51+ messages in thread
From: Dan Magenheimer @ 2008-06-11  1:44 UTC (permalink / raw)
  To: dan.magenheimer@oracle.com, Dave Winchell, Keir Fraser
  Cc: xen-devel, Ben Guthro

> In EL5u1-32 however it looks like the fractions are accounted
> for.  Indeed the EL5u1-32 "lost tick handling" code resembles
> the Linux/ia64 code which is what I've always assumed was
> the "missed tick" model.  In this case, I think no policy
> is necessary and the measured skew should be identical to
> any physical hpet skew.  I'll have to test this hypothesis though.

I've tested this hypothesis and it seems to hold true.
This means the existing (unpatched) hpet code works fine
on EL5-32bit (vcpus=1) when hpet is the clocksource,
even when the machine is overcommitted.  A second hypothesis
still needs to be tested that Dave's patch will not make this worse.

(Note that per previous discussion, my EL5u1-32bit guest
running on an Intel dual-core physical box chose tsc as
the best clocksource and I had to override it with
clock=hpet in the kernel command line.)

> Yes, that makes sense and concurs with what I remember from
> the EL4u5-32 code.  If this is true, one would expect the
> default "no missed tick" policy to see time moving faster
> than an external source -- the first missed tick delivered
> after a long sleep would "catch up" and then the remainder
> would each add another tick.

Indeed with the existing (unpatched) hpet code, time is
running faster on EL4u5-32 (vcpus=1, when overcommited).
So Dave's patch is definitely needed here.

Will try 64-bit next.

Dan

> -----Original Message-----
> From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> Sent: Monday, June 09, 2008 9:21 PM
> To: 'Dave Winchell'; 'Keir Fraser'
> Cc: 'xen-devel'; 'Ben Guthro'
> Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> 
> 
> > I'll tell  you what I recall about this. Tomorrow I'll check the
> > guest code to verify. I think that Linux declares a full tick,
> > even if the interrupt is early. That's the problem.
> 
> Yes, that makes sense and concurs with what I remember from
> the EL4u5-32 code.  If this is true, one would expect the
> default "no missed tick" policy to see time moving faster
> than an external source -- the first missed tick delivered
> after a long sleep would "catch up" and then the remainder
> would each add another tick.
> 
> > On the other hand, if the interrupt is late it in effect declares
> > a tick plus fraction. If it just declared the fraction in 
> the first place,
> > we could deliver the interrupts whenever we wanted.
> 
> My read of the EL4u5-32 code is that the fraction is discarded
> and a new tick period commences at "now", so the fractions
> eventually accumulate as lost time.
> 
> In EL5u1-32 however it looks like the fractions are accounted
> for.  Indeed the EL5u1-32 "lost tick handling" code resembles
> the Linux/ia64 code which is what I've always assumed was
> the "missed tick" model.  In this case, I think no policy
> is necessary and the measured skew should be identical to
> any physical hpet skew.  I'll have to test this hypothesis though.
> 
> -----Original Message-----
> From: xen-devel-bounces@lists.xensource.com 
> [mailto:xen-devel-bounces@lists.xensource.com]On Behalf Of 
> Dave Winchell
> Sent: Monday, June 09, 2008 5:35 PM
> To: dan.magenheimer@oracle.com; Keir Fraser
> Cc: Dave Winchell; xen-devel; Ben Guthro
> Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> 
> 
> > > The Linux policy is more subtle, but is required to go
> > > from .1% to .03%.
> 
> > Thanks for the good documentation which I hadn't thoroughly
> > read until now.
> > I now understand that the essence of your
> > hpet missed ticks policy is to ensure that ticks are never
> > delivered too close together.  But I'm trying to understand
> > WHY your patch works, in other words, what problem it is
> > countering.
> 
> I'll tell  you what I recall about this. Tomorrow I'll check the
> guest code to verify. I think that Linux declares a full tick,
> even if the interrupt is early. That's the problem.
> On the other hand, if the interrupt is late it in effect declares
> a tick plus fraction. If it just declared the fraction in the 
> first place,
> we could deliver the interrupts whenever we wanted.
> 
> Its really not that different than the missed ticks policy in vpt.c
> except that there the period in vpt.c is based on start of interrupt
> and I have improved that with end-of interrupt as described
> in the patch note.
> 
> I don't recall what prompted me to try end-of-interrupt,
> but I saw a significant improvement. I may have been running
> a monotonicity test at the same time to explain the lock
> contention mentioned in the write-up.
> 
> >  I care about this for more reasons than just
> > because it is interesting: (1) I'd like to feel confident that
> > it is fixing a bug rather than just a symptom of a bug;
> > and (2) I wonder how universally it is applicable.
> 
> Its worked well my my small set of guests. You and our
> QA are going to tell us about the wider set. It doesn't
> matter if guest A handles interrupts closely spaced or not,
> just whether it handles them far apart. So it should be pretty
> universal with guests that really handle missed ticks.
> I think its interesting that some 32bit Linux guests handle
> missed ticks for hpet.
> 
> > I see from code examination in mark_offset_hpet() in
> > RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that
> > the correction for lost ticks is just plain wrong in
> > a virtual environment. (Suppose for example that a virtual
> > tick was delivered every 1.999*hpet_tick... I think
> > the clock would be off by 50%!)  Is this the bug that
> > is being "countered" by your policy?
> 
> I haven't looked at that code, perhaps.
> I'll check it tomorrow.
> 
> > However, the lost tick handling in RHEL5u1/kernel/timer.c
> > (which I think is used also for hpet) is much better
> > so I am eager to find out if your policy works there
> > too.
> > If the hpet missed tick policy works for both, though,
> > I should be happy, though I wonder about upstream kernels
> > (e.g. the trend toward tickless).
> 
> I wasn't aware of this trend. If its robust, however, it should
> handle late interrupts ...
> 
> > That said, I'd rather
> > see this get into Xen 3.3 and worry about upstream kernels
> > later :-)
> 
> Regards,
> Dave
> 
> 
> 
> -----Original Message-----
> From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> Sent: Mon 6/9/2008 6:02 PM
> To: Dave Winchell; Keir Fraser
> Cc: Ben Guthro; xen-devel
> Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> 
> > The Linux policy is more subtle, but is required to go
> > from .1% to .03%.
> 
> Thanks for the good documentation which I hadn't thoroughly
> read until now.  I now understand that the essence of your
> hpet missed ticks policy is to ensure that ticks are never
> delivered too close together.  But I'm trying to understand
> WHY your patch works, in other words, what problem it is
> countering.  I care about this for more reasons than just
> because it is interesting: (1) I'd like to feel confident that
> it is fixing a bug rather than just a symptom of a bug;
> and (2) I wonder how universally it is applicable.
> 
> I see from code examination in mark_offset_hpet() in
> RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that
> the correction for lost ticks is just plain wrong in
> a virtual environment. (Suppose for example that a virtual
> tick was delivered every 1.999*hpet_tick... I think
> the clock would be off by 50%!)  Is this the bug that
> is being "countered" by your policy?
> 
> However, the lost tick handling in RHEL5u1/kernel/timer.c
> (which I think is used also for hpet) is much better
> so I am eager to find out if your policy works there
> too.
> 
> If the hpet missed tick policy works for both, though,
> I should be happy, though I wonder about upstream kernels
> (e.g. the trend toward tickless).  That said, I'd rather
> see this get into Xen 3.3 and worry about upstream kernels
> later :-)
> 
> -----Original Message-----
> From: Dave Winchell [mailto:dwinchell@virtualiron.com]
> Sent: Sunday, June 08, 2008 2:32 PM
> To: dan.magenheimer@oracle.com; Keir Fraser
> Cc: Ben Guthro; xen-devel; Dave Winchell
> Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> 
> 
> Hi Dan,
> 
> > While I am fully supportive of offering hardware hpet as an option
> > for hvm guests (let's call it hwhpet=1 for shorthand), I am very
> > surprised by your preliminary results; the most obvious conclusion
> > is that Xen system time is losing time at the rate of 1000 PPM
> > though its possible there's a bug somewhere else in the "time
> > stack".  Your Windows result is jaw-dropping and inexplicable,
> > though I have to admit ignorance of how Windows manages time.
> 
> I think xen system time is fine. You have to add the interrupt
> delivery policies decribed in the write-up for the patch to get
> accurate timekeeping in the guest.
> 
> The windows policy is obvious and results in a large improvement
> in accuracy. The Linux policy is more subtle, but is required to go
> from .1% to .03%.
> 
> > I think with my recent patch and hpet=1 (essentially the same as
> > your emulated hpet), hvm guest time should track Xen system time.
> > I wonder if domain0 (which if I understand correctly is directly
> > using Xen system time) is also seeing an error of .1%?  Also
> > I wonder for the skew you are seeing (in both hvm guests and
> > domain0) is time moving too fast or two slow?
> 
> I don't recall the direction. I can look it up in my notes at work
> tomorrow.
> 
> > Although hwhpet=1 is a fine alternative in many cases, it may
> > be unavailable on some systems and may cause significant performance
> > issues on others.  So I think we will still need to track down
> > the poor accuracy when hwhpet=0.
> 
> Our patch is accurate to < .03% using the physical hpet mode or
> the simulated mode.
> 
> > And if for some reason
> > Xen system time can't be made accurate enough (< 0.05%), then
> > I think we should consider building Xen system time itself on
> > top of hardware hpet instead of TSC... at least when Xen discovers
> > a capable hpet.
> 
> In our experience, Xen system time is accurate enough now.
> 
> > One more thought... do you know the accuracy of the TSC crystals
> > on your test systems?  I posted a patch awhile ago that was
> > intended to test that, though I guess it was only testing skew
> > of different TSCs on the same system, not TSCs against an
> > external time source.
> 
> I do not know the tsc accuracy.
> 
> > Or maybe there's a computation error somewhere in the hvm hpet
> > scaling code?  Hmmm...
> 
> 
> Regards,
> Dave
> 
> 
> -----Original Message-----
> From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> Sent: Fri 6/6/2008 4:29 PM
> To: Dave Winchell; Keir Fraser
> Cc: Ben Guthro; xen-devel
> Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> 
> Dave --
> 
> Thanks much for posting the preliminary results!
> 
> While I am fully supportive of offering hardware hpet as an option
> for hvm guests (let's call it hwhpet=1 for shorthand), I am very
> surprised by your preliminary results; the most obvious conclusion
> is that Xen system time is losing time at the rate of 1000 PPM
> though its possible there's a bug somewhere else in the "time
> stack".  Your Windows result is jaw-dropping and inexplicable,
> though I have to admit ignorance of how Windows manages time.
> 
> 
> I think with my recent patch and hpet=1 (essentially the same as
> your emulated hpet), hvm guest time should track Xen system time.
> I wonder if domain0 (which if I understand correctly is directly
> using Xen system time) is also seeing an error of .1%?  Also
> I wonder for the skew you are seeing (in both hvm guests and
> domain0) is time moving too fast or two slow?
> 
> Although hwhpet=1 is a fine alternative in many cases, it may
> be unavailable on some systems and may cause significant performance
> issues on others.  So I think we will still need to track down
> the poor accuracy when hwhpet=0.  And if for some reason
> Xen system time can't be made accurate enough (< 0.05%), then
> I think we should consider building Xen system time itself on
> top of hardware hpet instead of TSC... at least when Xen discovers
> a capable hpet.
> 
> One more thought... do you know the accuracy of the TSC crystals
> on your test systems?  I posted a patch awhile ago that was
> intended to test that, though I guess it was only testing skew
> of different TSCs on the same system, not TSCs against an
> external time source.
> 
> Or maybe there's a computation error somewhere in the hvm hpet
> scaling code?  Hmmm...
> 
> Thanks,
> Dan
> 
> > -----Original Message-----
> > From: Dave Winchell [mailto:dwinchell@virtualiron.com]
> > Sent: Friday, June 06, 2008 1:33 PM
> > To: dan.magenheimer@oracle.com; Keir Fraser
> > Cc: Ben Guthro; xen-devel; Dave Winchell
> > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >
> >
> > Dan, Keir:
> >
> > Preliminary tests results indicate an error of .1% for Linux 64 bit
> > guests configured
> > for hpet with xen-unstable as is. As we have discussed many 
> times, the
> > ntp requirement is .05%.
> > Tests on the patch we just submitted for hpet have 
> indicated errors of
> > .0012%
> > on this platform under similar test conditions and .03% on
> > other platforms.
> >
> > Windows vista64 has an error of 11% using hpet with the
> > xen-unstable bits.
> > In an overnight test with our hpet patch, the Windows vista
> > error was .008%.
> >
> > The tests are with two or three guests on a physical node, all under
> > load, and with
> > the ratio of vcpus to phys cpus > 1.
> >
> > I will continue to run tests over the next few days.
> >
> > thanks,
> > Dave
> >
> >
> > Dan Magenheimer wrote:
> >
> > > Hi Dave and Ben --
> > >
> > > When running tests on xen-unstable (without your patch),
> > please ensure
> > > that hpet=1 is set in the hvm config and also I think 
> that when hpet
> > > is the clocksource on RHEL4-32, the clock IS resilient to
> > missed ticks
> > > so timer_mode should be 2 (vs when pit is the clocksource
> > on RHEL4-32,
> > > all clock ticks must be delivered and so timer_mode should be 0).
> > >
> > > Per
> > >
> > http://lists.xensource.com/archives/html/xen-devel/2008-06/msg
> > 00098.html it's
> > > my intent to clean this up, but I won't get to it until next week.
> > >
> > > Thanks,
> > > Dan
> > >
> > >     -----Original Message-----
> > >     *From:* xen-devel-bounces@lists.xensource.com
> > >     [mailto:xen-devel-bounces@lists.xensource.com]*On
> > Behalf Of *Dave
> > >     Winchell
> > >     *Sent:* Friday, June 06, 2008 4:46 AM
> > >     *To:* Keir Fraser; Ben Guthro; xen-devel
> > >     *Cc:* dan.magenheimer@oracle.com; Dave Winchell
> > >     *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> > >
> > >     Keir,
> > >
> > >     I think the changes are required. We'll run some tests
> > today today so
> > >     that we have some data to talk about.
> > >
> > >     -Dave
> > >
> > >
> > >     -----Original Message-----
> > >     From: xen-devel-bounces@lists.xensource.com on behalf
> > of Keir Fraser
> > >     Sent: Fri 6/6/2008 4:58 AM
> > >     To: Ben Guthro; xen-devel
> > >     Cc: dan.magenheimer@oracle.com
> > >     Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> > >
> > >     Are these patches needed now the timers are built on 
> Xen system
> > >     time rather
> > >     than host TSC? Dan has reported much better
> > time-keeping with his
> > >     patch
> > >     checked in, and it¹s for sure a lot less invasive than
> > this patchset.
> > >
> > >
> > >      -- Keir
> > >
> > >     On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote:
> > >
> > >     >
> > >     > 1. Introduction
> > >     >
> > >     > This patch improves the hpet based guest clock in
> > terms of drift and
> > >     > monotonicity.
> > >     > Prior to this work the drift with hpet was greater
> > than 2%, far
> > >     above the .05%
> > >     > limit
> > >     > for ntp to synchronize. With this code, the drift 
> ranges from
> > >     .001% to .0033%
> > >     > depending
> > >     > on guest and physical platform.
> > >     >
> > >     > Using hpet allows guest operating systems to 
> provide monotonic
> > >     time to their
> > >     > applications. Time sources other than hpet are not
> > monotonic because
> > >     > of their reliance on tsc, which is not synchronized
> > across physical
> > >     > processors.
> > >     >
> > >     > Windows 2k864 and many Linux guests are supported with two
> > >     policies, one for
> > >     > guests
> > >     > that handle missed clock interrupts and the other for guests
> > >     that require the
> > >     > correct number of interrupts.
> > >     >
> > >     > Guests may use hpet for the timing source even if 
> the physical
> > >     platform has no
> > >     > visible
> > >     > hpet. Migration is supported between physical machines which
> > >     differ in
> > >     > physical
> > >     > hpet visibility.
> > >     >
> > >     > Most of the changes are in hpet.c. Two general 
> facilities are
> > >     added to track
> > >     > interrupt
> > >     > progress. The ideas here and the facilities would 
> be useful in
> > >     vpt.c, for
> > >     > other time
> > >     > sources, though no attempt is made here to improve vpt.c.
> > >     >
> > >     > The following sections discuss hpet dependencies, interrupt
> > >     delivery policies,
> > >     > live migration,
> > >     > test results, and relation to recent work with 
> monotonic time.
> > >     >
> > >     >
> > >     > 2. Virtual Hpet dependencies
> > >     >
> > >     > The virtual hpet depends on the ability to read the
> > physical or
> > >     simulated
> > >     > (see discussion below) hpet.  For timekeeping, the
> > virtual hpet
> > >     also depends
> > >     > on two new interrupt notification facilities to 
> implement its
> > >     policies for
> > >     > interrupt delivery.
> > >     >
> > >     > 2.1. Two modes of low-level hpet main counter reads.
> > >     >
> > >     > In this implementation, the virtual hpet reads with
> > >     read_64_main_counter(),
> > >     > exported by
> > >     > time.c, either the real physical hpet main counter register
> > >     directly or a
> > >     > "simulated"
> > >     > hpet main counter.
> > >     >
> > >     > The simulated mode uses a monotonic version of get_s_time()
> > >     (NOW()), where the
> > >     > last
> > >     > time value is returned whenever the current time 
> value is less
> > >     than the last
> > >     > time
> > >     > value. In simulated mode, since it is layered on s_time, the
> > >     underlying
> > >     > hardware
> > >     > can be hpet or some other device. The frequency of the main
> > >     counter in
> > >     > simulated
> > >     > mode is the same as the standard physical hpet frequency,
> > >     allowing live
> > >     > migration
> > >     > between nodes that are configured differently.
> > >     >
> > >     > If the physical platform does not have an hpet
> > device, or if xen
> > >     is configured
> > >     > not
> > >     > to use the device, then the simulated method is 
> used. If there
> > >     is a physical
> > >     > hpet device,
> > >     > and xen has initialized it, then either simulated 
> or physical
> > >     mode can be
> > >     > used.
> > >     > This is governed by a boot time option, hpet-avoid.
> > Setting this
> > >     option to 1
> > >     > gives the
> > >     > simulated mode and 0 the physical mode. The default
> > is physical
> > >     mode.
> > >     >
> > >     > A disadvantage of the physical mode is that may 
> take longer to
> > >     read the device
> > >     > than in simulated mode. On some platforms the cost is
> > about the
> > >     same (less
> > >     > than 250 nsec) for
> > >     > physical and simulated modes, while on others 
> physical cost is
> > >     much higher
> > >     > than simulated.
> > >     > A disadvantage of the simulated mode is that it can 
> return the
> > >     same value
> > >     > for the counter in consecutive calls.
> > >     >
> > >     > 2.2. Interrupt notification facilities.
> > >     >
> > >     > Two interrupt notification facilities are introduced, one is
> > >     > hvm_isa_irq_assert_cb()
> > >     > and the other hvm_register_intr_en_notif().
> > >     >
> > >     > The vhpet uses hvm_isa_irq_assert_cb to deliver 
> interrupts to
> > >     the vioapic.
> > >     > hvm_isa_irq_assert_cb allows a callback to be 
> passed along to
> > >     > vioapic_deliver()
> > >     > and this callback is called with a mask of the vcpus
> > which will
> > >     get the
> > >     > interrupt. This callback is made before any vcpus receive an
> > >     interrupt.
> > >     >
> > >     > Vhpet uses hvm_register_intr_en_notif() to register 
> a handler
> > >     for a particular
> > >     > vector that will be called when that vector is injected in
> > >     > [vmx,svm]_intr_assist()
> > >     > and also when the guest finishes handling the 
> interrupt. Here
> > >     finished is
> > >     > defined
> > >     > as the point when the guest re-enables interrupts or
> > lowers the
> > >     tpr value.
> > >     > EOI is not used as the end of interrupt as this is sometimes
> > >     returned before
> > >     > the interrupt handler has done its work. A flag is
> > passed to the
> > >     handler
> > >     > indicating
> > >     > whether this is the injection point (post = 1) or the
> > interrupt
> > >     finished (post
> > >     > = 0) point.
> > >     > The need for the finished point callback is discussed in the
> > >     missed ticks
> > >     > policy section.
> > >     >
> > >     > To prevent a possible early trigger of the finished 
> callback,
> > >     intr_en_notif
> > >     > logic
> > >     > has a two stage arm, the first at injection
> > >     (hvm_intr_en_notif_arm()) and the
> > >     > second when
> > >     > interrupts are seen to be disabled
> > (hvm_intr_en_notif_disarm()).
> > >     Once fully
> > >     > armed, re-enabling
> > >     > interrupts will cause hvm_intr_en_notif_disarm() to
> > make the end
> > >     of interrupt
> > >     > callback. hvm_intr_en_notif_arm() and
> > hvm_intr_en_notif_disarm()
> > >     are called by
> > >     > [vmx,svm]_intr_assist().
> > >     >
> > >     > 3. Interrupt delivery policies
> > >     >
> > >     > The existing hpet interrupt delivery is preserved.
> > This includes
> > >     > vcpu round robin delivery used by Linux and 
> broadcast delivery
> > >     used by
> > >     > Windows.
> > >     >
> > >     > There are two policies for interrupt delivery, one 
> for Windows
> > >     2k8-64 and the
> > >     > other
> > >     > for Linux. The Linux policy takes advantage of the
> > (guest) Linux
> > >     missed tick
> > >     > and offset
> > >     > calculations and does not attempt to deliver the
> > right number of
> > >     interrupts.
> > >     > The Windows policy delivers the correct number of 
> interrupts,
> > >     even if
> > >     > sometimes much
> > >     > closer to each other than the period. The policies 
> are similar
> > >     to those in
> > >     > vpt.c, though
> > >     > there are some important differences.
> > >     >
> > >     > Policies are selected with an HVMOP_set_param
> > hypercall with index
> > >     > HVM_PARAM_TIMER_MODE.
> > >     > Two new values are added,
> > HVM_HPET_guest_computes_missed_ticks and
> > >     > HVM_HPET_guest_does_not_compute_missed_ticks.  The 
> reason that
> > >     two new ones
> > >     > are added is that
> > >     > in some guests (32bit Linux) a no-missed policy is 
> needed for
> > >     clock sources
> > >     > other than hpet
> > >     > and a missed ticks policy for hpet. It was felt that
> > there would
> > >     be less
> > >     > confusion by simply
> > >     > introducing the two hpet policies.
> > >     >
> > >     > 3.1. The missed ticks policy
> > >     >
> > >     > The Linux clock interrupt handler for hpet calculates missed
> > >     ticks and offset
> > >     > using the hpet
> > >     > main counter. The algorithm works well when the 
> time since the
> > >     last interrupt
> > >     > is greater than
> > >     > or equal to a period and poorly otherwise.
> > >     >
> > >     > The missed ticks policy ensures that no two clock
> > interrupts are
> > >     delivered to
> > >     > the guest at
> > >     > a time interval less than a period. A time stamp (hpet main
> > >     counter value) is
> > >     > recorded (by a
> > >     > callback registered with hvm_register_intr_en_notif)
> > when Linux
> > >     finishes
> > >     > handling the clock
> > >     > interrupt. Then, ensuing interrupts are delivered to
> > the vioapic
> > >     only if the
> > >     > current main
> > >     > counter value is a period greater than when the 
> last interrupt
> > >     was handled.
> > >     >
> > >     > Tests showed a significant improvement in clock 
> drift with end
> > >     of interrupt
> > >     > time stamps
> > >     > versus beginning of interrupt[1]. It is believed that
> > the reason
> > >     for the
> > >     > improvement
> > >     > is that the clock interrupt handler goes for a
> > spinlock and can
> > >     be therefore
> > >     > delayed in its
> > >     > processing. Furthermore, the main counter is read 
> by the guest
> > >     under the lock.
> > >     > The net
> > >     > effect is that if we time stamp injection, we can get the
> > >     difference in time
> > >     > between successive interrupt handler lock acquisitions to be
> > >     less than the
> > >     > period.
> > >     >
> > >     > 3.2. The no-missed ticks policy
> > >     >
> > >     > Windows 2k864 keeps very poor time with the missed
> > ticks policy.
> > >     So the
> > >     > no-missed ticks policy
> > >     > was developed. In the no-missed ticks policy we deliver the
> > >     correct number of
> > >     > interrupts,
> > >     > even if they are spaced less than a period apart
> > (when catching up).
> > >     >
> > >     > Windows 2k864 uses a broadcast mode in the interrupt routing
> > >     such that
> > >     > all vcpus get the clock interrupt. The best Windows drift
> > >     performance was
> > >     > achieved when the
> > >     > policy code ensured that all the previous interrupts (on the
> > >     various vcpus)
> > >     > had been injected
> > >     > before injecting the next interrupt to the vioapic..
> > >     >
> > >     > The policy code works as follows. It uses the
> > >     hvm_isa_irq_assert_cb() to
> > >     > record
> > >     > the vcpus to be interrupted in 
> h->hpet.pending_mask. Then, in
> > >     the callback
> > >     > registered
> > >     > with hvm_register_intr_en_notif() at post=1 time it 
> clears the
> > >     current vcpu in
> > >     > the pending_mask.
> > >     > When the pending_mask is clear it decrements
> > >     hpet.intr_pending_nr and if
> > >     > intr_pending_nr is still
> > >     > non-zero posts another interrupt to the ioapic with
> > >     hvm_isa_irq_assert_cb().
> > >     > Intr_pending_nr is incremented in
> > >     hpet_route_decision_not_missed_ticks().
> > >     >
> > >     > The missed ticks policy intr_en_notif callback also uses the
> > >     pending_mask
> > >     > method. So even though
> > >     > Linux does not broadcast its interrupts, the code 
> could handle
> > >     it if it did.
> > >     > In this case the end of interrupt time stamp is 
> made when the
> > >     pending_mask is
> > >     > clear.
> > >     >
> > >     > 4. Live Migration
> > >     >
> > >     > Live migration with hpet preserves the current offset of the
> > >     guest clock with
> > >     > respect
> > >     > to ntp. This is accomplished by migrating all of 
> the state in
> > >     the h->hpet data
> > >     > structure
> > >     > in the usual way. The hp->mc_offset is recalculated on the
> > >     receiving node so
> > >     > that the
> > >     > guest sees a continuous hpet main counter.
> > >     >
> > >     > Code as been added to xc_domain_save.c to send a 
> small message
> > >     after the
> > >     > domain context is sent. The contents of the message is the
> > >     physical tsc
> > >     > timestamp, last_tsc,
> > >     > read just before the message is sent. When the
> > last_tsc message
> > >     is received in
> > >     > xc_domain_restore.c,
> > >     > another physical tsc timestamp, cur_tsc, is read. The two
> > >     timestamps are
> > >     > loaded into the domain
> > >     > structure as last_tsc_sender and first_tsc_receiver with
> > >     hypercalls. Then
> > >     > xc_domain_hvm_setcontext
> > >     > is called so that hpet_load has access to these time stamps.
> > >     Hpet_load uses
> > >     > the timestamps
> > >     > to account for the time spent saving and loading the domain
> > >     context. With this
> > >     > technique,
> > >     > the only neglected time is the time spent sending a small
> > >     network message.
> > >     >
> > >     > 5. Test Results
> > >     >
> > >     > Some recent test results are:
> > >     >
> > >     > 5.1 Linux 4u664 and Windows 2k864 load test.
> > >     >       Duration: 70 hours.
> > >     >       Test date: 6/2/08
> > >     >       Loads: usex -b48 on Linux; burn-in on Windows
> > >     >       Guest vcpus: 8 for Linux; 2 for Windows
> > >     >       Hardware: 8 physical cpu AMD
> > >     >       Clock drift : Linux: .0012% Windows: .009%
> > >     >
> > >     > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 
> no-load test
> > >     >       Duration: 23 hours.
> > >     >       Test date: 6/3/08
> > >     >       Loads: none
> > >     >       Guest vcpus: 8 for each Linux; 2 for Windows
> > >     >       Hardware: 4 physical cpu AMD
> > >     >       Clock drift : Linux: .033% Windows: .019%
> > >     >
> > >     > 6. Relation to recent work in xen-unstable
> > >     >
> > >     > There is a similarity between hvm_get_guest_time() in
> > >     xen-unstable and
> > >     > read_64_main_counter()
> > >     > in this code. However, read_64_main_counter() is 
> more tuned to
> > >     the needs of
> > >     > hpet.c. It has no
> > >     > "set" operation, only the get. It isolates the mode,
> > physical or
> > >     simulated, in
> > >     > read_64_main_counter()
> > >     > itself. It uses no vcpu or domain state as it is a physical
> > >     entity, in either
> > >     > mode. And it provides a real
> > >     > physical mode for every read for those applications
> > that desire
> > >     this.
> > >     >
> > >     > 7. Conclusion
> > >     >
> > >     > The virtual hpet is improved by this patch in terms
> > of accuracy and
> > >     > monotonicity.
> > >     > Tests performed to date verify this and more testing
> > is under way.
> > >     >
> > >     > 8. Future Work
> > >     >
> > >     > Testing with Windows Vista will be performed soon. 
> The reason
> > >     for accuracy
> > >     > variations
> > >     > on different platforms using the physical hpet 
> device will be
> > >     investigated.
> > >     > Additional overhead measurements on simulated vs 
> physical hpet
> > >     mode will be
> > >     > made.
> > >     >
> > >     > Footnotes:
> > >     >
> > >     > 1. I don't recall the accuracy improvement with end
> > of interrupt
> > >     stamping, but
> > >     > it was
> > >     > significant, perhaps better than two to one improvement. It
> > >     would be a very
> > >     > simple matter
> > >     > to re-measure the improvement as the facility can 
> call back at
> > >     injection time
> > >     > as well.
> > >     >
> > >     >
> > >     > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com>
> > >     > <mailto:dwinchell@virtualiron.com>
> > >     > Signed-off-by: Ben Guthro <bguthro@virtualiron.com>
> > >     > <mailto:bguthro@virtualiron.com>
> > >     >
> > >     >
> > >     > _______________________________________________
> > >     > Xen-devel mailing list
> > >     > Xen-devel@lists.xensource.com
> > >     > http://lists.xensource.com/xen-devel
> > >
> > >
> > >
> >
> >

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/2] Improve hpet accuracy
  2008-06-10 17:13                         ` Dave Winchell
@ 2008-06-11  8:30                           ` Keir Fraser
  2008-06-11 11:38                             ` Dave Winchell
  0 siblings, 1 reply; 51+ messages in thread
From: Keir Fraser @ 2008-06-11  8:30 UTC (permalink / raw)
  To: Dave Winchell; +Cc: dan.magenheimer, xen-devel, Ben Guthro

I implemented the monotonicity guarantee within hvm_get_guest_time(). We
don't need or want get_s_time_mono().

 -- Keir

On 10/6/08 18:13, "Dave Winchell" <dwinchell@virtualiron.com> wrote:

> Keir, Dan:
> 
> Although I plan to break up the patch, etc., I'm posting
> this fix to the patch for anyone who might be interested.
> 
> thanks,
> Dave
> # This is a BitKeeper generated diff -Nru style patch.
> #
> # ChangeSet
> #   2008/06/10 12:20:48-04:00 winchell@dwinchell2.virtualiron.com
> #   vi-patch: xen-hpet
> #   
> #   Bug Id: 6057 
> #   
> #   Reviewed by: Robert
> #   
> #   SUMMARY: Fix wrap issue in monotonic s_time().
> # 
> # xen/arch/x86/time.c
> #   2008/06/10 12:20:48-04:00 winchell@dwinchell2.virtualiron.com +3 -2
> #   Fix wrap issue in monotonic s_time().
> # 
> diff -Nru a/xen/arch/x86/time.c b/xen/arch/x86/time.c
> --- a/xen/arch/x86/time.c 2008-06-10 13:08:39 -04:00
> +++ b/xen/arch/x86/time.c 2008-06-10 13:08:39 -04:00
> @@ -534,7 +534,7 @@
>      u64 count;
>      unsigned long flags;
>      struct cpu_time *t = &this_cpu(cpu_time);
> -    u64 tsc, delta;
> +    u64 tsc, delta, diff;
>      s_time_t now;
>  
>      if(hpet_main_counter_phys_avoid_hdw || !hpet_physical_inited) {
> @@ -542,7 +542,8 @@
>          rdtscll(tsc);
>          delta = tsc - t->local_tsc_stamp;
>          now = t->stime_local_stamp + scale_delta(delta, &t->tsc_scale);
> -        if(now > get_s_time_mon.last_ret)
> +        diff = (u64)now - (u64)get_s_time_mon.last_ret;
> +        if((s64)diff > (s64)0)
>              get_s_time_mon.last_ret = now;
>          else
>              now = get_s_time_mon.last_ret;

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [PATCH 0/2] Improve hpet accuracy
  2008-06-11  8:30                           ` Keir Fraser
@ 2008-06-11 11:38                             ` Dave Winchell
  0 siblings, 0 replies; 51+ messages in thread
From: Dave Winchell @ 2008-06-11 11:38 UTC (permalink / raw)
  To: Keir Fraser; +Cc: dan.magenheimer, xen-devel, Dave Winchell, Ben Guthro


[-- Attachment #1.1: Type: text/plain, Size: 2117 bytes --]

> I implemented the monotonicity guarantee within hvm_get_guest_time(). We
> don't need or want get_s_time_mono().

I'll give hvm_get_guest_time() another look.

-Dave



-----Original Message-----
From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
Sent: Wed 6/11/2008 4:30 AM
To: Dave Winchell
Cc: dan.magenheimer@oracle.com; Ben Guthro; xen-devel
Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
 
I implemented the monotonicity guarantee within hvm_get_guest_time(). We
don't need or want get_s_time_mono().

 -- Keir

On 10/6/08 18:13, "Dave Winchell" <dwinchell@virtualiron.com> wrote:

> Keir, Dan:
> 
> Although I plan to break up the patch, etc., I'm posting
> this fix to the patch for anyone who might be interested.
> 
> thanks,
> Dave
> # This is a BitKeeper generated diff -Nru style patch.
> #
> # ChangeSet
> #   2008/06/10 12:20:48-04:00 winchell@dwinchell2.virtualiron.com
> #   vi-patch: xen-hpet
> #   
> #   Bug Id: 6057 
> #   
> #   Reviewed by: Robert
> #   
> #   SUMMARY: Fix wrap issue in monotonic s_time().
> # 
> # xen/arch/x86/time.c
> #   2008/06/10 12:20:48-04:00 winchell@dwinchell2.virtualiron.com +3 -2
> #   Fix wrap issue in monotonic s_time().
> # 
> diff -Nru a/xen/arch/x86/time.c b/xen/arch/x86/time.c
> --- a/xen/arch/x86/time.c 2008-06-10 13:08:39 -04:00
> +++ b/xen/arch/x86/time.c 2008-06-10 13:08:39 -04:00
> @@ -534,7 +534,7 @@
>      u64 count;
>      unsigned long flags;
>      struct cpu_time *t = &this_cpu(cpu_time);
> -    u64 tsc, delta;
> +    u64 tsc, delta, diff;
>      s_time_t now;
>  
>      if(hpet_main_counter_phys_avoid_hdw || !hpet_physical_inited) {
> @@ -542,7 +542,8 @@
>          rdtscll(tsc);
>          delta = tsc - t->local_tsc_stamp;
>          now = t->stime_local_stamp + scale_delta(delta, &t->tsc_scale);
> -        if(now > get_s_time_mon.last_ret)
> +        diff = (u64)now - (u64)get_s_time_mon.last_ret;
> +        if((s64)diff > (s64)0)
>              get_s_time_mon.last_ret = now;
>          else
>              now = get_s_time_mon.last_ret;




[-- Attachment #1.2: Type: text/html, Size: 3690 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/2] Improve hpet accuracy
  2008-06-11  1:44                   ` Dan Magenheimer
@ 2008-06-11 13:58                     ` Dave Winchell
  2008-06-11 16:47                       ` Dan Magenheimer
  2008-06-12 22:51                     ` Dan Magenheimer
  1 sibling, 1 reply; 51+ messages in thread
From: Dave Winchell @ 2008-06-11 13:58 UTC (permalink / raw)
  To: dan.magenheimer@oracle.com
  Cc: Dave Winchell, xen-devel, Keir Fraser, Ben Guthro

Dan Magenheimer wrote:

>>In EL5u1-32 however it looks like the fractions are accounted
>>for.  Indeed the EL5u1-32 "lost tick handling" code resembles
>>the Linux/ia64 code which is what I've always assumed was
>>the "missed tick" model.  In this case, I think no policy
>>is necessary and the measured skew should be identical to
>>any physical hpet skew.  I'll have to test this hypothesis though.
>>    
>>
>
>I've tested this hypothesis and it seems to hold true.
>This means the existing (unpatched) hpet code works fine
>on EL5-32bit (vcpus=1) when hpet is the clocksource,
>even when the machine is overcommitted.  A second hypothesis
>still needs to be tested that Dave's patch will not make this worse.
>  
>
Interesting, thanks for pointing this out and confirming.

>(Note that per previous discussion, my EL5u1-32bit guest
>running on an Intel dual-core physical box chose tsc as
>the best clocksource and I had to override it with
>clock=hpet in the kernel command line.)
>  
>
Is there one setting for all Linux guests that makes them
choose hpet? Is it "clock=hpet clocksource=hpet"?
I know you wrote at length about this before.

>  
>
>>Yes, that makes sense and concurs with what I remember from
>>the EL4u5-32 code.  If this is true, one would expect the
>>default "no missed tick" policy to see time moving faster
>>than an external source -- the first missed tick delivered
>>after a long sleep would "catch up" and then the remainder
>>would each add another tick.
>>    
>>
>
>Indeed with the existing (unpatched) hpet code, time is
>running faster on EL4u5-32 (vcpus=1, when overcommited).
>So Dave's patch is definitely needed here.
>  
>
Its good to get the verification of this.

thanks,
Dave

>Will try 64-bit next.
>
>Dan
>
>  
>
>>-----Original Message-----
>>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
>>Sent: Monday, June 09, 2008 9:21 PM
>>To: 'Dave Winchell'; 'Keir Fraser'
>>Cc: 'xen-devel'; 'Ben Guthro'
>>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>>
>>
>>    
>>
>>>I'll tell  you what I recall about this. Tomorrow I'll check the
>>>guest code to verify. I think that Linux declares a full tick,
>>>even if the interrupt is early. That's the problem.
>>>      
>>>
>>Yes, that makes sense and concurs with what I remember from
>>the EL4u5-32 code.  If this is true, one would expect the
>>default "no missed tick" policy to see time moving faster
>>than an external source -- the first missed tick delivered
>>after a long sleep would "catch up" and then the remainder
>>would each add another tick.
>>
>>    
>>
>>>On the other hand, if the interrupt is late it in effect declares
>>>a tick plus fraction. If it just declared the fraction in 
>>>      
>>>
>>the first place,
>>    
>>
>>>we could deliver the interrupts whenever we wanted.
>>>      
>>>
>>My read of the EL4u5-32 code is that the fraction is discarded
>>and a new tick period commences at "now", so the fractions
>>eventually accumulate as lost time.
>>
>>In EL5u1-32 however it looks like the fractions are accounted
>>for.  Indeed the EL5u1-32 "lost tick handling" code resembles
>>the Linux/ia64 code which is what I've always assumed was
>>the "missed tick" model.  In this case, I think no policy
>>is necessary and the measured skew should be identical to
>>any physical hpet skew.  I'll have to test this hypothesis though.
>>
>>-----Original Message-----
>>From: xen-devel-bounces@lists.xensource.com 
>>[mailto:xen-devel-bounces@lists.xensource.com]On Behalf Of 
>>Dave Winchell
>>Sent: Monday, June 09, 2008 5:35 PM
>>To: dan.magenheimer@oracle.com; Keir Fraser
>>Cc: Dave Winchell; xen-devel; Ben Guthro
>>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>>
>>
>>    
>>
>>>>The Linux policy is more subtle, but is required to go
>>>>from .1% to .03%.
>>>>        
>>>>
>>>Thanks for the good documentation which I hadn't thoroughly
>>>read until now.
>>>I now understand that the essence of your
>>>hpet missed ticks policy is to ensure that ticks are never
>>>delivered too close together.  But I'm trying to understand
>>>WHY your patch works, in other words, what problem it is
>>>countering.
>>>      
>>>
>>I'll tell  you what I recall about this. Tomorrow I'll check the
>>guest code to verify. I think that Linux declares a full tick,
>>even if the interrupt is early. That's the problem.
>>On the other hand, if the interrupt is late it in effect declares
>>a tick plus fraction. If it just declared the fraction in the 
>>first place,
>>we could deliver the interrupts whenever we wanted.
>>
>>Its really not that different than the missed ticks policy in vpt.c
>>except that there the period in vpt.c is based on start of interrupt
>>and I have improved that with end-of interrupt as described
>>in the patch note.
>>
>>I don't recall what prompted me to try end-of-interrupt,
>>but I saw a significant improvement. I may have been running
>>a monotonicity test at the same time to explain the lock
>>contention mentioned in the write-up.
>>
>>    
>>
>>> I care about this for more reasons than just
>>>because it is interesting: (1) I'd like to feel confident that
>>>it is fixing a bug rather than just a symptom of a bug;
>>>and (2) I wonder how universally it is applicable.
>>>      
>>>
>>Its worked well my my small set of guests. You and our
>>QA are going to tell us about the wider set. It doesn't
>>matter if guest A handles interrupts closely spaced or not,
>>just whether it handles them far apart. So it should be pretty
>>universal with guests that really handle missed ticks.
>>I think its interesting that some 32bit Linux guests handle
>>missed ticks for hpet.
>>
>>    
>>
>>>I see from code examination in mark_offset_hpet() in
>>>RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that
>>>the correction for lost ticks is just plain wrong in
>>>a virtual environment. (Suppose for example that a virtual
>>>tick was delivered every 1.999*hpet_tick... I think
>>>the clock would be off by 50%!)  Is this the bug that
>>>is being "countered" by your policy?
>>>      
>>>
>>I haven't looked at that code, perhaps.
>>I'll check it tomorrow.
>>
>>    
>>
>>>However, the lost tick handling in RHEL5u1/kernel/timer.c
>>>(which I think is used also for hpet) is much better
>>>so I am eager to find out if your policy works there
>>>too.
>>>If the hpet missed tick policy works for both, though,
>>>I should be happy, though I wonder about upstream kernels
>>>(e.g. the trend toward tickless).
>>>      
>>>
>>I wasn't aware of this trend. If its robust, however, it should
>>handle late interrupts ...
>>
>>    
>>
>>>That said, I'd rather
>>>see this get into Xen 3.3 and worry about upstream kernels
>>>later :-)
>>>      
>>>
>>Regards,
>>Dave
>>
>>
>>
>>-----Original Message-----
>>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
>>Sent: Mon 6/9/2008 6:02 PM
>>To: Dave Winchell; Keir Fraser
>>Cc: Ben Guthro; xen-devel
>>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>>
>>    
>>
>>>The Linux policy is more subtle, but is required to go
>>>from .1% to .03%.
>>>      
>>>
>>Thanks for the good documentation which I hadn't thoroughly
>>read until now.  I now understand that the essence of your
>>hpet missed ticks policy is to ensure that ticks are never
>>delivered too close together.  But I'm trying to understand
>>WHY your patch works, in other words, what problem it is
>>countering.  I care about this for more reasons than just
>>because it is interesting: (1) I'd like to feel confident that
>>it is fixing a bug rather than just a symptom of a bug;
>>and (2) I wonder how universally it is applicable.
>>
>>I see from code examination in mark_offset_hpet() in
>>RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that
>>the correction for lost ticks is just plain wrong in
>>a virtual environment. (Suppose for example that a virtual
>>tick was delivered every 1.999*hpet_tick... I think
>>the clock would be off by 50%!)  Is this the bug that
>>is being "countered" by your policy?
>>
>>However, the lost tick handling in RHEL5u1/kernel/timer.c
>>(which I think is used also for hpet) is much better
>>so I am eager to find out if your policy works there
>>too.
>>
>>If the hpet missed tick policy works for both, though,
>>I should be happy, though I wonder about upstream kernels
>>(e.g. the trend toward tickless).  That said, I'd rather
>>see this get into Xen 3.3 and worry about upstream kernels
>>later :-)
>>
>>-----Original Message-----
>>From: Dave Winchell [mailto:dwinchell@virtualiron.com]
>>Sent: Sunday, June 08, 2008 2:32 PM
>>To: dan.magenheimer@oracle.com; Keir Fraser
>>Cc: Ben Guthro; xen-devel; Dave Winchell
>>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>>
>>
>>Hi Dan,
>>
>>    
>>
>>>While I am fully supportive of offering hardware hpet as an option
>>>for hvm guests (let's call it hwhpet=1 for shorthand), I am very
>>>surprised by your preliminary results; the most obvious conclusion
>>>is that Xen system time is losing time at the rate of 1000 PPM
>>>though its possible there's a bug somewhere else in the "time
>>>stack".  Your Windows result is jaw-dropping and inexplicable,
>>>though I have to admit ignorance of how Windows manages time.
>>>      
>>>
>>I think xen system time is fine. You have to add the interrupt
>>delivery policies decribed in the write-up for the patch to get
>>accurate timekeeping in the guest.
>>
>>The windows policy is obvious and results in a large improvement
>>in accuracy. The Linux policy is more subtle, but is required to go
>>from .1% to .03%.
>>
>>    
>>
>>>I think with my recent patch and hpet=1 (essentially the same as
>>>your emulated hpet), hvm guest time should track Xen system time.
>>>I wonder if domain0 (which if I understand correctly is directly
>>>using Xen system time) is also seeing an error of .1%?  Also
>>>I wonder for the skew you are seeing (in both hvm guests and
>>>domain0) is time moving too fast or two slow?
>>>      
>>>
>>I don't recall the direction. I can look it up in my notes at work
>>tomorrow.
>>
>>    
>>
>>>Although hwhpet=1 is a fine alternative in many cases, it may
>>>be unavailable on some systems and may cause significant performance
>>>issues on others.  So I think we will still need to track down
>>>the poor accuracy when hwhpet=0.
>>>      
>>>
>>Our patch is accurate to < .03% using the physical hpet mode or
>>the simulated mode.
>>
>>    
>>
>>>And if for some reason
>>>Xen system time can't be made accurate enough (< 0.05%), then
>>>I think we should consider building Xen system time itself on
>>>top of hardware hpet instead of TSC... at least when Xen discovers
>>>a capable hpet.
>>>      
>>>
>>In our experience, Xen system time is accurate enough now.
>>
>>    
>>
>>>One more thought... do you know the accuracy of the TSC crystals
>>>on your test systems?  I posted a patch awhile ago that was
>>>intended to test that, though I guess it was only testing skew
>>>of different TSCs on the same system, not TSCs against an
>>>external time source.
>>>      
>>>
>>I do not know the tsc accuracy.
>>
>>    
>>
>>>Or maybe there's a computation error somewhere in the hvm hpet
>>>scaling code?  Hmmm...
>>>      
>>>
>>Regards,
>>Dave
>>
>>
>>-----Original Message-----
>>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
>>Sent: Fri 6/6/2008 4:29 PM
>>To: Dave Winchell; Keir Fraser
>>Cc: Ben Guthro; xen-devel
>>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>>
>>Dave --
>>
>>Thanks much for posting the preliminary results!
>>
>>While I am fully supportive of offering hardware hpet as an option
>>for hvm guests (let's call it hwhpet=1 for shorthand), I am very
>>surprised by your preliminary results; the most obvious conclusion
>>is that Xen system time is losing time at the rate of 1000 PPM
>>though its possible there's a bug somewhere else in the "time
>>stack".  Your Windows result is jaw-dropping and inexplicable,
>>though I have to admit ignorance of how Windows manages time.
>>
>>
>>I think with my recent patch and hpet=1 (essentially the same as
>>your emulated hpet), hvm guest time should track Xen system time.
>>I wonder if domain0 (which if I understand correctly is directly
>>using Xen system time) is also seeing an error of .1%?  Also
>>I wonder for the skew you are seeing (in both hvm guests and
>>domain0) is time moving too fast or two slow?
>>
>>Although hwhpet=1 is a fine alternative in many cases, it may
>>be unavailable on some systems and may cause significant performance
>>issues on others.  So I think we will still need to track down
>>the poor accuracy when hwhpet=0.  And if for some reason
>>Xen system time can't be made accurate enough (< 0.05%), then
>>I think we should consider building Xen system time itself on
>>top of hardware hpet instead of TSC... at least when Xen discovers
>>a capable hpet.
>>
>>One more thought... do you know the accuracy of the TSC crystals
>>on your test systems?  I posted a patch awhile ago that was
>>intended to test that, though I guess it was only testing skew
>>of different TSCs on the same system, not TSCs against an
>>external time source.
>>
>>Or maybe there's a computation error somewhere in the hvm hpet
>>scaling code?  Hmmm...
>>
>>Thanks,
>>Dan
>>
>>    
>>
>>>-----Original Message-----
>>>From: Dave Winchell [mailto:dwinchell@virtualiron.com]
>>>Sent: Friday, June 06, 2008 1:33 PM
>>>To: dan.magenheimer@oracle.com; Keir Fraser
>>>Cc: Ben Guthro; xen-devel; Dave Winchell
>>>Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>>>
>>>
>>>Dan, Keir:
>>>
>>>Preliminary tests results indicate an error of .1% for Linux 64 bit
>>>guests configured
>>>for hpet with xen-unstable as is. As we have discussed many 
>>>      
>>>
>>times, the
>>    
>>
>>>ntp requirement is .05%.
>>>Tests on the patch we just submitted for hpet have 
>>>      
>>>
>>indicated errors of
>>    
>>
>>>.0012%
>>>on this platform under similar test conditions and .03% on
>>>other platforms.
>>>
>>>Windows vista64 has an error of 11% using hpet with the
>>>xen-unstable bits.
>>>In an overnight test with our hpet patch, the Windows vista
>>>error was .008%.
>>>
>>>The tests are with two or three guests on a physical node, all under
>>>load, and with
>>>the ratio of vcpus to phys cpus > 1.
>>>
>>>I will continue to run tests over the next few days.
>>>
>>>thanks,
>>>Dave
>>>
>>>
>>>Dan Magenheimer wrote:
>>>
>>>      
>>>
>>>>Hi Dave and Ben --
>>>>
>>>>When running tests on xen-unstable (without your patch),
>>>>        
>>>>
>>>please ensure
>>>      
>>>
>>>>that hpet=1 is set in the hvm config and also I think 
>>>>        
>>>>
>>that when hpet
>>    
>>
>>>>is the clocksource on RHEL4-32, the clock IS resilient to
>>>>        
>>>>
>>>missed ticks
>>>      
>>>
>>>>so timer_mode should be 2 (vs when pit is the clocksource
>>>>        
>>>>
>>>on RHEL4-32,
>>>      
>>>
>>>>all clock ticks must be delivered and so timer_mode should be 0).
>>>>
>>>>Per
>>>>
>>>>        
>>>>
>>>http://lists.xensource.com/archives/html/xen-devel/2008-06/msg
>>>00098.html it's
>>>      
>>>
>>>>my intent to clean this up, but I won't get to it until next week.
>>>>
>>>>Thanks,
>>>>Dan
>>>>
>>>>    -----Original Message-----
>>>>    *From:* xen-devel-bounces@lists.xensource.com
>>>>    [mailto:xen-devel-bounces@lists.xensource.com]*On
>>>>        
>>>>
>>>Behalf Of *Dave
>>>      
>>>
>>>>    Winchell
>>>>    *Sent:* Friday, June 06, 2008 4:46 AM
>>>>    *To:* Keir Fraser; Ben Guthro; xen-devel
>>>>    *Cc:* dan.magenheimer@oracle.com; Dave Winchell
>>>>    *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>>>>
>>>>    Keir,
>>>>
>>>>    I think the changes are required. We'll run some tests
>>>>        
>>>>
>>>today today so
>>>      
>>>
>>>>    that we have some data to talk about.
>>>>
>>>>    -Dave
>>>>
>>>>
>>>>    -----Original Message-----
>>>>    From: xen-devel-bounces@lists.xensource.com on behalf
>>>>        
>>>>
>>>of Keir Fraser
>>>      
>>>
>>>>    Sent: Fri 6/6/2008 4:58 AM
>>>>    To: Ben Guthro; xen-devel
>>>>    Cc: dan.magenheimer@oracle.com
>>>>    Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>>>>
>>>>    Are these patches needed now the timers are built on 
>>>>        
>>>>
>>Xen system
>>    
>>
>>>>    time rather
>>>>    than host TSC? Dan has reported much better
>>>>        
>>>>
>>>time-keeping with his
>>>      
>>>
>>>>    patch
>>>>    checked in, and it¹s for sure a lot less invasive than
>>>>        
>>>>
>>>this patchset.
>>>      
>>>
>>>>     -- Keir
>>>>
>>>>    On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote:
>>>>
>>>>    >
>>>>    > 1. Introduction
>>>>    >
>>>>    > This patch improves the hpet based guest clock in
>>>>        
>>>>
>>>terms of drift and
>>>      
>>>
>>>>    > monotonicity.
>>>>    > Prior to this work the drift with hpet was greater
>>>>        
>>>>
>>>than 2%, far
>>>      
>>>
>>>>    above the .05%
>>>>    > limit
>>>>    > for ntp to synchronize. With this code, the drift 
>>>>        
>>>>
>>ranges from
>>    
>>
>>>>    .001% to .0033%
>>>>    > depending
>>>>    > on guest and physical platform.
>>>>    >
>>>>    > Using hpet allows guest operating systems to 
>>>>        
>>>>
>>provide monotonic
>>    
>>
>>>>    time to their
>>>>    > applications. Time sources other than hpet are not
>>>>        
>>>>
>>>monotonic because
>>>      
>>>
>>>>    > of their reliance on tsc, which is not synchronized
>>>>        
>>>>
>>>across physical
>>>      
>>>
>>>>    > processors.
>>>>    >
>>>>    > Windows 2k864 and many Linux guests are supported with two
>>>>    policies, one for
>>>>    > guests
>>>>    > that handle missed clock interrupts and the other for guests
>>>>    that require the
>>>>    > correct number of interrupts.
>>>>    >
>>>>    > Guests may use hpet for the timing source even if 
>>>>        
>>>>
>>the physical
>>    
>>
>>>>    platform has no
>>>>    > visible
>>>>    > hpet. Migration is supported between physical machines which
>>>>    differ in
>>>>    > physical
>>>>    > hpet visibility.
>>>>    >
>>>>    > Most of the changes are in hpet.c. Two general 
>>>>        
>>>>
>>facilities are
>>    
>>
>>>>    added to track
>>>>    > interrupt
>>>>    > progress. The ideas here and the facilities would 
>>>>        
>>>>
>>be useful in
>>    
>>
>>>>    vpt.c, for
>>>>    > other time
>>>>    > sources, though no attempt is made here to improve vpt.c.
>>>>    >
>>>>    > The following sections discuss hpet dependencies, interrupt
>>>>    delivery policies,
>>>>    > live migration,
>>>>    > test results, and relation to recent work with 
>>>>        
>>>>
>>monotonic time.
>>    
>>
>>>>    >
>>>>    >
>>>>    > 2. Virtual Hpet dependencies
>>>>    >
>>>>    > The virtual hpet depends on the ability to read the
>>>>        
>>>>
>>>physical or
>>>      
>>>
>>>>    simulated
>>>>    > (see discussion below) hpet.  For timekeeping, the
>>>>        
>>>>
>>>virtual hpet
>>>      
>>>
>>>>    also depends
>>>>    > on two new interrupt notification facilities to 
>>>>        
>>>>
>>implement its
>>    
>>
>>>>    policies for
>>>>    > interrupt delivery.
>>>>    >
>>>>    > 2.1. Two modes of low-level hpet main counter reads.
>>>>    >
>>>>    > In this implementation, the virtual hpet reads with
>>>>    read_64_main_counter(),
>>>>    > exported by
>>>>    > time.c, either the real physical hpet main counter register
>>>>    directly or a
>>>>    > "simulated"
>>>>    > hpet main counter.
>>>>    >
>>>>    > The simulated mode uses a monotonic version of get_s_time()
>>>>    (NOW()), where the
>>>>    > last
>>>>    > time value is returned whenever the current time 
>>>>        
>>>>
>>value is less
>>    
>>
>>>>    than the last
>>>>    > time
>>>>    > value. In simulated mode, since it is layered on s_time, the
>>>>    underlying
>>>>    > hardware
>>>>    > can be hpet or some other device. The frequency of the main
>>>>    counter in
>>>>    > simulated
>>>>    > mode is the same as the standard physical hpet frequency,
>>>>    allowing live
>>>>    > migration
>>>>    > between nodes that are configured differently.
>>>>    >
>>>>    > If the physical platform does not have an hpet
>>>>        
>>>>
>>>device, or if xen
>>>      
>>>
>>>>    is configured
>>>>    > not
>>>>    > to use the device, then the simulated method is 
>>>>        
>>>>
>>used. If there
>>    
>>
>>>>    is a physical
>>>>    > hpet device,
>>>>    > and xen has initialized it, then either simulated 
>>>>        
>>>>
>>or physical
>>    
>>
>>>>    mode can be
>>>>    > used.
>>>>    > This is governed by a boot time option, hpet-avoid.
>>>>        
>>>>
>>>Setting this
>>>      
>>>
>>>>    option to 1
>>>>    > gives the
>>>>    > simulated mode and 0 the physical mode. The default
>>>>        
>>>>
>>>is physical
>>>      
>>>
>>>>    mode.
>>>>    >
>>>>    > A disadvantage of the physical mode is that may 
>>>>        
>>>>
>>take longer to
>>    
>>
>>>>    read the device
>>>>    > than in simulated mode. On some platforms the cost is
>>>>        
>>>>
>>>about the
>>>      
>>>
>>>>    same (less
>>>>    > than 250 nsec) for
>>>>    > physical and simulated modes, while on others 
>>>>        
>>>>
>>physical cost is
>>    
>>
>>>>    much higher
>>>>    > than simulated.
>>>>    > A disadvantage of the simulated mode is that it can 
>>>>        
>>>>
>>return the
>>    
>>
>>>>    same value
>>>>    > for the counter in consecutive calls.
>>>>    >
>>>>    > 2.2. Interrupt notification facilities.
>>>>    >
>>>>    > Two interrupt notification facilities are introduced, one is
>>>>    > hvm_isa_irq_assert_cb()
>>>>    > and the other hvm_register_intr_en_notif().
>>>>    >
>>>>    > The vhpet uses hvm_isa_irq_assert_cb to deliver 
>>>>        
>>>>
>>interrupts to
>>    
>>
>>>>    the vioapic.
>>>>    > hvm_isa_irq_assert_cb allows a callback to be 
>>>>        
>>>>
>>passed along to
>>    
>>
>>>>    > vioapic_deliver()
>>>>    > and this callback is called with a mask of the vcpus
>>>>        
>>>>
>>>which will
>>>      
>>>
>>>>    get the
>>>>    > interrupt. This callback is made before any vcpus receive an
>>>>    interrupt.
>>>>    >
>>>>    > Vhpet uses hvm_register_intr_en_notif() to register 
>>>>        
>>>>
>>a handler
>>    
>>
>>>>    for a particular
>>>>    > vector that will be called when that vector is injected in
>>>>    > [vmx,svm]_intr_assist()
>>>>    > and also when the guest finishes handling the 
>>>>        
>>>>
>>interrupt. Here
>>    
>>
>>>>    finished is
>>>>    > defined
>>>>    > as the point when the guest re-enables interrupts or
>>>>        
>>>>
>>>lowers the
>>>      
>>>
>>>>    tpr value.
>>>>    > EOI is not used as the end of interrupt as this is sometimes
>>>>    returned before
>>>>    > the interrupt handler has done its work. A flag is
>>>>        
>>>>
>>>passed to the
>>>      
>>>
>>>>    handler
>>>>    > indicating
>>>>    > whether this is the injection point (post = 1) or the
>>>>        
>>>>
>>>interrupt
>>>      
>>>
>>>>    finished (post
>>>>    > = 0) point.
>>>>    > The need for the finished point callback is discussed in the
>>>>    missed ticks
>>>>    > policy section.
>>>>    >
>>>>    > To prevent a possible early trigger of the finished 
>>>>        
>>>>
>>callback,
>>    
>>
>>>>    intr_en_notif
>>>>    > logic
>>>>    > has a two stage arm, the first at injection
>>>>    (hvm_intr_en_notif_arm()) and the
>>>>    > second when
>>>>    > interrupts are seen to be disabled
>>>>        
>>>>
>>>(hvm_intr_en_notif_disarm()).
>>>      
>>>
>>>>    Once fully
>>>>    > armed, re-enabling
>>>>    > interrupts will cause hvm_intr_en_notif_disarm() to
>>>>        
>>>>
>>>make the end
>>>      
>>>
>>>>    of interrupt
>>>>    > callback. hvm_intr_en_notif_arm() and
>>>>        
>>>>
>>>hvm_intr_en_notif_disarm()
>>>      
>>>
>>>>    are called by
>>>>    > [vmx,svm]_intr_assist().
>>>>    >
>>>>    > 3. Interrupt delivery policies
>>>>    >
>>>>    > The existing hpet interrupt delivery is preserved.
>>>>        
>>>>
>>>This includes
>>>      
>>>
>>>>    > vcpu round robin delivery used by Linux and 
>>>>        
>>>>
>>broadcast delivery
>>    
>>
>>>>    used by
>>>>    > Windows.
>>>>    >
>>>>    > There are two policies for interrupt delivery, one 
>>>>        
>>>>
>>for Windows
>>    
>>
>>>>    2k8-64 and the
>>>>    > other
>>>>    > for Linux. The Linux policy takes advantage of the
>>>>        
>>>>
>>>(guest) Linux
>>>      
>>>
>>>>    missed tick
>>>>    > and offset
>>>>    > calculations and does not attempt to deliver the
>>>>        
>>>>
>>>right number of
>>>      
>>>
>>>>    interrupts.
>>>>    > The Windows policy delivers the correct number of 
>>>>        
>>>>
>>interrupts,
>>    
>>
>>>>    even if
>>>>    > sometimes much
>>>>    > closer to each other than the period. The policies 
>>>>        
>>>>
>>are similar
>>    
>>
>>>>    to those in
>>>>    > vpt.c, though
>>>>    > there are some important differences.
>>>>    >
>>>>    > Policies are selected with an HVMOP_set_param
>>>>        
>>>>
>>>hypercall with index
>>>      
>>>
>>>>    > HVM_PARAM_TIMER_MODE.
>>>>    > Two new values are added,
>>>>        
>>>>
>>>HVM_HPET_guest_computes_missed_ticks and
>>>      
>>>
>>>>    > HVM_HPET_guest_does_not_compute_missed_ticks.  The 
>>>>        
>>>>
>>reason that
>>    
>>
>>>>    two new ones
>>>>    > are added is that
>>>>    > in some guests (32bit Linux) a no-missed policy is 
>>>>        
>>>>
>>needed for
>>    
>>
>>>>    clock sources
>>>>    > other than hpet
>>>>    > and a missed ticks policy for hpet. It was felt that
>>>>        
>>>>
>>>there would
>>>      
>>>
>>>>    be less
>>>>    > confusion by simply
>>>>    > introducing the two hpet policies.
>>>>    >
>>>>    > 3.1. The missed ticks policy
>>>>    >
>>>>    > The Linux clock interrupt handler for hpet calculates missed
>>>>    ticks and offset
>>>>    > using the hpet
>>>>    > main counter. The algorithm works well when the 
>>>>        
>>>>
>>time since the
>>    
>>
>>>>    last interrupt
>>>>    > is greater than
>>>>    > or equal to a period and poorly otherwise.
>>>>    >
>>>>    > The missed ticks policy ensures that no two clock
>>>>        
>>>>
>>>interrupts are
>>>      
>>>
>>>>    delivered to
>>>>    > the guest at
>>>>    > a time interval less than a period. A time stamp (hpet main
>>>>    counter value) is
>>>>    > recorded (by a
>>>>    > callback registered with hvm_register_intr_en_notif)
>>>>        
>>>>
>>>when Linux
>>>      
>>>
>>>>    finishes
>>>>    > handling the clock
>>>>    > interrupt. Then, ensuing interrupts are delivered to
>>>>        
>>>>
>>>the vioapic
>>>      
>>>
>>>>    only if the
>>>>    > current main
>>>>    > counter value is a period greater than when the 
>>>>        
>>>>
>>last interrupt
>>    
>>
>>>>    was handled.
>>>>    >
>>>>    > Tests showed a significant improvement in clock 
>>>>        
>>>>
>>drift with end
>>    
>>
>>>>    of interrupt
>>>>    > time stamps
>>>>    > versus beginning of interrupt[1]. It is believed that
>>>>        
>>>>
>>>the reason
>>>      
>>>
>>>>    for the
>>>>    > improvement
>>>>    > is that the clock interrupt handler goes for a
>>>>        
>>>>
>>>spinlock and can
>>>      
>>>
>>>>    be therefore
>>>>    > delayed in its
>>>>    > processing. Furthermore, the main counter is read 
>>>>        
>>>>
>>by the guest
>>    
>>
>>>>    under the lock.
>>>>    > The net
>>>>    > effect is that if we time stamp injection, we can get the
>>>>    difference in time
>>>>    > between successive interrupt handler lock acquisitions to be
>>>>    less than the
>>>>    > period.
>>>>    >
>>>>    > 3.2. The no-missed ticks policy
>>>>    >
>>>>    > Windows 2k864 keeps very poor time with the missed
>>>>        
>>>>
>>>ticks policy.
>>>      
>>>
>>>>    So the
>>>>    > no-missed ticks policy
>>>>    > was developed. In the no-missed ticks policy we deliver the
>>>>    correct number of
>>>>    > interrupts,
>>>>    > even if they are spaced less than a period apart
>>>>        
>>>>
>>>(when catching up).
>>>      
>>>
>>>>    >
>>>>    > Windows 2k864 uses a broadcast mode in the interrupt routing
>>>>    such that
>>>>    > all vcpus get the clock interrupt. The best Windows drift
>>>>    performance was
>>>>    > achieved when the
>>>>    > policy code ensured that all the previous interrupts (on the
>>>>    various vcpus)
>>>>    > had been injected
>>>>    > before injecting the next interrupt to the vioapic..
>>>>    >
>>>>    > The policy code works as follows. It uses the
>>>>    hvm_isa_irq_assert_cb() to
>>>>    > record
>>>>    > the vcpus to be interrupted in 
>>>>        
>>>>
>>h->hpet.pending_mask. Then, in
>>    
>>
>>>>    the callback
>>>>    > registered
>>>>    > with hvm_register_intr_en_notif() at post=1 time it 
>>>>        
>>>>
>>clears the
>>    
>>
>>>>    current vcpu in
>>>>    > the pending_mask.
>>>>    > When the pending_mask is clear it decrements
>>>>    hpet.intr_pending_nr and if
>>>>    > intr_pending_nr is still
>>>>    > non-zero posts another interrupt to the ioapic with
>>>>    hvm_isa_irq_assert_cb().
>>>>    > Intr_pending_nr is incremented in
>>>>    hpet_route_decision_not_missed_ticks().
>>>>    >
>>>>    > The missed ticks policy intr_en_notif callback also uses the
>>>>    pending_mask
>>>>    > method. So even though
>>>>    > Linux does not broadcast its interrupts, the code 
>>>>        
>>>>
>>could handle
>>    
>>
>>>>    it if it did.
>>>>    > In this case the end of interrupt time stamp is 
>>>>        
>>>>
>>made when the
>>    
>>
>>>>    pending_mask is
>>>>    > clear.
>>>>    >
>>>>    > 4. Live Migration
>>>>    >
>>>>    > Live migration with hpet preserves the current offset of the
>>>>    guest clock with
>>>>    > respect
>>>>    > to ntp. This is accomplished by migrating all of 
>>>>        
>>>>
>>the state in
>>    
>>
>>>>    the h->hpet data
>>>>    > structure
>>>>    > in the usual way. The hp->mc_offset is recalculated on the
>>>>    receiving node so
>>>>    > that the
>>>>    > guest sees a continuous hpet main counter.
>>>>    >
>>>>    > Code as been added to xc_domain_save.c to send a 
>>>>        
>>>>
>>small message
>>    
>>
>>>>    after the
>>>>    > domain context is sent. The contents of the message is the
>>>>    physical tsc
>>>>    > timestamp, last_tsc,
>>>>    > read just before the message is sent. When the
>>>>        
>>>>
>>>last_tsc message
>>>      
>>>
>>>>    is received in
>>>>    > xc_domain_restore.c,
>>>>    > another physical tsc timestamp, cur_tsc, is read. The two
>>>>    timestamps are
>>>>    > loaded into the domain
>>>>    > structure as last_tsc_sender and first_tsc_receiver with
>>>>    hypercalls. Then
>>>>    > xc_domain_hvm_setcontext
>>>>    > is called so that hpet_load has access to these time stamps.
>>>>    Hpet_load uses
>>>>    > the timestamps
>>>>    > to account for the time spent saving and loading the domain
>>>>    context. With this
>>>>    > technique,
>>>>    > the only neglected time is the time spent sending a small
>>>>    network message.
>>>>    >
>>>>    > 5. Test Results
>>>>    >
>>>>    > Some recent test results are:
>>>>    >
>>>>    > 5.1 Linux 4u664 and Windows 2k864 load test.
>>>>    >       Duration: 70 hours.
>>>>    >       Test date: 6/2/08
>>>>    >       Loads: usex -b48 on Linux; burn-in on Windows
>>>>    >       Guest vcpus: 8 for Linux; 2 for Windows
>>>>    >       Hardware: 8 physical cpu AMD
>>>>    >       Clock drift : Linux: .0012% Windows: .009%
>>>>    >
>>>>    > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 
>>>>        
>>>>
>>no-load test
>>    
>>
>>>>    >       Duration: 23 hours.
>>>>    >       Test date: 6/3/08
>>>>    >       Loads: none
>>>>    >       Guest vcpus: 8 for each Linux; 2 for Windows
>>>>    >       Hardware: 4 physical cpu AMD
>>>>    >       Clock drift : Linux: .033% Windows: .019%
>>>>    >
>>>>    > 6. Relation to recent work in xen-unstable
>>>>    >
>>>>    > There is a similarity between hvm_get_guest_time() in
>>>>    xen-unstable and
>>>>    > read_64_main_counter()
>>>>    > in this code. However, read_64_main_counter() is 
>>>>        
>>>>
>>more tuned to
>>    
>>
>>>>    the needs of
>>>>    > hpet.c. It has no
>>>>    > "set" operation, only the get. It isolates the mode,
>>>>        
>>>>
>>>physical or
>>>      
>>>
>>>>    simulated, in
>>>>    > read_64_main_counter()
>>>>    > itself. It uses no vcpu or domain state as it is a physical
>>>>    entity, in either
>>>>    > mode. And it provides a real
>>>>    > physical mode for every read for those applications
>>>>        
>>>>
>>>that desire
>>>      
>>>
>>>>    this.
>>>>    >
>>>>    > 7. Conclusion
>>>>    >
>>>>    > The virtual hpet is improved by this patch in terms
>>>>        
>>>>
>>>of accuracy and
>>>      
>>>
>>>>    > monotonicity.
>>>>    > Tests performed to date verify this and more testing
>>>>        
>>>>
>>>is under way.
>>>      
>>>
>>>>    >
>>>>    > 8. Future Work
>>>>    >
>>>>    > Testing with Windows Vista will be performed soon. 
>>>>        
>>>>
>>The reason
>>    
>>
>>>>    for accuracy
>>>>    > variations
>>>>    > on different platforms using the physical hpet 
>>>>        
>>>>
>>device will be
>>    
>>
>>>>    investigated.
>>>>    > Additional overhead measurements on simulated vs 
>>>>        
>>>>
>>physical hpet
>>    
>>
>>>>    mode will be
>>>>    > made.
>>>>    >
>>>>    > Footnotes:
>>>>    >
>>>>    > 1. I don't recall the accuracy improvement with end
>>>>        
>>>>
>>>of interrupt
>>>      
>>>
>>>>    stamping, but
>>>>    > it was
>>>>    > significant, perhaps better than two to one improvement. It
>>>>    would be a very
>>>>    > simple matter
>>>>    > to re-measure the improvement as the facility can 
>>>>        
>>>>
>>call back at
>>    
>>
>>>>    injection time
>>>>    > as well.
>>>>    >
>>>>    >
>>>>    > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com>
>>>>    > <mailto:dwinchell@virtualiron.com>
>>>>    > Signed-off-by: Ben Guthro <bguthro@virtualiron.com>
>>>>    > <mailto:bguthro@virtualiron.com>
>>>>    >
>>>>    >
>>>>    > _______________________________________________
>>>>    > Xen-devel mailing list
>>>>    > Xen-devel@lists.xensource.com
>>>>    > http://lists.xensource.com/xen-devel
>>>>
>>>>
>>>>
>>>>        
>>>>
>>>      
>>>
>
>
>_______________________________________________
>Xen-devel mailing list
>Xen-devel@lists.xensource.com
>http://lists.xensource.com/xen-devel
>  
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [PATCH 0/2] Improve hpet accuracy
  2008-06-11 13:58                     ` Dave Winchell
@ 2008-06-11 16:47                       ` Dan Magenheimer
  0 siblings, 0 replies; 51+ messages in thread
From: Dan Magenheimer @ 2008-06-11 16:47 UTC (permalink / raw)
  To: Dave Winchell; +Cc: xen-devel, Keir Fraser, Ben Guthro

OK, I can confirm that without Dave's patch RHEL4-
and RHEL5-based 64-bit uni-p kernels gain time when hpet is
the clocksource.

But WHOA with vcpus=2, el5u1-32 time suddenly goes crazy
when domains are added whereas it seems fine when vcpus=1.

All my testing so far has been on 3.1.3, so I am going to
redo it on xen-unstable, first without Dave's patch then
with.

> Is there one setting for all Linux guests that makes them
> choose hpet? Is it "clock=hpet clocksource=hpet"?
> I know you wrote at length about this before.

In the hvm configuration file:

hpet=1
acpi=1

(Note, acpi unspecified works too as 1 appears to be the default;
but hpet=1 is ignored if acpi=0)

In the kernel command line of the hvm domain (e.g. in grub.conf):

clock=hpet notsc nopmtimer

(Note, a different set of kernel parameters is necessary for each
kernel but because the kernel either ignores or gives harmless
warnings for invalid parameters, this set should always result
in hpet being selected as the clocksource, at least on all
RHEL4- and RHEL5-based kernels I've tested.)

> -----Original Message-----
> From: Dave Winchell [mailto:dwinchell@virtualiron.com]
> Sent: Wednesday, June 11, 2008 7:58 AM
> To: dan.magenheimer@oracle.com
> Cc: Keir Fraser; xen-devel; Ben Guthro; Dave Winchell
> Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> 
> 
> Dan Magenheimer wrote:
> 
> >>In EL5u1-32 however it looks like the fractions are accounted
> >>for.  Indeed the EL5u1-32 "lost tick handling" code resembles
> >>the Linux/ia64 code which is what I've always assumed was
> >>the "missed tick" model.  In this case, I think no policy
> >>is necessary and the measured skew should be identical to
> >>any physical hpet skew.  I'll have to test this hypothesis though.
> >>
> >>
> >
> >I've tested this hypothesis and it seems to hold true.
> >This means the existing (unpatched) hpet code works fine
> >on EL5-32bit (vcpus=1) when hpet is the clocksource,
> >even when the machine is overcommitted.  A second hypothesis
> >still needs to be tested that Dave's patch will not make this worse.
> >
> >
> Interesting, thanks for pointing this out and confirming.
> 
> >(Note that per previous discussion, my EL5u1-32bit guest
> >running on an Intel dual-core physical box chose tsc as
> >the best clocksource and I had to override it with
> >clock=hpet in the kernel command line.)
> >
> >
> Is there one setting for all Linux guests that makes them
> choose hpet? Is it "clock=hpet clocksource=hpet"?
> I know you wrote at length about this before.
> 
> >
> >
> >>Yes, that makes sense and concurs with what I remember from
> >>the EL4u5-32 code.  If this is true, one would expect the
> >>default "no missed tick" policy to see time moving faster
> >>than an external source -- the first missed tick delivered
> >>after a long sleep would "catch up" and then the remainder
> >>would each add another tick.
> >>
> >>
> >
> >Indeed with the existing (unpatched) hpet code, time is
> >running faster on EL4u5-32 (vcpus=1, when overcommited).
> >So Dave's patch is definitely needed here.
> >
> >
> Its good to get the verification of this.
> 
> thanks,
> Dave
> 
> >Will try 64-bit next.
> >
> >Dan
> >
> >
> >
> >>-----Original Message-----
> >>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> >>Sent: Monday, June 09, 2008 9:21 PM
> >>To: 'Dave Winchell'; 'Keir Fraser'
> >>Cc: 'xen-devel'; 'Ben Guthro'
> >>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >>
> >>
> >>
> >>
> >>>I'll tell  you what I recall about this. Tomorrow I'll check the
> >>>guest code to verify. I think that Linux declares a full tick,
> >>>even if the interrupt is early. That's the problem.
> >>>
> >>>
> >>Yes, that makes sense and concurs with what I remember from
> >>the EL4u5-32 code.  If this is true, one would expect the
> >>default "no missed tick" policy to see time moving faster
> >>than an external source -- the first missed tick delivered
> >>after a long sleep would "catch up" and then the remainder
> >>would each add another tick.
> >>
> >>
> >>
> >>>On the other hand, if the interrupt is late it in effect declares
> >>>a tick plus fraction. If it just declared the fraction in
> >>>
> >>>
> >>the first place,
> >>
> >>
> >>>we could deliver the interrupts whenever we wanted.
> >>>
> >>>
> >>My read of the EL4u5-32 code is that the fraction is discarded
> >>and a new tick period commences at "now", so the fractions
> >>eventually accumulate as lost time.
> >>
> >>In EL5u1-32 however it looks like the fractions are accounted
> >>for.  Indeed the EL5u1-32 "lost tick handling" code resembles
> >>the Linux/ia64 code which is what I've always assumed was
> >>the "missed tick" model.  In this case, I think no policy
> >>is necessary and the measured skew should be identical to
> >>any physical hpet skew.  I'll have to test this hypothesis though.
> >>
> >>-----Original Message-----
> >>From: xen-devel-bounces@lists.xensource.com
> >>[mailto:xen-devel-bounces@lists.xensource.com]On Behalf Of
> >>Dave Winchell
> >>Sent: Monday, June 09, 2008 5:35 PM
> >>To: dan.magenheimer@oracle.com; Keir Fraser
> >>Cc: Dave Winchell; xen-devel; Ben Guthro
> >>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >>
> >>
> >>
> >>
> >>>>The Linux policy is more subtle, but is required to go
> >>>>from .1% to .03%.
> >>>>
> >>>>
> >>>Thanks for the good documentation which I hadn't thoroughly
> >>>read until now.
> >>>I now understand that the essence of your
> >>>hpet missed ticks policy is to ensure that ticks are never
> >>>delivered too close together.  But I'm trying to understand
> >>>WHY your patch works, in other words, what problem it is
> >>>countering.
> >>>
> >>>
> >>I'll tell  you what I recall about this. Tomorrow I'll check the
> >>guest code to verify. I think that Linux declares a full tick,
> >>even if the interrupt is early. That's the problem.
> >>On the other hand, if the interrupt is late it in effect declares
> >>a tick plus fraction. If it just declared the fraction in the
> >>first place,
> >>we could deliver the interrupts whenever we wanted.
> >>
> >>Its really not that different than the missed ticks policy in vpt.c
> >>except that there the period in vpt.c is based on start of interrupt
> >>and I have improved that with end-of interrupt as described
> >>in the patch note.
> >>
> >>I don't recall what prompted me to try end-of-interrupt,
> >>but I saw a significant improvement. I may have been running
> >>a monotonicity test at the same time to explain the lock
> >>contention mentioned in the write-up.
> >>
> >>
> >>
> >>> I care about this for more reasons than just
> >>>because it is interesting: (1) I'd like to feel confident that
> >>>it is fixing a bug rather than just a symptom of a bug;
> >>>and (2) I wonder how universally it is applicable.
> >>>
> >>>
> >>Its worked well my my small set of guests. You and our
> >>QA are going to tell us about the wider set. It doesn't
> >>matter if guest A handles interrupts closely spaced or not,
> >>just whether it handles them far apart. So it should be pretty
> >>universal with guests that really handle missed ticks.
> >>I think its interesting that some 32bit Linux guests handle
> >>missed ticks for hpet.
> >>
> >>
> >>
> >>>I see from code examination in mark_offset_hpet() in
> >>>RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that
> >>>the correction for lost ticks is just plain wrong in
> >>>a virtual environment. (Suppose for example that a virtual
> >>>tick was delivered every 1.999*hpet_tick... I think
> >>>the clock would be off by 50%!)  Is this the bug that
> >>>is being "countered" by your policy?
> >>>
> >>>
> >>I haven't looked at that code, perhaps.
> >>I'll check it tomorrow.
> >>
> >>
> >>
> >>>However, the lost tick handling in RHEL5u1/kernel/timer.c
> >>>(which I think is used also for hpet) is much better
> >>>so I am eager to find out if your policy works there
> >>>too.
> >>>If the hpet missed tick policy works for both, though,
> >>>I should be happy, though I wonder about upstream kernels
> >>>(e.g. the trend toward tickless).
> >>>
> >>>
> >>I wasn't aware of this trend. If its robust, however, it should
> >>handle late interrupts ...
> >>
> >>
> >>
> >>>That said, I'd rather
> >>>see this get into Xen 3.3 and worry about upstream kernels
> >>>later :-)
> >>>
> >>>
> >>Regards,
> >>Dave
> >>
> >>
> >>
> >>-----Original Message-----
> >>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> >>Sent: Mon 6/9/2008 6:02 PM
> >>To: Dave Winchell; Keir Fraser
> >>Cc: Ben Guthro; xen-devel
> >>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >>
> >>
> >>
> >>>The Linux policy is more subtle, but is required to go
> >>>from .1% to .03%.
> >>>
> >>>
> >>Thanks for the good documentation which I hadn't thoroughly
> >>read until now.  I now understand that the essence of your
> >>hpet missed ticks policy is to ensure that ticks are never
> >>delivered too close together.  But I'm trying to understand
> >>WHY your patch works, in other words, what problem it is
> >>countering.  I care about this for more reasons than just
> >>because it is interesting: (1) I'd like to feel confident that
> >>it is fixing a bug rather than just a symptom of a bug;
> >>and (2) I wonder how universally it is applicable.
> >>
> >>I see from code examination in mark_offset_hpet() in
> >>RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that
> >>the correction for lost ticks is just plain wrong in
> >>a virtual environment. (Suppose for example that a virtual
> >>tick was delivered every 1.999*hpet_tick... I think
> >>the clock would be off by 50%!)  Is this the bug that
> >>is being "countered" by your policy?
> >>
> >>However, the lost tick handling in RHEL5u1/kernel/timer.c
> >>(which I think is used also for hpet) is much better
> >>so I am eager to find out if your policy works there
> >>too.
> >>
> >>If the hpet missed tick policy works for both, though,
> >>I should be happy, though I wonder about upstream kernels
> >>(e.g. the trend toward tickless).  That said, I'd rather
> >>see this get into Xen 3.3 and worry about upstream kernels
> >>later :-)
> >>
> >>-----Original Message-----
> >>From: Dave Winchell [mailto:dwinchell@virtualiron.com]
> >>Sent: Sunday, June 08, 2008 2:32 PM
> >>To: dan.magenheimer@oracle.com; Keir Fraser
> >>Cc: Ben Guthro; xen-devel; Dave Winchell
> >>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >>
> >>
> >>Hi Dan,
> >>
> >>
> >>
> >>>While I am fully supportive of offering hardware hpet as an option
> >>>for hvm guests (let's call it hwhpet=1 for shorthand), I am very
> >>>surprised by your preliminary results; the most obvious conclusion
> >>>is that Xen system time is losing time at the rate of 1000 PPM
> >>>though its possible there's a bug somewhere else in the "time
> >>>stack".  Your Windows result is jaw-dropping and inexplicable,
> >>>though I have to admit ignorance of how Windows manages time.
> >>>
> >>>
> >>I think xen system time is fine. You have to add the interrupt
> >>delivery policies decribed in the write-up for the patch to get
> >>accurate timekeeping in the guest.
> >>
> >>The windows policy is obvious and results in a large improvement
> >>in accuracy. The Linux policy is more subtle, but is required to go
> >>from .1% to .03%.
> >>
> >>
> >>
> >>>I think with my recent patch and hpet=1 (essentially the same as
> >>>your emulated hpet), hvm guest time should track Xen system time.
> >>>I wonder if domain0 (which if I understand correctly is directly
> >>>using Xen system time) is also seeing an error of .1%?  Also
> >>>I wonder for the skew you are seeing (in both hvm guests and
> >>>domain0) is time moving too fast or two slow?
> >>>
> >>>
> >>I don't recall the direction. I can look it up in my notes at work
> >>tomorrow.
> >>
> >>
> >>
> >>>Although hwhpet=1 is a fine alternative in many cases, it may
> >>>be unavailable on some systems and may cause significant 
> performance
> >>>issues on others.  So I think we will still need to track down
> >>>the poor accuracy when hwhpet=0.
> >>>
> >>>
> >>Our patch is accurate to < .03% using the physical hpet mode or
> >>the simulated mode.
> >>
> >>
> >>
> >>>And if for some reason
> >>>Xen system time can't be made accurate enough (< 0.05%), then
> >>>I think we should consider building Xen system time itself on
> >>>top of hardware hpet instead of TSC... at least when Xen discovers
> >>>a capable hpet.
> >>>
> >>>
> >>In our experience, Xen system time is accurate enough now.
> >>
> >>
> >>
> >>>One more thought... do you know the accuracy of the TSC crystals
> >>>on your test systems?  I posted a patch awhile ago that was
> >>>intended to test that, though I guess it was only testing skew
> >>>of different TSCs on the same system, not TSCs against an
> >>>external time source.
> >>>
> >>>
> >>I do not know the tsc accuracy.
> >>
> >>
> >>
> >>>Or maybe there's a computation error somewhere in the hvm hpet
> >>>scaling code?  Hmmm...
> >>>
> >>>
> >>Regards,
> >>Dave
> >>
> >>
> >>-----Original Message-----
> >>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> >>Sent: Fri 6/6/2008 4:29 PM
> >>To: Dave Winchell; Keir Fraser
> >>Cc: Ben Guthro; xen-devel
> >>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >>
> >>Dave --
> >>
> >>Thanks much for posting the preliminary results!
> >>
> >>While I am fully supportive of offering hardware hpet as an option
> >>for hvm guests (let's call it hwhpet=1 for shorthand), I am very
> >>surprised by your preliminary results; the most obvious conclusion
> >>is that Xen system time is losing time at the rate of 1000 PPM
> >>though its possible there's a bug somewhere else in the "time
> >>stack".  Your Windows result is jaw-dropping and inexplicable,
> >>though I have to admit ignorance of how Windows manages time.
> >>
> >>
> >>I think with my recent patch and hpet=1 (essentially the same as
> >>your emulated hpet), hvm guest time should track Xen system time.
> >>I wonder if domain0 (which if I understand correctly is directly
> >>using Xen system time) is also seeing an error of .1%?  Also
> >>I wonder for the skew you are seeing (in both hvm guests and
> >>domain0) is time moving too fast or two slow?
> >>
> >>Although hwhpet=1 is a fine alternative in many cases, it may
> >>be unavailable on some systems and may cause significant performance
> >>issues on others.  So I think we will still need to track down
> >>the poor accuracy when hwhpet=0.  And if for some reason
> >>Xen system time can't be made accurate enough (< 0.05%), then
> >>I think we should consider building Xen system time itself on
> >>top of hardware hpet instead of TSC... at least when Xen discovers
> >>a capable hpet.
> >>
> >>One more thought... do you know the accuracy of the TSC crystals
> >>on your test systems?  I posted a patch awhile ago that was
> >>intended to test that, though I guess it was only testing skew
> >>of different TSCs on the same system, not TSCs against an
> >>external time source.
> >>
> >>Or maybe there's a computation error somewhere in the hvm hpet
> >>scaling code?  Hmmm...
> >>
> >>Thanks,
> >>Dan
> >>
> >>
> >>
> >>>-----Original Message-----
> >>>From: Dave Winchell [mailto:dwinchell@virtualiron.com]
> >>>Sent: Friday, June 06, 2008 1:33 PM
> >>>To: dan.magenheimer@oracle.com; Keir Fraser
> >>>Cc: Ben Guthro; xen-devel; Dave Winchell
> >>>Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >>>
> >>>
> >>>Dan, Keir:
> >>>
> >>>Preliminary tests results indicate an error of .1% for Linux 64 bit
> >>>guests configured
> >>>for hpet with xen-unstable as is. As we have discussed many
> >>>
> >>>
> >>times, the
> >>
> >>
> >>>ntp requirement is .05%.
> >>>Tests on the patch we just submitted for hpet have
> >>>
> >>>
> >>indicated errors of
> >>
> >>
> >>>.0012%
> >>>on this platform under similar test conditions and .03% on
> >>>other platforms.
> >>>
> >>>Windows vista64 has an error of 11% using hpet with the
> >>>xen-unstable bits.
> >>>In an overnight test with our hpet patch, the Windows vista
> >>>error was .008%.
> >>>
> >>>The tests are with two or three guests on a physical node, 
> all under
> >>>load, and with
> >>>the ratio of vcpus to phys cpus > 1.
> >>>
> >>>I will continue to run tests over the next few days.
> >>>
> >>>thanks,
> >>>Dave
> >>>
> >>>
> >>>Dan Magenheimer wrote:
> >>>
> >>>
> >>>
> >>>>Hi Dave and Ben --
> >>>>
> >>>>When running tests on xen-unstable (without your patch),
> >>>>
> >>>>
> >>>please ensure
> >>>
> >>>
> >>>>that hpet=1 is set in the hvm config and also I think
> >>>>
> >>>>
> >>that when hpet
> >>
> >>
> >>>>is the clocksource on RHEL4-32, the clock IS resilient to
> >>>>
> >>>>
> >>>missed ticks
> >>>
> >>>
> >>>>so timer_mode should be 2 (vs when pit is the clocksource
> >>>>
> >>>>
> >>>on RHEL4-32,
> >>>
> >>>
> >>>>all clock ticks must be delivered and so timer_mode should be 0).
> >>>>
> >>>>Per
> >>>>
> >>>>
> >>>>
> >>>http://lists.xensource.com/archives/html/xen-devel/2008-06/msg
> >>>00098.html it's
> >>>
> >>>
> >>>>my intent to clean this up, but I won't get to it until next week.
> >>>>
> >>>>Thanks,
> >>>>Dan
> >>>>
> >>>>    -----Original Message-----
> >>>>    *From:* xen-devel-bounces@lists.xensource.com
> >>>>    [mailto:xen-devel-bounces@lists.xensource.com]*On
> >>>>
> >>>>
> >>>Behalf Of *Dave
> >>>
> >>>
> >>>>    Winchell
> >>>>    *Sent:* Friday, June 06, 2008 4:46 AM
> >>>>    *To:* Keir Fraser; Ben Guthro; xen-devel
> >>>>    *Cc:* dan.magenheimer@oracle.com; Dave Winchell
> >>>>    *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >>>>
> >>>>    Keir,
> >>>>
> >>>>    I think the changes are required. We'll run some tests
> >>>>
> >>>>
> >>>today today so
> >>>
> >>>
> >>>>    that we have some data to talk about.
> >>>>
> >>>>    -Dave
> >>>>
> >>>>
> >>>>    -----Original Message-----
> >>>>    From: xen-devel-bounces@lists.xensource.com on behalf
> >>>>
> >>>>
> >>>of Keir Fraser
> >>>
> >>>
> >>>>    Sent: Fri 6/6/2008 4:58 AM
> >>>>    To: Ben Guthro; xen-devel
> >>>>    Cc: dan.magenheimer@oracle.com
> >>>>    Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >>>>
> >>>>    Are these patches needed now the timers are built on
> >>>>
> >>>>
> >>Xen system
> >>
> >>
> >>>>    time rather
> >>>>    than host TSC? Dan has reported much better
> >>>>
> >>>>
> >>>time-keeping with his
> >>>
> >>>
> >>>>    patch
> >>>>    checked in, and it¹s for sure a lot less invasive than
> >>>>
> >>>>
> >>>this patchset.
> >>>
> >>>
> >>>>     -- Keir
> >>>>
> >>>>    On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote:
> >>>>
> >>>>    >
> >>>>    > 1. Introduction
> >>>>    >
> >>>>    > This patch improves the hpet based guest clock in
> >>>>
> >>>>
> >>>terms of drift and
> >>>
> >>>
> >>>>    > monotonicity.
> >>>>    > Prior to this work the drift with hpet was greater
> >>>>
> >>>>
> >>>than 2%, far
> >>>
> >>>
> >>>>    above the .05%
> >>>>    > limit
> >>>>    > for ntp to synchronize. With this code, the drift
> >>>>
> >>>>
> >>ranges from
> >>
> >>
> >>>>    .001% to .0033%
> >>>>    > depending
> >>>>    > on guest and physical platform.
> >>>>    >
> >>>>    > Using hpet allows guest operating systems to
> >>>>
> >>>>
> >>provide monotonic
> >>
> >>
> >>>>    time to their
> >>>>    > applications. Time sources other than hpet are not
> >>>>
> >>>>
> >>>monotonic because
> >>>
> >>>
> >>>>    > of their reliance on tsc, which is not synchronized
> >>>>
> >>>>
> >>>across physical
> >>>
> >>>
> >>>>    > processors.
> >>>>    >
> >>>>    > Windows 2k864 and many Linux guests are supported with two
> >>>>    policies, one for
> >>>>    > guests
> >>>>    > that handle missed clock interrupts and the other for guests
> >>>>    that require the
> >>>>    > correct number of interrupts.
> >>>>    >
> >>>>    > Guests may use hpet for the timing source even if
> >>>>
> >>>>
> >>the physical
> >>
> >>
> >>>>    platform has no
> >>>>    > visible
> >>>>    > hpet. Migration is supported between physical machines which
> >>>>    differ in
> >>>>    > physical
> >>>>    > hpet visibility.
> >>>>    >
> >>>>    > Most of the changes are in hpet.c. Two general
> >>>>
> >>>>
> >>facilities are
> >>
> >>
> >>>>    added to track
> >>>>    > interrupt
> >>>>    > progress. The ideas here and the facilities would
> >>>>
> >>>>
> >>be useful in
> >>
> >>
> >>>>    vpt.c, for
> >>>>    > other time
> >>>>    > sources, though no attempt is made here to improve vpt.c.
> >>>>    >
> >>>>    > The following sections discuss hpet dependencies, interrupt
> >>>>    delivery policies,
> >>>>    > live migration,
> >>>>    > test results, and relation to recent work with
> >>>>
> >>>>
> >>monotonic time.
> >>
> >>
> >>>>    >
> >>>>    >
> >>>>    > 2. Virtual Hpet dependencies
> >>>>    >
> >>>>    > The virtual hpet depends on the ability to read the
> >>>>
> >>>>
> >>>physical or
> >>>
> >>>
> >>>>    simulated
> >>>>    > (see discussion below) hpet.  For timekeeping, the
> >>>>
> >>>>
> >>>virtual hpet
> >>>
> >>>
> >>>>    also depends
> >>>>    > on two new interrupt notification facilities to
> >>>>
> >>>>
> >>implement its
> >>
> >>
> >>>>    policies for
> >>>>    > interrupt delivery.
> >>>>    >
> >>>>    > 2.1. Two modes of low-level hpet main counter reads.
> >>>>    >
> >>>>    > In this implementation, the virtual hpet reads with
> >>>>    read_64_main_counter(),
> >>>>    > exported by
> >>>>    > time.c, either the real physical hpet main counter register
> >>>>    directly or a
> >>>>    > "simulated"
> >>>>    > hpet main counter.
> >>>>    >
> >>>>    > The simulated mode uses a monotonic version of get_s_time()
> >>>>    (NOW()), where the
> >>>>    > last
> >>>>    > time value is returned whenever the current time
> >>>>
> >>>>
> >>value is less
> >>
> >>
> >>>>    than the last
> >>>>    > time
> >>>>    > value. In simulated mode, since it is layered on s_time, the
> >>>>    underlying
> >>>>    > hardware
> >>>>    > can be hpet or some other device. The frequency of the main
> >>>>    counter in
> >>>>    > simulated
> >>>>    > mode is the same as the standard physical hpet frequency,
> >>>>    allowing live
> >>>>    > migration
> >>>>    > between nodes that are configured differently.
> >>>>    >
> >>>>    > If the physical platform does not have an hpet
> >>>>
> >>>>
> >>>device, or if xen
> >>>
> >>>
> >>>>    is configured
> >>>>    > not
> >>>>    > to use the device, then the simulated method is
> >>>>
> >>>>
> >>used. If there
> >>
> >>
> >>>>    is a physical
> >>>>    > hpet device,
> >>>>    > and xen has initialized it, then either simulated
> >>>>
> >>>>
> >>or physical
> >>
> >>
> >>>>    mode can be
> >>>>    > used.
> >>>>    > This is governed by a boot time option, hpet-avoid.
> >>>>
> >>>>
> >>>Setting this
> >>>
> >>>
> >>>>    option to 1
> >>>>    > gives the
> >>>>    > simulated mode and 0 the physical mode. The default
> >>>>
> >>>>
> >>>is physical
> >>>
> >>>
> >>>>    mode.
> >>>>    >
> >>>>    > A disadvantage of the physical mode is that may
> >>>>
> >>>>
> >>take longer to
> >>
> >>
> >>>>    read the device
> >>>>    > than in simulated mode. On some platforms the cost is
> >>>>
> >>>>
> >>>about the
> >>>
> >>>
> >>>>    same (less
> >>>>    > than 250 nsec) for
> >>>>    > physical and simulated modes, while on others
> >>>>
> >>>>
> >>physical cost is
> >>
> >>
> >>>>    much higher
> >>>>    > than simulated.
> >>>>    > A disadvantage of the simulated mode is that it can
> >>>>
> >>>>
> >>return the
> >>
> >>
> >>>>    same value
> >>>>    > for the counter in consecutive calls.
> >>>>    >
> >>>>    > 2.2. Interrupt notification facilities.
> >>>>    >
> >>>>    > Two interrupt notification facilities are introduced, one is
> >>>>    > hvm_isa_irq_assert_cb()
> >>>>    > and the other hvm_register_intr_en_notif().
> >>>>    >
> >>>>    > The vhpet uses hvm_isa_irq_assert_cb to deliver
> >>>>
> >>>>
> >>interrupts to
> >>
> >>
> >>>>    the vioapic.
> >>>>    > hvm_isa_irq_assert_cb allows a callback to be
> >>>>
> >>>>
> >>passed along to
> >>
> >>
> >>>>    > vioapic_deliver()
> >>>>    > and this callback is called with a mask of the vcpus
> >>>>
> >>>>
> >>>which will
> >>>
> >>>
> >>>>    get the
> >>>>    > interrupt. This callback is made before any vcpus receive an
> >>>>    interrupt.
> >>>>    >
> >>>>    > Vhpet uses hvm_register_intr_en_notif() to register
> >>>>
> >>>>
> >>a handler
> >>
> >>
> >>>>    for a particular
> >>>>    > vector that will be called when that vector is injected in
> >>>>    > [vmx,svm]_intr_assist()
> >>>>    > and also when the guest finishes handling the
> >>>>
> >>>>
> >>interrupt. Here
> >>
> >>
> >>>>    finished is
> >>>>    > defined
> >>>>    > as the point when the guest re-enables interrupts or
> >>>>
> >>>>
> >>>lowers the
> >>>
> >>>
> >>>>    tpr value.
> >>>>    > EOI is not used as the end of interrupt as this is sometimes
> >>>>    returned before
> >>>>    > the interrupt handler has done its work. A flag is
> >>>>
> >>>>
> >>>passed to the
> >>>
> >>>
> >>>>    handler
> >>>>    > indicating
> >>>>    > whether this is the injection point (post = 1) or the
> >>>>
> >>>>
> >>>interrupt
> >>>
> >>>
> >>>>    finished (post
> >>>>    > = 0) point.
> >>>>    > The need for the finished point callback is discussed in the
> >>>>    missed ticks
> >>>>    > policy section.
> >>>>    >
> >>>>    > To prevent a possible early trigger of the finished
> >>>>
> >>>>
> >>callback,
> >>
> >>
> >>>>    intr_en_notif
> >>>>    > logic
> >>>>    > has a two stage arm, the first at injection
> >>>>    (hvm_intr_en_notif_arm()) and the
> >>>>    > second when
> >>>>    > interrupts are seen to be disabled
> >>>>
> >>>>
> >>>(hvm_intr_en_notif_disarm()).
> >>>
> >>>
> >>>>    Once fully
> >>>>    > armed, re-enabling
> >>>>    > interrupts will cause hvm_intr_en_notif_disarm() to
> >>>>
> >>>>
> >>>make the end
> >>>
> >>>
> >>>>    of interrupt
> >>>>    > callback. hvm_intr_en_notif_arm() and
> >>>>
> >>>>
> >>>hvm_intr_en_notif_disarm()
> >>>
> >>>
> >>>>    are called by
> >>>>    > [vmx,svm]_intr_assist().
> >>>>    >
> >>>>    > 3. Interrupt delivery policies
> >>>>    >
> >>>>    > The existing hpet interrupt delivery is preserved.
> >>>>
> >>>>
> >>>This includes
> >>>
> >>>
> >>>>    > vcpu round robin delivery used by Linux and
> >>>>
> >>>>
> >>broadcast delivery
> >>
> >>
> >>>>    used by
> >>>>    > Windows.
> >>>>    >
> >>>>    > There are two policies for interrupt delivery, one
> >>>>
> >>>>
> >>for Windows
> >>
> >>
> >>>>    2k8-64 and the
> >>>>    > other
> >>>>    > for Linux. The Linux policy takes advantage of the
> >>>>
> >>>>
> >>>(guest) Linux
> >>>
> >>>
> >>>>    missed tick
> >>>>    > and offset
> >>>>    > calculations and does not attempt to deliver the
> >>>>
> >>>>
> >>>right number of
> >>>
> >>>
> >>>>    interrupts.
> >>>>    > The Windows policy delivers the correct number of
> >>>>
> >>>>
> >>interrupts,
> >>
> >>
> >>>>    even if
> >>>>    > sometimes much
> >>>>    > closer to each other than the period. The policies
> >>>>
> >>>>
> >>are similar
> >>
> >>
> >>>>    to those in
> >>>>    > vpt.c, though
> >>>>    > there are some important differences.
> >>>>    >
> >>>>    > Policies are selected with an HVMOP_set_param
> >>>>
> >>>>
> >>>hypercall with index
> >>>
> >>>
> >>>>    > HVM_PARAM_TIMER_MODE.
> >>>>    > Two new values are added,
> >>>>
> >>>>
> >>>HVM_HPET_guest_computes_missed_ticks and
> >>>
> >>>
> >>>>    > HVM_HPET_guest_does_not_compute_missed_ticks.  The
> >>>>
> >>>>
> >>reason that
> >>
> >>
> >>>>    two new ones
> >>>>    > are added is that
> >>>>    > in some guests (32bit Linux) a no-missed policy is
> >>>>
> >>>>
> >>needed for
> >>
> >>
> >>>>    clock sources
> >>>>    > other than hpet
> >>>>    > and a missed ticks policy for hpet. It was felt that
> >>>>
> >>>>
> >>>there would
> >>>
> >>>
> >>>>    be less
> >>>>    > confusion by simply
> >>>>    > introducing the two hpet policies.
> >>>>    >
> >>>>    > 3.1. The missed ticks policy
> >>>>    >
> >>>>    > The Linux clock interrupt handler for hpet calculates missed
> >>>>    ticks and offset
> >>>>    > using the hpet
> >>>>    > main counter. The algorithm works well when the
> >>>>
> >>>>
> >>time since the
> >>
> >>
> >>>>    last interrupt
> >>>>    > is greater than
> >>>>    > or equal to a period and poorly otherwise.
> >>>>    >
> >>>>    > The missed ticks policy ensures that no two clock
> >>>>
> >>>>
> >>>interrupts are
> >>>
> >>>
> >>>>    delivered to
> >>>>    > the guest at
> >>>>    > a time interval less than a period. A time stamp (hpet main
> >>>>    counter value) is
> >>>>    > recorded (by a
> >>>>    > callback registered with hvm_register_intr_en_notif)
> >>>>
> >>>>
> >>>when Linux
> >>>
> >>>
> >>>>    finishes
> >>>>    > handling the clock
> >>>>    > interrupt. Then, ensuing interrupts are delivered to
> >>>>
> >>>>
> >>>the vioapic
> >>>
> >>>
> >>>>    only if the
> >>>>    > current main
> >>>>    > counter value is a period greater than when the
> >>>>
> >>>>
> >>last interrupt
> >>
> >>
> >>>>    was handled.
> >>>>    >
> >>>>    > Tests showed a significant improvement in clock
> >>>>
> >>>>
> >>drift with end
> >>
> >>
> >>>>    of interrupt
> >>>>    > time stamps
> >>>>    > versus beginning of interrupt[1]. It is believed that
> >>>>
> >>>>
> >>>the reason
> >>>
> >>>
> >>>>    for the
> >>>>    > improvement
> >>>>    > is that the clock interrupt handler goes for a
> >>>>
> >>>>
> >>>spinlock and can
> >>>
> >>>
> >>>>    be therefore
> >>>>    > delayed in its
> >>>>    > processing. Furthermore, the main counter is read
> >>>>
> >>>>
> >>by the guest
> >>
> >>
> >>>>    under the lock.
> >>>>    > The net
> >>>>    > effect is that if we time stamp injection, we can get the
> >>>>    difference in time
> >>>>    > between successive interrupt handler lock acquisitions to be
> >>>>    less than the
> >>>>    > period.
> >>>>    >
> >>>>    > 3.2. The no-missed ticks policy
> >>>>    >
> >>>>    > Windows 2k864 keeps very poor time with the missed
> >>>>
> >>>>
> >>>ticks policy.
> >>>
> >>>
> >>>>    So the
> >>>>    > no-missed ticks policy
> >>>>    > was developed. In the no-missed ticks policy we deliver the
> >>>>    correct number of
> >>>>    > interrupts,
> >>>>    > even if they are spaced less than a period apart
> >>>>
> >>>>
> >>>(when catching up).
> >>>
> >>>
> >>>>    >
> >>>>    > Windows 2k864 uses a broadcast mode in the interrupt routing
> >>>>    such that
> >>>>    > all vcpus get the clock interrupt. The best Windows drift
> >>>>    performance was
> >>>>    > achieved when the
> >>>>    > policy code ensured that all the previous interrupts (on the
> >>>>    various vcpus)
> >>>>    > had been injected
> >>>>    > before injecting the next interrupt to the vioapic..
> >>>>    >
> >>>>    > The policy code works as follows. It uses the
> >>>>    hvm_isa_irq_assert_cb() to
> >>>>    > record
> >>>>    > the vcpus to be interrupted in
> >>>>
> >>>>
> >>h->hpet.pending_mask. Then, in
> >>
> >>
> >>>>    the callback
> >>>>    > registered
> >>>>    > with hvm_register_intr_en_notif() at post=1 time it
> >>>>
> >>>>
> >>clears the
> >>
> >>
> >>>>    current vcpu in
> >>>>    > the pending_mask.
> >>>>    > When the pending_mask is clear it decrements
> >>>>    hpet.intr_pending_nr and if
> >>>>    > intr_pending_nr is still
> >>>>    > non-zero posts another interrupt to the ioapic with
> >>>>    hvm_isa_irq_assert_cb().
> >>>>    > Intr_pending_nr is incremented in
> >>>>    hpet_route_decision_not_missed_ticks().
> >>>>    >
> >>>>    > The missed ticks policy intr_en_notif callback also uses the
> >>>>    pending_mask
> >>>>    > method. So even though
> >>>>    > Linux does not broadcast its interrupts, the code
> >>>>
> >>>>
> >>could handle
> >>
> >>
> >>>>    it if it did.
> >>>>    > In this case the end of interrupt time stamp is
> >>>>
> >>>>
> >>made when the
> >>
> >>
> >>>>    pending_mask is
> >>>>    > clear.
> >>>>    >
> >>>>    > 4. Live Migration
> >>>>    >
> >>>>    > Live migration with hpet preserves the current offset of the
> >>>>    guest clock with
> >>>>    > respect
> >>>>    > to ntp. This is accomplished by migrating all of
> >>>>
> >>>>
> >>the state in
> >>
> >>
> >>>>    the h->hpet data
> >>>>    > structure
> >>>>    > in the usual way. The hp->mc_offset is recalculated on the
> >>>>    receiving node so
> >>>>    > that the
> >>>>    > guest sees a continuous hpet main counter.
> >>>>    >
> >>>>    > Code as been added to xc_domain_save.c to send a
> >>>>
> >>>>
> >>small message
> >>
> >>
> >>>>    after the
> >>>>    > domain context is sent. The contents of the message is the
> >>>>    physical tsc
> >>>>    > timestamp, last_tsc,
> >>>>    > read just before the message is sent. When the
> >>>>
> >>>>
> >>>last_tsc message
> >>>
> >>>
> >>>>    is received in
> >>>>    > xc_domain_restore.c,
> >>>>    > another physical tsc timestamp, cur_tsc, is read. The two
> >>>>    timestamps are
> >>>>    > loaded into the domain
> >>>>    > structure as last_tsc_sender and first_tsc_receiver with
> >>>>    hypercalls. Then
> >>>>    > xc_domain_hvm_setcontext
> >>>>    > is called so that hpet_load has access to these time stamps.
> >>>>    Hpet_load uses
> >>>>    > the timestamps
> >>>>    > to account for the time spent saving and loading the domain
> >>>>    context. With this
> >>>>    > technique,
> >>>>    > the only neglected time is the time spent sending a small
> >>>>    network message.
> >>>>    >
> >>>>    > 5. Test Results
> >>>>    >
> >>>>    > Some recent test results are:
> >>>>    >
> >>>>    > 5.1 Linux 4u664 and Windows 2k864 load test.
> >>>>    >       Duration: 70 hours.
> >>>>    >       Test date: 6/2/08
> >>>>    >       Loads: usex -b48 on Linux; burn-in on Windows
> >>>>    >       Guest vcpus: 8 for Linux; 2 for Windows
> >>>>    >       Hardware: 8 physical cpu AMD
> >>>>    >       Clock drift : Linux: .0012% Windows: .009%
> >>>>    >
> >>>>    > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864
> >>>>
> >>>>
> >>no-load test
> >>
> >>
> >>>>    >       Duration: 23 hours.
> >>>>    >       Test date: 6/3/08
> >>>>    >       Loads: none
> >>>>    >       Guest vcpus: 8 for each Linux; 2 for Windows
> >>>>    >       Hardware: 4 physical cpu AMD
> >>>>    >       Clock drift : Linux: .033% Windows: .019%
> >>>>    >
> >>>>    > 6. Relation to recent work in xen-unstable
> >>>>    >
> >>>>    > There is a similarity between hvm_get_guest_time() in
> >>>>    xen-unstable and
> >>>>    > read_64_main_counter()
> >>>>    > in this code. However, read_64_main_counter() is
> >>>>
> >>>>
> >>more tuned to
> >>
> >>
> >>>>    the needs of
> >>>>    > hpet.c. It has no
> >>>>    > "set" operation, only the get. It isolates the mode,
> >>>>
> >>>>
> >>>physical or
> >>>
> >>>
> >>>>    simulated, in
> >>>>    > read_64_main_counter()
> >>>>    > itself. It uses no vcpu or domain state as it is a physical
> >>>>    entity, in either
> >>>>    > mode. And it provides a real
> >>>>    > physical mode for every read for those applications
> >>>>
> >>>>
> >>>that desire
> >>>
> >>>
> >>>>    this.
> >>>>    >
> >>>>    > 7. Conclusion
> >>>>    >
> >>>>    > The virtual hpet is improved by this patch in terms
> >>>>
> >>>>
> >>>of accuracy and
> >>>
> >>>
> >>>>    > monotonicity.
> >>>>    > Tests performed to date verify this and more testing
> >>>>
> >>>>
> >>>is under way.
> >>>
> >>>
> >>>>    >
> >>>>    > 8. Future Work
> >>>>    >
> >>>>    > Testing with Windows Vista will be performed soon.
> >>>>
> >>>>
> >>The reason
> >>
> >>
> >>>>    for accuracy
> >>>>    > variations
> >>>>    > on different platforms using the physical hpet
> >>>>
> >>>>
> >>device will be
> >>
> >>
> >>>>    investigated.
> >>>>    > Additional overhead measurements on simulated vs
> >>>>
> >>>>
> >>physical hpet
> >>
> >>
> >>>>    mode will be
> >>>>    > made.
> >>>>    >
> >>>>    > Footnotes:
> >>>>    >
> >>>>    > 1. I don't recall the accuracy improvement with end
> >>>>
> >>>>
> >>>of interrupt
> >>>
> >>>
> >>>>    stamping, but
> >>>>    > it was
> >>>>    > significant, perhaps better than two to one improvement. It
> >>>>    would be a very
> >>>>    > simple matter
> >>>>    > to re-measure the improvement as the facility can
> >>>>
> >>>>
> >>call back at
> >>
> >>
> >>>>    injection time
> >>>>    > as well.
> >>>>    >
> >>>>    >
> >>>>    > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com>
> >>>>    > <mailto:dwinchell@virtualiron.com>
> >>>>    > Signed-off-by: Ben Guthro <bguthro@virtualiron.com>
> >>>>    > <mailto:bguthro@virtualiron.com>
> >>>>    >
> >>>>    >
> >>>>    > _______________________________________________
> >>>>    > Xen-devel mailing list
> >>>>    > Xen-devel@lists.xensource.com
> >>>>    > http://lists.xensource.com/xen-devel
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >
> >
> >_______________________________________________
> >Xen-devel mailing list
> >Xen-devel@lists.xensource.com
> >http://lists.xensource.com/xen-devel
> >
> >
> 
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/2] Improve hpet accuracy
       [not found] <4851940F.2050307@virtualiron.com>
@ 2008-06-12 22:05 ` Dan Magenheimer
  2008-06-12 22:49   ` Dave Winchell
  2008-06-13  0:38   ` Dave Winchell
  0 siblings, 2 replies; 51+ messages in thread
From: Dan Magenheimer @ 2008-06-12 22:05 UTC (permalink / raw)
  To: Dave Winchell, Keir Fraser, xen-devel; +Cc: Ben Guthro

[-- Attachment #1: Type: text/plain, Size: 2077 bytes --]

(Going back on list.)

OK, so looking at the updated patch, hpet_avoid=1 is actually
working, just reporting wrong, correct?

With el5u1-64-hvm and hpet_avoid=1 and timer_mode=4, skew
is under -0.04% and falling.  With hpet_avoid=0, it looks
about the same.  However both cases seem to start creeping
up again when I put load on, then fall again when I remove
the load -- even with sched-credit capping cpu usage.  Odd!
This implies to me that the activity in the other domains
IS affecting skew on the domain-under-test. (Keir, any
comments on the hypothesis attached below?)

Another theoretical oddity... if you are always delivering
timer ticks "late", fewer than the nominal 1000 ticks/sec
should be being received.  So then why is guest time actually
going faster than an external source?

(In my mind, going faster is much worse than going slower
because if ntpd or a human moves time backwards to compensate
for a clock going faster, "make" and other programs can
get very confused.)

Dan

> -----Original Message-----
> From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> Sent: Thursday, June 12, 2008 3:13 PM
> To: 'Dave Winchell'
> Subject: RE: xen hpet patch
> 
> 
> One more thought while waiting for compile and reboot:
> 
> Am I right that all of the policies are correcting for when
> a domain "A" is out-of-context?  There's nothing in any other
> domain "B" that can account for any timer loss/gain in domain
> "A".  The only reason we are running other domains is to ensure
> that domain "A" is sometimes out-of-context, and the more
> it is out-of-context, the more likely we will observe
> a problem, correct?
> 
> If this is true, it doesn't matter what workload is run
> in the non-A domains... as long as it is loading the
> CPU(s), thus ensuring that domain A is sometimes not
> scheduled on any CPU.
> 
> And if all this is true, we may not need to run other
> domains at all... running "xm sched-credit -d A -c 50"
> should result in domain A being out-of-context at least
> half the time.

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [PATCH 0/2] Improve hpet accuracy
  2008-06-12 22:05 ` Dan Magenheimer
@ 2008-06-12 22:49   ` Dave Winchell
  2008-06-13  4:47     ` Dan Magenheimer
  2008-06-13  0:38   ` Dave Winchell
  1 sibling, 1 reply; 51+ messages in thread
From: Dave Winchell @ 2008-06-12 22:49 UTC (permalink / raw)
  To: dan.magenheimer, Keir Fraser, xen-devel; +Cc: Dave Winchell, Ben Guthro

[-- Attachment #1.1: Type: text/plain, Size: 2968 bytes --]

Dan,

You shouldn't be getting higher than .05%.
I'd like to figure out what is wrong. I'm running the same guest
you are with heavy loads and the physical processors overcommitted
by 3:1. And I'm seeing .027% error on rh5u1-64 after an hour.

Can you type ^a^a^a at the console and then
type 'Z' a couple of times about 10 seconds apart and send
me the output? Do this when you have a domain
running that is keeping poor time.

You should take drift measurements over a period
of time that is at least 20 minutes, preferably longer.

Also, can you send me a tarball of your sources from
the xen directory?

thanks,
Dave

-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Thu 6/12/2008 6:05 PM
To: Dave Winchell; Keir Fraser; xen-devel
Cc: Ben Guthro
Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

(Going back on list.)

OK, so looking at the updated patch, hpet_avoid=1 is actually
working, just reporting wrong, correct?

With el5u1-64-hvm and hpet_avoid=1 and timer_mode=4, skew
is under -0.04% and falling.  With hpet_avoid=0, it looks
about the same.  However both cases seem to start creeping
up again when I put load on, then fall again when I remove
the load -- even with sched-credit capping cpu usage.  Odd!
This implies to me that the activity in the other domains
IS affecting skew on the domain-under-test. (Keir, any
comments on the hypothesis attached below?)

Another theoretical oddity... if you are always delivering
timer ticks "late", fewer than the nominal 1000 ticks/sec
should be being received.  So then why is guest time actually
going faster than an external source?

(In my mind, going faster is much worse than going slower
because if ntpd or a human moves time backwards to compensate
for a clock going faster, "make" and other programs can
get very confused.)

Dan

> -----Original Message-----
> From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> Sent: Thursday, June 12, 2008 3:13 PM
> To: 'Dave Winchell'
> Subject: RE: xen hpet patch
> 
> 
> One more thought while waiting for compile and reboot:
> 
> Am I right that all of the policies are correcting for when
> a domain "A" is out-of-context?  There's nothing in any other
> domain "B" that can account for any timer loss/gain in domain
> "A".  The only reason we are running other domains is to ensure
> that domain "A" is sometimes out-of-context, and the more
> it is out-of-context, the more likely we will observe
> a problem, correct?
> 
> If this is true, it doesn't matter what workload is run
> in the non-A domains... as long as it is loading the
> CPU(s), thus ensuring that domain A is sometimes not
> scheduled on any CPU.
> 
> And if all this is true, we may not need to run other
> domains at all... running "xm sched-credit -d A -c 50"
> should result in domain A being out-of-context at least
> half the time.

[-- Attachment #1.2: Type: text/html, Size: 3968 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [PATCH 0/2] Improve hpet accuracy
  2008-06-11  1:44                   ` Dan Magenheimer
  2008-06-11 13:58                     ` Dave Winchell
@ 2008-06-12 22:51                     ` Dan Magenheimer
  1 sibling, 0 replies; 51+ messages in thread
From: Dan Magenheimer @ 2008-06-12 22:51 UTC (permalink / raw)
  To: dan.magenheimer@oracle.com, Dave Winchell, Keir Fraser
  Cc: xen-devel, Ben Guthro

> > In EL5u1-32 however it looks like the fractions are accounted
> > for.  Indeed the EL5u1-32 "lost tick handling" code resembles
> > the Linux/ia64 code which is what I've always assumed was
> > the "missed tick" model.  In this case, I think no policy
> > is necessary and the measured skew should be identical to
> > any physical hpet skew.  I'll have to test this hypothesis though.
> 
> I've tested this hypothesis and it seems to hold true.
> This means the existing (unpatched) hpet code works fine
> on EL5-32bit (vcpus=1) when hpet is the clocksource,
> even when the machine is overcommitted.  A second hypothesis
> still needs to be tested that Dave's patch will not make this worse.

OK, I can confirm that Dave's patch, as expected, does not
make this any worse.

The timer algorithm in 2.6.18 for x86 (i.e. RHEL5-32bit) is
definitely the most resilient to variations in tick delivery
for a monotonically-increasing timesource (i.e. hpet).
This algorithm is in arch-independent code but sadly x86_64
didn't use it as of 2.6.18.

Dan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [PATCH 0/2] Improve hpet accuracy
  2008-06-12 22:05 ` Dan Magenheimer
  2008-06-12 22:49   ` Dave Winchell
@ 2008-06-13  0:38   ` Dave Winchell
  2008-06-13  2:21     ` Dan Magenheimer
  1 sibling, 1 reply; 51+ messages in thread
From: Dave Winchell @ 2008-06-13  0:38 UTC (permalink / raw)
  To: dan.magenheimer, Keir Fraser, xen-devel; +Cc: Dave Winchell, Ben Guthro

[-- Attachment #1.1: Type: text/plain, Size: 3044 bytes --]

Hi Dan,

> Another theoretical oddity... if you are always delivering
> timer ticks "late", fewer than the nominal 1000 ticks/sec
> should be being received.  So then why is guest time actually
> going faster than an external source?

With a guest that computes missed ticks, and is not dealing
with fractional ticks when the interrupts are closer than
a period:
If you send several interrupts farther apart than period and then
send one closer than period, the guest gains a tick. With this
fact you can have fewer than the expected number of interrupts
and be gaining time.

With one that expects the right number of interrupts (Windows)
delivering fewer than expected makes the guest run slow.

-Dave

-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Thu 6/12/2008 6:05 PM
To: Dave Winchell; Keir Fraser; xen-devel
Cc: Ben Guthro
Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

(Going back on list.)

OK, so looking at the updated patch, hpet_avoid=1 is actually
working, just reporting wrong, correct?

With el5u1-64-hvm and hpet_avoid=1 and timer_mode=4, skew
is under -0.04% and falling.  With hpet_avoid=0, it looks
about the same.  However both cases seem to start creeping
up again when I put load on, then fall again when I remove
the load -- even with sched-credit capping cpu usage.  Odd!
This implies to me that the activity in the other domains
IS affecting skew on the domain-under-test. (Keir, any
comments on the hypothesis attached below?)

Another theoretical oddity... if you are always delivering
timer ticks "late", fewer than the nominal 1000 ticks/sec
should be being received.  So then why is guest time actually
going faster than an external source?

(In my mind, going faster is much worse than going slower
because if ntpd or a human moves time backwards to compensate
for a clock going faster, "make" and other programs can
get very confused.)

Dan

> -----Original Message-----
> From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> Sent: Thursday, June 12, 2008 3:13 PM
> To: 'Dave Winchell'
> Subject: RE: xen hpet patch
> 
> 
> One more thought while waiting for compile and reboot:
> 
> Am I right that all of the policies are correcting for when
> a domain "A" is out-of-context?  There's nothing in any other
> domain "B" that can account for any timer loss/gain in domain
> "A".  The only reason we are running other domains is to ensure
> that domain "A" is sometimes out-of-context, and the more
> it is out-of-context, the more likely we will observe
> a problem, correct?
> 
> If this is true, it doesn't matter what workload is run
> in the non-A domains... as long as it is loading the
> CPU(s), thus ensuring that domain A is sometimes not
> scheduled on any CPU.
> 
> And if all this is true, we may not need to run other
> domains at all... running "xm sched-credit -d A -c 50"
> should result in domain A being out-of-context at least
> half the time.

[-- Attachment #1.2: Type: text/html, Size: 4055 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [PATCH 0/2] Improve hpet accuracy
  2008-06-13  0:38   ` Dave Winchell
@ 2008-06-13  2:21     ` Dan Magenheimer
  2008-06-13  3:12       ` Dave Winchell
  0 siblings, 1 reply; 51+ messages in thread
From: Dan Magenheimer @ 2008-06-13  2:21 UTC (permalink / raw)
  To: Dave Winchell, Keir Fraser, xen-devel; +Cc: Ben Guthro

Hi Dave --

I understand that ticks too close together causes
time to move faster, but I thought your policy ensured
that ticks were never delivered too close together.
So I was surprised to see that time was moving faster
rather than slower.

Dan

-----Original Message-----
From: Dave Winchell [mailto:dwinchell@virtualiron.com]
Sent: Thursday, June 12, 2008 6:39 PM
To: dan.magenheimer@oracle.com; Keir Fraser; xen-devel
Cc: Ben Guthro; Dave Winchell
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Hi Dan,

> Another theoretical oddity... if you are always delivering
> timer ticks "late", fewer than the nominal 1000 ticks/sec
> should be being received.  So then why is guest time actually
> going faster than an external source?

With a guest that computes missed ticks, and is not dealing
with fractional ticks when the interrupts are closer than
a period:
If you send several interrupts farther apart than period and then
send one closer than period, the guest gains a tick. With this
fact you can have fewer than the expected number of interrupts
and be gaining time.

With one that expects the right number of interrupts (Windows)
delivering fewer than expected makes the guest run slow.

-Dave

-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Thu 6/12/2008 6:05 PM
To: Dave Winchell; Keir Fraser; xen-devel
Cc: Ben Guthro
Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

(Going back on list.)

OK, so looking at the updated patch, hpet_avoid=1 is actually
working, just reporting wrong, correct?

With el5u1-64-hvm and hpet_avoid=1 and timer_mode=4, skew
is under -0.04% and falling.  With hpet_avoid=0, it looks
about the same.  However both cases seem to start creeping
up again when I put load on, then fall again when I remove
the load -- even with sched-credit capping cpu usage.  Odd!
This implies to me that the activity in the other domains
IS affecting skew on the domain-under-test. (Keir, any
comments on the hypothesis attached below?)

Another theoretical oddity... if you are always delivering
timer ticks "late", fewer than the nominal 1000 ticks/sec
should be being received.  So then why is guest time actually
going faster than an external source?

(In my mind, going faster is much worse than going slower
because if ntpd or a human moves time backwards to compensate
for a clock going faster, "make" and other programs can
get very confused.)

Dan

> -----Original Message-----
> From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> Sent: Thursday, June 12, 2008 3:13 PM
> To: 'Dave Winchell'
> Subject: RE: xen hpet patch
>
>
> One more thought while waiting for compile and reboot:
>
> Am I right that all of the policies are correcting for when
> a domain "A" is out-of-context?  There's nothing in any other
> domain "B" that can account for any timer loss/gain in domain
> "A".  The only reason we are running other domains is to ensure
> that domain "A" is sometimes out-of-context, and the more
> it is out-of-context, the more likely we will observe
> a problem, correct?
>
> If this is true, it doesn't matter what workload is run
> in the non-A domains... as long as it is loading the
> CPU(s), thus ensuring that domain A is sometimes not
> scheduled on any CPU.
>
> And if all this is true, we may not need to run other
> domains at all... running "xm sched-credit -d A -c 50"
> should result in domain A being out-of-context at least
> half the time.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [PATCH 0/2] Improve hpet accuracy
  2008-06-13  2:21     ` Dan Magenheimer
@ 2008-06-13  3:12       ` Dave Winchell
  0 siblings, 0 replies; 51+ messages in thread
From: Dave Winchell @ 2008-06-13  3:12 UTC (permalink / raw)
  To: dan.magenheimer, Keir Fraser, xen-devel; +Cc: Dave Winchell, Ben Guthro

[-- Attachment #1.1: Type: text/plain, Size: 4156 bytes --]

> I understand that ticks too close together causes
> time to move faster, but I thought your policy ensured
> that ticks were never delivered too close together.
> So I was surprised to see that time was moving faster
> rather than slower.

ok. Send me the debug info and I'll try to figure out
what's going on.

thanks,
Dave

-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Thu 6/12/2008 10:21 PM
To: Dave Winchell; Keir Fraser; xen-devel
Cc: Ben Guthro
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Hi Dave --

I understand that ticks too close together causes
time to move faster, but I thought your policy ensured
that ticks were never delivered too close together.
So I was surprised to see that time was moving faster
rather than slower.

Dan

-----Original Message-----
From: Dave Winchell [mailto:dwinchell@virtualiron.com]
Sent: Thursday, June 12, 2008 6:39 PM
To: dan.magenheimer@oracle.com; Keir Fraser; xen-devel
Cc: Ben Guthro; Dave Winchell
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Hi Dan,

> Another theoretical oddity... if you are always delivering
> timer ticks "late", fewer than the nominal 1000 ticks/sec
> should be being received.  So then why is guest time actually
> going faster than an external source?

With a guest that computes missed ticks, and is not dealing
with fractional ticks when the interrupts are closer than
a period:
If you send several interrupts farther apart than period and then
send one closer than period, the guest gains a tick. With this
fact you can have fewer than the expected number of interrupts
and be gaining time.

With one that expects the right number of interrupts (Windows)
delivering fewer than expected makes the guest run slow.

-Dave

-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Thu 6/12/2008 6:05 PM
To: Dave Winchell; Keir Fraser; xen-devel
Cc: Ben Guthro
Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

(Going back on list.)

OK, so looking at the updated patch, hpet_avoid=1 is actually
working, just reporting wrong, correct?

With el5u1-64-hvm and hpet_avoid=1 and timer_mode=4, skew
is under -0.04% and falling.  With hpet_avoid=0, it looks
about the same.  However both cases seem to start creeping
up again when I put load on, then fall again when I remove
the load -- even with sched-credit capping cpu usage.  Odd!
This implies to me that the activity in the other domains
IS affecting skew on the domain-under-test. (Keir, any
comments on the hypothesis attached below?)

Another theoretical oddity... if you are always delivering
timer ticks "late", fewer than the nominal 1000 ticks/sec
should be being received.  So then why is guest time actually
going faster than an external source?

(In my mind, going faster is much worse than going slower
because if ntpd or a human moves time backwards to compensate
for a clock going faster, "make" and other programs can
get very confused.)

Dan

> -----Original Message-----
> From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> Sent: Thursday, June 12, 2008 3:13 PM
> To: 'Dave Winchell'
> Subject: RE: xen hpet patch
>
>
> One more thought while waiting for compile and reboot:
>
> Am I right that all of the policies are correcting for when
> a domain "A" is out-of-context?  There's nothing in any other
> domain "B" that can account for any timer loss/gain in domain
> "A".  The only reason we are running other domains is to ensure
> that domain "A" is sometimes out-of-context, and the more
> it is out-of-context, the more likely we will observe
> a problem, correct?
>
> If this is true, it doesn't matter what workload is run
> in the non-A domains... as long as it is loading the
> CPU(s), thus ensuring that domain A is sometimes not
> scheduled on any CPU.
>
> And if all this is true, we may not need to run other
> domains at all... running "xm sched-credit -d A -c 50"
> should result in domain A being out-of-context at least
> half the time.

[-- Attachment #1.2: Type: text/html, Size: 5442 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [PATCH 0/2] Improve hpet accuracy
  2008-06-12 22:49   ` Dave Winchell
@ 2008-06-13  4:47     ` Dan Magenheimer
  2008-06-13  7:33       ` Keir Fraser
  2008-06-13 12:08       ` Dave Winchell
  0 siblings, 2 replies; 51+ messages in thread
From: Dan Magenheimer @ 2008-06-13  4:47 UTC (permalink / raw)
  To: Dave Winchell, Keir Fraser, xen-devel; +Cc: Ben Guthro

Hi Dave --

Hmmm... in my earlier runs with rhel5u1-64, I had apic=0
(yes apic, not acpi).  Changing it to apic=1 gives excellent
results (< 0.01% even with overcommit).  Changing it back
to apic=0 has the same fairly bad results, 0.08% with no
overcommit and 0.16% (and climbing) with overcommit.
Note that this is all with vcpus=1.

How odd...

I vaguely recalled from some research a couple of months ago
that hpet is read MORE than once/tick on the boot processor.
I can't seem to find the table I compiled from that research,
but I did find this in an email I sent to you:

"You probably know this already but an n-way 2.6 Linux
kernel reads hpet (n+1)*1000 times/second.  Let's take
five 2-way guests as an example; that comes to 15000
hpet reads/second...."

I wondered what was different between apic=1 vs 0. Using:

# cat /proc/interrupts | grep 'LOC|timer'; sleep 10; \
     cat /proc/interrupts | grep 'LOC|timer'

you can see that there are always 1000 LOC/sec.  But
with apic=1 there are also about 350 IO-APIC-edge-timer/sec
and with apic=0 there are 1000 XT-PIC-timer/sec.

I suspect that the latter of these (XT-PIC-timer) is
messing up your policy and the former (edge-timer) is not.

Dan

-----Original Message-----
From: Dave Winchell [mailto:dwinchell@virtualiron.com]
Sent: Thursday, June 12, 2008 4:49 PM
To: dan.magenheimer@oracle.com; Keir Fraser; xen-devel
Cc: Ben Guthro; Dave Winchell
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Dan,

You shouldn't be getting higher than .05%.
I'd like to figure out what is wrong. I'm running the same guest
you are with heavy loads and the physical processors overcommitted
by 3:1. And I'm seeing .027% error on rh5u1-64 after an hour.

Can you type ^a^a^a at the console and then
type 'Z' a couple of times about 10 seconds apart and send
me the output? Do this when you have a domain
running that is keeping poor time.

You should take drift measurements over a period
of time that is at least 20 minutes, preferably longer.

Also, can you send me a tarball of your sources from
the xen directory?

thanks,
Dave

-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Thu 6/12/2008 6:05 PM
To: Dave Winchell; Keir Fraser; xen-devel
Cc: Ben Guthro
Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

(Going back on list.)

OK, so looking at the updated patch, hpet_avoid=1 is actually
working, just reporting wrong, correct?

With el5u1-64-hvm and hpet_avoid=1 and timer_mode=4, skew
is under -0.04% and falling.  With hpet_avoid=0, it looks
about the same.  However both cases seem to start creeping
up again when I put load on, then fall again when I remove
the load -- even with sched-credit capping cpu usage.  Odd!
This implies to me that the activity in the other domains
IS affecting skew on the domain-under-test. (Keir, any
comments on the hypothesis attached below?)

Another theoretical oddity... if you are always delivering
timer ticks "late", fewer than the nominal 1000 ticks/sec
should be being received.  So then why is guest time actually
going faster than an external source?

(In my mind, going faster is much worse than going slower
because if ntpd or a human moves time backwards to compensate
for a clock going faster, "make" and other programs can
get very confused.)

Dan

> -----Original Message-----
> From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> Sent: Thursday, June 12, 2008 3:13 PM
> To: 'Dave Winchell'
> Subject: RE: xen hpet patch
>
>
> One more thought while waiting for compile and reboot:
>
> Am I right that all of the policies are correcting for when
> a domain "A" is out-of-context?  There's nothing in any other
> domain "B" that can account for any timer loss/gain in domain
> "A".  The only reason we are running other domains is to ensure
> that domain "A" is sometimes out-of-context, and the more
> it is out-of-context, the more likely we will observe
> a problem, correct?
>
> If this is true, it doesn't matter what workload is run
> in the non-A domains... as long as it is loading the
> CPU(s), thus ensuring that domain A is sometimes not
> scheduled on any CPU.
>
> And if all this is true, we may not need to run other
> domains at all... running "xm sched-credit -d A -c 50"
> should result in domain A being out-of-context at least
> half the time.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/2] Improve hpet accuracy
  2008-06-13  4:47     ` Dan Magenheimer
@ 2008-06-13  7:33       ` Keir Fraser
  2008-06-13 15:39         ` Dan Magenheimer
  2008-06-13 12:08       ` Dave Winchell
  1 sibling, 1 reply; 51+ messages in thread
From: Keir Fraser @ 2008-06-13  7:33 UTC (permalink / raw)
  To: dan.magenheimer@oracle.com, Dave Winchell, xen-devel; +Cc: Ben Guthro

On 13/6/08 05:47, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> I wondered what was different between apic=1 vs 0. Using:
> 
> # cat /proc/interrupts | grep 'LOC|timer'; sleep 10; \
>      cat /proc/interrupts | grep 'LOC|timer'
> 
> you can see that there are always 1000 LOC/sec.  But
> with apic=1 there are also about 350 IO-APIC-edge-timer/sec
> and with apic=0 there are 1000 XT-PIC-timer/sec.
> 
> I suspect that the latter of these (XT-PIC-timer) is
> messing up your policy and the former (edge-timer) is not.

I think apic=0 is not a particularly useful configuration though, right?

 -- Keir

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [PATCH 0/2] Improve hpet accuracy
  2008-06-13  4:47     ` Dan Magenheimer
  2008-06-13  7:33       ` Keir Fraser
@ 2008-06-13 12:08       ` Dave Winchell
  2008-06-13 14:58         ` Dave Winchell
  2008-06-13 15:52         ` Dan Magenheimer
  1 sibling, 2 replies; 51+ messages in thread
From: Dave Winchell @ 2008-06-13 12:08 UTC (permalink / raw)
  To: dan.magenheimer, Keir Fraser, xen-devel; +Cc: Dave Winchell, Ben Guthro

[-- Attachment #1.1: Type: text/plain, Size: 5741 bytes --]

Hi Dan,

I'm glad your able to reproduce my results.
Are you still seeing the boot time hang up?
Is this the reason for vcpus=1?

> you can see that there are always 1000 LOC/sec.  But
> with apic=1 there are also about 350 IO-APIC-edge-timer/sec
> and with apic=0 there are 1000 XT-PIC-timer/sec.

> I suspect that the latter of these (XT-PIC-timer) is
> messing up your policy and the former (edge-timer) is not.

Thanks for this data. Your analysis is correct, I think.
I wrote the interrupt routing and callback code for the
IOAPIC edge triggered interrupts. The PIC path does not
have the callbacks. With no callbacks, it always looks to
the routing code in hpet.c like its been longer than a period
since the last one as the end-of-interrupt time stamp is zero. Thus, you get
an interrupt each timeout or 1000 interrupts/sec.
350 is a typical amount when the algorithm for missed ticks is
doing its thing. I'll put this on the bug list - unless no one
cares about apic=0.

thanks,
Dave

-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Fri 6/13/2008 12:47 AM
To: Dave Winchell; Keir Fraser; xen-devel
Cc: Ben Guthro
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Hi Dave --

Hmmm... in my earlier runs with rhel5u1-64, I had apic=0
(yes apic, not acpi).  Changing it to apic=1 gives excellent
results (< 0.01% even with overcommit).  Changing it back
to apic=0 has the same fairly bad results, 0.08% with no
overcommit and 0.16% (and climbing) with overcommit.
Note that this is all with vcpus=1.

How odd...

I vaguely recalled from some research a couple of months ago
that hpet is read MORE than once/tick on the boot processor.
I can't seem to find the table I compiled from that research,
but I did find this in an email I sent to you:

"You probably know this already but an n-way 2.6 Linux
kernel reads hpet (n+1)*1000 times/second.  Let's take
five 2-way guests as an example; that comes to 15000
hpet reads/second...."

I wondered what was different between apic=1 vs 0. Using:

# cat /proc/interrupts | grep 'LOC|timer'; sleep 10; \
     cat /proc/interrupts | grep 'LOC|timer'

you can see that there are always 1000 LOC/sec.  But
with apic=1 there are also about 350 IO-APIC-edge-timer/sec
and with apic=0 there are 1000 XT-PIC-timer/sec.

I suspect that the latter of these (XT-PIC-timer) is
messing up your policy and the former (edge-timer) is not.

Dan

-----Original Message-----
From: Dave Winchell [mailto:dwinchell@virtualiron.com]
Sent: Thursday, June 12, 2008 4:49 PM
To: dan.magenheimer@oracle.com; Keir Fraser; xen-devel
Cc: Ben Guthro; Dave Winchell
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Dan,

You shouldn't be getting higher than .05%.
I'd like to figure out what is wrong. I'm running the same guest
you are with heavy loads and the physical processors overcommitted
by 3:1. And I'm seeing .027% error on rh5u1-64 after an hour.

Can you type ^a^a^a at the console and then
type 'Z' a couple of times about 10 seconds apart and send
me the output? Do this when you have a domain
running that is keeping poor time.

You should take drift measurements over a period
of time that is at least 20 minutes, preferably longer.

Also, can you send me a tarball of your sources from
the xen directory?

thanks,
Dave

-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Thu 6/12/2008 6:05 PM
To: Dave Winchell; Keir Fraser; xen-devel
Cc: Ben Guthro
Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

(Going back on list.)

OK, so looking at the updated patch, hpet_avoid=1 is actually
working, just reporting wrong, correct?

With el5u1-64-hvm and hpet_avoid=1 and timer_mode=4, skew
is under -0.04% and falling.  With hpet_avoid=0, it looks
about the same.  However both cases seem to start creeping
up again when I put load on, then fall again when I remove
the load -- even with sched-credit capping cpu usage.  Odd!
This implies to me that the activity in the other domains
IS affecting skew on the domain-under-test. (Keir, any
comments on the hypothesis attached below?)

Another theoretical oddity... if you are always delivering
timer ticks "late", fewer than the nominal 1000 ticks/sec
should be being received.  So then why is guest time actually
going faster than an external source?

(In my mind, going faster is much worse than going slower
because if ntpd or a human moves time backwards to compensate
for a clock going faster, "make" and other programs can
get very confused.)

Dan

> -----Original Message-----
> From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> Sent: Thursday, June 12, 2008 3:13 PM
> To: 'Dave Winchell'
> Subject: RE: xen hpet patch
>
>
> One more thought while waiting for compile and reboot:
>
> Am I right that all of the policies are correcting for when
> a domain "A" is out-of-context?  There's nothing in any other
> domain "B" that can account for any timer loss/gain in domain
> "A".  The only reason we are running other domains is to ensure
> that domain "A" is sometimes out-of-context, and the more
> it is out-of-context, the more likely we will observe
> a problem, correct?
>
> If this is true, it doesn't matter what workload is run
> in the non-A domains... as long as it is loading the
> CPU(s), thus ensuring that domain A is sometimes not
> scheduled on any CPU.
>
> And if all this is true, we may not need to run other
> domains at all... running "xm sched-credit -d A -c 50"
> should result in domain A being out-of-context at least
> half the time.

[-- Attachment #1.2: Type: text/html, Size: 7230 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/2] Improve hpet accuracy
  2008-06-13 12:08       ` Dave Winchell
@ 2008-06-13 14:58         ` Dave Winchell
  2008-06-13 15:52         ` Dan Magenheimer
  1 sibling, 0 replies; 51+ messages in thread
From: Dave Winchell @ 2008-06-13 14:58 UTC (permalink / raw)
  To: dan.magenheimer, Keir Fraser; +Cc: Dave Winchell, xen-devel, Ben Guthro

Dan, Keir,

In an overnight (17.5 hrs) test with three guests,
8 vcpus each on 8 physical cpus all under
usex b48 loads I noted the following errors:

rh4u664 -.72 sec (.0012%)
rhas5u164 -10.2 sec (.016%)
sles10u164 -9.3 sec (.015%)

The number for rh4u664 is what I am used to seeing on this
platform. The other ones are 10 times worse, but still good enough
for ntp. The reason they are worse is that the guest clock code
for hpet in rhas5u164 looks at the cmp register to calculate interrupt 
delay.
I mentioned before on this list that one of the beauties of hpet
was the fine hpet code in the guest (rh4u664) which did not use the delay
computation, which in my mind is unnecessary and adds error.
Well, in rhas5u164 and I assume in sles10u164 delay is back in
and so is the associated error.

The cmp register is also the reason for the hesitations on boot.
I'll have more to say on this later.

thanks,
Dave



Dave Winchell wrote:

> Hi Dan,
>
> I'm glad your able to reproduce my results.
> Are you still seeing the boot time hang up?
> Is this the reason for vcpus=1?
>
> > you can see that there are always 1000 LOC/sec.  But
> > with apic=1 there are also about 350 IO-APIC-edge-timer/sec
> > and with apic=0 there are 1000 XT-PIC-timer/sec.
>
> > I suspect that the latter of these (XT-PIC-timer) is
> > messing up your policy and the former (edge-timer) is not.
>
> Thanks for this data. Your analysis is correct, I think.
> I wrote the interrupt routing and callback code for the
> IOAPIC edge triggered interrupts. The PIC path does not
> have the callbacks. With no callbacks, it always looks to
> the routing code in hpet.c like its been longer than a period
> since the last one as the end-of-interrupt time stamp is zero. Thus, 
> you get
> an interrupt each timeout or 1000 interrupts/sec.
> 350 is a typical amount when the algorithm for missed ticks is
> doing its thing. I'll put this on the bug list - unless no one
> cares about apic=0.
>
> thanks,
> Dave
>
>
> -----Original Message-----
> From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> Sent: Fri 6/13/2008 12:47 AM
> To: Dave Winchell; Keir Fraser; xen-devel
> Cc: Ben Guthro
> Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
> Hi Dave --
>
> Hmmm... in my earlier runs with rhel5u1-64, I had apic=0
> (yes apic, not acpi).  Changing it to apic=1 gives excellent
> results (< 0.01% even with overcommit).  Changing it back
> to apic=0 has the same fairly bad results, 0.08% with no
> overcommit and 0.16% (and climbing) with overcommit.
> Note that this is all with vcpus=1.
>
> How odd...
>
> I vaguely recalled from some research a couple of months ago
> that hpet is read MORE than once/tick on the boot processor.
> I can't seem to find the table I compiled from that research,
> but I did find this in an email I sent to you:
>
> "You probably know this already but an n-way 2.6 Linux
> kernel reads hpet (n+1)*1000 times/second.  Let's take
> five 2-way guests as an example; that comes to 15000
> hpet reads/second...."
>
> I wondered what was different between apic=1 vs 0. Using:
>
> # cat /proc/interrupts | grep 'LOC|timer'; sleep 10; \
>      cat /proc/interrupts | grep 'LOC|timer'
>
> you can see that there are always 1000 LOC/sec.  But
> with apic=1 there are also about 350 IO-APIC-edge-timer/sec
> and with apic=0 there are 1000 XT-PIC-timer/sec.
>
> I suspect that the latter of these (XT-PIC-timer) is
> messing up your policy and the former (edge-timer) is not.
>
> Dan
>
> -----Original Message-----
> From: Dave Winchell [mailto:dwinchell@virtualiron.com]
> Sent: Thursday, June 12, 2008 4:49 PM
> To: dan.magenheimer@oracle.com; Keir Fraser; xen-devel
> Cc: Ben Guthro; Dave Winchell
> Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
>
> Dan,
>
> You shouldn't be getting higher than .05%.
> I'd like to figure out what is wrong. I'm running the same guest
> you are with heavy loads and the physical processors overcommitted
> by 3:1. And I'm seeing .027% error on rh5u1-64 after an hour.
>
> Can you type ^a^a^a at the console and then
> type 'Z' a couple of times about 10 seconds apart and send
> me the output? Do this when you have a domain
> running that is keeping poor time.
>
> You should take drift measurements over a period
> of time that is at least 20 minutes, preferably longer.
>
> Also, can you send me a tarball of your sources from
> the xen directory?
>
>
> thanks,
> Dave
>
>
>
>
> -----Original Message-----
> From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> Sent: Thu 6/12/2008 6:05 PM
> To: Dave Winchell; Keir Fraser; xen-devel
> Cc: Ben Guthro
> Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
> (Going back on list.)
>
> OK, so looking at the updated patch, hpet_avoid=1 is actually
> working, just reporting wrong, correct?
>
> With el5u1-64-hvm and hpet_avoid=1 and timer_mode=4, skew
> is under -0.04% and falling.  With hpet_avoid=0, it looks
> about the same.  However both cases seem to start creeping
> up again when I put load on, then fall again when I remove
> the load -- even with sched-credit capping cpu usage.  Odd!
> This implies to me that the activity in the other domains
> IS affecting skew on the domain-under-test. (Keir, any
> comments on the hypothesis attached below?)
>
> Another theoretical oddity... if you are always delivering
> timer ticks "late", fewer than the nominal 1000 ticks/sec
> should be being received.  So then why is guest time actually
> going faster than an external source?
>
> (In my mind, going faster is much worse than going slower
> because if ntpd or a human moves time backwards to compensate
> for a clock going faster, "make" and other programs can
> get very confused.)
>
> Dan
>
> > -----Original Message-----
> > From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> > Sent: Thursday, June 12, 2008 3:13 PM
> > To: 'Dave Winchell'
> > Subject: RE: xen hpet patch
> >
> >
> > One more thought while waiting for compile and reboot:
> >
> > Am I right that all of the policies are correcting for when
> > a domain "A" is out-of-context?  There's nothing in any other
> > domain "B" that can account for any timer loss/gain in domain
> > "A".  The only reason we are running other domains is to ensure
> > that domain "A" is sometimes out-of-context, and the more
> > it is out-of-context, the more likely we will observe
> > a problem, correct?
> >
> > If this is true, it doesn't matter what workload is run
> > in the non-A domains... as long as it is loading the
> > CPU(s), thus ensuring that domain A is sometimes not
> > scheduled on any CPU.
> >
> > And if all this is true, we may not need to run other
> > domains at all... running "xm sched-credit -d A -c 50"
> > should result in domain A being out-of-context at least
> > half the time.
>
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [PATCH 0/2] Improve hpet accuracy
  2008-06-13  7:33       ` Keir Fraser
@ 2008-06-13 15:39         ` Dan Magenheimer
  0 siblings, 0 replies; 51+ messages in thread
From: Dan Magenheimer @ 2008-06-13 15:39 UTC (permalink / raw)
  To: Keir Fraser, Dave Winchell, xen-devel; +Cc: Ben Guthro

> I think apic=0 is not a particularly useful configuration 
> though, right?

We've seen it proposed sometimes as a workaround for
a boot-time problem, but I agree its not useful enough
to warrant concern or stand in the way of Dave's patch.

> -----Original Message-----
> From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
> Sent: Friday, June 13, 2008 1:34 AM
> To: dan.magenheimer@oracle.com; Dave Winchell; xen-devel
> Cc: Ben Guthro
> Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> 
> 
> On 13/6/08 05:47, "Dan Magenheimer" 
> <dan.magenheimer@oracle.com> wrote:
> 
> > I wondered what was different between apic=1 vs 0. Using:
> >
> > # cat /proc/interrupts | grep 'LOC|timer'; sleep 10; \
> >      cat /proc/interrupts | grep 'LOC|timer'
> >
> > you can see that there are always 1000 LOC/sec.  But
> > with apic=1 there are also about 350 IO-APIC-edge-timer/sec
> > and with apic=0 there are 1000 XT-PIC-timer/sec.
> >
> > I suspect that the latter of these (XT-PIC-timer) is
> > messing up your policy and the former (edge-timer) is not.
> 
> I think apic=0 is not a particularly useful configuration 
> though, right?
> 
>  -- Keir
> 
> 
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [PATCH 0/2] Improve hpet accuracy
  2008-06-13 12:08       ` Dave Winchell
  2008-06-13 14:58         ` Dave Winchell
@ 2008-06-13 15:52         ` Dan Magenheimer
  2008-06-13 16:53           ` Dave Winchell
  1 sibling, 1 reply; 51+ messages in thread
From: Dan Magenheimer @ 2008-06-13 15:52 UTC (permalink / raw)
  To: Dave Winchell, Keir Fraser, xen-devel; +Cc: Ben Guthro

Kudos, Dave, for your excellent work!

Keir, I've completed enough testing to agree that
Dave's hpet policy is a huge improvement over the
existing hpet code and a major improvement over
the pit-based policies/timekeeping.  I strongly
recommend that, once Dave's soon-to-be-revised
patch is in, we turn on hpet by default for all
hvm guests. I'd also suggest that the default
timer_mode (at least when hpet=1) should be
Dave's guest_computes_missed_ticks policy.
(Dave, could you include this in your revised
patch?  Or if you want me to, let me know.)

A couple of remaining points...

> I'm glad your able to reproduce my results.
> Are you still seeing the boot time hang up?
> Is this the reason for vcpus=1?

No, I was just trying to be methodical in my testing,
covering various cases.  I haven't seen the boot-time
hang for awhile.

> I'll put this on the bug list - unless no one
> cares about apic=0.

It probably should be "on the bug list" but very low
priority compared with getting the patch cleaned up
(per Keir's requirements) in time for the 3.3 release.

Dan

-----Original Message-----
From: Dave Winchell [mailto:dwinchell@virtualiron.com]
Sent: Friday, June 13, 2008 6:08 AM
To: dan.magenheimer@oracle.com; Keir Fraser; xen-devel
Cc: Ben Guthro; Dave Winchell
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Hi Dan,

I'm glad your able to reproduce my results.
Are you still seeing the boot time hang up?
Is this the reason for vcpus=1?

> you can see that there are always 1000 LOC/sec.  But
> with apic=1 there are also about 350 IO-APIC-edge-timer/sec
> and with apic=0 there are 1000 XT-PIC-timer/sec.

> I suspect that the latter of these (XT-PIC-timer) is
> messing up your policy and the former (edge-timer) is not.

Thanks for this data. Your analysis is correct, I think.
I wrote the interrupt routing and callback code for the
IOAPIC edge triggered interrupts. The PIC path does not
have the callbacks. With no callbacks, it always looks to
the routing code in hpet.c like its been longer than a period
since the last one as the end-of-interrupt time stamp is zero. Thus, you get
an interrupt each timeout or 1000 interrupts/sec.
350 is a typical amount when the algorithm for missed ticks is
doing its thing. I'll put this on the bug list - unless no one
cares about apic=0.

thanks,
Dave

-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Fri 6/13/2008 12:47 AM
To: Dave Winchell; Keir Fraser; xen-devel
Cc: Ben Guthro
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Hi Dave --

Hmmm... in my earlier runs with rhel5u1-64, I had apic=0
(yes apic, not acpi).  Changing it to apic=1 gives excellent
results (< 0.01% even with overcommit).  Changing it back
to apic=0 has the same fairly bad results, 0.08% with no
overcommit and 0.16% (and climbing) with overcommit.
Note that this is all with vcpus=1.

How odd...

I vaguely recalled from some research a couple of months ago
that hpet is read MORE than once/tick on the boot processor.
I can't seem to find the table I compiled from that research,
but I did find this in an email I sent to you:

"You probably know this already but an n-way 2.6 Linux
kernel reads hpet (n+1)*1000 times/second.  Let's take
five 2-way guests as an example; that comes to 15000
hpet reads/second...."

I wondered what was different between apic=1 vs 0. Using:

# cat /proc/interrupts | grep 'LOC|timer'; sleep 10; \
     cat /proc/interrupts | grep 'LOC|timer'

you can see that there are always 1000 LOC/sec.  But
with apic=1 there are also about 350 IO-APIC-edge-timer/sec
and with apic=0 there are 1000 XT-PIC-timer/sec.

I suspect that the latter of these (XT-PIC-timer) is
messing up your policy and the former (edge-timer) is not.

Dan

-----Original Message-----
From: Dave Winchell [mailto:dwinchell@virtualiron.com]
Sent: Thursday, June 12, 2008 4:49 PM
To: dan.magenheimer@oracle.com; Keir Fraser; xen-devel
Cc: Ben Guthro; Dave Winchell
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Dan,

You shouldn't be getting higher than .05%.
I'd like to figure out what is wrong. I'm running the same guest
you are with heavy loads and the physical processors overcommitted
by 3:1. And I'm seeing .027% error on rh5u1-64 after an hour.

Can you type ^a^a^a at the console and then
type 'Z' a couple of times about 10 seconds apart and send
me the output? Do this when you have a domain
running that is keeping poor time.

You should take drift measurements over a period
of time that is at least 20 minutes, preferably longer.

Also, can you send me a tarball of your sources from
the xen directory?

thanks,
Dave

-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Thu 6/12/2008 6:05 PM
To: Dave Winchell; Keir Fraser; xen-devel
Cc: Ben Guthro
Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

(Going back on list.)

OK, so looking at the updated patch, hpet_avoid=1 is actually
working, just reporting wrong, correct?

With el5u1-64-hvm and hpet_avoid=1 and timer_mode=4, skew
is under -0.04% and falling.  With hpet_avoid=0, it looks
about the same.  However both cases seem to start creeping
up again when I put load on, then fall again when I remove
the load -- even with sched-credit capping cpu usage.  Odd!
This implies to me that the activity in the other domains
IS affecting skew on the domain-under-test. (Keir, any
comments on the hypothesis attached below?)

Another theoretical oddity... if you are always delivering
timer ticks "late", fewer than the nominal 1000 ticks/sec
should be being received.  So then why is guest time actually
going faster than an external source?

(In my mind, going faster is much worse than going slower
because if ntpd or a human moves time backwards to compensate
for a clock going faster, "make" and other programs can
get very confused.)

Dan

> -----Original Message-----
> From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> Sent: Thursday, June 12, 2008 3:13 PM
> To: 'Dave Winchell'
> Subject: RE: xen hpet patch
>
>
> One more thought while waiting for compile and reboot:
>
> Am I right that all of the policies are correcting for when
> a domain "A" is out-of-context?  There's nothing in any other
> domain "B" that can account for any timer loss/gain in domain
> "A".  The only reason we are running other domains is to ensure
> that domain "A" is sometimes out-of-context, and the more
> it is out-of-context, the more likely we will observe
> a problem, correct?
>
> If this is true, it doesn't matter what workload is run
> in the non-A domains... as long as it is loading the
> CPU(s), thus ensuring that domain A is sometimes not
> scheduled on any CPU.
>
> And if all this is true, we may not need to run other
> domains at all... running "xm sched-credit -d A -c 50"
> should result in domain A being out-of-context at least
> half the time.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/2] Improve hpet accuracy
  2008-06-13 15:52         ` Dan Magenheimer
@ 2008-06-13 16:53           ` Dave Winchell
  0 siblings, 0 replies; 51+ messages in thread
From: Dave Winchell @ 2008-06-13 16:53 UTC (permalink / raw)
  To: dan.magenheimer@oracle.com
  Cc: Dave Winchell, xen-devel, Keir Fraser, Ben Guthro

Dan Magenheimer wrote:

>Kudos, Dave, for your excellent work!
>  
>
Thanks, Dan.

>Keir, I've completed enough testing to agree that
>Dave's hpet policy is a huge improvement over the
>existing hpet code and a major improvement over
>the pit-based policies/timekeeping.  I strongly
>recommend that, once Dave's soon-to-be-revised
>patch is in, we turn on hpet by default for all
>hvm guests. I'd also suggest that the default
>timer_mode (at least when hpet=1) should be
>Dave's guest_computes_missed_ticks policy.
>(Dave, could you include this in your revised
>patch?  Or if you want me to, let me know.)
>  
>
Sure, I can do it.

>A couple of remaining points...
>
>  
>
>>I'm glad your able to reproduce my results.
>>Are you still seeing the boot time hang up?
>>Is this the reason for vcpus=1?
>>    
>>
>
>No, I was just trying to be methodical in my testing,
>covering various cases.  I haven't seen the boot-time
>hang for awhile.
>  
>
ok. We still see it here so I'm working on a fix/workaround.

>  
>
>>I'll put this on the bug list - unless no one
>>cares about apic=0.
>>    
>>
>
>It probably should be "on the bug list" but very low
>priority compared with getting the patch cleaned up
>(per Keir's requirements) in time for the 3.3 release.
>  
>
ok.

Dan, thanks very much for the testing work. I know its not
easy and you still came up with the results very quickly.

-Dave

>Dan
>
>-----Original Message-----
>From: Dave Winchell [mailto:dwinchell@virtualiron.com]
>Sent: Friday, June 13, 2008 6:08 AM
>To: dan.magenheimer@oracle.com; Keir Fraser; xen-devel
>Cc: Ben Guthro; Dave Winchell
>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
>
>Hi Dan,
>
>I'm glad your able to reproduce my results.
>Are you still seeing the boot time hang up?
>Is this the reason for vcpus=1?
>
>  
>
>>you can see that there are always 1000 LOC/sec.  But
>>with apic=1 there are also about 350 IO-APIC-edge-timer/sec
>>and with apic=0 there are 1000 XT-PIC-timer/sec.
>>    
>>
>
>  
>
>>I suspect that the latter of these (XT-PIC-timer) is
>>messing up your policy and the former (edge-timer) is not.
>>    
>>
>
>Thanks for this data. Your analysis is correct, I think.
>I wrote the interrupt routing and callback code for the
>IOAPIC edge triggered interrupts. The PIC path does not
>have the callbacks. With no callbacks, it always looks to
>the routing code in hpet.c like its been longer than a period
>since the last one as the end-of-interrupt time stamp is zero. Thus, you get
>an interrupt each timeout or 1000 interrupts/sec.
>350 is a typical amount when the algorithm for missed ticks is
>doing its thing. I'll put this on the bug list - unless no one
>cares about apic=0.
>
>thanks,
>Dave
>
>
>-----Original Message-----
>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
>Sent: Fri 6/13/2008 12:47 AM
>To: Dave Winchell; Keir Fraser; xen-devel
>Cc: Ben Guthro
>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
>Hi Dave --
>
>Hmmm... in my earlier runs with rhel5u1-64, I had apic=0
>(yes apic, not acpi).  Changing it to apic=1 gives excellent
>results (< 0.01% even with overcommit).  Changing it back
>to apic=0 has the same fairly bad results, 0.08% with no
>overcommit and 0.16% (and climbing) with overcommit.
>Note that this is all with vcpus=1.
>
>How odd...
>
>I vaguely recalled from some research a couple of months ago
>that hpet is read MORE than once/tick on the boot processor.
>I can't seem to find the table I compiled from that research,
>but I did find this in an email I sent to you:
>
>"You probably know this already but an n-way 2.6 Linux
>kernel reads hpet (n+1)*1000 times/second.  Let's take
>five 2-way guests as an example; that comes to 15000
>hpet reads/second...."
>
>I wondered what was different between apic=1 vs 0. Using:
>
># cat /proc/interrupts | grep 'LOC|timer'; sleep 10; \
>     cat /proc/interrupts | grep 'LOC|timer'
>
>you can see that there are always 1000 LOC/sec.  But
>with apic=1 there are also about 350 IO-APIC-edge-timer/sec
>and with apic=0 there are 1000 XT-PIC-timer/sec.
>
>I suspect that the latter of these (XT-PIC-timer) is
>messing up your policy and the former (edge-timer) is not.
>
>Dan
>
>-----Original Message-----
>From: Dave Winchell [mailto:dwinchell@virtualiron.com]
>Sent: Thursday, June 12, 2008 4:49 PM
>To: dan.magenheimer@oracle.com; Keir Fraser; xen-devel
>Cc: Ben Guthro; Dave Winchell
>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
>
>Dan,
>
>You shouldn't be getting higher than .05%.
>I'd like to figure out what is wrong. I'm running the same guest
>you are with heavy loads and the physical processors overcommitted
>by 3:1. And I'm seeing .027% error on rh5u1-64 after an hour.
>
>Can you type ^a^a^a at the console and then
>type 'Z' a couple of times about 10 seconds apart and send
>me the output? Do this when you have a domain
>running that is keeping poor time.
>
>You should take drift measurements over a period
>of time that is at least 20 minutes, preferably longer.
>
>Also, can you send me a tarball of your sources from
>the xen directory?
>
>
>thanks,
>Dave
>
>
>
>
>-----Original Message-----
>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
>Sent: Thu 6/12/2008 6:05 PM
>To: Dave Winchell; Keir Fraser; xen-devel
>Cc: Ben Guthro
>Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
>(Going back on list.)
>
>OK, so looking at the updated patch, hpet_avoid=1 is actually
>working, just reporting wrong, correct?
>
>With el5u1-64-hvm and hpet_avoid=1 and timer_mode=4, skew
>is under -0.04% and falling.  With hpet_avoid=0, it looks
>about the same.  However both cases seem to start creeping
>up again when I put load on, then fall again when I remove
>the load -- even with sched-credit capping cpu usage.  Odd!
>This implies to me that the activity in the other domains
>IS affecting skew on the domain-under-test. (Keir, any
>comments on the hypothesis attached below?)
>
>Another theoretical oddity... if you are always delivering
>timer ticks "late", fewer than the nominal 1000 ticks/sec
>should be being received.  So then why is guest time actually
>going faster than an external source?
>
>(In my mind, going faster is much worse than going slower
>because if ntpd or a human moves time backwards to compensate
>for a clock going faster, "make" and other programs can
>get very confused.)
>
>Dan
>
>  
>
>>-----Original Message-----
>>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
>>Sent: Thursday, June 12, 2008 3:13 PM
>>To: 'Dave Winchell'
>>Subject: RE: xen hpet patch
>>
>>
>>One more thought while waiting for compile and reboot:
>>
>>Am I right that all of the policies are correcting for when
>>a domain "A" is out-of-context?  There's nothing in any other
>>domain "B" that can account for any timer loss/gain in domain
>>"A".  The only reason we are running other domains is to ensure
>>that domain "A" is sometimes out-of-context, and the more
>>it is out-of-context, the more likely we will observe
>>a problem, correct?
>>
>>If this is true, it doesn't matter what workload is run
>>in the non-A domains... as long as it is loading the
>>CPU(s), thus ensuring that domain A is sometimes not
>>scheduled on any CPU.
>>
>>And if all this is true, we may not need to run other
>>domains at all... running "xm sched-credit -d A -c 50"
>>should result in domain A being out-of-context at least
>>half the time.
>>    
>>
>
>
>_______________________________________________
>Xen-devel mailing list
>Xen-devel@lists.xensource.com
>http://lists.xensource.com/xen-devel
>  
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2008-06-13 16:53 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-05 14:59 [PATCH 0/2] Improve hpet accuracy Ben Guthro
2008-06-06  8:58 ` Keir Fraser
2008-06-06 10:45   ` Dave Winchell
2008-06-06 15:53     ` Dan Magenheimer
2008-06-06 17:54       ` Dave Winchell
2008-06-06 19:33       ` Dave Winchell
2008-06-06 20:29         ` Dan Magenheimer
2008-06-06 22:34           ` Keir Fraser
2008-06-07 21:20             ` Dave Winchell
2008-06-09 21:07               ` Dan Magenheimer
2008-06-09 21:44                 ` Dave Winchell
2008-06-08 20:32           ` Dave Winchell
2008-06-08 21:10             ` Dan Magenheimer
2008-06-08 23:26               ` Dave Winchell
2008-06-09  7:36                 ` Keir Fraser
2008-06-09 11:13                   ` Dave Winchell
2008-06-09 12:03                     ` Keir Fraser
2008-06-09 12:10                       ` Keir Fraser
2008-06-09 13:08                         ` Dave Winchell
2008-06-09 20:48                   ` Dan Magenheimer
2008-06-09 21:18                     ` Keir Fraser
2008-06-09 21:46                       ` Dan Magenheimer
2008-06-08 21:18             ` Dan Magenheimer
2008-06-09 22:02             ` Dan Magenheimer
2008-06-09 23:34               ` Dave Winchell
2008-06-10  3:21                 ` Dan Magenheimer
2008-06-11  1:44                   ` Dan Magenheimer
2008-06-11 13:58                     ` Dave Winchell
2008-06-11 16:47                       ` Dan Magenheimer
2008-06-12 22:51                     ` Dan Magenheimer
2008-06-10  7:52                 ` Keir Fraser
2008-06-10 11:49                   ` Dave Winchell
2008-06-10 12:34                     ` Dave Winchell
2008-06-10 12:42                       ` Keir Fraser
2008-06-10 17:13                         ` Dave Winchell
2008-06-11  8:30                           ` Keir Fraser
2008-06-11 11:38                             ` Dave Winchell
2008-06-06 15:35 ` Steven Hand
2008-06-06 17:34   ` Dave Winchell
     [not found] <4851940F.2050307@virtualiron.com>
2008-06-12 22:05 ` Dan Magenheimer
2008-06-12 22:49   ` Dave Winchell
2008-06-13  4:47     ` Dan Magenheimer
2008-06-13  7:33       ` Keir Fraser
2008-06-13 15:39         ` Dan Magenheimer
2008-06-13 12:08       ` Dave Winchell
2008-06-13 14:58         ` Dave Winchell
2008-06-13 15:52         ` Dan Magenheimer
2008-06-13 16:53           ` Dave Winchell
2008-06-13  0:38   ` Dave Winchell
2008-06-13  2:21     ` Dan Magenheimer
2008-06-13  3:12       ` Dave Winchell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.