From mboxrd@z Thu Jan  1 00:00:00 1970
From: Dave Winchell <dwinchell@virtualiron.com>
Subject: Re: [PATCH 0/2] Improve hpet accuracy
Date: Fri, 06 Jun 2008 15:33:09 -0400
Message-ID: <484990F5.4040300@virtualiron.com>
References: <20080606095323843.00000002776@djm-pc>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Return-path: <xen-devel-bounces@lists.xensource.com>
In-Reply-To: <20080606095323843.00000002776@djm-pc>
List-Unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: "dan.magenheimer@oracle.com" <dan.magenheimer@oracle.com>, Keir Fraser <keir.fraser@eu.citrix.com>
Cc: Dave Winchell <dwinchell@virtualiron.com>, xen-devel <xen-devel@lists.xensource.com>, Ben Guthro <bguthro@virtualiron.com>
List-Id: xen-devel@lists.xenproject.org

Dan, Keir:

Preliminary tests results indicate an error of .1% for Linux 64 bit 
guests configured
for hpet with xen-unstable as is. As we have discussed many times, the 
ntp requirement is .05%.
Tests on the patch we just submitted for hpet have indicated errors of 
.0012%
on this platform under similar test conditions and .03% on other platforms.

Windows vista64 has an error of 11% using hpet with the xen-unstable bits.
In an overnight test with our hpet patch, the Windows vista error was .008%.

The tests are with two or three guests on a physical node, all under 
load, and with
the ratio of vcpus to phys cpus > 1.

I will continue to run tests over the next few days.

thanks,
Dave


Dan Magenheimer wrote:

> Hi Dave and Ben --
>  
> When running tests on xen-unstable (without your patch), please ensure 
> that hpet=1 is set in the hvm config and also I think that when hpet 
> is the clocksource on RHEL4-32, the clock IS resilient to missed ticks 
> so timer_mode should be 2 (vs when pit is the clocksource on RHEL4-32, 
> all clock ticks must be delivered and so timer_mode should be 0).
>  
> Per 
> http://lists.xensource.com/archives/html/xen-devel/2008-06/msg00098.html it's 
> my intent to clean this up, but I won't get to it until next week.
>  
> Thanks,
> Dan
>
>     -----Original Message-----
>     *From:* xen-devel-bounces@lists.xensource.com
>     [mailto:xen-devel-bounces@lists.xensource.com]*On Behalf Of *Dave
>     Winchell
>     *Sent:* Friday, June 06, 2008 4:46 AM
>     *To:* Keir Fraser; Ben Guthro; xen-devel
>     *Cc:* dan.magenheimer@oracle.com; Dave Winchell
>     *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
>     Keir,
>
>     I think the changes are required. We'll run some tests today today so
>     that we have some data to talk about.
>
>     -Dave
>
>
>     -----Original Message-----
>     From: xen-devel-bounces@lists.xensource.com on behalf of Keir Fraser
>     Sent: Fri 6/6/2008 4:58 AM
>     To: Ben Guthro; xen-devel
>     Cc: dan.magenheimer@oracle.com
>     Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
>     Are these patches needed now the timers are built on Xen system
>     time rather
>     than host TSC? Dan has reported much better time-keeping with his
>     patch
>     checked in, and itąs for sure a lot less invasive than this patchset.
>
>
>      -- Keir
>
>     On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote:
>
>     >
>     > 1. Introduction
>     >
>     > This patch improves the hpet based guest clock in terms of drift and
>     > monotonicity.
>     > Prior to this work the drift with hpet was greater than 2%, far
>     above the .05%
>     > limit
>     > for ntp to synchronize. With this code, the drift ranges from
>     .001% to .0033%
>     > depending
>     > on guest and physical platform.
>     >
>     > Using hpet allows guest operating systems to provide monotonic
>     time to their
>     > applications. Time sources other than hpet are not monotonic because
>     > of their reliance on tsc, which is not synchronized across physical
>     > processors.
>     >
>     > Windows 2k864 and many Linux guests are supported with two
>     policies, one for
>     > guests
>     > that handle missed clock interrupts and the other for guests
>     that require the
>     > correct number of interrupts.
>     >
>     > Guests may use hpet for the timing source even if the physical
>     platform has no
>     > visible
>     > hpet. Migration is supported between physical machines which
>     differ in
>     > physical
>     > hpet visibility.
>     >
>     > Most of the changes are in hpet.c. Two general facilities are
>     added to track
>     > interrupt
>     > progress. The ideas here and the facilities would be useful in
>     vpt.c, for
>     > other time
>     > sources, though no attempt is made here to improve vpt.c.
>     >
>     > The following sections discuss hpet dependencies, interrupt
>     delivery policies,
>     > live migration,
>     > test results, and relation to recent work with monotonic time.
>     >
>     >
>     > 2. Virtual Hpet dependencies
>     >
>     > The virtual hpet depends on the ability to read the physical or
>     simulated
>     > (see discussion below) hpet.  For timekeeping, the virtual hpet
>     also depends
>     > on two new interrupt notification facilities to implement its
>     policies for
>     > interrupt delivery.
>     >
>     > 2.1. Two modes of low-level hpet main counter reads.
>     >
>     > In this implementation, the virtual hpet reads with
>     read_64_main_counter(),
>     > exported by
>     > time.c, either the real physical hpet main counter register
>     directly or a
>     > "simulated"
>     > hpet main counter.
>     >
>     > The simulated mode uses a monotonic version of get_s_time()
>     (NOW()), where the
>     > last
>     > time value is returned whenever the current time value is less
>     than the last
>     > time
>     > value. In simulated mode, since it is layered on s_time, the
>     underlying
>     > hardware
>     > can be hpet or some other device. The frequency of the main
>     counter in
>     > simulated
>     > mode is the same as the standard physical hpet frequency,
>     allowing live
>     > migration
>     > between nodes that are configured differently.
>     >
>     > If the physical platform does not have an hpet device, or if xen
>     is configured
>     > not
>     > to use the device, then the simulated method is used. If there
>     is a physical
>     > hpet device,
>     > and xen has initialized it, then either simulated or physical
>     mode can be
>     > used.
>     > This is governed by a boot time option, hpet-avoid. Setting this
>     option to 1
>     > gives the
>     > simulated mode and 0 the physical mode. The default is physical
>     mode.
>     >
>     > A disadvantage of the physical mode is that may take longer to
>     read the device
>     > than in simulated mode. On some platforms the cost is about the
>     same (less
>     > than 250 nsec) for
>     > physical and simulated modes, while on others physical cost is
>     much higher
>     > than simulated.
>     > A disadvantage of the simulated mode is that it can return the
>     same value
>     > for the counter in consecutive calls.
>     >
>     > 2.2. Interrupt notification facilities.
>     >
>     > Two interrupt notification facilities are introduced, one is
>     > hvm_isa_irq_assert_cb()
>     > and the other hvm_register_intr_en_notif().
>     >
>     > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to
>     the vioapic.
>     > hvm_isa_irq_assert_cb allows a callback to be passed along to
>     > vioapic_deliver()
>     > and this callback is called with a mask of the vcpus which will
>     get the
>     > interrupt. This callback is made before any vcpus receive an
>     interrupt.
>     >
>     > Vhpet uses hvm_register_intr_en_notif() to register a handler
>     for a particular
>     > vector that will be called when that vector is injected in
>     > [vmx,svm]_intr_assist()
>     > and also when the guest finishes handling the interrupt. Here
>     finished is
>     > defined
>     > as the point when the guest re-enables interrupts or lowers the
>     tpr value.
>     > EOI is not used as the end of interrupt as this is sometimes
>     returned before
>     > the interrupt handler has done its work. A flag is passed to the
>     handler
>     > indicating
>     > whether this is the injection point (post = 1) or the interrupt
>     finished (post
>     > = 0) point.
>     > The need for the finished point callback is discussed in the
>     missed ticks
>     > policy section.
>     >
>     > To prevent a possible early trigger of the finished callback,
>     intr_en_notif
>     > logic
>     > has a two stage arm, the first at injection
>     (hvm_intr_en_notif_arm()) and the
>     > second when
>     > interrupts are seen to be disabled (hvm_intr_en_notif_disarm()).
>     Once fully
>     > armed, re-enabling
>     > interrupts will cause hvm_intr_en_notif_disarm() to make the end
>     of interrupt
>     > callback. hvm_intr_en_notif_arm() and hvm_intr_en_notif_disarm()
>     are called by
>     > [vmx,svm]_intr_assist().
>     >
>     > 3. Interrupt delivery policies
>     >
>     > The existing hpet interrupt delivery is preserved. This includes
>     > vcpu round robin delivery used by Linux and broadcast delivery
>     used by
>     > Windows.
>     >
>     > There are two policies for interrupt delivery, one for Windows
>     2k8-64 and the
>     > other
>     > for Linux. The Linux policy takes advantage of the (guest) Linux
>     missed tick
>     > and offset
>     > calculations and does not attempt to deliver the right number of
>     interrupts.
>     > The Windows policy delivers the correct number of interrupts,
>     even if
>     > sometimes much
>     > closer to each other than the period. The policies are similar
>     to those in
>     > vpt.c, though
>     > there are some important differences.
>     >
>     > Policies are selected with an HVMOP_set_param hypercall with index
>     > HVM_PARAM_TIMER_MODE.
>     > Two new values are added, HVM_HPET_guest_computes_missed_ticks and
>     > HVM_HPET_guest_does_not_compute_missed_ticks.  The reason that
>     two new ones
>     > are added is that
>     > in some guests (32bit Linux) a no-missed policy is needed for
>     clock sources
>     > other than hpet
>     > and a missed ticks policy for hpet. It was felt that there would
>     be less
>     > confusion by simply
>     > introducing the two hpet policies.
>     >
>     > 3.1. The missed ticks policy
>     >
>     > The Linux clock interrupt handler for hpet calculates missed
>     ticks and offset
>     > using the hpet
>     > main counter. The algorithm works well when the time since the
>     last interrupt
>     > is greater than
>     > or equal to a period and poorly otherwise.
>     >
>     > The missed ticks policy ensures that no two clock interrupts are
>     delivered to
>     > the guest at
>     > a time interval less than a period. A time stamp (hpet main
>     counter value) is
>     > recorded (by a
>     > callback registered with hvm_register_intr_en_notif) when Linux
>     finishes
>     > handling the clock
>     > interrupt. Then, ensuing interrupts are delivered to the vioapic
>     only if the
>     > current main
>     > counter value is a period greater than when the last interrupt
>     was handled.
>     >
>     > Tests showed a significant improvement in clock drift with end
>     of interrupt
>     > time stamps
>     > versus beginning of interrupt[1]. It is believed that the reason
>     for the
>     > improvement
>     > is that the clock interrupt handler goes for a spinlock and can
>     be therefore
>     > delayed in its
>     > processing. Furthermore, the main counter is read by the guest
>     under the lock.
>     > The net
>     > effect is that if we time stamp injection, we can get the
>     difference in time
>     > between successive interrupt handler lock acquisitions to be
>     less than the
>     > period.
>     >
>     > 3.2. The no-missed ticks policy
>     >
>     > Windows 2k864 keeps very poor time with the missed ticks policy.
>     So the
>     > no-missed ticks policy
>     > was developed. In the no-missed ticks policy we deliver the
>     correct number of
>     > interrupts,
>     > even if they are spaced less than a period apart (when catching up).
>     >
>     > Windows 2k864 uses a broadcast mode in the interrupt routing
>     such that
>     > all vcpus get the clock interrupt. The best Windows drift
>     performance was
>     > achieved when the
>     > policy code ensured that all the previous interrupts (on the
>     various vcpus)
>     > had been injected
>     > before injecting the next interrupt to the vioapic..
>     >
>     > The policy code works as follows. It uses the
>     hvm_isa_irq_assert_cb() to
>     > record
>     > the vcpus to be interrupted in h->hpet.pending_mask. Then, in
>     the callback
>     > registered
>     > with hvm_register_intr_en_notif() at post=1 time it clears the
>     current vcpu in
>     > the pending_mask.
>     > When the pending_mask is clear it decrements
>     hpet.intr_pending_nr and if
>     > intr_pending_nr is still
>     > non-zero posts another interrupt to the ioapic with
>     hvm_isa_irq_assert_cb().
>     > Intr_pending_nr is incremented in
>     hpet_route_decision_not_missed_ticks().
>     >
>     > The missed ticks policy intr_en_notif callback also uses the
>     pending_mask
>     > method. So even though
>     > Linux does not broadcast its interrupts, the code could handle
>     it if it did.
>     > In this case the end of interrupt time stamp is made when the
>     pending_mask is
>     > clear.
>     >
>     > 4. Live Migration
>     >
>     > Live migration with hpet preserves the current offset of the
>     guest clock with
>     > respect
>     > to ntp. This is accomplished by migrating all of the state in
>     the h->hpet data
>     > structure
>     > in the usual way. The hp->mc_offset is recalculated on the
>     receiving node so
>     > that the
>     > guest sees a continuous hpet main counter.
>     >
>     > Code as been added to xc_domain_save.c to send a small message
>     after the
>     > domain context is sent. The contents of the message is the
>     physical tsc
>     > timestamp, last_tsc,
>     > read just before the message is sent. When the last_tsc message
>     is received in
>     > xc_domain_restore.c,
>     > another physical tsc timestamp, cur_tsc, is read. The two
>     timestamps are
>     > loaded into the domain
>     > structure as last_tsc_sender and first_tsc_receiver with
>     hypercalls. Then
>     > xc_domain_hvm_setcontext
>     > is called so that hpet_load has access to these time stamps.
>     Hpet_load uses
>     > the timestamps
>     > to account for the time spent saving and loading the domain
>     context. With this
>     > technique,
>     > the only neglected time is the time spent sending a small
>     network message.
>     >
>     > 5. Test Results
>     >
>     > Some recent test results are:
>     >
>     > 5.1 Linux 4u664 and Windows 2k864 load test.
>     >       Duration: 70 hours.
>     >       Test date: 6/2/08
>     >       Loads: usex -b48 on Linux; burn-in on Windows
>     >       Guest vcpus: 8 for Linux; 2 for Windows
>     >       Hardware: 8 physical cpu AMD
>     >       Clock drift : Linux: .0012% Windows: .009%
>     >
>     > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test
>     >       Duration: 23 hours.
>     >       Test date: 6/3/08
>     >       Loads: none
>     >       Guest vcpus: 8 for each Linux; 2 for Windows
>     >       Hardware: 4 physical cpu AMD
>     >       Clock drift : Linux: .033% Windows: .019%
>     >
>     > 6. Relation to recent work in xen-unstable
>     >
>     > There is a similarity between hvm_get_guest_time() in
>     xen-unstable and
>     > read_64_main_counter()
>     > in this code. However, read_64_main_counter() is more tuned to
>     the needs of
>     > hpet.c. It has no
>     > "set" operation, only the get. It isolates the mode, physical or
>     simulated, in
>     > read_64_main_counter()
>     > itself. It uses no vcpu or domain state as it is a physical
>     entity, in either
>     > mode. And it provides a real
>     > physical mode for every read for those applications that desire
>     this.
>     >
>     > 7. Conclusion
>     >
>     > The virtual hpet is improved by this patch in terms of accuracy and
>     > monotonicity.
>     > Tests performed to date verify this and more testing is under way.
>     >
>     > 8. Future Work
>     >
>     > Testing with Windows Vista will be performed soon. The reason
>     for accuracy
>     > variations
>     > on different platforms using the physical hpet device will be
>     investigated.
>     > Additional overhead measurements on simulated vs physical hpet
>     mode will be
>     > made.
>     >
>     > Footnotes:
>     >
>     > 1. I don't recall the accuracy improvement with end of interrupt
>     stamping, but
>     > it was
>     > significant, perhaps better than two to one improvement. It
>     would be a very
>     > simple matter
>     > to re-measure the improvement as the facility can call back at
>     injection time
>     > as well.
>     >
>     >
>     > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com>
>     > <mailto:dwinchell@virtualiron.com>
>     > Signed-off-by: Ben Guthro <bguthro@virtualiron.com>
>     > <mailto:bguthro@virtualiron.com>
>     >
>     >
>     > _______________________________________________
>     > Xen-devel mailing list
>     > Xen-devel@lists.xensource.com
>     > http://lists.xensource.com/xen-devel
>
>
>