From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dave Winchell Subject: Re: [PATCH 0/2] Improve hpet accuracy Date: Fri, 06 Jun 2008 15:33:09 -0400 Message-ID: <484990F5.4040300@virtualiron.com> References: <20080606095323843.00000002776@djm-pc> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit Return-path: In-Reply-To: <20080606095323843.00000002776@djm-pc> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: "dan.magenheimer@oracle.com" , Keir Fraser Cc: Dave Winchell , xen-devel , Ben Guthro List-Id: xen-devel@lists.xenproject.org Dan, Keir: Preliminary tests results indicate an error of .1% for Linux 64 bit guests configured for hpet with xen-unstable as is. As we have discussed many times, the ntp requirement is .05%. Tests on the patch we just submitted for hpet have indicated errors of .0012% on this platform under similar test conditions and .03% on other platforms. Windows vista64 has an error of 11% using hpet with the xen-unstable bits. In an overnight test with our hpet patch, the Windows vista error was .008%. The tests are with two or three guests on a physical node, all under load, and with the ratio of vcpus to phys cpus > 1. I will continue to run tests over the next few days. thanks, Dave Dan Magenheimer wrote: > Hi Dave and Ben -- > > When running tests on xen-unstable (without your patch), please ensure > that hpet=1 is set in the hvm config and also I think that when hpet > is the clocksource on RHEL4-32, the clock IS resilient to missed ticks > so timer_mode should be 2 (vs when pit is the clocksource on RHEL4-32, > all clock ticks must be delivered and so timer_mode should be 0). > > Per > http://lists.xensource.com/archives/html/xen-devel/2008-06/msg00098.html it's > my intent to clean this up, but I won't get to it until next week. > > Thanks, > Dan > > -----Original Message----- > *From:* xen-devel-bounces@lists.xensource.com > [mailto:xen-devel-bounces@lists.xensource.com]*On Behalf Of *Dave > Winchell > *Sent:* Friday, June 06, 2008 4:46 AM > *To:* Keir Fraser; Ben Guthro; xen-devel > *Cc:* dan.magenheimer@oracle.com; Dave Winchell > *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > Keir, > > I think the changes are required. We'll run some tests today today so > that we have some data to talk about. > > -Dave > > > -----Original Message----- > From: xen-devel-bounces@lists.xensource.com on behalf of Keir Fraser > Sent: Fri 6/6/2008 4:58 AM > To: Ben Guthro; xen-devel > Cc: dan.magenheimer@oracle.com > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > Are these patches needed now the timers are built on Xen system > time rather > than host TSC? Dan has reported much better time-keeping with his > patch > checked in, and itıs for sure a lot less invasive than this patchset. > > > -- Keir > > On 5/6/08 15:59, "Ben Guthro" wrote: > > > > > 1. Introduction > > > > This patch improves the hpet based guest clock in terms of drift and > > monotonicity. > > Prior to this work the drift with hpet was greater than 2%, far > above the .05% > > limit > > for ntp to synchronize. With this code, the drift ranges from > .001% to .0033% > > depending > > on guest and physical platform. > > > > Using hpet allows guest operating systems to provide monotonic > time to their > > applications. Time sources other than hpet are not monotonic because > > of their reliance on tsc, which is not synchronized across physical > > processors. > > > > Windows 2k864 and many Linux guests are supported with two > policies, one for > > guests > > that handle missed clock interrupts and the other for guests > that require the > > correct number of interrupts. > > > > Guests may use hpet for the timing source even if the physical > platform has no > > visible > > hpet. Migration is supported between physical machines which > differ in > > physical > > hpet visibility. > > > > Most of the changes are in hpet.c. Two general facilities are > added to track > > interrupt > > progress. The ideas here and the facilities would be useful in > vpt.c, for > > other time > > sources, though no attempt is made here to improve vpt.c. > > > > The following sections discuss hpet dependencies, interrupt > delivery policies, > > live migration, > > test results, and relation to recent work with monotonic time. > > > > > > 2. Virtual Hpet dependencies > > > > The virtual hpet depends on the ability to read the physical or > simulated > > (see discussion below) hpet. For timekeeping, the virtual hpet > also depends > > on two new interrupt notification facilities to implement its > policies for > > interrupt delivery. > > > > 2.1. Two modes of low-level hpet main counter reads. > > > > In this implementation, the virtual hpet reads with > read_64_main_counter(), > > exported by > > time.c, either the real physical hpet main counter register > directly or a > > "simulated" > > hpet main counter. > > > > The simulated mode uses a monotonic version of get_s_time() > (NOW()), where the > > last > > time value is returned whenever the current time value is less > than the last > > time > > value. In simulated mode, since it is layered on s_time, the > underlying > > hardware > > can be hpet or some other device. The frequency of the main > counter in > > simulated > > mode is the same as the standard physical hpet frequency, > allowing live > > migration > > between nodes that are configured differently. > > > > If the physical platform does not have an hpet device, or if xen > is configured > > not > > to use the device, then the simulated method is used. If there > is a physical > > hpet device, > > and xen has initialized it, then either simulated or physical > mode can be > > used. > > This is governed by a boot time option, hpet-avoid. Setting this > option to 1 > > gives the > > simulated mode and 0 the physical mode. The default is physical > mode. > > > > A disadvantage of the physical mode is that may take longer to > read the device > > than in simulated mode. On some platforms the cost is about the > same (less > > than 250 nsec) for > > physical and simulated modes, while on others physical cost is > much higher > > than simulated. > > A disadvantage of the simulated mode is that it can return the > same value > > for the counter in consecutive calls. > > > > 2.2. Interrupt notification facilities. > > > > Two interrupt notification facilities are introduced, one is > > hvm_isa_irq_assert_cb() > > and the other hvm_register_intr_en_notif(). > > > > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to > the vioapic. > > hvm_isa_irq_assert_cb allows a callback to be passed along to > > vioapic_deliver() > > and this callback is called with a mask of the vcpus which will > get the > > interrupt. This callback is made before any vcpus receive an > interrupt. > > > > Vhpet uses hvm_register_intr_en_notif() to register a handler > for a particular > > vector that will be called when that vector is injected in > > [vmx,svm]_intr_assist() > > and also when the guest finishes handling the interrupt. Here > finished is > > defined > > as the point when the guest re-enables interrupts or lowers the > tpr value. > > EOI is not used as the end of interrupt as this is sometimes > returned before > > the interrupt handler has done its work. A flag is passed to the > handler > > indicating > > whether this is the injection point (post = 1) or the interrupt > finished (post > > = 0) point. > > The need for the finished point callback is discussed in the > missed ticks > > policy section. > > > > To prevent a possible early trigger of the finished callback, > intr_en_notif > > logic > > has a two stage arm, the first at injection > (hvm_intr_en_notif_arm()) and the > > second when > > interrupts are seen to be disabled (hvm_intr_en_notif_disarm()). > Once fully > > armed, re-enabling > > interrupts will cause hvm_intr_en_notif_disarm() to make the end > of interrupt > > callback. hvm_intr_en_notif_arm() and hvm_intr_en_notif_disarm() > are called by > > [vmx,svm]_intr_assist(). > > > > 3. Interrupt delivery policies > > > > The existing hpet interrupt delivery is preserved. This includes > > vcpu round robin delivery used by Linux and broadcast delivery > used by > > Windows. > > > > There are two policies for interrupt delivery, one for Windows > 2k8-64 and the > > other > > for Linux. The Linux policy takes advantage of the (guest) Linux > missed tick > > and offset > > calculations and does not attempt to deliver the right number of > interrupts. > > The Windows policy delivers the correct number of interrupts, > even if > > sometimes much > > closer to each other than the period. The policies are similar > to those in > > vpt.c, though > > there are some important differences. > > > > Policies are selected with an HVMOP_set_param hypercall with index > > HVM_PARAM_TIMER_MODE. > > Two new values are added, HVM_HPET_guest_computes_missed_ticks and > > HVM_HPET_guest_does_not_compute_missed_ticks. The reason that > two new ones > > are added is that > > in some guests (32bit Linux) a no-missed policy is needed for > clock sources > > other than hpet > > and a missed ticks policy for hpet. It was felt that there would > be less > > confusion by simply > > introducing the two hpet policies. > > > > 3.1. The missed ticks policy > > > > The Linux clock interrupt handler for hpet calculates missed > ticks and offset > > using the hpet > > main counter. The algorithm works well when the time since the > last interrupt > > is greater than > > or equal to a period and poorly otherwise. > > > > The missed ticks policy ensures that no two clock interrupts are > delivered to > > the guest at > > a time interval less than a period. A time stamp (hpet main > counter value) is > > recorded (by a > > callback registered with hvm_register_intr_en_notif) when Linux > finishes > > handling the clock > > interrupt. Then, ensuing interrupts are delivered to the vioapic > only if the > > current main > > counter value is a period greater than when the last interrupt > was handled. > > > > Tests showed a significant improvement in clock drift with end > of interrupt > > time stamps > > versus beginning of interrupt[1]. It is believed that the reason > for the > > improvement > > is that the clock interrupt handler goes for a spinlock and can > be therefore > > delayed in its > > processing. Furthermore, the main counter is read by the guest > under the lock. > > The net > > effect is that if we time stamp injection, we can get the > difference in time > > between successive interrupt handler lock acquisitions to be > less than the > > period. > > > > 3.2. The no-missed ticks policy > > > > Windows 2k864 keeps very poor time with the missed ticks policy. > So the > > no-missed ticks policy > > was developed. In the no-missed ticks policy we deliver the > correct number of > > interrupts, > > even if they are spaced less than a period apart (when catching up). > > > > Windows 2k864 uses a broadcast mode in the interrupt routing > such that > > all vcpus get the clock interrupt. The best Windows drift > performance was > > achieved when the > > policy code ensured that all the previous interrupts (on the > various vcpus) > > had been injected > > before injecting the next interrupt to the vioapic.. > > > > The policy code works as follows. It uses the > hvm_isa_irq_assert_cb() to > > record > > the vcpus to be interrupted in h->hpet.pending_mask. Then, in > the callback > > registered > > with hvm_register_intr_en_notif() at post=1 time it clears the > current vcpu in > > the pending_mask. > > When the pending_mask is clear it decrements > hpet.intr_pending_nr and if > > intr_pending_nr is still > > non-zero posts another interrupt to the ioapic with > hvm_isa_irq_assert_cb(). > > Intr_pending_nr is incremented in > hpet_route_decision_not_missed_ticks(). > > > > The missed ticks policy intr_en_notif callback also uses the > pending_mask > > method. So even though > > Linux does not broadcast its interrupts, the code could handle > it if it did. > > In this case the end of interrupt time stamp is made when the > pending_mask is > > clear. > > > > 4. Live Migration > > > > Live migration with hpet preserves the current offset of the > guest clock with > > respect > > to ntp. This is accomplished by migrating all of the state in > the h->hpet data > > structure > > in the usual way. The hp->mc_offset is recalculated on the > receiving node so > > that the > > guest sees a continuous hpet main counter. > > > > Code as been added to xc_domain_save.c to send a small message > after the > > domain context is sent. The contents of the message is the > physical tsc > > timestamp, last_tsc, > > read just before the message is sent. When the last_tsc message > is received in > > xc_domain_restore.c, > > another physical tsc timestamp, cur_tsc, is read. The two > timestamps are > > loaded into the domain > > structure as last_tsc_sender and first_tsc_receiver with > hypercalls. Then > > xc_domain_hvm_setcontext > > is called so that hpet_load has access to these time stamps. > Hpet_load uses > > the timestamps > > to account for the time spent saving and loading the domain > context. With this > > technique, > > the only neglected time is the time spent sending a small > network message. > > > > 5. Test Results > > > > Some recent test results are: > > > > 5.1 Linux 4u664 and Windows 2k864 load test. > > Duration: 70 hours. > > Test date: 6/2/08 > > Loads: usex -b48 on Linux; burn-in on Windows > > Guest vcpus: 8 for Linux; 2 for Windows > > Hardware: 8 physical cpu AMD > > Clock drift : Linux: .0012% Windows: .009% > > > > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test > > Duration: 23 hours. > > Test date: 6/3/08 > > Loads: none > > Guest vcpus: 8 for each Linux; 2 for Windows > > Hardware: 4 physical cpu AMD > > Clock drift : Linux: .033% Windows: .019% > > > > 6. Relation to recent work in xen-unstable > > > > There is a similarity between hvm_get_guest_time() in > xen-unstable and > > read_64_main_counter() > > in this code. However, read_64_main_counter() is more tuned to > the needs of > > hpet.c. It has no > > "set" operation, only the get. It isolates the mode, physical or > simulated, in > > read_64_main_counter() > > itself. It uses no vcpu or domain state as it is a physical > entity, in either > > mode. And it provides a real > > physical mode for every read for those applications that desire > this. > > > > 7. Conclusion > > > > The virtual hpet is improved by this patch in terms of accuracy and > > monotonicity. > > Tests performed to date verify this and more testing is under way. > > > > 8. Future Work > > > > Testing with Windows Vista will be performed soon. The reason > for accuracy > > variations > > on different platforms using the physical hpet device will be > investigated. > > Additional overhead measurements on simulated vs physical hpet > mode will be > > made. > > > > Footnotes: > > > > 1. I don't recall the accuracy improvement with end of interrupt > stamping, but > > it was > > significant, perhaps better than two to one improvement. It > would be a very > > simple matter > > to re-measure the improvement as the facility can call back at > injection time > > as well. > > > > > > Signed-off-by: Dave Winchell > > > > Signed-off-by: Ben Guthro > > > > > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.xensource.com > > http://lists.xensource.com/xen-devel > > >