From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dave Winchell Subject: Re: [PATCH 0/2] Improve hpet accuracy Date: Wed, 11 Jun 2008 09:58:00 -0400 Message-ID: <484FD9E8.9030307@virtualiron.com> References: <20080610194420109.00000041832@djm-pc> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit Return-path: In-Reply-To: <20080610194420109.00000041832@djm-pc> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: "dan.magenheimer@oracle.com" Cc: Dave Winchell , xen-devel , Keir Fraser , Ben Guthro List-Id: xen-devel@lists.xenproject.org Dan Magenheimer wrote: >>In EL5u1-32 however it looks like the fractions are accounted >>for. Indeed the EL5u1-32 "lost tick handling" code resembles >>the Linux/ia64 code which is what I've always assumed was >>the "missed tick" model. In this case, I think no policy >>is necessary and the measured skew should be identical to >>any physical hpet skew. I'll have to test this hypothesis though. >> >> > >I've tested this hypothesis and it seems to hold true. >This means the existing (unpatched) hpet code works fine >on EL5-32bit (vcpus=1) when hpet is the clocksource, >even when the machine is overcommitted. A second hypothesis >still needs to be tested that Dave's patch will not make this worse. > > Interesting, thanks for pointing this out and confirming. >(Note that per previous discussion, my EL5u1-32bit guest >running on an Intel dual-core physical box chose tsc as >the best clocksource and I had to override it with >clock=hpet in the kernel command line.) > > Is there one setting for all Linux guests that makes them choose hpet? Is it "clock=hpet clocksource=hpet"? I know you wrote at length about this before. > > >>Yes, that makes sense and concurs with what I remember from >>the EL4u5-32 code. If this is true, one would expect the >>default "no missed tick" policy to see time moving faster >>than an external source -- the first missed tick delivered >>after a long sleep would "catch up" and then the remainder >>would each add another tick. >> >> > >Indeed with the existing (unpatched) hpet code, time is >running faster on EL4u5-32 (vcpus=1, when overcommited). >So Dave's patch is definitely needed here. > > Its good to get the verification of this. thanks, Dave >Will try 64-bit next. > >Dan > > > >>-----Original Message----- >>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] >>Sent: Monday, June 09, 2008 9:21 PM >>To: 'Dave Winchell'; 'Keir Fraser' >>Cc: 'xen-devel'; 'Ben Guthro' >>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy >> >> >> >> >>>I'll tell you what I recall about this. Tomorrow I'll check the >>>guest code to verify. I think that Linux declares a full tick, >>>even if the interrupt is early. That's the problem. >>> >>> >>Yes, that makes sense and concurs with what I remember from >>the EL4u5-32 code. If this is true, one would expect the >>default "no missed tick" policy to see time moving faster >>than an external source -- the first missed tick delivered >>after a long sleep would "catch up" and then the remainder >>would each add another tick. >> >> >> >>>On the other hand, if the interrupt is late it in effect declares >>>a tick plus fraction. If it just declared the fraction in >>> >>> >>the first place, >> >> >>>we could deliver the interrupts whenever we wanted. >>> >>> >>My read of the EL4u5-32 code is that the fraction is discarded >>and a new tick period commences at "now", so the fractions >>eventually accumulate as lost time. >> >>In EL5u1-32 however it looks like the fractions are accounted >>for. Indeed the EL5u1-32 "lost tick handling" code resembles >>the Linux/ia64 code which is what I've always assumed was >>the "missed tick" model. In this case, I think no policy >>is necessary and the measured skew should be identical to >>any physical hpet skew. I'll have to test this hypothesis though. >> >>-----Original Message----- >>From: xen-devel-bounces@lists.xensource.com >>[mailto:xen-devel-bounces@lists.xensource.com]On Behalf Of >>Dave Winchell >>Sent: Monday, June 09, 2008 5:35 PM >>To: dan.magenheimer@oracle.com; Keir Fraser >>Cc: Dave Winchell; xen-devel; Ben Guthro >>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy >> >> >> >> >>>>The Linux policy is more subtle, but is required to go >>>>from .1% to .03%. >>>> >>>> >>>Thanks for the good documentation which I hadn't thoroughly >>>read until now. >>>I now understand that the essence of your >>>hpet missed ticks policy is to ensure that ticks are never >>>delivered too close together. But I'm trying to understand >>>WHY your patch works, in other words, what problem it is >>>countering. >>> >>> >>I'll tell you what I recall about this. Tomorrow I'll check the >>guest code to verify. I think that Linux declares a full tick, >>even if the interrupt is early. That's the problem. >>On the other hand, if the interrupt is late it in effect declares >>a tick plus fraction. If it just declared the fraction in the >>first place, >>we could deliver the interrupts whenever we wanted. >> >>Its really not that different than the missed ticks policy in vpt.c >>except that there the period in vpt.c is based on start of interrupt >>and I have improved that with end-of interrupt as described >>in the patch note. >> >>I don't recall what prompted me to try end-of-interrupt, >>but I saw a significant improvement. I may have been running >>a monotonicity test at the same time to explain the lock >>contention mentioned in the write-up. >> >> >> >>> I care about this for more reasons than just >>>because it is interesting: (1) I'd like to feel confident that >>>it is fixing a bug rather than just a symptom of a bug; >>>and (2) I wonder how universally it is applicable. >>> >>> >>Its worked well my my small set of guests. You and our >>QA are going to tell us about the wider set. It doesn't >>matter if guest A handles interrupts closely spaced or not, >>just whether it handles them far apart. So it should be pretty >>universal with guests that really handle missed ticks. >>I think its interesting that some 32bit Linux guests handle >>missed ticks for hpet. >> >> >> >>>I see from code examination in mark_offset_hpet() in >>>RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that >>>the correction for lost ticks is just plain wrong in >>>a virtual environment. (Suppose for example that a virtual >>>tick was delivered every 1.999*hpet_tick... I think >>>the clock would be off by 50%!) Is this the bug that >>>is being "countered" by your policy? >>> >>> >>I haven't looked at that code, perhaps. >>I'll check it tomorrow. >> >> >> >>>However, the lost tick handling in RHEL5u1/kernel/timer.c >>>(which I think is used also for hpet) is much better >>>so I am eager to find out if your policy works there >>>too. >>>If the hpet missed tick policy works for both, though, >>>I should be happy, though I wonder about upstream kernels >>>(e.g. the trend toward tickless). >>> >>> >>I wasn't aware of this trend. If its robust, however, it should >>handle late interrupts ... >> >> >> >>>That said, I'd rather >>>see this get into Xen 3.3 and worry about upstream kernels >>>later :-) >>> >>> >>Regards, >>Dave >> >> >> >>-----Original Message----- >>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] >>Sent: Mon 6/9/2008 6:02 PM >>To: Dave Winchell; Keir Fraser >>Cc: Ben Guthro; xen-devel >>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy >> >> >> >>>The Linux policy is more subtle, but is required to go >>>from .1% to .03%. >>> >>> >>Thanks for the good documentation which I hadn't thoroughly >>read until now. I now understand that the essence of your >>hpet missed ticks policy is to ensure that ticks are never >>delivered too close together. But I'm trying to understand >>WHY your patch works, in other words, what problem it is >>countering. I care about this for more reasons than just >>because it is interesting: (1) I'd like to feel confident that >>it is fixing a bug rather than just a symptom of a bug; >>and (2) I wonder how universally it is applicable. >> >>I see from code examination in mark_offset_hpet() in >>RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that >>the correction for lost ticks is just plain wrong in >>a virtual environment. (Suppose for example that a virtual >>tick was delivered every 1.999*hpet_tick... I think >>the clock would be off by 50%!) Is this the bug that >>is being "countered" by your policy? >> >>However, the lost tick handling in RHEL5u1/kernel/timer.c >>(which I think is used also for hpet) is much better >>so I am eager to find out if your policy works there >>too. >> >>If the hpet missed tick policy works for both, though, >>I should be happy, though I wonder about upstream kernels >>(e.g. the trend toward tickless). That said, I'd rather >>see this get into Xen 3.3 and worry about upstream kernels >>later :-) >> >>-----Original Message----- >>From: Dave Winchell [mailto:dwinchell@virtualiron.com] >>Sent: Sunday, June 08, 2008 2:32 PM >>To: dan.magenheimer@oracle.com; Keir Fraser >>Cc: Ben Guthro; xen-devel; Dave Winchell >>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy >> >> >>Hi Dan, >> >> >> >>>While I am fully supportive of offering hardware hpet as an option >>>for hvm guests (let's call it hwhpet=1 for shorthand), I am very >>>surprised by your preliminary results; the most obvious conclusion >>>is that Xen system time is losing time at the rate of 1000 PPM >>>though its possible there's a bug somewhere else in the "time >>>stack". Your Windows result is jaw-dropping and inexplicable, >>>though I have to admit ignorance of how Windows manages time. >>> >>> >>I think xen system time is fine. You have to add the interrupt >>delivery policies decribed in the write-up for the patch to get >>accurate timekeeping in the guest. >> >>The windows policy is obvious and results in a large improvement >>in accuracy. The Linux policy is more subtle, but is required to go >>from .1% to .03%. >> >> >> >>>I think with my recent patch and hpet=1 (essentially the same as >>>your emulated hpet), hvm guest time should track Xen system time. >>>I wonder if domain0 (which if I understand correctly is directly >>>using Xen system time) is also seeing an error of .1%? Also >>>I wonder for the skew you are seeing (in both hvm guests and >>>domain0) is time moving too fast or two slow? >>> >>> >>I don't recall the direction. I can look it up in my notes at work >>tomorrow. >> >> >> >>>Although hwhpet=1 is a fine alternative in many cases, it may >>>be unavailable on some systems and may cause significant performance >>>issues on others. So I think we will still need to track down >>>the poor accuracy when hwhpet=0. >>> >>> >>Our patch is accurate to < .03% using the physical hpet mode or >>the simulated mode. >> >> >> >>>And if for some reason >>>Xen system time can't be made accurate enough (< 0.05%), then >>>I think we should consider building Xen system time itself on >>>top of hardware hpet instead of TSC... at least when Xen discovers >>>a capable hpet. >>> >>> >>In our experience, Xen system time is accurate enough now. >> >> >> >>>One more thought... do you know the accuracy of the TSC crystals >>>on your test systems? I posted a patch awhile ago that was >>>intended to test that, though I guess it was only testing skew >>>of different TSCs on the same system, not TSCs against an >>>external time source. >>> >>> >>I do not know the tsc accuracy. >> >> >> >>>Or maybe there's a computation error somewhere in the hvm hpet >>>scaling code? Hmmm... >>> >>> >>Regards, >>Dave >> >> >>-----Original Message----- >>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] >>Sent: Fri 6/6/2008 4:29 PM >>To: Dave Winchell; Keir Fraser >>Cc: Ben Guthro; xen-devel >>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy >> >>Dave -- >> >>Thanks much for posting the preliminary results! >> >>While I am fully supportive of offering hardware hpet as an option >>for hvm guests (let's call it hwhpet=1 for shorthand), I am very >>surprised by your preliminary results; the most obvious conclusion >>is that Xen system time is losing time at the rate of 1000 PPM >>though its possible there's a bug somewhere else in the "time >>stack". Your Windows result is jaw-dropping and inexplicable, >>though I have to admit ignorance of how Windows manages time. >> >> >>I think with my recent patch and hpet=1 (essentially the same as >>your emulated hpet), hvm guest time should track Xen system time. >>I wonder if domain0 (which if I understand correctly is directly >>using Xen system time) is also seeing an error of .1%? Also >>I wonder for the skew you are seeing (in both hvm guests and >>domain0) is time moving too fast or two slow? >> >>Although hwhpet=1 is a fine alternative in many cases, it may >>be unavailable on some systems and may cause significant performance >>issues on others. So I think we will still need to track down >>the poor accuracy when hwhpet=0. And if for some reason >>Xen system time can't be made accurate enough (< 0.05%), then >>I think we should consider building Xen system time itself on >>top of hardware hpet instead of TSC... at least when Xen discovers >>a capable hpet. >> >>One more thought... do you know the accuracy of the TSC crystals >>on your test systems? I posted a patch awhile ago that was >>intended to test that, though I guess it was only testing skew >>of different TSCs on the same system, not TSCs against an >>external time source. >> >>Or maybe there's a computation error somewhere in the hvm hpet >>scaling code? Hmmm... >> >>Thanks, >>Dan >> >> >> >>>-----Original Message----- >>>From: Dave Winchell [mailto:dwinchell@virtualiron.com] >>>Sent: Friday, June 06, 2008 1:33 PM >>>To: dan.magenheimer@oracle.com; Keir Fraser >>>Cc: Ben Guthro; xen-devel; Dave Winchell >>>Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy >>> >>> >>>Dan, Keir: >>> >>>Preliminary tests results indicate an error of .1% for Linux 64 bit >>>guests configured >>>for hpet with xen-unstable as is. As we have discussed many >>> >>> >>times, the >> >> >>>ntp requirement is .05%. >>>Tests on the patch we just submitted for hpet have >>> >>> >>indicated errors of >> >> >>>.0012% >>>on this platform under similar test conditions and .03% on >>>other platforms. >>> >>>Windows vista64 has an error of 11% using hpet with the >>>xen-unstable bits. >>>In an overnight test with our hpet patch, the Windows vista >>>error was .008%. >>> >>>The tests are with two or three guests on a physical node, all under >>>load, and with >>>the ratio of vcpus to phys cpus > 1. >>> >>>I will continue to run tests over the next few days. >>> >>>thanks, >>>Dave >>> >>> >>>Dan Magenheimer wrote: >>> >>> >>> >>>>Hi Dave and Ben -- >>>> >>>>When running tests on xen-unstable (without your patch), >>>> >>>> >>>please ensure >>> >>> >>>>that hpet=1 is set in the hvm config and also I think >>>> >>>> >>that when hpet >> >> >>>>is the clocksource on RHEL4-32, the clock IS resilient to >>>> >>>> >>>missed ticks >>> >>> >>>>so timer_mode should be 2 (vs when pit is the clocksource >>>> >>>> >>>on RHEL4-32, >>> >>> >>>>all clock ticks must be delivered and so timer_mode should be 0). >>>> >>>>Per >>>> >>>> >>>> >>>http://lists.xensource.com/archives/html/xen-devel/2008-06/msg >>>00098.html it's >>> >>> >>>>my intent to clean this up, but I won't get to it until next week. >>>> >>>>Thanks, >>>>Dan >>>> >>>> -----Original Message----- >>>> *From:* xen-devel-bounces@lists.xensource.com >>>> [mailto:xen-devel-bounces@lists.xensource.com]*On >>>> >>>> >>>Behalf Of *Dave >>> >>> >>>> Winchell >>>> *Sent:* Friday, June 06, 2008 4:46 AM >>>> *To:* Keir Fraser; Ben Guthro; xen-devel >>>> *Cc:* dan.magenheimer@oracle.com; Dave Winchell >>>> *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy >>>> >>>> Keir, >>>> >>>> I think the changes are required. We'll run some tests >>>> >>>> >>>today today so >>> >>> >>>> that we have some data to talk about. >>>> >>>> -Dave >>>> >>>> >>>> -----Original Message----- >>>> From: xen-devel-bounces@lists.xensource.com on behalf >>>> >>>> >>>of Keir Fraser >>> >>> >>>> Sent: Fri 6/6/2008 4:58 AM >>>> To: Ben Guthro; xen-devel >>>> Cc: dan.magenheimer@oracle.com >>>> Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy >>>> >>>> Are these patches needed now the timers are built on >>>> >>>> >>Xen system >> >> >>>> time rather >>>> than host TSC? Dan has reported much better >>>> >>>> >>>time-keeping with his >>> >>> >>>> patch >>>> checked in, and itıs for sure a lot less invasive than >>>> >>>> >>>this patchset. >>> >>> >>>> -- Keir >>>> >>>> On 5/6/08 15:59, "Ben Guthro" wrote: >>>> >>>> > >>>> > 1. Introduction >>>> > >>>> > This patch improves the hpet based guest clock in >>>> >>>> >>>terms of drift and >>> >>> >>>> > monotonicity. >>>> > Prior to this work the drift with hpet was greater >>>> >>>> >>>than 2%, far >>> >>> >>>> above the .05% >>>> > limit >>>> > for ntp to synchronize. With this code, the drift >>>> >>>> >>ranges from >> >> >>>> .001% to .0033% >>>> > depending >>>> > on guest and physical platform. >>>> > >>>> > Using hpet allows guest operating systems to >>>> >>>> >>provide monotonic >> >> >>>> time to their >>>> > applications. Time sources other than hpet are not >>>> >>>> >>>monotonic because >>> >>> >>>> > of their reliance on tsc, which is not synchronized >>>> >>>> >>>across physical >>> >>> >>>> > processors. >>>> > >>>> > Windows 2k864 and many Linux guests are supported with two >>>> policies, one for >>>> > guests >>>> > that handle missed clock interrupts and the other for guests >>>> that require the >>>> > correct number of interrupts. >>>> > >>>> > Guests may use hpet for the timing source even if >>>> >>>> >>the physical >> >> >>>> platform has no >>>> > visible >>>> > hpet. Migration is supported between physical machines which >>>> differ in >>>> > physical >>>> > hpet visibility. >>>> > >>>> > Most of the changes are in hpet.c. Two general >>>> >>>> >>facilities are >> >> >>>> added to track >>>> > interrupt >>>> > progress. The ideas here and the facilities would >>>> >>>> >>be useful in >> >> >>>> vpt.c, for >>>> > other time >>>> > sources, though no attempt is made here to improve vpt.c. >>>> > >>>> > The following sections discuss hpet dependencies, interrupt >>>> delivery policies, >>>> > live migration, >>>> > test results, and relation to recent work with >>>> >>>> >>monotonic time. >> >> >>>> > >>>> > >>>> > 2. Virtual Hpet dependencies >>>> > >>>> > The virtual hpet depends on the ability to read the >>>> >>>> >>>physical or >>> >>> >>>> simulated >>>> > (see discussion below) hpet. For timekeeping, the >>>> >>>> >>>virtual hpet >>> >>> >>>> also depends >>>> > on two new interrupt notification facilities to >>>> >>>> >>implement its >> >> >>>> policies for >>>> > interrupt delivery. >>>> > >>>> > 2.1. Two modes of low-level hpet main counter reads. >>>> > >>>> > In this implementation, the virtual hpet reads with >>>> read_64_main_counter(), >>>> > exported by >>>> > time.c, either the real physical hpet main counter register >>>> directly or a >>>> > "simulated" >>>> > hpet main counter. >>>> > >>>> > The simulated mode uses a monotonic version of get_s_time() >>>> (NOW()), where the >>>> > last >>>> > time value is returned whenever the current time >>>> >>>> >>value is less >> >> >>>> than the last >>>> > time >>>> > value. In simulated mode, since it is layered on s_time, the >>>> underlying >>>> > hardware >>>> > can be hpet or some other device. The frequency of the main >>>> counter in >>>> > simulated >>>> > mode is the same as the standard physical hpet frequency, >>>> allowing live >>>> > migration >>>> > between nodes that are configured differently. >>>> > >>>> > If the physical platform does not have an hpet >>>> >>>> >>>device, or if xen >>> >>> >>>> is configured >>>> > not >>>> > to use the device, then the simulated method is >>>> >>>> >>used. If there >> >> >>>> is a physical >>>> > hpet device, >>>> > and xen has initialized it, then either simulated >>>> >>>> >>or physical >> >> >>>> mode can be >>>> > used. >>>> > This is governed by a boot time option, hpet-avoid. >>>> >>>> >>>Setting this >>> >>> >>>> option to 1 >>>> > gives the >>>> > simulated mode and 0 the physical mode. The default >>>> >>>> >>>is physical >>> >>> >>>> mode. >>>> > >>>> > A disadvantage of the physical mode is that may >>>> >>>> >>take longer to >> >> >>>> read the device >>>> > than in simulated mode. On some platforms the cost is >>>> >>>> >>>about the >>> >>> >>>> same (less >>>> > than 250 nsec) for >>>> > physical and simulated modes, while on others >>>> >>>> >>physical cost is >> >> >>>> much higher >>>> > than simulated. >>>> > A disadvantage of the simulated mode is that it can >>>> >>>> >>return the >> >> >>>> same value >>>> > for the counter in consecutive calls. >>>> > >>>> > 2.2. Interrupt notification facilities. >>>> > >>>> > Two interrupt notification facilities are introduced, one is >>>> > hvm_isa_irq_assert_cb() >>>> > and the other hvm_register_intr_en_notif(). >>>> > >>>> > The vhpet uses hvm_isa_irq_assert_cb to deliver >>>> >>>> >>interrupts to >> >> >>>> the vioapic. >>>> > hvm_isa_irq_assert_cb allows a callback to be >>>> >>>> >>passed along to >> >> >>>> > vioapic_deliver() >>>> > and this callback is called with a mask of the vcpus >>>> >>>> >>>which will >>> >>> >>>> get the >>>> > interrupt. This callback is made before any vcpus receive an >>>> interrupt. >>>> > >>>> > Vhpet uses hvm_register_intr_en_notif() to register >>>> >>>> >>a handler >> >> >>>> for a particular >>>> > vector that will be called when that vector is injected in >>>> > [vmx,svm]_intr_assist() >>>> > and also when the guest finishes handling the >>>> >>>> >>interrupt. Here >> >> >>>> finished is >>>> > defined >>>> > as the point when the guest re-enables interrupts or >>>> >>>> >>>lowers the >>> >>> >>>> tpr value. >>>> > EOI is not used as the end of interrupt as this is sometimes >>>> returned before >>>> > the interrupt handler has done its work. A flag is >>>> >>>> >>>passed to the >>> >>> >>>> handler >>>> > indicating >>>> > whether this is the injection point (post = 1) or the >>>> >>>> >>>interrupt >>> >>> >>>> finished (post >>>> > = 0) point. >>>> > The need for the finished point callback is discussed in the >>>> missed ticks >>>> > policy section. >>>> > >>>> > To prevent a possible early trigger of the finished >>>> >>>> >>callback, >> >> >>>> intr_en_notif >>>> > logic >>>> > has a two stage arm, the first at injection >>>> (hvm_intr_en_notif_arm()) and the >>>> > second when >>>> > interrupts are seen to be disabled >>>> >>>> >>>(hvm_intr_en_notif_disarm()). >>> >>> >>>> Once fully >>>> > armed, re-enabling >>>> > interrupts will cause hvm_intr_en_notif_disarm() to >>>> >>>> >>>make the end >>> >>> >>>> of interrupt >>>> > callback. hvm_intr_en_notif_arm() and >>>> >>>> >>>hvm_intr_en_notif_disarm() >>> >>> >>>> are called by >>>> > [vmx,svm]_intr_assist(). >>>> > >>>> > 3. Interrupt delivery policies >>>> > >>>> > The existing hpet interrupt delivery is preserved. >>>> >>>> >>>This includes >>> >>> >>>> > vcpu round robin delivery used by Linux and >>>> >>>> >>broadcast delivery >> >> >>>> used by >>>> > Windows. >>>> > >>>> > There are two policies for interrupt delivery, one >>>> >>>> >>for Windows >> >> >>>> 2k8-64 and the >>>> > other >>>> > for Linux. The Linux policy takes advantage of the >>>> >>>> >>>(guest) Linux >>> >>> >>>> missed tick >>>> > and offset >>>> > calculations and does not attempt to deliver the >>>> >>>> >>>right number of >>> >>> >>>> interrupts. >>>> > The Windows policy delivers the correct number of >>>> >>>> >>interrupts, >> >> >>>> even if >>>> > sometimes much >>>> > closer to each other than the period. The policies >>>> >>>> >>are similar >> >> >>>> to those in >>>> > vpt.c, though >>>> > there are some important differences. >>>> > >>>> > Policies are selected with an HVMOP_set_param >>>> >>>> >>>hypercall with index >>> >>> >>>> > HVM_PARAM_TIMER_MODE. >>>> > Two new values are added, >>>> >>>> >>>HVM_HPET_guest_computes_missed_ticks and >>> >>> >>>> > HVM_HPET_guest_does_not_compute_missed_ticks. The >>>> >>>> >>reason that >> >> >>>> two new ones >>>> > are added is that >>>> > in some guests (32bit Linux) a no-missed policy is >>>> >>>> >>needed for >> >> >>>> clock sources >>>> > other than hpet >>>> > and a missed ticks policy for hpet. It was felt that >>>> >>>> >>>there would >>> >>> >>>> be less >>>> > confusion by simply >>>> > introducing the two hpet policies. >>>> > >>>> > 3.1. The missed ticks policy >>>> > >>>> > The Linux clock interrupt handler for hpet calculates missed >>>> ticks and offset >>>> > using the hpet >>>> > main counter. The algorithm works well when the >>>> >>>> >>time since the >> >> >>>> last interrupt >>>> > is greater than >>>> > or equal to a period and poorly otherwise. >>>> > >>>> > The missed ticks policy ensures that no two clock >>>> >>>> >>>interrupts are >>> >>> >>>> delivered to >>>> > the guest at >>>> > a time interval less than a period. A time stamp (hpet main >>>> counter value) is >>>> > recorded (by a >>>> > callback registered with hvm_register_intr_en_notif) >>>> >>>> >>>when Linux >>> >>> >>>> finishes >>>> > handling the clock >>>> > interrupt. Then, ensuing interrupts are delivered to >>>> >>>> >>>the vioapic >>> >>> >>>> only if the >>>> > current main >>>> > counter value is a period greater than when the >>>> >>>> >>last interrupt >> >> >>>> was handled. >>>> > >>>> > Tests showed a significant improvement in clock >>>> >>>> >>drift with end >> >> >>>> of interrupt >>>> > time stamps >>>> > versus beginning of interrupt[1]. It is believed that >>>> >>>> >>>the reason >>> >>> >>>> for the >>>> > improvement >>>> > is that the clock interrupt handler goes for a >>>> >>>> >>>spinlock and can >>> >>> >>>> be therefore >>>> > delayed in its >>>> > processing. Furthermore, the main counter is read >>>> >>>> >>by the guest >> >> >>>> under the lock. >>>> > The net >>>> > effect is that if we time stamp injection, we can get the >>>> difference in time >>>> > between successive interrupt handler lock acquisitions to be >>>> less than the >>>> > period. >>>> > >>>> > 3.2. The no-missed ticks policy >>>> > >>>> > Windows 2k864 keeps very poor time with the missed >>>> >>>> >>>ticks policy. >>> >>> >>>> So the >>>> > no-missed ticks policy >>>> > was developed. In the no-missed ticks policy we deliver the >>>> correct number of >>>> > interrupts, >>>> > even if they are spaced less than a period apart >>>> >>>> >>>(when catching up). >>> >>> >>>> > >>>> > Windows 2k864 uses a broadcast mode in the interrupt routing >>>> such that >>>> > all vcpus get the clock interrupt. The best Windows drift >>>> performance was >>>> > achieved when the >>>> > policy code ensured that all the previous interrupts (on the >>>> various vcpus) >>>> > had been injected >>>> > before injecting the next interrupt to the vioapic.. >>>> > >>>> > The policy code works as follows. It uses the >>>> hvm_isa_irq_assert_cb() to >>>> > record >>>> > the vcpus to be interrupted in >>>> >>>> >>h->hpet.pending_mask. Then, in >> >> >>>> the callback >>>> > registered >>>> > with hvm_register_intr_en_notif() at post=1 time it >>>> >>>> >>clears the >> >> >>>> current vcpu in >>>> > the pending_mask. >>>> > When the pending_mask is clear it decrements >>>> hpet.intr_pending_nr and if >>>> > intr_pending_nr is still >>>> > non-zero posts another interrupt to the ioapic with >>>> hvm_isa_irq_assert_cb(). >>>> > Intr_pending_nr is incremented in >>>> hpet_route_decision_not_missed_ticks(). >>>> > >>>> > The missed ticks policy intr_en_notif callback also uses the >>>> pending_mask >>>> > method. So even though >>>> > Linux does not broadcast its interrupts, the code >>>> >>>> >>could handle >> >> >>>> it if it did. >>>> > In this case the end of interrupt time stamp is >>>> >>>> >>made when the >> >> >>>> pending_mask is >>>> > clear. >>>> > >>>> > 4. Live Migration >>>> > >>>> > Live migration with hpet preserves the current offset of the >>>> guest clock with >>>> > respect >>>> > to ntp. This is accomplished by migrating all of >>>> >>>> >>the state in >> >> >>>> the h->hpet data >>>> > structure >>>> > in the usual way. The hp->mc_offset is recalculated on the >>>> receiving node so >>>> > that the >>>> > guest sees a continuous hpet main counter. >>>> > >>>> > Code as been added to xc_domain_save.c to send a >>>> >>>> >>small message >> >> >>>> after the >>>> > domain context is sent. The contents of the message is the >>>> physical tsc >>>> > timestamp, last_tsc, >>>> > read just before the message is sent. When the >>>> >>>> >>>last_tsc message >>> >>> >>>> is received in >>>> > xc_domain_restore.c, >>>> > another physical tsc timestamp, cur_tsc, is read. The two >>>> timestamps are >>>> > loaded into the domain >>>> > structure as last_tsc_sender and first_tsc_receiver with >>>> hypercalls. Then >>>> > xc_domain_hvm_setcontext >>>> > is called so that hpet_load has access to these time stamps. >>>> Hpet_load uses >>>> > the timestamps >>>> > to account for the time spent saving and loading the domain >>>> context. With this >>>> > technique, >>>> > the only neglected time is the time spent sending a small >>>> network message. >>>> > >>>> > 5. Test Results >>>> > >>>> > Some recent test results are: >>>> > >>>> > 5.1 Linux 4u664 and Windows 2k864 load test. >>>> > Duration: 70 hours. >>>> > Test date: 6/2/08 >>>> > Loads: usex -b48 on Linux; burn-in on Windows >>>> > Guest vcpus: 8 for Linux; 2 for Windows >>>> > Hardware: 8 physical cpu AMD >>>> > Clock drift : Linux: .0012% Windows: .009% >>>> > >>>> > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 >>>> >>>> >>no-load test >> >> >>>> > Duration: 23 hours. >>>> > Test date: 6/3/08 >>>> > Loads: none >>>> > Guest vcpus: 8 for each Linux; 2 for Windows >>>> > Hardware: 4 physical cpu AMD >>>> > Clock drift : Linux: .033% Windows: .019% >>>> > >>>> > 6. Relation to recent work in xen-unstable >>>> > >>>> > There is a similarity between hvm_get_guest_time() in >>>> xen-unstable and >>>> > read_64_main_counter() >>>> > in this code. However, read_64_main_counter() is >>>> >>>> >>more tuned to >> >> >>>> the needs of >>>> > hpet.c. It has no >>>> > "set" operation, only the get. It isolates the mode, >>>> >>>> >>>physical or >>> >>> >>>> simulated, in >>>> > read_64_main_counter() >>>> > itself. It uses no vcpu or domain state as it is a physical >>>> entity, in either >>>> > mode. And it provides a real >>>> > physical mode for every read for those applications >>>> >>>> >>>that desire >>> >>> >>>> this. >>>> > >>>> > 7. Conclusion >>>> > >>>> > The virtual hpet is improved by this patch in terms >>>> >>>> >>>of accuracy and >>> >>> >>>> > monotonicity. >>>> > Tests performed to date verify this and more testing >>>> >>>> >>>is under way. >>> >>> >>>> > >>>> > 8. Future Work >>>> > >>>> > Testing with Windows Vista will be performed soon. >>>> >>>> >>The reason >> >> >>>> for accuracy >>>> > variations >>>> > on different platforms using the physical hpet >>>> >>>> >>device will be >> >> >>>> investigated. >>>> > Additional overhead measurements on simulated vs >>>> >>>> >>physical hpet >> >> >>>> mode will be >>>> > made. >>>> > >>>> > Footnotes: >>>> > >>>> > 1. I don't recall the accuracy improvement with end >>>> >>>> >>>of interrupt >>> >>> >>>> stamping, but >>>> > it was >>>> > significant, perhaps better than two to one improvement. It >>>> would be a very >>>> > simple matter >>>> > to re-measure the improvement as the facility can >>>> >>>> >>call back at >> >> >>>> injection time >>>> > as well. >>>> > >>>> > >>>> > Signed-off-by: Dave Winchell >>>> > >>>> > Signed-off-by: Ben Guthro >>>> > >>>> > >>>> > >>>> > _______________________________________________ >>>> > Xen-devel mailing list >>>> > Xen-devel@lists.xensource.com >>>> > http://lists.xensource.com/xen-devel >>>> >>>> >>>> >>>> >>>> >>> >>> > > >_______________________________________________ >Xen-devel mailing list >Xen-devel@lists.xensource.com >http://lists.xensource.com/xen-devel > >