From mboxrd@z Thu Jan  1 00:00:00 1970
From: Dave Winchell <dwinchell@virtualiron.com>
Subject: Re: [PATCH 0/2] Improve hpet accuracy
Date: Wed, 11 Jun 2008 09:58:00 -0400
Message-ID: <484FD9E8.9030307@virtualiron.com>
References: <20080610194420109.00000041832@djm-pc>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Return-path: <xen-devel-bounces@lists.xensource.com>
In-Reply-To: <20080610194420109.00000041832@djm-pc>
List-Unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: "dan.magenheimer@oracle.com" <dan.magenheimer@oracle.com>
Cc: Dave Winchell <dwinchell@virtualiron.com>, xen-devel <xen-devel@lists.xensource.com>, Keir Fraser <keir.fraser@eu.citrix.com>, Ben Guthro <bguthro@virtualiron.com>
List-Id: xen-devel@lists.xenproject.org

Dan Magenheimer wrote:

>>In EL5u1-32 however it looks like the fractions are accounted
>>for.  Indeed the EL5u1-32 "lost tick handling" code resembles
>>the Linux/ia64 code which is what I've always assumed was
>>the "missed tick" model.  In this case, I think no policy
>>is necessary and the measured skew should be identical to
>>any physical hpet skew.  I'll have to test this hypothesis though.
>>    
>>
>
>I've tested this hypothesis and it seems to hold true.
>This means the existing (unpatched) hpet code works fine
>on EL5-32bit (vcpus=1) when hpet is the clocksource,
>even when the machine is overcommitted.  A second hypothesis
>still needs to be tested that Dave's patch will not make this worse.
>  
>
Interesting, thanks for pointing this out and confirming.

>(Note that per previous discussion, my EL5u1-32bit guest
>running on an Intel dual-core physical box chose tsc as
>the best clocksource and I had to override it with
>clock=hpet in the kernel command line.)
>  
>
Is there one setting for all Linux guests that makes them
choose hpet? Is it "clock=hpet clocksource=hpet"?
I know you wrote at length about this before.

>  
>
>>Yes, that makes sense and concurs with what I remember from
>>the EL4u5-32 code.  If this is true, one would expect the
>>default "no missed tick" policy to see time moving faster
>>than an external source -- the first missed tick delivered
>>after a long sleep would "catch up" and then the remainder
>>would each add another tick.
>>    
>>
>
>Indeed with the existing (unpatched) hpet code, time is
>running faster on EL4u5-32 (vcpus=1, when overcommited).
>So Dave's patch is definitely needed here.
>  
>
Its good to get the verification of this.

thanks,
Dave

>Will try 64-bit next.
>
>Dan
>
>  
>
>>-----Original Message-----
>>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
>>Sent: Monday, June 09, 2008 9:21 PM
>>To: 'Dave Winchell'; 'Keir Fraser'
>>Cc: 'xen-devel'; 'Ben Guthro'
>>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>>
>>
>>    
>>
>>>I'll tell  you what I recall about this. Tomorrow I'll check the
>>>guest code to verify. I think that Linux declares a full tick,
>>>even if the interrupt is early. That's the problem.
>>>      
>>>
>>Yes, that makes sense and concurs with what I remember from
>>the EL4u5-32 code.  If this is true, one would expect the
>>default "no missed tick" policy to see time moving faster
>>than an external source -- the first missed tick delivered
>>after a long sleep would "catch up" and then the remainder
>>would each add another tick.
>>
>>    
>>
>>>On the other hand, if the interrupt is late it in effect declares
>>>a tick plus fraction. If it just declared the fraction in 
>>>      
>>>
>>the first place,
>>    
>>
>>>we could deliver the interrupts whenever we wanted.
>>>      
>>>
>>My read of the EL4u5-32 code is that the fraction is discarded
>>and a new tick period commences at "now", so the fractions
>>eventually accumulate as lost time.
>>
>>In EL5u1-32 however it looks like the fractions are accounted
>>for.  Indeed the EL5u1-32 "lost tick handling" code resembles
>>the Linux/ia64 code which is what I've always assumed was
>>the "missed tick" model.  In this case, I think no policy
>>is necessary and the measured skew should be identical to
>>any physical hpet skew.  I'll have to test this hypothesis though.
>>
>>-----Original Message-----
>>From: xen-devel-bounces@lists.xensource.com 
>>[mailto:xen-devel-bounces@lists.xensource.com]On Behalf Of 
>>Dave Winchell
>>Sent: Monday, June 09, 2008 5:35 PM
>>To: dan.magenheimer@oracle.com; Keir Fraser
>>Cc: Dave Winchell; xen-devel; Ben Guthro
>>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>>
>>
>>    
>>
>>>>The Linux policy is more subtle, but is required to go
>>>>from .1% to .03%.
>>>>        
>>>>
>>>Thanks for the good documentation which I hadn't thoroughly
>>>read until now.
>>>I now understand that the essence of your
>>>hpet missed ticks policy is to ensure that ticks are never
>>>delivered too close together.  But I'm trying to understand
>>>WHY your patch works, in other words, what problem it is
>>>countering.
>>>      
>>>
>>I'll tell  you what I recall about this. Tomorrow I'll check the
>>guest code to verify. I think that Linux declares a full tick,
>>even if the interrupt is early. That's the problem.
>>On the other hand, if the interrupt is late it in effect declares
>>a tick plus fraction. If it just declared the fraction in the 
>>first place,
>>we could deliver the interrupts whenever we wanted.
>>
>>Its really not that different than the missed ticks policy in vpt.c
>>except that there the period in vpt.c is based on start of interrupt
>>and I have improved that with end-of interrupt as described
>>in the patch note.
>>
>>I don't recall what prompted me to try end-of-interrupt,
>>but I saw a significant improvement. I may have been running
>>a monotonicity test at the same time to explain the lock
>>contention mentioned in the write-up.
>>
>>    
>>
>>> I care about this for more reasons than just
>>>because it is interesting: (1) I'd like to feel confident that
>>>it is fixing a bug rather than just a symptom of a bug;
>>>and (2) I wonder how universally it is applicable.
>>>      
>>>
>>Its worked well my my small set of guests. You and our
>>QA are going to tell us about the wider set. It doesn't
>>matter if guest A handles interrupts closely spaced or not,
>>just whether it handles them far apart. So it should be pretty
>>universal with guests that really handle missed ticks.
>>I think its interesting that some 32bit Linux guests handle
>>missed ticks for hpet.
>>
>>    
>>
>>>I see from code examination in mark_offset_hpet() in
>>>RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that
>>>the correction for lost ticks is just plain wrong in
>>>a virtual environment. (Suppose for example that a virtual
>>>tick was delivered every 1.999*hpet_tick... I think
>>>the clock would be off by 50%!)  Is this the bug that
>>>is being "countered" by your policy?
>>>      
>>>
>>I haven't looked at that code, perhaps.
>>I'll check it tomorrow.
>>
>>    
>>
>>>However, the lost tick handling in RHEL5u1/kernel/timer.c
>>>(which I think is used also for hpet) is much better
>>>so I am eager to find out if your policy works there
>>>too.
>>>If the hpet missed tick policy works for both, though,
>>>I should be happy, though I wonder about upstream kernels
>>>(e.g. the trend toward tickless).
>>>      
>>>
>>I wasn't aware of this trend. If its robust, however, it should
>>handle late interrupts ...
>>
>>    
>>
>>>That said, I'd rather
>>>see this get into Xen 3.3 and worry about upstream kernels
>>>later :-)
>>>      
>>>
>>Regards,
>>Dave
>>
>>
>>
>>-----Original Message-----
>>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
>>Sent: Mon 6/9/2008 6:02 PM
>>To: Dave Winchell; Keir Fraser
>>Cc: Ben Guthro; xen-devel
>>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>>
>>    
>>
>>>The Linux policy is more subtle, but is required to go
>>>from .1% to .03%.
>>>      
>>>
>>Thanks for the good documentation which I hadn't thoroughly
>>read until now.  I now understand that the essence of your
>>hpet missed ticks policy is to ensure that ticks are never
>>delivered too close together.  But I'm trying to understand
>>WHY your patch works, in other words, what problem it is
>>countering.  I care about this for more reasons than just
>>because it is interesting: (1) I'd like to feel confident that
>>it is fixing a bug rather than just a symptom of a bug;
>>and (2) I wonder how universally it is applicable.
>>
>>I see from code examination in mark_offset_hpet() in
>>RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that
>>the correction for lost ticks is just plain wrong in
>>a virtual environment. (Suppose for example that a virtual
>>tick was delivered every 1.999*hpet_tick... I think
>>the clock would be off by 50%!)  Is this the bug that
>>is being "countered" by your policy?
>>
>>However, the lost tick handling in RHEL5u1/kernel/timer.c
>>(which I think is used also for hpet) is much better
>>so I am eager to find out if your policy works there
>>too.
>>
>>If the hpet missed tick policy works for both, though,
>>I should be happy, though I wonder about upstream kernels
>>(e.g. the trend toward tickless).  That said, I'd rather
>>see this get into Xen 3.3 and worry about upstream kernels
>>later :-)
>>
>>-----Original Message-----
>>From: Dave Winchell [mailto:dwinchell@virtualiron.com]
>>Sent: Sunday, June 08, 2008 2:32 PM
>>To: dan.magenheimer@oracle.com; Keir Fraser
>>Cc: Ben Guthro; xen-devel; Dave Winchell
>>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>>
>>
>>Hi Dan,
>>
>>    
>>
>>>While I am fully supportive of offering hardware hpet as an option
>>>for hvm guests (let's call it hwhpet=1 for shorthand), I am very
>>>surprised by your preliminary results; the most obvious conclusion
>>>is that Xen system time is losing time at the rate of 1000 PPM
>>>though its possible there's a bug somewhere else in the "time
>>>stack".  Your Windows result is jaw-dropping and inexplicable,
>>>though I have to admit ignorance of how Windows manages time.
>>>      
>>>
>>I think xen system time is fine. You have to add the interrupt
>>delivery policies decribed in the write-up for the patch to get
>>accurate timekeeping in the guest.
>>
>>The windows policy is obvious and results in a large improvement
>>in accuracy. The Linux policy is more subtle, but is required to go
>>from .1% to .03%.
>>
>>    
>>
>>>I think with my recent patch and hpet=1 (essentially the same as
>>>your emulated hpet), hvm guest time should track Xen system time.
>>>I wonder if domain0 (which if I understand correctly is directly
>>>using Xen system time) is also seeing an error of .1%?  Also
>>>I wonder for the skew you are seeing (in both hvm guests and
>>>domain0) is time moving too fast or two slow?
>>>      
>>>
>>I don't recall the direction. I can look it up in my notes at work
>>tomorrow.
>>
>>    
>>
>>>Although hwhpet=1 is a fine alternative in many cases, it may
>>>be unavailable on some systems and may cause significant performance
>>>issues on others.  So I think we will still need to track down
>>>the poor accuracy when hwhpet=0.
>>>      
>>>
>>Our patch is accurate to < .03% using the physical hpet mode or
>>the simulated mode.
>>
>>    
>>
>>>And if for some reason
>>>Xen system time can't be made accurate enough (< 0.05%), then
>>>I think we should consider building Xen system time itself on
>>>top of hardware hpet instead of TSC... at least when Xen discovers
>>>a capable hpet.
>>>      
>>>
>>In our experience, Xen system time is accurate enough now.
>>
>>    
>>
>>>One more thought... do you know the accuracy of the TSC crystals
>>>on your test systems?  I posted a patch awhile ago that was
>>>intended to test that, though I guess it was only testing skew
>>>of different TSCs on the same system, not TSCs against an
>>>external time source.
>>>      
>>>
>>I do not know the tsc accuracy.
>>
>>    
>>
>>>Or maybe there's a computation error somewhere in the hvm hpet
>>>scaling code?  Hmmm...
>>>      
>>>
>>Regards,
>>Dave
>>
>>
>>-----Original Message-----
>>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
>>Sent: Fri 6/6/2008 4:29 PM
>>To: Dave Winchell; Keir Fraser
>>Cc: Ben Guthro; xen-devel
>>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>>
>>Dave --
>>
>>Thanks much for posting the preliminary results!
>>
>>While I am fully supportive of offering hardware hpet as an option
>>for hvm guests (let's call it hwhpet=1 for shorthand), I am very
>>surprised by your preliminary results; the most obvious conclusion
>>is that Xen system time is losing time at the rate of 1000 PPM
>>though its possible there's a bug somewhere else in the "time
>>stack".  Your Windows result is jaw-dropping and inexplicable,
>>though I have to admit ignorance of how Windows manages time.
>>
>>
>>I think with my recent patch and hpet=1 (essentially the same as
>>your emulated hpet), hvm guest time should track Xen system time.
>>I wonder if domain0 (which if I understand correctly is directly
>>using Xen system time) is also seeing an error of .1%?  Also
>>I wonder for the skew you are seeing (in both hvm guests and
>>domain0) is time moving too fast or two slow?
>>
>>Although hwhpet=1 is a fine alternative in many cases, it may
>>be unavailable on some systems and may cause significant performance
>>issues on others.  So I think we will still need to track down
>>the poor accuracy when hwhpet=0.  And if for some reason
>>Xen system time can't be made accurate enough (< 0.05%), then
>>I think we should consider building Xen system time itself on
>>top of hardware hpet instead of TSC... at least when Xen discovers
>>a capable hpet.
>>
>>One more thought... do you know the accuracy of the TSC crystals
>>on your test systems?  I posted a patch awhile ago that was
>>intended to test that, though I guess it was only testing skew
>>of different TSCs on the same system, not TSCs against an
>>external time source.
>>
>>Or maybe there's a computation error somewhere in the hvm hpet
>>scaling code?  Hmmm...
>>
>>Thanks,
>>Dan
>>
>>    
>>
>>>-----Original Message-----
>>>From: Dave Winchell [mailto:dwinchell@virtualiron.com]
>>>Sent: Friday, June 06, 2008 1:33 PM
>>>To: dan.magenheimer@oracle.com; Keir Fraser
>>>Cc: Ben Guthro; xen-devel; Dave Winchell
>>>Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>>>
>>>
>>>Dan, Keir:
>>>
>>>Preliminary tests results indicate an error of .1% for Linux 64 bit
>>>guests configured
>>>for hpet with xen-unstable as is. As we have discussed many 
>>>      
>>>
>>times, the
>>    
>>
>>>ntp requirement is .05%.
>>>Tests on the patch we just submitted for hpet have 
>>>      
>>>
>>indicated errors of
>>    
>>
>>>.0012%
>>>on this platform under similar test conditions and .03% on
>>>other platforms.
>>>
>>>Windows vista64 has an error of 11% using hpet with the
>>>xen-unstable bits.
>>>In an overnight test with our hpet patch, the Windows vista
>>>error was .008%.
>>>
>>>The tests are with two or three guests on a physical node, all under
>>>load, and with
>>>the ratio of vcpus to phys cpus > 1.
>>>
>>>I will continue to run tests over the next few days.
>>>
>>>thanks,
>>>Dave
>>>
>>>
>>>Dan Magenheimer wrote:
>>>
>>>      
>>>
>>>>Hi Dave and Ben --
>>>>
>>>>When running tests on xen-unstable (without your patch),
>>>>        
>>>>
>>>please ensure
>>>      
>>>
>>>>that hpet=1 is set in the hvm config and also I think 
>>>>        
>>>>
>>that when hpet
>>    
>>
>>>>is the clocksource on RHEL4-32, the clock IS resilient to
>>>>        
>>>>
>>>missed ticks
>>>      
>>>
>>>>so timer_mode should be 2 (vs when pit is the clocksource
>>>>        
>>>>
>>>on RHEL4-32,
>>>      
>>>
>>>>all clock ticks must be delivered and so timer_mode should be 0).
>>>>
>>>>Per
>>>>
>>>>        
>>>>
>>>http://lists.xensource.com/archives/html/xen-devel/2008-06/msg
>>>00098.html it's
>>>      
>>>
>>>>my intent to clean this up, but I won't get to it until next week.
>>>>
>>>>Thanks,
>>>>Dan
>>>>
>>>>    -----Original Message-----
>>>>    *From:* xen-devel-bounces@lists.xensource.com
>>>>    [mailto:xen-devel-bounces@lists.xensource.com]*On
>>>>        
>>>>
>>>Behalf Of *Dave
>>>      
>>>
>>>>    Winchell
>>>>    *Sent:* Friday, June 06, 2008 4:46 AM
>>>>    *To:* Keir Fraser; Ben Guthro; xen-devel
>>>>    *Cc:* dan.magenheimer@oracle.com; Dave Winchell
>>>>    *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>>>>
>>>>    Keir,
>>>>
>>>>    I think the changes are required. We'll run some tests
>>>>        
>>>>
>>>today today so
>>>      
>>>
>>>>    that we have some data to talk about.
>>>>
>>>>    -Dave
>>>>
>>>>
>>>>    -----Original Message-----
>>>>    From: xen-devel-bounces@lists.xensource.com on behalf
>>>>        
>>>>
>>>of Keir Fraser
>>>      
>>>
>>>>    Sent: Fri 6/6/2008 4:58 AM
>>>>    To: Ben Guthro; xen-devel
>>>>    Cc: dan.magenheimer@oracle.com
>>>>    Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>>>>
>>>>    Are these patches needed now the timers are built on 
>>>>        
>>>>
>>Xen system
>>    
>>
>>>>    time rather
>>>>    than host TSC? Dan has reported much better
>>>>        
>>>>
>>>time-keeping with his
>>>      
>>>
>>>>    patch
>>>>    checked in, and it�s for sure a lot less invasive than
>>>>        
>>>>
>>>this patchset.
>>>      
>>>
>>>>     -- Keir
>>>>
>>>>    On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote:
>>>>
>>>>    >
>>>>    > 1. Introduction
>>>>    >
>>>>    > This patch improves the hpet based guest clock in
>>>>        
>>>>
>>>terms of drift and
>>>      
>>>
>>>>    > monotonicity.
>>>>    > Prior to this work the drift with hpet was greater
>>>>        
>>>>
>>>than 2%, far
>>>      
>>>
>>>>    above the .05%
>>>>    > limit
>>>>    > for ntp to synchronize. With this code, the drift 
>>>>        
>>>>
>>ranges from
>>    
>>
>>>>    .001% to .0033%
>>>>    > depending
>>>>    > on guest and physical platform.
>>>>    >
>>>>    > Using hpet allows guest operating systems to 
>>>>        
>>>>
>>provide monotonic
>>    
>>
>>>>    time to their
>>>>    > applications. Time sources other than hpet are not
>>>>        
>>>>
>>>monotonic because
>>>      
>>>
>>>>    > of their reliance on tsc, which is not synchronized
>>>>        
>>>>
>>>across physical
>>>      
>>>
>>>>    > processors.
>>>>    >
>>>>    > Windows 2k864 and many Linux guests are supported with two
>>>>    policies, one for
>>>>    > guests
>>>>    > that handle missed clock interrupts and the other for guests
>>>>    that require the
>>>>    > correct number of interrupts.
>>>>    >
>>>>    > Guests may use hpet for the timing source even if 
>>>>        
>>>>
>>the physical
>>    
>>
>>>>    platform has no
>>>>    > visible
>>>>    > hpet. Migration is supported between physical machines which
>>>>    differ in
>>>>    > physical
>>>>    > hpet visibility.
>>>>    >
>>>>    > Most of the changes are in hpet.c. Two general 
>>>>        
>>>>
>>facilities are
>>    
>>
>>>>    added to track
>>>>    > interrupt
>>>>    > progress. The ideas here and the facilities would 
>>>>        
>>>>
>>be useful in
>>    
>>
>>>>    vpt.c, for
>>>>    > other time
>>>>    > sources, though no attempt is made here to improve vpt.c.
>>>>    >
>>>>    > The following sections discuss hpet dependencies, interrupt
>>>>    delivery policies,
>>>>    > live migration,
>>>>    > test results, and relation to recent work with 
>>>>        
>>>>
>>monotonic time.
>>    
>>
>>>>    >
>>>>    >
>>>>    > 2. Virtual Hpet dependencies
>>>>    >
>>>>    > The virtual hpet depends on the ability to read the
>>>>        
>>>>
>>>physical or
>>>      
>>>
>>>>    simulated
>>>>    > (see discussion below) hpet.  For timekeeping, the
>>>>        
>>>>
>>>virtual hpet
>>>      
>>>
>>>>    also depends
>>>>    > on two new interrupt notification facilities to 
>>>>        
>>>>
>>implement its
>>    
>>
>>>>    policies for
>>>>    > interrupt delivery.
>>>>    >
>>>>    > 2.1. Two modes of low-level hpet main counter reads.
>>>>    >
>>>>    > In this implementation, the virtual hpet reads with
>>>>    read_64_main_counter(),
>>>>    > exported by
>>>>    > time.c, either the real physical hpet main counter register
>>>>    directly or a
>>>>    > "simulated"
>>>>    > hpet main counter.
>>>>    >
>>>>    > The simulated mode uses a monotonic version of get_s_time()
>>>>    (NOW()), where the
>>>>    > last
>>>>    > time value is returned whenever the current time 
>>>>        
>>>>
>>value is less
>>    
>>
>>>>    than the last
>>>>    > time
>>>>    > value. In simulated mode, since it is layered on s_time, the
>>>>    underlying
>>>>    > hardware
>>>>    > can be hpet or some other device. The frequency of the main
>>>>    counter in
>>>>    > simulated
>>>>    > mode is the same as the standard physical hpet frequency,
>>>>    allowing live
>>>>    > migration
>>>>    > between nodes that are configured differently.
>>>>    >
>>>>    > If the physical platform does not have an hpet
>>>>        
>>>>
>>>device, or if xen
>>>      
>>>
>>>>    is configured
>>>>    > not
>>>>    > to use the device, then the simulated method is 
>>>>        
>>>>
>>used. If there
>>    
>>
>>>>    is a physical
>>>>    > hpet device,
>>>>    > and xen has initialized it, then either simulated 
>>>>        
>>>>
>>or physical
>>    
>>
>>>>    mode can be
>>>>    > used.
>>>>    > This is governed by a boot time option, hpet-avoid.
>>>>        
>>>>
>>>Setting this
>>>      
>>>
>>>>    option to 1
>>>>    > gives the
>>>>    > simulated mode and 0 the physical mode. The default
>>>>        
>>>>
>>>is physical
>>>      
>>>
>>>>    mode.
>>>>    >
>>>>    > A disadvantage of the physical mode is that may 
>>>>        
>>>>
>>take longer to
>>    
>>
>>>>    read the device
>>>>    > than in simulated mode. On some platforms the cost is
>>>>        
>>>>
>>>about the
>>>      
>>>
>>>>    same (less
>>>>    > than 250 nsec) for
>>>>    > physical and simulated modes, while on others 
>>>>        
>>>>
>>physical cost is
>>    
>>
>>>>    much higher
>>>>    > than simulated.
>>>>    > A disadvantage of the simulated mode is that it can 
>>>>        
>>>>
>>return the
>>    
>>
>>>>    same value
>>>>    > for the counter in consecutive calls.
>>>>    >
>>>>    > 2.2. Interrupt notification facilities.
>>>>    >
>>>>    > Two interrupt notification facilities are introduced, one is
>>>>    > hvm_isa_irq_assert_cb()
>>>>    > and the other hvm_register_intr_en_notif().
>>>>    >
>>>>    > The vhpet uses hvm_isa_irq_assert_cb to deliver 
>>>>        
>>>>
>>interrupts to
>>    
>>
>>>>    the vioapic.
>>>>    > hvm_isa_irq_assert_cb allows a callback to be 
>>>>        
>>>>
>>passed along to
>>    
>>
>>>>    > vioapic_deliver()
>>>>    > and this callback is called with a mask of the vcpus
>>>>        
>>>>
>>>which will
>>>      
>>>
>>>>    get the
>>>>    > interrupt. This callback is made before any vcpus receive an
>>>>    interrupt.
>>>>    >
>>>>    > Vhpet uses hvm_register_intr_en_notif() to register 
>>>>        
>>>>
>>a handler
>>    
>>
>>>>    for a particular
>>>>    > vector that will be called when that vector is injected in
>>>>    > [vmx,svm]_intr_assist()
>>>>    > and also when the guest finishes handling the 
>>>>        
>>>>
>>interrupt. Here
>>    
>>
>>>>    finished is
>>>>    > defined
>>>>    > as the point when the guest re-enables interrupts or
>>>>        
>>>>
>>>lowers the
>>>      
>>>
>>>>    tpr value.
>>>>    > EOI is not used as the end of interrupt as this is sometimes
>>>>    returned before
>>>>    > the interrupt handler has done its work. A flag is
>>>>        
>>>>
>>>passed to the
>>>      
>>>
>>>>    handler
>>>>    > indicating
>>>>    > whether this is the injection point (post = 1) or the
>>>>        
>>>>
>>>interrupt
>>>      
>>>
>>>>    finished (post
>>>>    > = 0) point.
>>>>    > The need for the finished point callback is discussed in the
>>>>    missed ticks
>>>>    > policy section.
>>>>    >
>>>>    > To prevent a possible early trigger of the finished 
>>>>        
>>>>
>>callback,
>>    
>>
>>>>    intr_en_notif
>>>>    > logic
>>>>    > has a two stage arm, the first at injection
>>>>    (hvm_intr_en_notif_arm()) and the
>>>>    > second when
>>>>    > interrupts are seen to be disabled
>>>>        
>>>>
>>>(hvm_intr_en_notif_disarm()).
>>>      
>>>
>>>>    Once fully
>>>>    > armed, re-enabling
>>>>    > interrupts will cause hvm_intr_en_notif_disarm() to
>>>>        
>>>>
>>>make the end
>>>      
>>>
>>>>    of interrupt
>>>>    > callback. hvm_intr_en_notif_arm() and
>>>>        
>>>>
>>>hvm_intr_en_notif_disarm()
>>>      
>>>
>>>>    are called by
>>>>    > [vmx,svm]_intr_assist().
>>>>    >
>>>>    > 3. Interrupt delivery policies
>>>>    >
>>>>    > The existing hpet interrupt delivery is preserved.
>>>>        
>>>>
>>>This includes
>>>      
>>>
>>>>    > vcpu round robin delivery used by Linux and 
>>>>        
>>>>
>>broadcast delivery
>>    
>>
>>>>    used by
>>>>    > Windows.
>>>>    >
>>>>    > There are two policies for interrupt delivery, one 
>>>>        
>>>>
>>for Windows
>>    
>>
>>>>    2k8-64 and the
>>>>    > other
>>>>    > for Linux. The Linux policy takes advantage of the
>>>>        
>>>>
>>>(guest) Linux
>>>      
>>>
>>>>    missed tick
>>>>    > and offset
>>>>    > calculations and does not attempt to deliver the
>>>>        
>>>>
>>>right number of
>>>      
>>>
>>>>    interrupts.
>>>>    > The Windows policy delivers the correct number of 
>>>>        
>>>>
>>interrupts,
>>    
>>
>>>>    even if
>>>>    > sometimes much
>>>>    > closer to each other than the period. The policies 
>>>>        
>>>>
>>are similar
>>    
>>
>>>>    to those in
>>>>    > vpt.c, though
>>>>    > there are some important differences.
>>>>    >
>>>>    > Policies are selected with an HVMOP_set_param
>>>>        
>>>>
>>>hypercall with index
>>>      
>>>
>>>>    > HVM_PARAM_TIMER_MODE.
>>>>    > Two new values are added,
>>>>        
>>>>
>>>HVM_HPET_guest_computes_missed_ticks and
>>>      
>>>
>>>>    > HVM_HPET_guest_does_not_compute_missed_ticks.  The 
>>>>        
>>>>
>>reason that
>>    
>>
>>>>    two new ones
>>>>    > are added is that
>>>>    > in some guests (32bit Linux) a no-missed policy is 
>>>>        
>>>>
>>needed for
>>    
>>
>>>>    clock sources
>>>>    > other than hpet
>>>>    > and a missed ticks policy for hpet. It was felt that
>>>>        
>>>>
>>>there would
>>>      
>>>
>>>>    be less
>>>>    > confusion by simply
>>>>    > introducing the two hpet policies.
>>>>    >
>>>>    > 3.1. The missed ticks policy
>>>>    >
>>>>    > The Linux clock interrupt handler for hpet calculates missed
>>>>    ticks and offset
>>>>    > using the hpet
>>>>    > main counter. The algorithm works well when the 
>>>>        
>>>>
>>time since the
>>    
>>
>>>>    last interrupt
>>>>    > is greater than
>>>>    > or equal to a period and poorly otherwise.
>>>>    >
>>>>    > The missed ticks policy ensures that no two clock
>>>>        
>>>>
>>>interrupts are
>>>      
>>>
>>>>    delivered to
>>>>    > the guest at
>>>>    > a time interval less than a period. A time stamp (hpet main
>>>>    counter value) is
>>>>    > recorded (by a
>>>>    > callback registered with hvm_register_intr_en_notif)
>>>>        
>>>>
>>>when Linux
>>>      
>>>
>>>>    finishes
>>>>    > handling the clock
>>>>    > interrupt. Then, ensuing interrupts are delivered to
>>>>        
>>>>
>>>the vioapic
>>>      
>>>
>>>>    only if the
>>>>    > current main
>>>>    > counter value is a period greater than when the 
>>>>        
>>>>
>>last interrupt
>>    
>>
>>>>    was handled.
>>>>    >
>>>>    > Tests showed a significant improvement in clock 
>>>>        
>>>>
>>drift with end
>>    
>>
>>>>    of interrupt
>>>>    > time stamps
>>>>    > versus beginning of interrupt[1]. It is believed that
>>>>        
>>>>
>>>the reason
>>>      
>>>
>>>>    for the
>>>>    > improvement
>>>>    > is that the clock interrupt handler goes for a
>>>>        
>>>>
>>>spinlock and can
>>>      
>>>
>>>>    be therefore
>>>>    > delayed in its
>>>>    > processing. Furthermore, the main counter is read 
>>>>        
>>>>
>>by the guest
>>    
>>
>>>>    under the lock.
>>>>    > The net
>>>>    > effect is that if we time stamp injection, we can get the
>>>>    difference in time
>>>>    > between successive interrupt handler lock acquisitions to be
>>>>    less than the
>>>>    > period.
>>>>    >
>>>>    > 3.2. The no-missed ticks policy
>>>>    >
>>>>    > Windows 2k864 keeps very poor time with the missed
>>>>        
>>>>
>>>ticks policy.
>>>      
>>>
>>>>    So the
>>>>    > no-missed ticks policy
>>>>    > was developed. In the no-missed ticks policy we deliver the
>>>>    correct number of
>>>>    > interrupts,
>>>>    > even if they are spaced less than a period apart
>>>>        
>>>>
>>>(when catching up).
>>>      
>>>
>>>>    >
>>>>    > Windows 2k864 uses a broadcast mode in the interrupt routing
>>>>    such that
>>>>    > all vcpus get the clock interrupt. The best Windows drift
>>>>    performance was
>>>>    > achieved when the
>>>>    > policy code ensured that all the previous interrupts (on the
>>>>    various vcpus)
>>>>    > had been injected
>>>>    > before injecting the next interrupt to the vioapic..
>>>>    >
>>>>    > The policy code works as follows. It uses the
>>>>    hvm_isa_irq_assert_cb() to
>>>>    > record
>>>>    > the vcpus to be interrupted in 
>>>>        
>>>>
>>h->hpet.pending_mask. Then, in
>>    
>>
>>>>    the callback
>>>>    > registered
>>>>    > with hvm_register_intr_en_notif() at post=1 time it 
>>>>        
>>>>
>>clears the
>>    
>>
>>>>    current vcpu in
>>>>    > the pending_mask.
>>>>    > When the pending_mask is clear it decrements
>>>>    hpet.intr_pending_nr and if
>>>>    > intr_pending_nr is still
>>>>    > non-zero posts another interrupt to the ioapic with
>>>>    hvm_isa_irq_assert_cb().
>>>>    > Intr_pending_nr is incremented in
>>>>    hpet_route_decision_not_missed_ticks().
>>>>    >
>>>>    > The missed ticks policy intr_en_notif callback also uses the
>>>>    pending_mask
>>>>    > method. So even though
>>>>    > Linux does not broadcast its interrupts, the code 
>>>>        
>>>>
>>could handle
>>    
>>
>>>>    it if it did.
>>>>    > In this case the end of interrupt time stamp is 
>>>>        
>>>>
>>made when the
>>    
>>
>>>>    pending_mask is
>>>>    > clear.
>>>>    >
>>>>    > 4. Live Migration
>>>>    >
>>>>    > Live migration with hpet preserves the current offset of the
>>>>    guest clock with
>>>>    > respect
>>>>    > to ntp. This is accomplished by migrating all of 
>>>>        
>>>>
>>the state in
>>    
>>
>>>>    the h->hpet data
>>>>    > structure
>>>>    > in the usual way. The hp->mc_offset is recalculated on the
>>>>    receiving node so
>>>>    > that the
>>>>    > guest sees a continuous hpet main counter.
>>>>    >
>>>>    > Code as been added to xc_domain_save.c to send a 
>>>>        
>>>>
>>small message
>>    
>>
>>>>    after the
>>>>    > domain context is sent. The contents of the message is the
>>>>    physical tsc
>>>>    > timestamp, last_tsc,
>>>>    > read just before the message is sent. When the
>>>>        
>>>>
>>>last_tsc message
>>>      
>>>
>>>>    is received in
>>>>    > xc_domain_restore.c,
>>>>    > another physical tsc timestamp, cur_tsc, is read. The two
>>>>    timestamps are
>>>>    > loaded into the domain
>>>>    > structure as last_tsc_sender and first_tsc_receiver with
>>>>    hypercalls. Then
>>>>    > xc_domain_hvm_setcontext
>>>>    > is called so that hpet_load has access to these time stamps.
>>>>    Hpet_load uses
>>>>    > the timestamps
>>>>    > to account for the time spent saving and loading the domain
>>>>    context. With this
>>>>    > technique,
>>>>    > the only neglected time is the time spent sending a small
>>>>    network message.
>>>>    >
>>>>    > 5. Test Results
>>>>    >
>>>>    > Some recent test results are:
>>>>    >
>>>>    > 5.1 Linux 4u664 and Windows 2k864 load test.
>>>>    >       Duration: 70 hours.
>>>>    >       Test date: 6/2/08
>>>>    >       Loads: usex -b48 on Linux; burn-in on Windows
>>>>    >       Guest vcpus: 8 for Linux; 2 for Windows
>>>>    >       Hardware: 8 physical cpu AMD
>>>>    >       Clock drift : Linux: .0012% Windows: .009%
>>>>    >
>>>>    > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 
>>>>        
>>>>
>>no-load test
>>    
>>
>>>>    >       Duration: 23 hours.
>>>>    >       Test date: 6/3/08
>>>>    >       Loads: none
>>>>    >       Guest vcpus: 8 for each Linux; 2 for Windows
>>>>    >       Hardware: 4 physical cpu AMD
>>>>    >       Clock drift : Linux: .033% Windows: .019%
>>>>    >
>>>>    > 6. Relation to recent work in xen-unstable
>>>>    >
>>>>    > There is a similarity between hvm_get_guest_time() in
>>>>    xen-unstable and
>>>>    > read_64_main_counter()
>>>>    > in this code. However, read_64_main_counter() is 
>>>>        
>>>>
>>more tuned to
>>    
>>
>>>>    the needs of
>>>>    > hpet.c. It has no
>>>>    > "set" operation, only the get. It isolates the mode,
>>>>        
>>>>
>>>physical or
>>>      
>>>
>>>>    simulated, in
>>>>    > read_64_main_counter()
>>>>    > itself. It uses no vcpu or domain state as it is a physical
>>>>    entity, in either
>>>>    > mode. And it provides a real
>>>>    > physical mode for every read for those applications
>>>>        
>>>>
>>>that desire
>>>      
>>>
>>>>    this.
>>>>    >
>>>>    > 7. Conclusion
>>>>    >
>>>>    > The virtual hpet is improved by this patch in terms
>>>>        
>>>>
>>>of accuracy and
>>>      
>>>
>>>>    > monotonicity.
>>>>    > Tests performed to date verify this and more testing
>>>>        
>>>>
>>>is under way.
>>>      
>>>
>>>>    >
>>>>    > 8. Future Work
>>>>    >
>>>>    > Testing with Windows Vista will be performed soon. 
>>>>        
>>>>
>>The reason
>>    
>>
>>>>    for accuracy
>>>>    > variations
>>>>    > on different platforms using the physical hpet 
>>>>        
>>>>
>>device will be
>>    
>>
>>>>    investigated.
>>>>    > Additional overhead measurements on simulated vs 
>>>>        
>>>>
>>physical hpet
>>    
>>
>>>>    mode will be
>>>>    > made.
>>>>    >
>>>>    > Footnotes:
>>>>    >
>>>>    > 1. I don't recall the accuracy improvement with end
>>>>        
>>>>
>>>of interrupt
>>>      
>>>
>>>>    stamping, but
>>>>    > it was
>>>>    > significant, perhaps better than two to one improvement. It
>>>>    would be a very
>>>>    > simple matter
>>>>    > to re-measure the improvement as the facility can 
>>>>        
>>>>
>>call back at
>>    
>>
>>>>    injection time
>>>>    > as well.
>>>>    >
>>>>    >
>>>>    > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com>
>>>>    > <mailto:dwinchell@virtualiron.com>
>>>>    > Signed-off-by: Ben Guthro <bguthro@virtualiron.com>
>>>>    > <mailto:bguthro@virtualiron.com>
>>>>    >
>>>>    >
>>>>    > _______________________________________________
>>>>    > Xen-devel mailing list
>>>>    > Xen-devel@lists.xensource.com
>>>>    > http://lists.xensource.com/xen-devel
>>>>
>>>>
>>>>
>>>>        
>>>>
>>>      
>>>
>
>
>_______________________________________________
>Xen-devel mailing list
>Xen-devel@lists.xensource.com
>http://lists.xensource.com/xen-devel
>  
>