From mboxrd@z Thu Jan  1 00:00:00 1970
From: Dave Winchell <dwinchell@virtualiron.com>
Subject: Re: [PATCH] Fix hvm guest time to be more accurate
Date: Thu, 25 Oct 2007 10:45:06 -0400
Message-ID: <4720ABF2.3080505@virtualiron.com>
References: <471FB5FA.6060507@virtualiron.com>
	<10EA09EFD8728347A513008B6B0DA77A024823DF@pdsmsx411.ccr.corp.intel.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xensource.com>
In-Reply-To: <10EA09EFD8728347A513008B6B0DA77A024823DF@pdsmsx411.ccr.corp.intel.com>
List-Unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: "Dong, Eddie" <eddie.dong@intel.com>
Cc: xen-devel <xen-devel@lists.xensource.com>, Ben Guthro <bguthro@virtualiron.com>
List-Id: xen-devel@lists.xenproject.org

Hi Doug,

Thanks for these comments.

Dong, Eddie wrote:

>>The vpt timer code in effect accumulates missed ticks
>>when a guest is running but has interrupts disabled
>>or when the platform timer is starved. For guests
>>    
>>
>
>This case, VMM will pick up the lost ticks into pending_intr_nr.
>The only issue is that if a guest is suspended or save/restored
>for long time such as several hours or days, we may see tons 
>of lost ticks, which is difficult to be injected back (cost minutes
>of times or even longer). So we give up those amount of 
>pending_intr_nr.  In all above case, guest need to re-sync its
>timer with others like network time for example. So it is 
>harmless.
>
>Similar situation happens when somebody is debugging a guest.
>  
>
The solution we provided removes the one second limit on missed ticks.
Our testing showed that this is often exceeded under some loads,
such as many guests, each running loads. Setting missed ticks to 1 tick
when 1000 is exceeded is a source of timing error. In the code,
where its set to one there is a TBD sync with guest comment, but no
action.

In terms of re-syncing with network time, our goal was to have the
timekeeping accurate enough so that the guest could run ntpd.
To do that, the under lying timekeeping needs to be accurate to .05%,
or so. Our measurements show that with this patch the core timekeeping is
accurate to .02%, approximately, even under loads where many guests run 
loads.
Without this patch, timekeeping is off by more than 10% and ntpd cannot
sync it.

>  
>
>>like 64 bit Linux which calculates missed ticks on each
>>clock interrupt based on the current tsc and the tsc
>>of the last interrupt and then adds missed ticks to jiffies
>>there is redundant accounting.
>>
>>This change subtracts off the hypervisor calculated missed
>>ticks while guest running for 64 bit guests using the pit.
>>Missed ticks when vcpu 0 is descheduled are unaffected.
>>
>>    
>>
>I think this one is not the right direction.
>
>The problem in time virtualization is that we don't how guest will use
>it.
>Latest 64 bit Linux can pick up the missed ticks from TSC like you
>mentioned, but it is not true for other 64 bits guest even linux 
>such as 2.6.16, nor for Windows.
>  
>
Ours is a specific solution.
Let me explain our logic.

We configure all our Linux guests with clock=pit.

The 32bit Linux guests we run don't calculate missed ticks and so
don't need cancellation. All the 64bit Linux guests that we run
calculate missed ticks and need cancellation.
I just checked 2.26.16 and it does calculate missed ticks in
arch/x86_64/lermel/time.c, main_timer_handler(), when using pit for
timekeeping.

The missed ticks cancellation code is activated in this patch when the
guest has configured the pit for timekeeping and the guest has four
level page tables (ie 64 bit).

The windows guests we run use rtc for timekeeping and don't need
or get cancellation.

So the simplifying assumption here is that a 64bit guest using pit is 
calculating
missed ticks.

I would be in favor of a method where xen is told directly whether to do
missed ticks cancellation. Perhaps its part of the guest configuration
information.

>Besides PV timer approach which is not always ready, basically
>we have 3 HVM time virtualization approaches:
>
>1: Current one:
>	Freeze guest time when the guest is descheduled and
>thus sync all guest time resource together. This one
>precisely solve the guest time cross-reference issues, guest TSC
>precisely represent guest time and thus can be cross-referenced
> in guest to pick up lossed ticks if have. but the logic 
>is relatively complicated and is easy to see bugs :-(
>
>
>2: Pin guest time to host time.
>	This is simplest approach, guest TSC is always pinned to
>host TSC with a fixed offset no matter the vCPU is descheduled or
>not. In this case, other guest periodic IRQ driven time resource 
>are not synced to guest TSC.
>	Base on this, we have 2 deviations:
>	A: Accumulate pending_intr_nr like current #1 approach.
>	B: Give up accumulated pending_intr_nr. We only inject
>one IRQ for a periodic IRQ driven guest time such as PIT.
>
>	What you mentioned here is a special case of 2B.
>
>	Since we don't know how guest behaviors, what we are
>proposing recently is to implement all of above, and let administrate
>tools to choose the one to use base on knowledge of guest OS
>type. 
>
>thanks, eddie
>  
>
I agree with you on having various policies for timekeeping based on
the guest being run.

This patch addresses specifically the problem
of pit users who calculate missed ticks. Note that in the solution,
de-scheduled missed ticks are not canceled, they are still needed
as the tsc is continuous in the current methods. We are only canceling those
pending_intr_nr that accumulate while the guest is running. These are due
to inaccuracies in the xen time expirations due to interrupt loads or 
long dom0
interrupt disable periods. They are also due to extended periods where 
the guest
has interrupts disabled. In these cases, as the tsc has been running,
the guest will calculated missed ticks at the time of first clock interrupt
injection and then xen will deliver pending_intr_nr additional 
interrupts resulting
in jiffies moving by 2*pending_intr_nr instead of the desired 
pending_intr_nr.

regards,
Dave