All of lore.kernel.org
 help / color / mirror / Atom feed
* RE: [Patch]  time resolution fix.
@ 2006-03-17 16:11 Dong, Eddie
  2006-03-19 14:28 ` Keir Fraser
  0 siblings, 1 reply; 9+ messages in thread
From: Dong, Eddie @ 2006-03-17 16:11 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel Devel, Nakajima, Jun

[-- Attachment #1: Type: text/plain, Size: 2676 bytes --]

Keir:
	Before this patch, we saw 2 issues:
	One is when a VM switch happens at  HVM guest within PIT
Interrupt Service Routine (ISR) time. There exist problem if we let
guest see time jump in TSC (TSC is used to adjust PIT). Previously
hypervisor try to minimize the jump seen by guest, but the resolution is
one PIT period that is not enough. (TSC_OFFSET=0-pending_intr_nr
*period). This situation is worse in IA32E.
	The second issue is that when HVM/SMP is enabled, APIC time
calibration want to see a TSC duration of about 100000000 cycles so that
ACPI timer frequency can be calibrated with IRQ disabled. This is
un-achievable in previously code. Because at that time the guest IRQ is
disabled and no PIT IRQ injection, thus guest time is frozen. Due to
that, the guest can never see 100000000 cycles passed (TSC is frozen)
and thus stuck there.
	Another benefit of this is that we can get much accurate guest
calibration result that is previously a known issue on multiple VM case,
and it is long time too.
	
	I have a much detail description in the attached slide, hope
this helpful.
	BTW, due to SMP support and more time resource support (RTC and
ACPI), we are planning to do some design modification to sync all those
different kind of time. This patch is mainly for bug fix that exist for
long time and block SMP effort.
thx,eddie

Keir Fraser wrote:
> On 17 Mar 2006, at 14:39, Dong, Eddie wrote:
> 
>> This patch fix HVM/VMX time resolution issue that cause IA32E
>> complain "loss tick" occationally and APIC time calibration issue.
>> 
>> not tested on SVM for slight common code change.
> 
> This patch looks scary. Can you give more info about the problem and
> how you solve it? It looks like you end up forcibly sync'ing the
> guest's TSC rate to the PIT rate? Would that even be necessary if the
> PIT emulation were moved into Xen, where it ought to be?
> 
> On a slightly unrelated note, I think TSC rate management will start
> to get exciting when we have HVM save/restore. What will happen if a
> guest is restored on a machine with quite different TSC rate to the
> machine it originally ran on? I was wondering whether the current
> TSC_OFFSET feature that VMX supports might be extended to allow
> control over TSC clock rate as well. For example, provide 'base' and
> 'scale' values and apply following when guest executes RDTSC:
>   guest_tsc = (host_tsc - base) * scale + offset
> 
> How do you guys see this working?
> 
>   -- Keir
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel


[-- Attachment #2: Guest time.ppt --]
[-- Type: application/vnd.ms-powerpoint, Size: 26624 bytes --]

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread
* RE: [Patch]  time resolution fix.
@ 2006-03-20 23:55 Dong, Eddie
  0 siblings, 0 replies; 9+ messages in thread
From: Dong, Eddie @ 2006-03-20 23:55 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel Devel, Nakajima, Jun

Keir:
	Thanks!

Keir Fraser wrote:
> What effect will this have? Are you suggesting to always run guest
> time at 'virtual time' rather than real wallclock time?
> 
Ooo, the new proposal is not focusing on this issue :-) 

The basic issue we saw is 
1: how to jump guest time
    For example, when a SMP system has 2 VPs, each VP APIC time 
(VP0-VP1) has a scheduled fire time say at 4ms, and 6ms time. 
And the platform time say PIT is scheduled at 8ms time. 
when VP0 is descheduled, while VP1 is switched in, then probably
we can't inject APIC time IRQ to VP1 even hypervisor undergo 
6ms+ time. Because injecting APIC time IRQ means VP1 saw guest 
time jumped to 6ms later and same on TSC (platform). Otherwise when 
VP0 is switched in, guest TSC time on VP0 is in 6ms+ later time, but
the ACPI timer ISR is still assuming it is in 4ms time. This kind of
lossing 
synchronization means VP0 see backward time that may cause various
corner case like we saw previously in PIT and TSC. 
	Combining per processor time IRQ with platform time IRQ, the 
situation will become much complicated.
    

2: How to deliver guest time IRQ effeciently.
     Same with above situation, if the VP with next scheduled timer
resource 
is deactive, all other VP may be unable to get time IRQ. That is unfair
and may
cause no way to catchup in some difficult case :-)
      Also if platform time IRQ is pinned on certain VP, that is much
worse :-(

3: Make platform time code object orientation.
    That means, no matter RTC, ACPI or PIT time, for each HVM, the
configuration
can choice eithe of them and xen will provide dynamic register APIs. In
this way we 
are no longer pinned on PIT.


We have something in mind, but not fully completed yet. 
For simplicity, we may assume a) An guest OS only use one of the 
platform time as its ticking resource.  b) platform time IRQ is not
pinned on
certain VP.

thx,eddie

^ permalink raw reply	[flat|nested] 9+ messages in thread
* RE: [Patch]  time resolution fix.
@ 2006-03-20 16:08 Dong, Eddie
  2006-03-20 16:48 ` Keir Fraser
  0 siblings, 1 reply; 9+ messages in thread
From: Dong, Eddie @ 2006-03-20 16:08 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel Devel, Nakajima, Jun

Keir:
	Yes, for future multiple platform time resource 
support (RTC, PIT, ACPI), I agree eventually they should be in HV. 

> The freezing is I guess why you have the new hook schedule_out(),
> which I'm also not madly keen on. Especially since this must surely
> be a short-term workaround (you don't intend to TSC freeze as
> long-term solution, right?).

It is true that I don't like to freeze the guest time at deschedule
time, but 
so far we have to stay here before we find better solution :-(

The reason is in legacy guest PIT ISR (interrupt service routine). The
ISR code 
reads TSC and computing the elapsed TSC (compare with saved old TSC)
from last PIT IRQ fire time. If xen doesn't present exactly expected
difference
in TSC, guest may get accumulated difference in PIT ISR and do some
fixup
that is a messy of guest jiffies and complain "lossed tick". Eventually
that 
will force guest to give up using TSC as time resource (roll back to
pure PIT).

With this patch, we get very accurate guest time in our local test:-)

Keir Fraser wrote:
> Actually, I now recall we were going to use this approach long term to
> ensure the guest calibrates TSC rate correctly during boot. But then
> we are going to turn it off the first time the guest reads wall-clock
> time (via RTC, for example). But that means we will need the
> schedule_out() hook long term, and that makes your patch less
> unattractive. I'll take another look and reconsider it.

Yes, this meets with what Ian and Asit talked in xensummit too. And it
can solve
 the TSC calibration issue as wall-clock (RTC) read is some time later
after 
TSC calibration. But it has problem in APIC time calibration side, as 
it is done very later in Linux (not sure for other OSes), it is even
later than init 
thread creation that is hard to determine in xen.

Freezing TSC has similar function with this suggestion. The difference
in freezing
TSC approach is that we need to assume the guest calibration is a
one-time task. 
Otherwise the guest may see time backward in runtime.

A better solution to remove this assumption is that we implement a 
mechanism like PIT IRQ output line that will discard accumulated IRQs
during
guest IRQ disable time. I.e. if guest IRQ is disabled,
pickup_deactive_ticks 
should ignore the elapsed ticks (only add one more pending IRQ). In this
way
the guest behavior will be exactly same with native. We should put this
in 
our TODO list :-) 

> In summary, I'm not sure about this patch. I feel that if I take it
> I'm encouraging 'onward and upward' development without spending the
> time to make sure fundamental abstractions like time are designed and
> implemented soundly.

Thanks! We have plan to come out a much complicate time virtualization
design
soon to support multiple platform time resources and SMP better. We saw
several
issues for SMP support in guest time forwarding. We will send the design
out 
as soon as possible and collect feedback from you and all others:-)

thx,eddie
	

^ permalink raw reply	[flat|nested] 9+ messages in thread
* [Patch]  time resolution fix.
@ 2006-03-17 14:39 Dong, Eddie
  2006-03-17 15:37 ` Keir Fraser
  0 siblings, 1 reply; 9+ messages in thread
From: Dong, Eddie @ 2006-03-17 14:39 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 296 bytes --]

This patch fix HVM/VMX time resolution issue that cause IA32E complain
"loss tick" occationally and APIC time calibration issue.

not tested on SVM for slight common code change.
Eddie

Signed-off-by: Xiaowei Yang <xiaowei.yang@intel.com>
Signed-off-by: Eddie Dong <eddie.dong@intel.com>

[-- Attachment #2: time6.patch --]
[-- Type: application/octet-stream, Size: 10226 bytes --]

diff -r 3983e4f1b054 xen/arch/x86/domain.c
--- a/xen/arch/x86/domain.c	Sun Mar 12 09:03:33 2006
+++ b/xen/arch/x86/domain.c	Fri Mar 17 22:30:12 2006
@@ -791,6 +791,7 @@
 
     /* Allow at most one CPU at a time to be dirty. */
     ASSERT(cpus_weight(dirty_mask) <= 1);
+    schedule_out(prev);
     if ( unlikely(!cpu_isset(cpu, dirty_mask) && !cpus_empty(dirty_mask)) )
     {
         /* Other cpus call __sync_lazy_execstate from flush ipi handler. */
diff -r 3983e4f1b054 xen/arch/x86/hvm/intercept.c
--- a/xen/arch/x86/hvm/intercept.c	Sun Mar 12 09:03:33 2006
+++ b/xen/arch/x86/hvm/intercept.c	Fri Mar 17 22:30:12 2006
@@ -338,10 +338,10 @@
 
 static __inline__ void missed_ticks(struct hvm_virpit*vpit)
 {
-    int        missed_ticks;
+    int missed_ticks;
 
     missed_ticks = (NOW() - vpit->scheduled)/(s_time_t) vpit->period;
-    if ( missed_ticks > 0 ) {
+    if ( missed_ticks++ >= 0 ) {
         vpit->pending_intr_nr += missed_ticks;
         vpit->scheduled += missed_ticks * vpit->period;
     }
@@ -355,22 +355,16 @@
 
     /* pick up missed timer tick */
     missed_ticks(vpit);
-
-    vpit->pending_intr_nr++;
     if ( test_bit(_VCPUF_running, &v->vcpu_flags) ) {
-        vpit->scheduled += vpit->period;
         set_timer(&vpit->pit_timer, vpit->scheduled);
     }
 }
 
+/* pick up missed timer ticks at deactive time */
 void pickup_deactive_ticks(struct hvm_virpit *vpit)
 {
-
     if ( !active_timer(&(vpit->pit_timer)) ) {
-        /* pick up missed timer tick */
         missed_ticks(vpit);
-    
-        vpit->scheduled += vpit->period;
         set_timer(&vpit->pit_timer, vpit->scheduled);
     }
 }
diff -r 3983e4f1b054 xen/arch/x86/hvm/vmx/io.c
--- a/xen/arch/x86/hvm/vmx/io.c	Sun Mar 12 09:03:33 2006
+++ b/xen/arch/x86/hvm/vmx/io.c	Fri Mar 17 22:30:12 2006
@@ -40,20 +40,33 @@
 
 #define BSP_CPU(v)    (!(v->vcpu_id))
 
-void vmx_set_tsc_shift(struct vcpu *v, struct hvm_virpit *vpit)
-{
-    u64   drift;
-
-    if ( vpit->first_injected )
-        drift = vpit->period_cycles * vpit->pending_intr_nr;
-    else 
-        drift = 0;
-    vpit->shift = v->arch.hvm_vmx.tsc_offset - drift;
-    __vmwrite(TSC_OFFSET, vpit->shift);
-
+static inline 
+void __set_tsc_offset(u64  offset)
+{
+    __vmwrite(TSC_OFFSET, offset);
 #if defined (__i386__)
-    __vmwrite(TSC_OFFSET_HIGH, ((vpit->shift)>> 32));
+    __vmwrite(TSC_OFFSET_HIGH, offset >> 32);
 #endif
+}
+
+u64 get_guest_time(struct vcpu *v)
+{
+    struct hvm_virpit *vpit = &(v->domain->arch.hvm_domain.vpit);
+    u64    host_tsc;
+    
+    rdtscll(host_tsc);
+    return host_tsc + vpit->cache_tsc_offset;
+}
+
+void set_guest_time(struct vcpu *v, u64 gtime)
+{
+    struct hvm_virpit *vpit = &(v->domain->arch.hvm_domain.vpit);
+    u64    host_tsc;
+   
+    rdtscll(host_tsc);
+    
+    vpit->cache_tsc_offset = gtime - host_tsc;
+    __set_tsc_offset(vpit->cache_tsc_offset);
 }
 
 static inline void
@@ -64,6 +77,7 @@
     if ( is_pit_irq(v, vector, type) ) {
         if ( !vpit->first_injected ) {
             vpit->pending_intr_nr = 0;
+            vpit->last_pit_gtime = get_guest_time(v);
             vpit->scheduled = NOW() + vpit->period;
             set_timer(&vpit->pit_timer, vpit->scheduled);
             vpit->first_injected = 1;
@@ -71,7 +85,9 @@
             vpit->pending_intr_nr--;
         }
         vpit->inject_point = NOW();
-        vmx_set_tsc_shift (v, vpit);
+
+        vpit->last_pit_gtime += vpit->period;
+        set_guest_time(v, vpit->last_pit_gtime);
     }
 
     switch(type)
@@ -189,14 +205,15 @@
 
     vmx_stts();
 
+    /* pick up the elapsed PIT ticks and re-enable pit_timer */
+    if ( vpit->first_injected) {
+        set_guest_time(v, v->domain->arch.hvm_domain.guest_time);
+        pickup_deactive_ticks(vpit);
+    }
+
     if ( test_bit(iopacket_port(v), &d->shared_info->evtchn_pending[0]) ||
          test_bit(ARCH_HVM_IO_WAIT, &v->arch.hvm_vcpu.ioflags) )
         hvm_wait_io();
-
-    /* pick up the elapsed PIT ticks and re-enable pit_timer */
-    if ( vpit->first_injected )
-        pickup_deactive_ticks(vpit);
-    vmx_set_tsc_shift(v, vpit);
 
     /* We can't resume the guest if we're waiting on I/O */
     ASSERT(!test_bit(ARCH_HVM_IO_WAIT, &v->arch.hvm_vcpu.ioflags));
diff -r 3983e4f1b054 xen/arch/x86/hvm/vmx/vmcs.c
--- a/xen/arch/x86/hvm/vmx/vmcs.c	Sun Mar 12 09:03:33 2006
+++ b/xen/arch/x86/hvm/vmx/vmcs.c	Fri Mar 17 22:30:12 2006
@@ -195,7 +195,6 @@
 /* Update CR3, GDT, LDT, TR */
     unsigned int  error = 0;
     unsigned long cr0, cr4;
-    u64     host_tsc;
 
     if (v->vcpu_id == 0)
         hvm_setup_platform(v->domain);
@@ -246,13 +245,12 @@
     __vmwrite(GUEST_CR3, pagetable_get_paddr(v->domain->arch.phys_table));
     __vmwrite(HOST_CR3, pagetable_get_paddr(v->arch.monitor_table));
 
+    v->arch.schedule_out = arch_vmx_do_switchout;
     v->arch.schedule_tail = arch_vmx_do_resume;
     v->arch.hvm_vmx.launch_cpu = smp_processor_id();
 
     /* init guest tsc to start from 0 */
-    rdtscll(host_tsc);
-    v->arch.hvm_vmx.tsc_offset = 0 - host_tsc;
-    vmx_set_tsc_shift(v, &v->domain->arch.hvm_domain.vpit);
+    set_guest_time(v, 0);
 }
 
 /*
@@ -571,6 +569,15 @@
     reset_stack_and_jump(vmx_asm_do_launch);
 }
 
+void arch_vmx_do_switchout(struct vcpu *v)
+{
+    struct hvm_virpit *vpit = &(v->domain->arch.hvm_domain.vpit);
+    
+    v->domain->arch.hvm_domain.guest_time = get_guest_time(v);
+    if ( vpit->first_injected )
+        stop_timer(&(vpit->pit_timer));
+}
+
 /*
  * Local variables:
  * mode: C
diff -r 3983e4f1b054 xen/arch/x86/hvm/vmx/vmx.c
--- a/xen/arch/x86/hvm/vmx/vmx.c	Sun Mar 12 09:03:33 2006
+++ b/xen/arch/x86/hvm/vmx/vmx.c	Fri Mar 17 22:30:12 2006
@@ -1675,7 +1675,7 @@
 
         rdtscll(msr_content);
         vpit = &(v->domain->arch.hvm_domain.vpit);
-        msr_content += vpit->shift;
+        msr_content += vpit->cache_tsc_offset;
         break;
     }
     case MSR_IA32_SYSENTER_CS:
@@ -1719,22 +1719,8 @@
 
     switch (regs->ecx) {
     case MSR_IA32_TIME_STAMP_COUNTER:
-    {
-        struct hvm_virpit *vpit;
-        u64 host_tsc, drift;
-
-        rdtscll(host_tsc);
-        vpit = &(v->domain->arch.hvm_domain.vpit);
-        drift = v->arch.hvm_vmx.tsc_offset - vpit->shift;
-        vpit->shift = msr_content - host_tsc;
-	v->arch.hvm_vmx.tsc_offset = vpit->shift + drift;
-        __vmwrite(TSC_OFFSET, vpit->shift);
-
-#if defined (__i386__)
-        __vmwrite(TSC_OFFSET_HIGH, ((vpit->shift)>>32));
-#endif
-        break;
-    }
+        set_guest_time(v, msr_content);
+        break;
     case MSR_IA32_SYSENTER_CS:
         __vmwrite(GUEST_SYSENTER_CS, msr_content);
         break;
diff -r 3983e4f1b054 xen/include/asm-x86/current.h
--- a/xen/include/asm-x86/current.h	Sun Mar 12 09:03:33 2006
+++ b/xen/include/asm-x86/current.h	Fri Mar 17 22:30:12 2006
@@ -51,6 +51,7 @@
         "mov %0,%%"__OP"sp; jmp "STR(__fn)      \
         : : "r" (guest_cpu_user_regs()) : "memory" )
 
+#define schedule_out(_ed) do { if ((_ed)->arch.schedule_out) ((_ed)->arch.schedule_out)(_ed); } while(0)
 #define schedule_tail(_ed) (((_ed)->arch.schedule_tail)(_ed))
 
 #endif /* __X86_CURRENT_H__ */
diff -r 3983e4f1b054 xen/include/asm-x86/domain.h
--- a/xen/include/asm-x86/domain.h	Sun Mar 12 09:03:33 2006
+++ b/xen/include/asm-x86/domain.h	Fri Mar 17 22:30:12 2006
@@ -122,6 +122,7 @@
 
     unsigned long      flags; /* TF_ */
 
+    void (*schedule_out) (struct vcpu *);
     void (*schedule_tail) (struct vcpu *);
 
     /* Bounce information for propagating an exception to guest OS. */
diff -r 3983e4f1b054 xen/include/asm-x86/hvm/domain.h
--- a/xen/include/asm-x86/hvm/domain.h	Sun Mar 12 09:03:33 2006
+++ b/xen/include/asm-x86/hvm/domain.h	Fri Mar 17 22:30:12 2006
@@ -37,6 +37,7 @@
     unsigned int           pae_enabled;
 
     struct hvm_virpit      vpit;
+    u64                    guest_time;
     struct hvm_virpic      vpic;
     struct hvm_vioapic     vioapic;
     struct hvm_io_handler  io_handler;
diff -r 3983e4f1b054 xen/include/asm-x86/hvm/vmx/vmcs.h
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h	Sun Mar 12 09:03:33 2006
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h	Fri Mar 17 22:30:12 2006
@@ -79,7 +79,6 @@
     unsigned long           cpu_based_exec_control;
     struct vmx_msr_state    msr_content;
     void                    *io_bitmap_a, *io_bitmap_b;
-    u64                     tsc_offset;
     struct timer            hlt_timer;  /* hlt ins emulation wakeup timer */
 };
 
diff -r 3983e4f1b054 xen/include/asm-x86/hvm/vmx/vmx.h
--- a/xen/include/asm-x86/hvm/vmx/vmx.h	Sun Mar 12 09:03:33 2006
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h	Fri Mar 17 22:30:12 2006
@@ -30,10 +30,12 @@
 extern void vmx_asm_do_resume(void);
 extern void vmx_asm_do_launch(void);
 extern void vmx_intr_assist(void);
-extern void vmx_set_tsc_shift(struct vcpu *, struct hvm_virpit *);
 extern void vmx_migrate_timers(struct vcpu *v);
 extern void arch_vmx_do_launch(struct vcpu *);
 extern void arch_vmx_do_resume(struct vcpu *);
+extern void arch_vmx_do_switchout(struct vcpu *);
+extern void set_guest_time(struct vcpu *v, u64 gtime);
+extern u64  get_guest_time(struct vcpu *v);
 
 extern unsigned int cpu_rev;
 
diff -r 3983e4f1b054 xen/include/asm-x86/hvm/vpit.h
--- a/xen/include/asm-x86/hvm/vpit.h	Sun Mar 12 09:03:33 2006
+++ b/xen/include/asm-x86/hvm/vpit.h	Fri Mar 17 22:30:12 2006
@@ -38,7 +38,6 @@
 struct hvm_virpit {
     /* for simulation of counter 0 in mode 2 */
     u64 period_cycles;          /* pit frequency in cpu cycles */
-    u64 shift;                  /* save the value of offset - drift */
     s_time_t inject_point;      /* the time inject virt intr */
     s_time_t scheduled;         /* scheduled timer interrupt */
     struct timer pit_timer;     /* periodic timer for mode 2*/
@@ -46,6 +45,8 @@
     unsigned int pending_intr_nr; /* the couner for pending timer interrupts */
     u32 period;                 /* pit frequency in ns */
     int first_injected;         /* flag to prevent shadow window */
+    s64 cache_tsc_offset;       /* cache of VMCS TSC_OFFSET offset */
+    u64 last_pit_gtime;         /* guest time when last pit is injected */
 
     /* virtual PIT state for handle related I/O */
     int read_state;

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2006-03-20 23:55 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-17 16:11 [Patch] time resolution fix Dong, Eddie
2006-03-19 14:28 ` Keir Fraser
2006-03-19 14:58   ` Keir Fraser
2006-03-20 15:55   ` Ronald G Minnich
  -- strict thread matches above, loose matches on Subject: below --
2006-03-20 23:55 Dong, Eddie
2006-03-20 16:08 Dong, Eddie
2006-03-20 16:48 ` Keir Fraser
2006-03-17 14:39 Dong, Eddie
2006-03-17 15:37 ` Keir Fraser

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.