Re: [RFC] fix VMX TSC synchronicity

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [RFC] fix VMX TSC synchronicity
@ 2008-01-12  3:11 Will Trives
  2008-01-12 12:28 ` Marcelo Tosatti
  0 siblings, 1 reply; 28+ messages in thread
From: Will Trives @ 2008-01-12  3:11 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Hello,

Just thought i'd mention that this patch caused disk corruption in my
virtual machines, particularly when using scsi emulation.

I got few errors like this as well:

lsi_scsi: error: ORDERED queue not implemented
lsi_scsi: error: IO with unknown tag 65577

Regards,

Will


-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] fix VMX TSC synchronicity
  2008-01-12  3:11 [RFC] fix VMX TSC synchronicity Will Trives
@ 2008-01-12 12:28 ` Marcelo Tosatti
  2008-01-12 13:48   ` Will Trives
  0 siblings, 1 reply; 28+ messages in thread
From: Marcelo Tosatti @ 2008-01-12 12:28 UTC (permalink / raw)
  To: Will Trives; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Sat, Jan 12, 2008 at 02:11:30PM +1100, Will Trives wrote:
> Hello,
> 
> Just thought i'd mention that this patch caused disk corruption in my
> virtual machines, particularly when using scsi emulation.
> 
> I got few errors like this as well:
> 
> lsi_scsi: error: ORDERED queue not implemented
> lsi_scsi: error: IO with unknown tag 65577

Hi Will,

Can you please be more descriptive of your environment? What guest OS  ?
How many CPU's do you have assigned to it                              ?

Are you sure you're talking about the "VMX TSC" patch in the Subject:
line?

Thanks


-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] fix VMX TSC synchronicity
  2008-01-12 12:28 ` Marcelo Tosatti
@ 2008-01-12 13:48   ` Will Trives
  2008-01-12 20:51     ` Avi Kivity
  0 siblings, 1 reply; 28+ messages in thread
From: Will Trives @ 2008-01-12 13:48 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Hello,

64 bit environment Intel Q6600 processor.

I did hand apply the patch so I might have got something wrong.

Guest was a 32 bit kubuntu 7.10.

Basically I just mounted a few disk images with " -drive
file=/whatever,if=scsi " and used gparted to copy a partition from one
to the other.

It's probably something I have done wrong, i'll test it again in a few
more days.

Regards,

Will

On Sat, 2008-01-12 at 10:28 -0200, Marcelo Tosatti wrote:
> On Sat, Jan 12, 2008 at 02:11:30PM +1100, Will Trives wrote:
> > Hello,
> > 
> > Just thought i'd mention that this patch caused disk corruption in my
> > virtual machines, particularly when using scsi emulation.
> > 
> > I got few errors like this as well:
> > 
> > lsi_scsi: error: ORDERED queue not implemented
> > lsi_scsi: error: IO with unknown tag 65577
> 
> Hi Will,
> 
> Can you please be more descriptive of your environment? What guest OS  ?
> How many CPU's do you have assigned to it                              ?
> 
> Are you sure you're talking about the "VMX TSC" patch in the Subject:
> line?
> 
> Thanks


-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] fix VMX TSC synchronicity
  2008-01-12 13:48   ` Will Trives
@ 2008-01-12 20:51     ` Avi Kivity
       [not found]       ` <4789284E.7000907-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Avi Kivity @ 2008-01-12 20:51 UTC (permalink / raw)
  To: Will Trives; +Cc: Marcelo Tosatti, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Will Trives wrote:
> Hello,
>
> 64 bit environment Intel Q6600 processor.
>
> I did hand apply the patch so I might have got something wrong.
>
> Guest was a 32 bit kubuntu 7.10.
>
> Basically I just mounted a few disk images with " -drive
> file=/whatever,if=scsi " and used gparted to copy a partition from one
> to the other.
>
> It's probably something I have done wrong, i'll test it again in a few
> more days.
>
>   

I've trashed more than one image by running multiple guests with the 
same image, can this be the case here?

-- 
Any sufficiently difficult bug is indistinguishable from a feature.


-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace

^ permalink raw reply	[flat|nested] 28+ messages in thread

[parent not found: <4789284E.7000907-atKUWr5tajBWk0Htik3J/w@public.gmane.org>]

* Re: [PATCH] kvm: qemu: remove SIGUSR2 from io_sigset
       [not found]       ` <4789284E.7000907-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2008-01-19 18:58         ` Will Trives
  2008-01-20  9:34           ` Avi Kivity
  0 siblings, 1 reply; 28+ messages in thread
From: Will Trives @ 2008-01-19 18:58 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Hello,

This patch :

kvm: qemu: remove SIGUSR2 from io_sigset

Applied 15th Jan 2008 looks to kill I/O performance for me. As soon as I
revert it my VMs are much faster. I don't really use SMP for my guests,
perhaps this patch made uniprocessor VMs slower ?


Regards,

Will Trives


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] kvm: qemu: remove SIGUSR2 from io_sigset
  2008-01-19 18:58         ` [PATCH] kvm: qemu: remove SIGUSR2 from io_sigset Will Trives
@ 2008-01-20  9:34           ` Avi Kivity
       [not found]             ` <4793158A.2080509-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Avi Kivity @ 2008-01-20  9:34 UTC (permalink / raw)
  To: Will Trives; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Will Trives wrote:
> Hello,
>
> This patch :
>
> kvm: qemu: remove SIGUSR2 from io_sigset
>
> Applied 15th Jan 2008 looks to kill I/O performance for me. As soon as I
> revert it my VMs are much faster. I don't really use SMP for my guests,
> perhaps this patch made uniprocessor VMs slower ?
>
>
>   

How are you measuring performance?

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 28+ messages in thread

[parent not found: <4793158A.2080509-atKUWr5tajBWk0Htik3J/w@public.gmane.org>]

* Re: [PATCH] kvm: qemu: remove SIGUSR2 from io_sigset
       [not found]             ` <4793158A.2080509-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2008-01-20 10:45               ` Avi Kivity
  0 siblings, 0 replies; 28+ messages in thread
From: Avi Kivity @ 2008-01-20 10:45 UTC (permalink / raw)
  To: Will Trives; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Avi Kivity wrote:
> Will Trives wrote:
>> Hello,
>>
>> This patch :
>>
>> kvm: qemu: remove SIGUSR2 from io_sigset
>>
>> Applied 15th Jan 2008 looks to kill I/O performance for me. As soon as I
>> revert it my VMs are much faster. I don't really use SMP for my guests,
>> perhaps this patch made uniprocessor VMs slower ?
>>
>>
>>   
>
> How are you measuring performance?
>

Never mind, fio shows it.

Will investigate.

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC] fix VMX TSC synchronicity
@ 2008-01-11 20:49 Marcelo Tosatti
  2008-01-13 12:19 ` Avi Kivity
  0 siblings, 1 reply; 28+ messages in thread
From: Marcelo Tosatti @ 2008-01-11 20:49 UTC (permalink / raw)
  To: kvm-devel


The boot TSC sync check is failing on recent Linux SMP guests on TSC
stable hosts.

Following patch attempts to address the problems, which are:

1) APIC_DM_STARTUP, which is only called for vcpu's other than vcpu0,
will trigger ->vcpu_reset(), setting the TSC to 0. Fix that by moving
the guest_write_tsc(0) to vcpu_load.

2) vcpu's are initialized at different times by QEMU (vcpu0 init happens
way earlier than the rest). Fix that by initializing the TSC of vcpu's >
0 with reference to vcpu0 init tsc value. This way TSC synchronization
is kept (apparently Xen does something similar).

3) The code which adjusts the TSC of a VCPU on physical CPU switch is
meant to guarantee that the guest sees a monotonically increasing value.
However there is a large gap, in terms of clocks, between the time which
the source CPU TSC is read and the time the destination CPU TSC is read.
So move those two reads to happen at vcpu_clear.

I believe that 3) is the reason for the -no-kvm-irqchip problems
reported by Avi on RHEL5, with kernels < 2.6.21 (where vcpu->cpu pinning
would fix the problem). Unfortunately I could reproduce that problem.

4-way guest with constant tick at 250Hz now reliably sees the TSC's
synchronized, and idle guest CPU consumption is reduced by 50% (3-4%
instead of 8%, the latter with acpi_pm_good boot parameter).


diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 4741806..e1c8cf4 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -47,6 +47,8 @@ struct vcpu_vmx {
 	struct kvm_vcpu       vcpu;
 	int                   launched;
 	u8                    fail;
+	u64		      first_tsc;
+	u64		      tsc_this;
 	u32                   idt_vectoring_info;
 	struct kvm_msr_entry *guest_msrs;
 	struct kvm_msr_entry *host_msrs;
@@ -254,6 +256,7 @@ static void vcpu_clear(struct vcpu_vmx *vmx)
 	if (vmx->vcpu.cpu == -1)
 		return;
 	smp_call_function_single(vmx->vcpu.cpu, __vcpu_clear, vmx, 0, 1);
+	rdtscll(vmx->tsc_this);
 	vmx->launched = 0;
 }
 
@@ -480,6 +483,8 @@ static void vmx_load_host_state(struct vcpu_vmx *vmx)
 	reload_host_efer(vmx);
 }
 
+static void guest_write_tsc(u64 host_tsc, u64 guest_tsc);
+
 /*
  * Switches to specified vcpu, until a matching vcpu_put(), but assumes
  * vcpu mutex is already taken.
@@ -488,7 +493,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	u64 phys_addr = __pa(vmx->vmcs);
-	u64 tsc_this, delta;
+	u64 delta;
 
 	if (vcpu->cpu != cpu) {
 		vcpu_clear(vmx);
@@ -511,6 +516,19 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 		struct descriptor_table dt;
 		unsigned long sysenter_esp;
 
+		if (unlikely(vcpu->cpu == -1)) {
+			rdtscll(vcpu->arch.host_tsc);
+			vmx->tsc_this = vcpu->arch.host_tsc;
+			if (vcpu->vcpu_id == 0) {
+				guest_write_tsc(vcpu->arch.host_tsc, 0);
+				vmx->first_tsc = vcpu->arch.host_tsc;
+			} else {
+				struct vcpu_vmx *cpu0;
+				cpu0 = to_vmx(vcpu->kvm->vcpus[0]);
+				guest_write_tsc(cpu0->first_tsc, 0);
+			}
+		}
+
 		vcpu->cpu = cpu;
 		/*
 		 * Linux uses per-cpu TSS and GDT, so set these when switching
@@ -526,8 +544,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 		/*
 		 * Make sure the time stamp counter is monotonous.
 		 */
-		rdtscll(tsc_this);
-		delta = vcpu->arch.host_tsc - tsc_this;
+		delta = vcpu->arch.host_tsc - vmx->tsc_this;
 		vmcs_write64(TSC_OFFSET, vmcs_read64(TSC_OFFSET) + delta);
 	}
 }
@@ -690,11 +707,8 @@ static u64 guest_read_tsc(void)
  * writes 'guest_tsc' into guest's timestamp counter "register"
  * guest_tsc = host_tsc + tsc_offset ==> tsc_offset = guest_tsc - host_tsc
  */
-static void guest_write_tsc(u64 guest_tsc)
+static void guest_write_tsc(u64 host_tsc, u64 guest_tsc)
 {
-	u64 host_tsc;
-
-	rdtscll(host_tsc);
 	vmcs_write64(TSC_OFFSET, guest_tsc - host_tsc);
 }
 
@@ -758,6 +772,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	struct kvm_msr_entry *msr;
+	u64 host_tsc;
 	int ret = 0;
 
 	switch (msr_index) {
@@ -786,7 +801,8 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data)
 		vmcs_writel(GUEST_SYSENTER_ESP, data);
 		break;
 	case MSR_IA32_TIME_STAMP_COUNTER:
-		guest_write_tsc(data);
+		rdtscll(host_tsc);
+		guest_write_tsc(host_tsc, data);
 		break;
 	default:
 		msr = find_msr_entry(vmx, msr_index);
@@ -1684,7 +1700,6 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu)
 	vmcs_write32(GUEST_INTERRUPTIBILITY_INFO, 0);
 	vmcs_write32(GUEST_PENDING_DBG_EXCEPTIONS, 0);
 
-	guest_write_tsc(0);
 
 	/* Special registers */
 	vmcs_write64(GUEST_IA32_DEBUGCTL, 0);

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC] fix VMX TSC synchronicity
  2008-01-11 20:49 [RFC] fix VMX TSC synchronicity Marcelo Tosatti
@ 2008-01-13 12:19 ` Avi Kivity
       [not found]   ` <478A01D1.7000402-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Avi Kivity @ 2008-01-13 12:19 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm-devel

Marcelo Tosatti wrote:
> The boot TSC sync check is failing on recent Linux SMP guests on TSC
> stable hosts.
>
>   

What about tsc unstable hosts?  If your patch convinces the guest its 
tsc is table, while the host tsc is not, then it may cause confusion 
later on.

> Following patch attempts to address the problems, which are:
>
> 1) APIC_DM_STARTUP, which is only called for vcpu's other than vcpu0,
> will trigger ->vcpu_reset(), setting the TSC to 0. Fix that by moving
> the guest_write_tsc(0) to vcpu_load.
>
> 2) vcpu's are initialized at different times by QEMU (vcpu0 init happens
> way earlier than the rest). Fix that by initializing the TSC of vcpu's >
> 0 with reference to vcpu0 init tsc value. This way TSC synchronization
> is kept (apparently Xen does something similar).
>
> 3) The code which adjusts the TSC of a VCPU on physical CPU switch is
> meant to guarantee that the guest sees a monotonically increasing value.
> However there is a large gap, in terms of clocks, between the time which
> the source CPU TSC is read and the time the destination CPU TSC is read.
> So move those two reads to happen at vcpu_clear.
>
> I believe that 3) is the reason for the -no-kvm-irqchip problems
> reported by Avi on RHEL5, with kernels < 2.6.21 (where vcpu->cpu pinning
> would fix the problem). Unfortunately I could reproduce that problem.
>
> 4-way guest with constant tick at 250Hz now reliably sees the TSC's
> synchronized, and idle guest CPU consumption is reduced by 50% (3-4%
> instead of 8%, the latter with acpi_pm_good boot parameter).
>
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 4741806..e1c8cf4 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -47,6 +47,8 @@ struct vcpu_vmx {
>  	struct kvm_vcpu       vcpu;
>  	int                   launched;
>  	u8                    fail;
> +	u64		      first_tsc;
> +	u64		      tsc_this;
>  	u32                   idt_vectoring_info;
>  	struct kvm_msr_entry *guest_msrs;
>  	struct kvm_msr_entry *host_msrs;
> @@ -254,6 +256,7 @@ static void vcpu_clear(struct vcpu_vmx *vmx)
>  	if (vmx->vcpu.cpu == -1)
>  		return;
>  	smp_call_function_single(vmx->vcpu.cpu, __vcpu_clear, vmx, 0, 1);
> +	rdtscll(vmx->tsc_this);
>  	vmx->launched = 0;
>  }
>  
> @@ -480,6 +483,8 @@ static void vmx_load_host_state(struct vcpu_vmx *vmx)
>  	reload_host_efer(vmx);
>  }
>  
> +static void guest_write_tsc(u64 host_tsc, u64 guest_tsc);
> +
>  /*
>   * Switches to specified vcpu, until a matching vcpu_put(), but assumes
>   * vcpu mutex is already taken.
> @@ -488,7 +493,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>  {
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>  	u64 phys_addr = __pa(vmx->vmcs);
> -	u64 tsc_this, delta;
> +	u64 delta;
>  
>  	if (vcpu->cpu != cpu) {
>  		vcpu_clear(vmx);
> @@ -511,6 +516,19 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>  		struct descriptor_table dt;
>  		unsigned long sysenter_esp;
>  
> +		if (unlikely(vcpu->cpu == -1)) {
>   

This can happen after migration, I believe.

> +			rdtscll(vcpu->arch.host_tsc);
> +			vmx->tsc_this = vcpu->arch.host_tsc;
> +			if (vcpu->vcpu_id == 0) {
> +				guest_write_tsc(vcpu->arch.host_tsc, 0);
> +				vmx->first_tsc = vcpu->arch.host_tsc;
> +			} else {
> +				struct vcpu_vmx *cpu0;
> +				cpu0 = to_vmx(vcpu->kvm->vcpus[0]);
> +				guest_write_tsc(cpu0->first_tsc, 0);
> +			}
> +		}
> +
>   

Depending on vcpu_id == 0 can cause problems (for example, if vcpu 0 is 
somehow not the first to run).

We might initialize the tsc base on vm creation, and if the host tsc is 
synced, then the guest tsc should also be.

>  		vcpu->cpu = cpu;
>  		/*
>  		 * Linux uses per-cpu TSS and GDT, so set these when switching
> @@ -526,8 +544,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>  		/*
>  		 * Make sure the time stamp counter is monotonous.
>  		 */
> -		rdtscll(tsc_this);
> -		delta = vcpu->arch.host_tsc - tsc_this;
> +		delta = vcpu->arch.host_tsc - vmx->tsc_this;
>  		vmcs_write64(TSC_OFFSET, vmcs_read64(TSC_OFFSET) + delta);
>   

This is a little roundabout, how about moving the delta calculation 
immediately after the call to vcpu_clear()?

I don't think this is the cause of the problem, it can't account for 
more than a few hundred cycles, compared to the much greater vmentry cost.

Anyway it should be in a separate patch.



-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace

^ permalink raw reply	[flat|nested] 28+ messages in thread

[parent not found: <478A01D1.7000402-atKUWr5tajBWk0Htik3J/w@public.gmane.org>]

* Re: [RFC] fix VMX TSC synchronicity
       [not found]   ` <478A01D1.7000402-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2008-01-14 16:06     ` Marcelo Tosatti
  2008-01-14 20:46       ` Marcelo Tosatti
  2008-01-15 14:33       ` Avi Kivity
  0 siblings, 2 replies; 28+ messages in thread
From: Marcelo Tosatti @ 2008-01-14 16:06 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, kvm-devel

Hi Avi,

On Sun, Jan 13, 2008 at 02:19:29PM +0200, Avi Kivity wrote:
> Marcelo Tosatti wrote:
> >The boot TSC sync check is failing on recent Linux SMP guests on TSC
> >stable hosts.
> >
> >  
> 
> What about tsc unstable hosts?  If your patch convinces the guest its 
> tsc is table, while the host tsc is not, then it may cause confusion 
> later on.

The adjustment to zero won't fool the guest because it assumes that the
TSC's are synchronized. It will simply set the guest TSC to zero on all
VCPUs based on the time VCPU0 was initialized.

That is, setting -(vcpu[0].first_tsc) on all VCPU's will not correct any
synchronization problem.

> >Following patch attempts to address the problems, which are:
> >
> >1) APIC_DM_STARTUP, which is only called for vcpu's other than vcpu0,
> >will trigger ->vcpu_reset(), setting the TSC to 0. Fix that by moving
> >the guest_write_tsc(0) to vcpu_load.
> >
> >2) vcpu's are initialized at different times by QEMU (vcpu0 init happens
> >way earlier than the rest). Fix that by initializing the TSC of vcpu's >
> >0 with reference to vcpu0 init tsc value. This way TSC synchronization
> >is kept (apparently Xen does something similar).
> >
> >3) The code which adjusts the TSC of a VCPU on physical CPU switch is
> >meant to guarantee that the guest sees a monotonically increasing value.
> >However there is a large gap, in terms of clocks, between the time which
> >the source CPU TSC is read and the time the destination CPU TSC is read.
> >So move those two reads to happen at vcpu_clear.
> >
> >I believe that 3) is the reason for the -no-kvm-irqchip problems
> >reported by Avi on RHEL5, with kernels < 2.6.21 (where vcpu->cpu pinning
> >would fix the problem). Unfortunately I could reproduce that problem.
> >
> >4-way guest with constant tick at 250Hz now reliably sees the TSC's
> >synchronized, and idle guest CPU consumption is reduced by 50% (3-4%
> >instead of 8%, the latter with acpi_pm_good boot parameter).
> >
> >
> >diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >index 4741806..e1c8cf4 100644
> >--- a/arch/x86/kvm/vmx.c
> >+++ b/arch/x86/kvm/vmx.c
> >@@ -47,6 +47,8 @@ struct vcpu_vmx {
> > 	struct kvm_vcpu       vcpu;
> > 	int                   launched;
> > 	u8                    fail;
> >+	u64		      first_tsc;
> >+	u64		      tsc_this;
> > 	u32                   idt_vectoring_info;
> > 	struct kvm_msr_entry *guest_msrs;
> > 	struct kvm_msr_entry *host_msrs;
> >@@ -254,6 +256,7 @@ static void vcpu_clear(struct vcpu_vmx *vmx)
> > 	if (vmx->vcpu.cpu == -1)
> > 		return;
> > 	smp_call_function_single(vmx->vcpu.cpu, __vcpu_clear, vmx, 0, 1);
> >+	rdtscll(vmx->tsc_this);
> > 	vmx->launched = 0;
> > }
> > 
> >@@ -480,6 +483,8 @@ static void vmx_load_host_state(struct vcpu_vmx *vmx)
> > 	reload_host_efer(vmx);
> > }
> > 
> >+static void guest_write_tsc(u64 host_tsc, u64 guest_tsc);
> >+
> > /*
> >  * Switches to specified vcpu, until a matching vcpu_put(), but assumes
> >  * vcpu mutex is already taken.
> >@@ -488,7 +493,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int 
> >cpu)
> > {
> > 	struct vcpu_vmx *vmx = to_vmx(vcpu);
> > 	u64 phys_addr = __pa(vmx->vmcs);
> >-	u64 tsc_this, delta;
> >+	u64 delta;
> > 
> > 	if (vcpu->cpu != cpu) {
> > 		vcpu_clear(vmx);
> >@@ -511,6 +516,19 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int 
> >cpu)
> > 		struct descriptor_table dt;
> > 		unsigned long sysenter_esp;
> > 
> >+		if (unlikely(vcpu->cpu == -1)) {
> >  
> 
> This can happen after migration, I believe.

Right now the destination host will do ->vcpu_reset() on all VCPU's on
startup... so its already broken. This patch is not introducing any
issues, just changing where it happens :)

Perhaps fixing migration should be the subject of a separate patch?

> >+			rdtscll(vcpu->arch.host_tsc);
> >+			vmx->tsc_this = vcpu->arch.host_tsc;
> >+			if (vcpu->vcpu_id == 0) {
> >+				guest_write_tsc(vcpu->arch.host_tsc, 0);
> >+				vmx->first_tsc = vcpu->arch.host_tsc;
> >+			} else {
> >+				struct vcpu_vmx *cpu0;
> >+				cpu0 = to_vmx(vcpu->kvm->vcpus[0]);
> >+				guest_write_tsc(cpu0->first_tsc, 0);
> >+			}
> >+		}
> >+
> >  
> 
> Depending on vcpu_id == 0 can cause problems (for example, if vcpu 0 is 
> somehow not the first to run).

But QEMU will always run kvm_create() first (which does
kvm_create_vcpu(0)), then start the other threads later. So I don't see
how that can happen.

> We might initialize the tsc base on vm creation, and if the host tsc is 
> synced, then the guest tsc should also be.
> 
> > 		vcpu->cpu = cpu;
> > 		/*
> > 		 * Linux uses per-cpu TSS and GDT, so set these when 
> > 		 switching
> >@@ -526,8 +544,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int 
> >cpu)
> > 		/*
> > 		 * Make sure the time stamp counter is monotonous.
> > 		 */
> >-		rdtscll(tsc_this);
> >-		delta = vcpu->arch.host_tsc - tsc_this;
> >+		delta = vcpu->arch.host_tsc - vmx->tsc_this;
> > 		vmcs_write64(TSC_OFFSET, vmcs_read64(TSC_OFFSET) + delta);
> >  
> 
> This is a little roundabout, how about moving the delta calculation 
> immediately after the call to vcpu_clear()?
> 
> I don't think this is the cause of the problem, it can't account for 
> more than a few hundred cycles, compared to the much greater vmentry cost.

Actually it accounts for several thousand cycles warp by the time the
kernel checks the TSC synchronization between CPU's, which happens very
early on boot.

You saw the hang on userspace init, by then there could be many
VCPU->CPU switchs.

> 
> Anyway it should be in a separate patch.

How does the following look (minus proper changelog):


Index: kvm.quilt/arch/x86/kvm/vmx.c
===================================================================
--- kvm.quilt.orig/arch/x86/kvm/vmx.c
+++ kvm.quilt/arch/x86/kvm/vmx.c
@@ -47,6 +47,7 @@ struct vcpu_vmx {
 	struct kvm_vcpu       vcpu;
 	int                   launched;
 	u8                    fail;
+	u64		      first_tsc;
 	u32                   idt_vectoring_info;
 	struct kvm_msr_entry *guest_msrs;
 	struct kvm_msr_entry *host_msrs;
@@ -480,6 +481,7 @@ static void vmx_load_host_state(struct v
 	reload_host_efer(vmx);
 }
 
+static void guest_write_tsc(u64 host_tsc, u64 guest_tsc);
 /*
  * Switches to specified vcpu, until a matching vcpu_put(), but assumes
  * vcpu mutex is already taken.
@@ -511,6 +513,18 @@ static void vmx_vcpu_load(struct kvm_vcp
 		struct descriptor_table dt;
 		unsigned long sysenter_esp;
 
+		if (unlikely(vcpu->cpu == -1)) {
+			rdtscll(vcpu->arch.host_tsc);
+			if (vcpu->vcpu_id == 0) {
+				guest_write_tsc(vcpu->arch.host_tsc, 0);
+				vmx->first_tsc = vcpu->arch.host_tsc;
+			} else {
+				struct vcpu_vmx *cpu0;
+				cpu0 = to_vmx(vcpu->kvm->vcpus[0]);
+				guest_write_tsc(cpu0->first_tsc, 0);
+			}
+		}
+
 		vcpu->cpu = cpu;
 		/*
 		 * Linux uses per-cpu TSS and GDT, so set these when switching
@@ -690,11 +704,8 @@ static u64 guest_read_tsc(void)
  * writes 'guest_tsc' into guest's timestamp counter "register"
  * guest_tsc = host_tsc + tsc_offset ==> tsc_offset = guest_tsc - host_tsc
  */
-static void guest_write_tsc(u64 guest_tsc)
+static void guest_write_tsc(u64 host_tsc, u64 guest_tsc)
 {
-	u64 host_tsc;
-
-	rdtscll(host_tsc);
 	vmcs_write64(TSC_OFFSET, guest_tsc - host_tsc);
 }
 
@@ -758,6 +769,7 @@ static int vmx_set_msr(struct kvm_vcpu *
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	struct kvm_msr_entry *msr;
+	u64 host_tsc;
 	int ret = 0;
 
 	switch (msr_index) {
@@ -786,7 +798,8 @@ static int vmx_set_msr(struct kvm_vcpu *
 		vmcs_writel(GUEST_SYSENTER_ESP, data);
 		break;
 	case MSR_IA32_TIME_STAMP_COUNTER:
-		guest_write_tsc(data);
+		rdtscll(host_tsc);
+		guest_write_tsc(host_tsc, data);
 		break;
 	default:
 		msr = find_msr_entry(vmx, msr_index);
@@ -1684,8 +1697,6 @@ static int vmx_vcpu_reset(struct kvm_vcp
 	vmcs_write32(GUEST_INTERRUPTIBILITY_INFO, 0);
 	vmcs_write32(GUEST_PENDING_DBG_EXCEPTIONS, 0);
 
-	guest_write_tsc(0);
-
 	/* Special registers */
 	vmcs_write64(GUEST_IA32_DEBUGCTL, 0);
 


Index: kvm.quilt/arch/x86/kvm/vmx.c
===================================================================
--- kvm.quilt.orig/arch/x86/kvm/vmx.c
+++ kvm.quilt/arch/x86/kvm/vmx.c
@@ -238,26 +238,6 @@ static void vmcs_clear(struct vmcs *vmcs
 		       vmcs, phys_addr);
 }
 
-static void __vcpu_clear(void *arg)
-{
-	struct vcpu_vmx *vmx = arg;
-	int cpu = raw_smp_processor_id();
-
-	if (vmx->vcpu.cpu == cpu)
-		vmcs_clear(vmx->vmcs);
-	if (per_cpu(current_vmcs, cpu) == vmx->vmcs)
-		per_cpu(current_vmcs, cpu) = NULL;
-	rdtscll(vmx->vcpu.arch.host_tsc);
-}
-
-static void vcpu_clear(struct vcpu_vmx *vmx)
-{
-	if (vmx->vcpu.cpu == -1)
-		return;
-	smp_call_function_single(vmx->vcpu.cpu, __vcpu_clear, vmx, 0, 1);
-	vmx->launched = 0;
-}
-
 static unsigned long vmcs_readl(unsigned long field)
 {
 	unsigned long value;
@@ -348,6 +328,34 @@ static void update_exception_bitmap(stru
 	vmcs_write32(EXCEPTION_BITMAP, eb);
 }
 
+static void __vcpu_clear(void *arg)
+{
+	struct vcpu_vmx *vmx = arg;
+	int cpu = raw_smp_processor_id();
+
+	if (vmx->vcpu.cpu == cpu)
+		vmcs_clear(vmx->vmcs);
+	if (per_cpu(current_vmcs, cpu) == vmx->vmcs)
+		per_cpu(current_vmcs, cpu) = NULL;
+	rdtscll(vmx->vcpu.arch.host_tsc);
+}
+
+static void vcpu_clear(struct vcpu_vmx *vmx)
+{
+	u64 tsc_this, delta;
+
+	if (vmx->vcpu.cpu == -1)
+		return;
+	smp_call_function_single(vmx->vcpu.cpu, __vcpu_clear, vmx, 0, 1);
+	/*
+	 * Make sure the time stamp counter is monotonous.
+	 */
+	rdtscll(tsc_this);
+	delta = vmx->vcpu.arch.host_tsc - tsc_this;
+	vmcs_write64(TSC_OFFSET, vmcs_read64(TSC_OFFSET) + delta);
+	vmx->launched = 0;
+}
+
 static void reload_tss(void)
 {
 #ifndef CONFIG_X86_64
@@ -490,7 +498,6 @@ static void vmx_vcpu_load(struct kvm_vcp
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	u64 phys_addr = __pa(vmx->vmcs);
-	u64 tsc_this, delta;
 
 	if (vcpu->cpu != cpu) {
 		vcpu_clear(vmx);
@@ -536,13 +543,6 @@ static void vmx_vcpu_load(struct kvm_vcp
 
 		rdmsrl(MSR_IA32_SYSENTER_ESP, sysenter_esp);
 		vmcs_writel(HOST_IA32_SYSENTER_ESP, sysenter_esp); /* 22.2.3 */
-
-		/*
-		 * Make sure the time stamp counter is monotonous.
-		 */
-		rdtscll(tsc_this);
-		delta = vcpu->arch.host_tsc - tsc_this;
-		vmcs_write64(TSC_OFFSET, vmcs_read64(TSC_OFFSET) + delta);
 	}
 }
 

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] fix VMX TSC synchronicity
  2008-01-14 16:06     ` Marcelo Tosatti
@ 2008-01-14 20:46       ` Marcelo Tosatti
  2008-01-15 14:59         ` Avi Kivity
  2008-01-15 14:33       ` Avi Kivity
  1 sibling, 1 reply; 28+ messages in thread
From: Marcelo Tosatti @ 2008-01-14 20:46 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel

On Mon, Jan 14, 2008 at 02:06:47PM -0200, Marcelo Tosatti wrote:
> Hi Avi,
> 
> On Sun, Jan 13, 2008 at 02:19:29PM +0200, Avi Kivity wrote:
> > Marcelo Tosatti wrote:
> > >The boot TSC sync check is failing on recent Linux SMP guests on TSC
> > >stable hosts.
> > >
> > >  
> > 
> > What about tsc unstable hosts?  If your patch convinces the guest its 
> > tsc is table, while the host tsc is not, then it may cause confusion 
> > later on.
> 
> The adjustment to zero won't fool the guest because it assumes that the
> TSC's are synchronized. It will simply set the guest TSC to zero on all
> VCPUs based on the time VCPU0 was initialized.
> 
> That is, setting -(vcpu[0].first_tsc) on all VCPU's will not correct any
> synchronization problem.
> 
> > >Following patch attempts to address the problems, which are:
> > >
> > >1) APIC_DM_STARTUP, which is only called for vcpu's other than vcpu0,
> > >will trigger ->vcpu_reset(), setting the TSC to 0. Fix that by moving
> > >the guest_write_tsc(0) to vcpu_load.
> > >
> > >2) vcpu's are initialized at different times by QEMU (vcpu0 init happens
> > >way earlier than the rest). Fix that by initializing the TSC of vcpu's >
> > >0 with reference to vcpu0 init tsc value. This way TSC synchronization
> > >is kept (apparently Xen does something similar).
> > >
> > >3) The code which adjusts the TSC of a VCPU on physical CPU switch is
> > >meant to guarantee that the guest sees a monotonically increasing value.
> > >However there is a large gap, in terms of clocks, between the time which
> > >the source CPU TSC is read and the time the destination CPU TSC is read.
> > >So move those two reads to happen at vcpu_clear.
> > >
> > >I believe that 3) is the reason for the -no-kvm-irqchip problems
> > >reported by Avi on RHEL5, with kernels < 2.6.21 (where vcpu->cpu pinning
> > >would fix the problem). Unfortunately I could reproduce that problem.
> > >
> > >4-way guest with constant tick at 250Hz now reliably sees the TSC's
> > >synchronized, and idle guest CPU consumption is reduced by 50% (3-4%
> > >instead of 8%, the latter with acpi_pm_good boot parameter).
> > >
> > >
> > >diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > >index 4741806..e1c8cf4 100644
> > >--- a/arch/x86/kvm/vmx.c
> > >+++ b/arch/x86/kvm/vmx.c
> > >@@ -47,6 +47,8 @@ struct vcpu_vmx {
> > > 	struct kvm_vcpu       vcpu;
> > > 	int                   launched;
> > > 	u8                    fail;
> > >+	u64		      first_tsc;
> > >+	u64		      tsc_this;
> > > 	u32                   idt_vectoring_info;
> > > 	struct kvm_msr_entry *guest_msrs;
> > > 	struct kvm_msr_entry *host_msrs;
> > >@@ -254,6 +256,7 @@ static void vcpu_clear(struct vcpu_vmx *vmx)
> > > 	if (vmx->vcpu.cpu == -1)
> > > 		return;
> > > 	smp_call_function_single(vmx->vcpu.cpu, __vcpu_clear, vmx, 0, 1);
> > >+	rdtscll(vmx->tsc_this);
> > > 	vmx->launched = 0;
> > > }
> > > 
> > >@@ -480,6 +483,8 @@ static void vmx_load_host_state(struct vcpu_vmx *vmx)
> > > 	reload_host_efer(vmx);
> > > }
> > > 
> > >+static void guest_write_tsc(u64 host_tsc, u64 guest_tsc);
> > >+
> > > /*
> > >  * Switches to specified vcpu, until a matching vcpu_put(), but assumes
> > >  * vcpu mutex is already taken.
> > >@@ -488,7 +493,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int 
> > >cpu)
> > > {
> > > 	struct vcpu_vmx *vmx = to_vmx(vcpu);
> > > 	u64 phys_addr = __pa(vmx->vmcs);
> > >-	u64 tsc_this, delta;
> > >+	u64 delta;
> > > 
> > > 	if (vcpu->cpu != cpu) {
> > > 		vcpu_clear(vmx);
> > >@@ -511,6 +516,19 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int 
> > >cpu)
> > > 		struct descriptor_table dt;
> > > 		unsigned long sysenter_esp;
> > > 
> > >+		if (unlikely(vcpu->cpu == -1)) {
> > >  
> > 
> > This can happen after migration, I believe.
> 
> Right now the destination host will do ->vcpu_reset() on all VCPU's on
> startup... so its already broken. This patch is not introducing any
> issues, just changing where it happens :)
> 
> Perhaps fixing migration should be the subject of a separate patch?
> 
> > >+			rdtscll(vcpu->arch.host_tsc);
> > >+			vmx->tsc_this = vcpu->arch.host_tsc;
> > >+			if (vcpu->vcpu_id == 0) {
> > >+				guest_write_tsc(vcpu->arch.host_tsc, 0);
> > >+				vmx->first_tsc = vcpu->arch.host_tsc;
> > >+			} else {
> > >+				struct vcpu_vmx *cpu0;
> > >+				cpu0 = to_vmx(vcpu->kvm->vcpus[0]);
> > >+				guest_write_tsc(cpu0->first_tsc, 0);
> > >+			}
> > >+		}
> > >+
> > >  
> > 
> > Depending on vcpu_id == 0 can cause problems (for example, if vcpu 0 is 
> > somehow not the first to run).
> 
> But QEMU will always run kvm_create() first (which does
> kvm_create_vcpu(0)), then start the other threads later. So I don't see
> how that can happen.
> 
> > We might initialize the tsc base on vm creation, and if the host tsc is 
> > synced, then the guest tsc should also be.
> > 
> > > 		vcpu->cpu = cpu;
> > > 		/*
> > > 		 * Linux uses per-cpu TSS and GDT, so set these when 
> > > 		 switching
> > >@@ -526,8 +544,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int 
> > >cpu)
> > > 		/*
> > > 		 * Make sure the time stamp counter is monotonous.
> > > 		 */
> > >-		rdtscll(tsc_this);
> > >-		delta = vcpu->arch.host_tsc - tsc_this;
> > >+		delta = vcpu->arch.host_tsc - vmx->tsc_this;
> > > 		vmcs_write64(TSC_OFFSET, vmcs_read64(TSC_OFFSET) + delta);
> > >  
> > 
> > This is a little roundabout, how about moving the delta calculation 
> > immediately after the call to vcpu_clear()?
> > 
> > I don't think this is the cause of the problem, it can't account for 
> > more than a few hundred cycles, compared to the much greater vmentry cost.
> 
> Actually it accounts for several thousand cycles warp by the time the
> kernel checks the TSC synchronization between CPU's, which happens very
> early on boot.
> 
> You saw the hang on userspace init, by then there could be many
> VCPU->CPU switchs.
> 
> > 
> > Anyway it should be in a separate patch.
> 
> How does the following look (minus proper changelog):

Actually setting the TSC_OFFSET at vcpu_clear() seems a bit early:

vmwrite error: reg 2010 value 82470677 (err -2109274505)
Pid: 5590, comm: qemu-system-x86 Not tainted 2.6.24-rc6-ge7706133 #26

Call Trace:
 [<ffffffff880a5983>] :kvm_intel:vmwrite_error+0x22/0x28
 [<ffffffff880a5a99>] :kvm_intel:vcpu_clear+0x54/0x60
 [<ffffffff880a5ad0>] :kvm_intel:vmx_vcpu_load+0x29/0x12b
 [<ffffffff8024d803>] get_futex_value_locked+0x1d/0x38
 [<ffffffff8808fcf1>] :kvm:kvm_arch_vcpu_ioctl_run+0x13/0x4b9
 [<ffffffff8808c3df>] :kvm:kvm_vcpu_ioctl+0xda/0x2dd

So the first patch seemed alright (taking into account the comments from
my previous email, that migration should probably be fixed in a separate
patch since its broken already and that its guaranteed that vcpu0 is the
first to hit vcpu_load).

Avi?

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] fix VMX TSC synchronicity
  2008-01-14 20:46       ` Marcelo Tosatti
@ 2008-01-15 14:59         ` Avi Kivity
  0 siblings, 0 replies; 28+ messages in thread
From: Avi Kivity @ 2008-01-15 14:59 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm-devel

Marcelo Tosatti wrote:
> So the first patch seemed alright (taking into account the comments from
> my previous email, that migration should probably be fixed in a separate
> patch since its broken already and that its guaranteed that vcpu0 is the
> first to hit vcpu_load).
>   

I really would like not to have this dependency.  It's easy to write 
userspace that doesn't do this and there's no reason not to.

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] fix VMX TSC synchronicity
  2008-01-14 16:06     ` Marcelo Tosatti
  2008-01-14 20:46       ` Marcelo Tosatti
@ 2008-01-15 14:33       ` Avi Kivity
       [not found]         ` <478CC448.1030901-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  1 sibling, 1 reply; 28+ messages in thread
From: Avi Kivity @ 2008-01-15 14:33 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm-devel

Marcelo Tosatti wrote:
> Hi Avi,
>
> On Sun, Jan 13, 2008 at 02:19:29PM +0200, Avi Kivity wrote:
>   
>> Marcelo Tosatti wrote:
>>     
>>> The boot TSC sync check is failing on recent Linux SMP guests on TSC
>>> stable hosts.
>>>
>>>  
>>>       
>> What about tsc unstable hosts?  If your patch convinces the guest its 
>> tsc is table, while the host tsc is not, then it may cause confusion 
>> later on.
>>     
>
> The adjustment to zero won't fool the guest because it assumes that the
> TSC's are synchronized. It will simply set the guest TSC to zero on all
> VCPUs based on the time VCPU0 was initialized.
>
> That is, setting -(vcpu[0].first_tsc) on all VCPU's will not correct any
> synchronization problem.
>
>   

What I mean is, right now we present really broken tscs to the guest.  
After your patch, we present less-broken tscs (at boot, they will 
closely resemble stable tscs).  But after the machine idles a bit and 
cpufreq takes over, the tscs won't be stable any more.

>>> +static void guest_write_tsc(u64 host_tsc, u64 guest_tsc);
>>> +
>>> /*
>>>  * Switches to specified vcpu, until a matching vcpu_put(), but assumes
>>>  * vcpu mutex is already taken.
>>> @@ -488,7 +493,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int 
>>> cpu)
>>> {
>>> 	struct vcpu_vmx *vmx = to_vmx(vcpu);
>>> 	u64 phys_addr = __pa(vmx->vmcs);
>>> -	u64 tsc_this, delta;
>>> +	u64 delta;
>>>
>>> 	if (vcpu->cpu != cpu) {
>>> 		vcpu_clear(vmx);
>>> @@ -511,6 +516,19 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int 
>>> cpu)
>>> 		struct descriptor_table dt;
>>> 		unsigned long sysenter_esp;
>>>
>>> +		if (unlikely(vcpu->cpu == -1)) {
>>>  
>>>       
>> This can happen after migration, I believe.
>>     
>
> Right now the destination host will do ->vcpu_reset() on all VCPU's on
> startup... so its already broken. This patch is not introducing any
> issues, just changing where it happens :)
>   

No, after the reset qemu will set the guest msrs including the tsc msr, 
overriding the reset value.

However, by that time vcpu->cpu will no longer be -1, so there's no 
issue there.

On the other hand, it may happen during cpu hotunplug (see 
decache_vcpus_on_vcpu).

>
>   
>>> +			rdtscll(vcpu->arch.host_tsc);
>>> +			vmx->tsc_this = vcpu->arch.host_tsc;
>>> +			if (vcpu->vcpu_id == 0) {
>>> +				guest_write_tsc(vcpu->arch.host_tsc, 0);
>>> +				vmx->first_tsc = vcpu->arch.host_tsc;
>>> +			} else {
>>> +				struct vcpu_vmx *cpu0;
>>> +				cpu0 = to_vmx(vcpu->kvm->vcpus[0]);
>>> +				guest_write_tsc(cpu0->first_tsc, 0);
>>> +			}
>>> +		}
>>> +
>>>  
>>>       
>> Depending on vcpu_id == 0 can cause problems (for example, if vcpu 0 is 
>> somehow not the first to run).
>>     
>
> But QEMU will always run kvm_create() first (which does
> kvm_create_vcpu(0)), then start the other threads later. So I don't see
> how that can happen.
>
>   

We're not developing purely for qemu (there's xenner, for example) and I 
don't want to embed hidden assumptions into the API.


>>> 		vcpu->cpu = cpu;
>>> 		/*
>>> 		 * Linux uses per-cpu TSS and GDT, so set these when 
>>> 		 switching
>>> @@ -526,8 +544,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int 
>>> cpu)
>>> 		/*
>>> 		 * Make sure the time stamp counter is monotonous.
>>> 		 */
>>> -		rdtscll(tsc_this);
>>> -		delta = vcpu->arch.host_tsc - tsc_this;
>>> +		delta = vcpu->arch.host_tsc - vmx->tsc_this;
>>> 		vmcs_write64(TSC_OFFSET, vmcs_read64(TSC_OFFSET) + delta);
>>>  
>>>       
>> This is a little roundabout, how about moving the delta calculation 
>> immediately after the call to vcpu_clear()?
>>
>> I don't think this is the cause of the problem, it can't account for 
>> more than a few hundred cycles, compared to the much greater vmentry cost.
>>     
>
> Actually it accounts for several thousand cycles warp by the time the
> kernel checks the TSC synchronization between CPU's, which happens very
> early on boot.
>
> You saw the hang on userspace init, by then there could be many
> VCPU->CPU switchs.
>
>   

Okay.

>> Anyway it should be in a separate patch.
>>     
>
> How does the following look (minus proper changelog):
>
>
> Index: kvm.quilt/arch/x86/kvm/vmx.c
> ===================================================================
> --- kvm.quilt.orig/arch/x86/kvm/vmx.c
> +++ kvm.quilt/arch/x86/kvm/vmx.c
> @@ -47,6 +47,7 @@ struct vcpu_vmx {
>  	struct kvm_vcpu       vcpu;
>  	int                   launched;
>  	u8                    fail;
> +	u64		      first_tsc;
>  	u32                   idt_vectoring_info;
>  	struct kvm_msr_entry *guest_msrs;
>  	struct kvm_msr_entry *host_msrs;
> @@ -480,6 +481,7 @@ static void vmx_load_host_state(struct v
>  	reload_host_efer(vmx);
>  }
>  
> +static void guest_write_tsc(u64 host_tsc, u64 guest_tsc);
>  /*
>   * Switches to specified vcpu, until a matching vcpu_put(), but assumes
>   * vcpu mutex is already taken.
> @@ -511,6 +513,18 @@ static void vmx_vcpu_load(struct kvm_vcp
>  		struct descriptor_table dt;
>  		unsigned long sysenter_esp;
>  
> +		if (unlikely(vcpu->cpu == -1)) {
>   
> +			rdtscll(vcpu->arch.host_tsc);
> +			if (vcpu->vcpu_id == 0) {
> +				guest_write_tsc(vcpu->arch.host_tsc, 0);
> +				vmx->first_tsc = vcpu->arch.host_tsc;
> +			} else {
> +				struct vcpu_vmx *cpu0;
> +				cpu0 = to_vmx(vcpu->kvm->vcpus[0]);
> +				guest_write_tsc(cpu0->first_tsc, 0);
> +			}
> +		}
> +
>   

I don't like the dependency on vcpu 0.  Also see comments above.

>  
> +static void __vcpu_clear(void *arg)
> +{
> +	struct vcpu_vmx *vmx = arg;
> +	int cpu = raw_smp_processor_id();
> +
> +	if (vmx->vcpu.cpu == cpu)
> +		vmcs_clear(vmx->vmcs);
> +	if (per_cpu(current_vmcs, cpu) == vmx->vmcs)
> +		per_cpu(current_vmcs, cpu) = NULL;
> +	rdtscll(vmx->vcpu.arch.host_tsc);
> +}
> +
> +static void vcpu_clear(struct vcpu_vmx *vmx)
> +{
> +	u64 tsc_this, delta;
> +
> +	if (vmx->vcpu.cpu == -1)
> +		return;
> +	smp_call_function_single(vmx->vcpu.cpu, __vcpu_clear, vmx, 0, 1);
> +	/*
> +	 * Make sure the time stamp counter is monotonous.
> +	 */
> +	rdtscll(tsc_this);
> +	delta = vmx->vcpu.arch.host_tsc - tsc_this;
> +	vmcs_write64(TSC_OFFSET, vmcs_read64(TSC_OFFSET) + delta);
> +	vmx->launched = 0;
> +}
>   

This part is fine.

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 28+ messages in thread

[parent not found: <478CC448.1030901-atKUWr5tajBWk0Htik3J/w@public.gmane.org>]

* Re: [RFC] fix VMX TSC synchronicity
       [not found]         ` <478CC448.1030901-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2008-01-15 14:50           ` Alexander Graf
       [not found]             ` <478CC819.3040106-r27SGEef+tmzQB+pC5nmwQ@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Alexander Graf @ 2008-01-15 14:50 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, kvm-devel

Avi Kivity wrote:
> Marcelo Tosatti wrote:
>   
>> Hi Avi,
>>
>> On Sun, Jan 13, 2008 at 02:19:29PM +0200, Avi Kivity wrote:
>>   
>>     
>>> Marcelo Tosatti wrote:
>>>     
>>>       
>>>> The boot TSC sync check is failing on recent Linux SMP guests on TSC
>>>> stable hosts.
>>>>
>>>>  
>>>>       
>>>>         
>>> What about tsc unstable hosts?  If your patch convinces the guest its 
>>> tsc is table, while the host tsc is not, then it may cause confusion 
>>> later on.
>>>     
>>>       
>> The adjustment to zero won't fool the guest because it assumes that the
>> TSC's are synchronized. It will simply set the guest TSC to zero on all
>> VCPUs based on the time VCPU0 was initialized.
>>
>> That is, setting -(vcpu[0].first_tsc) on all VCPU's will not correct any
>> synchronization problem.
>>
>>   
>>     
>
> What I mean is, right now we present really broken tscs to the guest.  
> After your patch, we present less-broken tscs (at boot, they will 
> closely resemble stable tscs).  But after the machine idles a bit and 
> cpufreq takes over, the tscs won't be stable any more.
>
>   

Why would the TSC break due to cpufreq? This patchset was against VMX,
which is only available on current Intel CPUs. All of those guarantee a
constant TSC increase at the maximum frequency.

Also see the Intel Documentation:

Vol. 3 18-37

For Pentium 4 processors, Intel Xeon processors (family [0FH], models
[03H and higher]); for Intel Core Solo and Intel Core Duo processors
(family [06H], model [0EH]); for the Intel Xeon processor 5100 series
and Intel Core 2 Duo processors (family [06H], model [0FH]): the
time-stamp counter increments at a constant rate. That rate may be set
by the maximum core-clock to bus-clock ratio of the processor or may be
set by the maximum resolved frequency at which the processor is booted.
The maximum resolved frequency may differ from the maximum qualified
frequency of the processor, see Section 18.17.5 for more
detail.
The specific processor configuration determines the behavior. Constant
TSC behavior ensures that the duration of each clock tick is uniform and
supports the use of the TSC as a wall clock timer even if the processor
core changes frequency.
This is the architectural behavior moving forward.

Alex

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 28+ messages in thread

[parent not found: <478CC819.3040106-r27SGEef+tmzQB+pC5nmwQ@public.gmane.org>]

* Re: [RFC] fix VMX TSC synchronicity
       [not found]             ` <478CC819.3040106-r27SGEef+tmzQB+pC5nmwQ@public.gmane.org>
@ 2008-01-15 15:09               ` Avi Kivity
       [not found]                 ` <478CCCA9.2080300-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Avi Kivity @ 2008-01-15 15:09 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Marcelo Tosatti, kvm-devel

Alexander Graf wrote:
>>>   
>>>     
>>>       
>> What I mean is, right now we present really broken tscs to the guest.  
>> After your patch, we present less-broken tscs (at boot, they will 
>> closely resemble stable tscs).  But after the machine idles a bit and 
>> cpufreq takes over, the tscs won't be stable any more.
>>
>>   
>>     
>
> Why would the TSC break due to cpufreq? This patchset was against VMX,
> which is only available on current Intel CPUs. All of those guarantee a
> constant TSC increase at the maximum frequency.
>
> Also see the Intel Documentation:
>
> Vol. 3 18-37
>
> For Pentium 4 processors, Intel Xeon processors (family [0FH], models
> [03H and higher]); for Intel Core Solo and Intel Core Duo processors
> (family [06H], model [0EH]); for the Intel Xeon processor 5100 series
> and Intel Core 2 Duo processors (family [06H], model [0FH]): the
> time-stamp counter increments at a constant rate. That rate may be set
> by the maximum core-clock to bus-clock ratio of the processor or may be
> set by the maximum resolved frequency at which the processor is booted.
> The maximum resolved frequency may differ from the maximum qualified
> frequency of the processor, see Section 18.17.5 for more
> detail.
> The specific processor configuration determines the behavior. Constant
> TSC behavior ensures that the duration of each clock tick is uniform and
> supports the use of the TSC as a wall clock timer even if the processor
> core changes frequency.
> This is the architectural behavior moving forward.

Thanks; that's reassuring to know that it will work (at least on Intel).

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 28+ messages in thread

[parent not found: <478CCCA9.2080300-atKUWr5tajBWk0Htik3J/w@public.gmane.org>]

* Re: [RFC] fix VMX TSC synchronicity
       [not found]                 ` <478CCCA9.2080300-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2008-01-15 16:42                   ` Alexander Graf
       [not found]                     ` <478CE277.9010109-r27SGEef+tmzQB+pC5nmwQ@public.gmane.org>
  2008-01-16  5:51                   ` Andi Kleen
  1 sibling, 1 reply; 28+ messages in thread
From: Alexander Graf @ 2008-01-15 16:42 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, Alexander Graf, kvm-devel

Avi Kivity wrote:
> Alexander Graf wrote:
>   
>>>>   
>>>>     
>>>>       
>>>>         
>>> What I mean is, right now we present really broken tscs to the guest.  
>>> After your patch, we present less-broken tscs (at boot, they will 
>>> closely resemble stable tscs).  But after the machine idles a bit and 
>>> cpufreq takes over, the tscs won't be stable any more.
>>>
>>>   
>>>     
>>>       
>> Why would the TSC break due to cpufreq? This patchset was against VMX,
>> which is only available on current Intel CPUs. All of those guarantee a
>> constant TSC increase at the maximum frequency.
>>
>> Also see the Intel Documentation:
>>
>> Vol. 3 18-37
>>
>> For Pentium 4 processors, Intel Xeon processors (family [0FH], models
>> [03H and higher]); for Intel Core Solo and Intel Core Duo processors
>> (family [06H], model [0EH]); for the Intel Xeon processor 5100 series
>> and Intel Core 2 Duo processors (family [06H], model [0FH]): the
>> time-stamp counter increments at a constant rate. That rate may be set
>> by the maximum core-clock to bus-clock ratio of the processor or may be
>> set by the maximum resolved frequency at which the processor is booted.
>> The maximum resolved frequency may differ from the maximum qualified
>> frequency of the processor, see Section 18.17.5 for more
>> detail.
>> The specific processor configuration determines the behavior. Constant
>> TSC behavior ensures that the duration of each clock tick is uniform and
>> supports the use of the TSC as a wall clock timer even if the processor
>> core changes frequency.
>> This is the architectural behavior moving forward.
>>     
>
> Thanks; that's reassuring to know that it will work (at least on Intel).
>
>   
If I remember correctly, there was a statement on LKML by AMD which said
that the TSC is completely broken on AMD systems, so you should not use
it there anyway.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 28+ messages in thread

[parent not found: <478CE277.9010109-r27SGEef+tmzQB+pC5nmwQ@public.gmane.org>]

* Re: [RFC] fix VMX TSC synchronicity
       [not found]                     ` <478CE277.9010109-r27SGEef+tmzQB+pC5nmwQ@public.gmane.org>
@ 2008-01-15 17:46                       ` Avi Kivity
       [not found]                         ` <478CF186.5030304-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Avi Kivity @ 2008-01-15 17:46 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Marcelo Tosatti, kvm-devel

Alexander Graf wrote:

  

> If I remember correctly, there was a statement on LKML by AMD which said
> that the TSC is completely broken on AMD systems, so you should not use
> it there anyway.
>   

Well, that's up to the guest to decide.  Looks like older Linux guests 
make the wrong decision.


-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 28+ messages in thread

[parent not found: <478CF186.5030304-atKUWr5tajBWk0Htik3J/w@public.gmane.org>]

* Re: [RFC] fix VMX TSC synchronicity
       [not found]                         ` <478CF186.5030304-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2008-01-15 18:29                           ` Amit Shah
  0 siblings, 0 replies; 28+ messages in thread
From: Amit Shah @ 2008-01-15 18:29 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f
  Cc: Marcelo Tosatti, Alexander Graf, Avi Kivity

On Tuesday 15 January 2008 23:16:46 Avi Kivity wrote:
> Alexander Graf wrote:
> > If I remember correctly, there was a statement on LKML by AMD which said
> > that the TSC is completely broken on AMD systems, so you should not use
> > it there anyway.
>
> Well, that's up to the guest to decide.  Looks like older Linux guests
> make the wrong decision.

http://lkml.org/lkml/2005/11/4/173

That talks about AMD's TSC plans (dated 2005), but it's possible to find out 
whether it's a monotonically-increasing source. Basically some revisions of 
AMD family F onwards should support this feature:

Future TSC Directions and Solutions
===================================
Future AMD processors will provide a TSC that is P-state and
C-State invariant and unaffected by STPCLK-throttling.  This
will make  the TSC immune  to drift.  Because using  the TSC
for  fast  timer APIs  is  a  desirable  feature that  helps
performance,  AMD  has  defined  a CPUID  feature  bit  that
software   can   test   to   determine   if   the   TSC   is
invariant. Issuing a CPUID instruction with an %eax register
value of  0x8000_0007, on a  processor whose base  family is
0xF, returns "Advanced  Power Management Information" in the
%eax, %ebx, %ecx,  and %edx registers.  Bit 8  of the return
%edx is  the "TscInvariant" feature  flag which is  set when
TSC is P-state, C-state, and STPCLK-throttling invariant; it
is clear otherwise.

The  rate of the  invariant TSC  is implementation-dependent
and  will likely  *not* be  the frequency  of  the processor
core; however,  its period should be short  enough such that
it is  not possible for two  back-to-back rdtsc instructions
to  return the  same  value.  Software  which  is trying  to
measure  actual  processor  frequency  or  cycle-performance
should  use Performance  Event 76h,  CPU Clocks  not Halted,
rather than the TSC to count CPU cycles.

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] fix VMX TSC synchronicity
       [not found]                 ` <478CCCA9.2080300-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  2008-01-15 16:42                   ` Alexander Graf
@ 2008-01-16  5:51                   ` Andi Kleen
       [not found]                     ` <p73ir1ul3ls.fsf-KvMlXPVkKihbpigZmTR7Iw@public.gmane.org>
  1 sibling, 1 reply; 28+ messages in thread
From: Andi Kleen @ 2008-01-16  5:51 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, Alexander Graf, kvm-devel



Avi Kivity <avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org> writes:

> Thanks; that's reassuring to know that it will work (at least on Intel).

Actually there are modern Intel systems which still have instable TSCs;
e.g. IBM Summit multi node systems and some others. So you should
still handle that case.

-Andi


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 28+ messages in thread

[parent not found: <p73ir1ul3ls.fsf-KvMlXPVkKihbpigZmTR7Iw@public.gmane.org>]

* Re: [RFC] fix VMX TSC synchronicity
       [not found]                     ` <p73ir1ul3ls.fsf-KvMlXPVkKihbpigZmTR7Iw@public.gmane.org>
@ 2008-01-16  8:46                       ` Avi Kivity
       [not found]                         ` <478DC453.1000404-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Avi Kivity @ 2008-01-16  8:46 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Marcelo Tosatti, Alexander Graf, kvm-devel

[fixing gmane emails, urgfhsz]

Andi Kleen wrote:
> Avi Kivity <avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org> writes:
>
>   
>> Thanks; that's reassuring to know that it will work (at least on Intel).
>>     
>
> Actually there are modern Intel systems which still have instable TSCs;
> e.g. IBM Summit multi node systems and some others. So you should
> still handle that case.
>   

I really don't see any way we could.  If the guest assumes tscs are 
synchronous, and they really are not, there's nothing we can do.

[well, we could trap and emulate rdtsc, but performance would tank]

You might taskset guests into a single node on such systems, which is a 
good idea anyway.

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 28+ messages in thread

[parent not found: <478DC453.1000404-atKUWr5tajBWk0Htik3J/w@public.gmane.org>]

* Re: [RFC] fix VMX TSC synchronicity
       [not found]                         ` <478DC453.1000404-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2008-01-16 13:34                           ` Andi Kleen
       [not found]                             ` <478E08E5.2030507@qumranet.com>
  0 siblings, 1 reply; 28+ messages in thread
From: Andi Kleen @ 2008-01-16 13:34 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, Alexander Graf, Andi Kleen, kvm-devel

On Wed, Jan 16, 2008 at 10:46:11AM +0200, Avi Kivity wrote:
> [fixing gmane emails, urgfhsz]
> 
> Andi Kleen wrote:
> >Avi Kivity <avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org> writes:
> >
> >  
> >>Thanks; that's reassuring to know that it will work (at least on Intel).
> >>    
> >
> >Actually there are modern Intel systems which still have instable TSCs;
> >e.g. IBM Summit multi node systems and some others. So you should
> >still handle that case.
> >  
> 
> I really don't see any way we could.  If the guest assumes tscs are 
> synchronous, and they really are not, there's nothing we can do.

Linux checks a couple of things: e.g. if there are no deep C states
and if there are no clustered nodes in the APIC etc.

It might be reasonable to check the clock source of the kernel
and if it's not TSC force one of these in the emulated firmware
environment

> You might taskset guests into a single node on such systems, which is a
> good idea anyway.

Ah pushing the problem to the user. An easy, but typically wrong, solution.

-Andi

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 28+ messages in thread

[parent not found: <478E08E5.2030507@qumranet.com>]

[parent not found: <478E08E5.2030507-atKUWr5tajBWk0Htik3J/w@public.gmane.org>]

* Re: [RFC] fix VMX TSC synchronicity
       [not found]                               ` <478E08E5.2030507-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2008-01-16 13:54                                 ` Andi Kleen
       [not found]                                   ` <20080116135415.GA14664-qrUzlfsMFqo/4alezvVtWx2eb7JE58TQ@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Andi Kleen @ 2008-01-16 13:54 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, Alexander Graf, Andi Kleen, kvm-devel

On Wed, Jan 16, 2008 at 03:38:45PM +0200, Avi Kivity wrote:
> Andi Kleen wrote:
> >Linux checks a couple of things: e.g. if there are no deep C states
> >and if there are no clustered nodes in the APIC etc.
> >
> >It might be reasonable to check the clock source of the kernel
> >and if it's not TSC force one of these in the emulated firmware
> >environment
> >
> >  
> 
> The problems are with older guests which assume the tsc is okay.  Newer 
> guests check the tsc and conclude that it isn't usable.

If the guest would get it wrong running natively on the host
I guess it would be reasonable to require an option that forces
TSC off. Disabling the TSC bit unfortunately won't work for 64bit
guests, but for probably most 32bit guests.

But for non broken guests they can only do that if the guest has the same 
visibility into the firmware state as the host. For the easy cases Linux will
check it anyways becaused on standard the TSC synchronicity check, but
there are cases where the TSCs only drift apart slowly over a longer time

[I finally fixed the clocksource watchdog now to catch this case, but
it will be only in .25]

I think it would be better to fake at least some of the usual
firmware cues for bad TSC if the host does not use it.

> 
> >>You might taskset guests into a single node on such systems, which is a
> >>good idea anyway.
> >>    
> >
> >Ah pushing the problem to the user. An easy, but typically wrong, solution.
> >  
> 
> If you have other suggestions I'll be happy to hear them.  I don't like 
> this either.

Check if host is using TSC source and if not force a clustered
APIC mode (only works for 64bit unfortunately) or fake a C3 state
in ACPI and on AMD clear the synchronous TSC bit.

-Andi

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 28+ messages in thread

[parent not found: <20080116135415.GA14664-qrUzlfsMFqo/4alezvVtWx2eb7JE58TQ@public.gmane.org>]

* Re: [RFC] fix VMX TSC synchronicity
       [not found]                                   ` <20080116135415.GA14664-qrUzlfsMFqo/4alezvVtWx2eb7JE58TQ@public.gmane.org>
@ 2008-01-16 13:56                                     ` Avi Kivity
  2008-01-17 18:43                                     ` Marcelo Tosatti
  1 sibling, 0 replies; 28+ messages in thread
From: Avi Kivity @ 2008-01-16 13:56 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Marcelo Tosatti, Alexander Graf, kvm-devel

Andi Kleen wrote:
> Check if host is using TSC source and if not force a clustered
> APIC mode (only works for 64bit unfortunately) or fake a C3 state
> in ACPI and on AMD clear the synchronous TSC bit.
>   

Yes, I got similar suggestions from Thomas.  But it looks like older 
guests will need a boot option.

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] fix VMX TSC synchronicity
       [not found]                                   ` <20080116135415.GA14664-qrUzlfsMFqo/4alezvVtWx2eb7JE58TQ@public.gmane.org>
  2008-01-16 13:56                                     ` Avi Kivity
@ 2008-01-17 18:43                                     ` Marcelo Tosatti
  2008-01-17 18:56                                       ` Andi Kleen
                                                         ` (2 more replies)
  1 sibling, 3 replies; 28+ messages in thread
From: Marcelo Tosatti @ 2008-01-17 18:43 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Marcelo Tosatti, Alexander Graf, Avi Kivity, kvm-devel

On Wed, Jan 16, 2008 at 02:54:15PM +0100, Andi Kleen wrote:
> On Wed, Jan 16, 2008 at 03:38:45PM +0200, Avi Kivity wrote:
> > Andi Kleen wrote:
> > >Linux checks a couple of things: e.g. if there are no deep C states
> > >and if there are no clustered nodes in the APIC etc.
> > >
> > >It might be reasonable to check the clock source of the kernel
> > >and if it's not TSC force one of these in the emulated firmware
> > >environment
> > >
> > >  
> > 
> > The problems are with older guests which assume the tsc is okay.  Newer 
> > guests check the tsc and conclude that it isn't usable.
> 
> If the guest would get it wrong running natively on the host
> I guess it would be reasonable to require an option that forces
> TSC off. Disabling the TSC bit unfortunately won't work for 64bit
> guests, but for probably most 32bit guests.
> 
> But for non broken guests they can only do that if the guest has the same 
> visibility into the firmware state as the host. For the easy cases Linux will
> check it anyways becaused on standard the TSC synchronicity check, but
> there are cases where the TSCs only drift apart slowly over a longer time
> 
> [I finally fixed the clocksource watchdog now to catch this case, but
> it will be only in .25]
> 
> I think it would be better to fake at least some of the usual
> firmware cues for bad TSC if the host does not use it.
> 
> > 
> > >>You might taskset guests into a single node on such systems, which is a
> > >>good idea anyway.
> > >>    
> > >
> > >Ah pushing the problem to the user. An easy, but typically wrong, solution.
> > >  
> > 
> > If you have other suggestions I'll be happy to hear them.  I don't like 
> > this either.
> 
> Check if host is using TSC source and if not force a clustered
> APIC mode (only works for 64bit unfortunately) or fake a C3 state
> in ACPI and on AMD clear the synchronous TSC bit.

Forcing clustered APIC mode works only on SMP, and there were high CPU
consumption on Windows SMP guests due to C3 state being reported (fixed
in kvm-30 something).

So perhaps:
- Faking clustered APIC on SMP 
- Faking C3 on UP

And turning of the TSC bit (for 32-bit guests).

Is the way to go? 

Avi, do you understand why C3 was causing the Windows SMP problems ?

/* Common C-state entry for C2, C3, .. */
static void acpi_cstate_enter(struct acpi_processor_cx *cstate)
{
        if (cstate->space_id == ACPI_CSTATE_FFH) {
                /* Call into architectural FFH based C-state */
                acpi_processor_ffh_cstate_enter(cstate);
        } else {
                int unused;
                /* IO port based C-state */
                inb(cstate->address);
                /* Dummy wait op - must do something useless after P_LVL2 read
                   because chipsets cannot guarantee that STPCLK# signal
                   gets asserted in time to freeze execution properly. */
                unused = inl(acpi_gbl_FADT.xpm_timer_block.address);
        }
}

Clearly that inb() won't actually idle under QEMU. So the question is,
if C3 stated is reported, that port read should be emulated... But how? 

http://www.mail-archive.com/kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org/msg05360.html

Zhao, Yunfeng wrote:
>> I ran the same image under qemu 0.9.0, uniprocessor, and got the same
>> results: high host cpu usage while the guest is idle.  Most likely the
>> BIOS ACPI tables are misconfigured and the Windows idle loop is
>>     
> confused.




-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] fix VMX TSC synchronicity
  2008-01-17 18:43                                     ` Marcelo Tosatti
@ 2008-01-17 18:56                                       ` Andi Kleen
  2008-01-20 14:59                                       ` Avi Kivity
  2008-05-04 15:40                                       ` Avi Kivity
  2 siblings, 0 replies; 28+ messages in thread
From: Andi Kleen @ 2008-01-17 18:56 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm-devel, Alexander Graf, Andi Kleen, Avi Kivity

> And turning of the TSC bit (for 32-bit guests).

Turning off the TSC bit will break 64bit Linux (it checks
if the cpuid bits have minimum supported features) and you don't know 
in advantage if a guest is 32bit or not.

Ok there is a special BIOS call that the kernel issues to tell
the BIOS that it is 64bit -- in theory you could intercept that
one and then disable the TSC bit, but it doesn't sound very clean.

Also that call wasn't in the very first 64bit kernels.

-Andi

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] fix VMX TSC synchronicity
  2008-01-17 18:43                                     ` Marcelo Tosatti
  2008-01-17 18:56                                       ` Andi Kleen
@ 2008-01-20 14:59                                       ` Avi Kivity
  2008-05-04 15:40                                       ` Avi Kivity
  2 siblings, 0 replies; 28+ messages in thread
From: Avi Kivity @ 2008-01-20 14:59 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm-devel, Alexander Graf, Andi Kleen

Marcelo Tosatti wrote:
> Avi, do you understand why C3 was causing the Windows SMP problems ?
>
>   

It may be that the latency was advertised as so low that Windows 
switched into C3 and back too often.

Also, the problem possibly wasn't SMP related but rather TPR related -- 
when the idle loop called acpi to do the mode switches, that caused many 
tpr hits.  Perhaps with a FlexPriority enabled processor it would work 
better.

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] fix VMX TSC synchronicity
  2008-01-17 18:43                                     ` Marcelo Tosatti
  2008-01-17 18:56                                       ` Andi Kleen
  2008-01-20 14:59                                       ` Avi Kivity
@ 2008-05-04 15:40                                       ` Avi Kivity
  2008-05-06 13:55                                         ` Avi Kivity
  2 siblings, 1 reply; 28+ messages in thread
From: Avi Kivity @ 2008-05-04 15:40 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm-devel, Alexander Graf, Andi Kleen

[Resurrecting post from the dead]


Marcelo Tosatti wrote:
> Forcing clustered APIC mode works only on SMP, and there were high CPU
> consumption on Windows SMP guests due to C3 state being reported (fixed
> in kvm-30 something).
>
> So perhaps:
> - Faking clustered APIC on SMP 
> - Faking C3 on UP
>
> And turning of the TSC bit (for 32-bit guests).
>
> Is the way to go? 
>
> Avi, do you understand why C3 was causing the Windows SMP problems ?
>
>   

It's probably inb()ing on the port in a loop.  It's not SMP causing the 
problems, but the ACPI HAL.  I'll check this.

> /* Common C-state entry for C2, C3, .. */
> static void acpi_cstate_enter(struct acpi_processor_cx *cstate)
> {
>         if (cstate->space_id == ACPI_CSTATE_FFH) {
>                 /* Call into architectural FFH based C-state */
>                 acpi_processor_ffh_cstate_enter(cstate);
>         } else {
>                 int unused;
>                 /* IO port based C-state */
>                 inb(cstate->address);
>                 /* Dummy wait op - must do something useless after P_LVL2 read
>                    because chipsets cannot guarantee that STPCLK# signal
>                    gets asserted in time to freeze execution properly. */
>                 unused = inl(acpi_gbl_FADT.xpm_timer_block.address);
>         }
> }
>
> Clearly that inb() won't actually idle under QEMU. So the question is,
> if C3 stated is reported, that port read should be emulated... But how? 
>   

We can add now use the KVM_SET_MPSTATE ioctl to halt the vcpu if we see 
the port read.  Since not all hosts support setting mpstate, the bios 
should only report C3 if the host supports it.

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] fix VMX TSC synchronicity
  2008-05-04 15:40                                       ` Avi Kivity
@ 2008-05-06 13:55                                         ` Avi Kivity
  0 siblings, 0 replies; 28+ messages in thread
From: Avi Kivity @ 2008-05-06 13:55 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm-devel, Alexander Graf, Andi Kleen

Avi Kivity wrote:
> [Resurrecting post from the dead]
>
>
> Marcelo Tosatti wrote:
>> Forcing clustered APIC mode works only on SMP, and there were high CPU
>> consumption on Windows SMP guests due to C3 state being reported (fixed
>> in kvm-30 something).
>>
>> So perhaps:
>> - Faking clustered APIC on SMP - Faking C3 on UP
>>
>> And turning of the TSC bit (for 32-bit guests).
>>
>> Is the way to go?
>> Avi, do you understand why C3 was causing the Windows SMP problems ?
>>
>>   
>
> It's probably inb()ing on the port in a loop.  It's not SMP causing 
> the problems, but the ACPI HAL.  I'll check this.
>

Yes, it's reading 0xb010 and 0xb014, which ought to place the cpu in 
sleep mode, but don't.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2008-05-06 13:55 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-12  3:11 [RFC] fix VMX TSC synchronicity Will Trives
2008-01-12 12:28 ` Marcelo Tosatti
2008-01-12 13:48   ` Will Trives
2008-01-12 20:51     ` Avi Kivity
     [not found]       ` <4789284E.7000907-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2008-01-19 18:58         ` [PATCH] kvm: qemu: remove SIGUSR2 from io_sigset Will Trives
2008-01-20  9:34           ` Avi Kivity
     [not found]             ` <4793158A.2080509-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2008-01-20 10:45               ` Avi Kivity
  -- strict thread matches above, loose matches on Subject: below --
2008-01-11 20:49 [RFC] fix VMX TSC synchronicity Marcelo Tosatti
2008-01-13 12:19 ` Avi Kivity
     [not found]   ` <478A01D1.7000402-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2008-01-14 16:06     ` Marcelo Tosatti
2008-01-14 20:46       ` Marcelo Tosatti
2008-01-15 14:59         ` Avi Kivity
2008-01-15 14:33       ` Avi Kivity
     [not found]         ` <478CC448.1030901-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2008-01-15 14:50           ` Alexander Graf
     [not found]             ` <478CC819.3040106-r27SGEef+tmzQB+pC5nmwQ@public.gmane.org>
2008-01-15 15:09               ` Avi Kivity
     [not found]                 ` <478CCCA9.2080300-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2008-01-15 16:42                   ` Alexander Graf
     [not found]                     ` <478CE277.9010109-r27SGEef+tmzQB+pC5nmwQ@public.gmane.org>
2008-01-15 17:46                       ` Avi Kivity
     [not found]                         ` <478CF186.5030304-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2008-01-15 18:29                           ` Amit Shah
2008-01-16  5:51                   ` Andi Kleen
     [not found]                     ` <p73ir1ul3ls.fsf-KvMlXPVkKihbpigZmTR7Iw@public.gmane.org>
2008-01-16  8:46                       ` Avi Kivity
     [not found]                         ` <478DC453.1000404-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2008-01-16 13:34                           ` Andi Kleen
     [not found]                             ` <478E08E5.2030507@qumranet.com>
     [not found]                               ` <478E08E5.2030507-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2008-01-16 13:54                                 ` Andi Kleen
     [not found]                                   ` <20080116135415.GA14664-qrUzlfsMFqo/4alezvVtWx2eb7JE58TQ@public.gmane.org>
2008-01-16 13:56                                     ` Avi Kivity
2008-01-17 18:43                                     ` Marcelo Tosatti
2008-01-17 18:56                                       ` Andi Kleen
2008-01-20 14:59                                       ` Avi Kivity
2008-05-04 15:40                                       ` Avi Kivity
2008-05-06 13:55                                         ` Avi Kivity

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox