From mboxrd@z Thu Jan  1 00:00:00 1970
From: "Zhai, Edwin" <edwin.zhai@intel.com>
Subject: Re: [PATCH] [RESEND] KVM:VMX: Add support for Pause-Loop Exiting
Date: Fri, 09 Oct 2009 18:03:20 +0800
Message-ID: <4ACF0A68.6040906@intel.com>
References: <4ABA2AD7.6080008@intel.com> <4ABA2C22.7020000@redhat.com>	<4ABC18C6.9070202@intel.com> <4ABF2221.4000505@redhat.com>	<4AC082F9.1060502@intel.com> <4AC20CDB.2070203@redhat.com>	<4AC2ADFF.300@intel.com> <20090930162249.GA7440@amt.cnet> <20091002182840.GA6533@amt.cnet>
Mime-Version: 1.0
Content-Type: multipart/mixed;
 boundary="------------070409080603000002050601"
Cc: Mark Langsdorf <mark.langsdorf@amd.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"Zhai, Edwin" <edwin.zhai@intel.com>
To: Marcelo Tosatti <mtosatti@redhat.com>, Avi Kivity <avi@redhat.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mga11.intel.com ([192.55.52.93]:58975 "EHLO mga11.intel.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1754153AbZJIKEd (ORCPT <rfc822;kvm@vger.kernel.org>);
	Fri, 9 Oct 2009 06:04:33 -0400
In-Reply-To: <20091002182840.GA6533@amt.cnet>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

This is a multi-part message in MIME format.
--------------070409080603000002050601
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Tosatti,
See attached patch.

Avi,
Could you pls. do the check in if no any other comments.
Thanks,


Marcelo Tosatti wrote:
> On Wed, Sep 30, 2009 at 01:22:49PM -0300, Marcelo Tosatti wrote:
>   
>> On Wed, Sep 30, 2009 at 09:01:51AM +0800, Zhai, Edwin wrote:
>>     
>>> Avi,
>>> I modify it according your comments. The only thing I want to keep is  
>>> the module param ple_gap/window.  Although they are not per-guest, they  
>>> can be used to find the right value, and disable PLE for debug purpose.
>>>
>>> Thanks,
>>>
>>>
>>> Avi Kivity wrote:
>>>       
>>>> On 09/28/2009 11:33 AM, Zhai, Edwin wrote:
>>>>   
>>>>         
>>>>> Avi Kivity wrote:
>>>>>     
>>>>>           
>>>>>> +#define KVM_VMX_DEFAULT_PLE_GAP    41
>>>>>> +#define KVM_VMX_DEFAULT_PLE_WINDOW 4096
>>>>>> +static int __read_mostly ple_gap = KVM_VMX_DEFAULT_PLE_GAP;
>>>>>> +module_param(ple_gap, int, S_IRUGO);
>>>>>> +
>>>>>> +static int __read_mostly ple_window = KVM_VMX_DEFAULT_PLE_WINDOW;
>>>>>> +module_param(ple_window, int, S_IRUGO);
>>>>>>
>>>>>> Shouldn't be __read_mostly since they're read very rarely  
>>>>>> (__read_mostly should be for variables that are very often read, 
>>>>>> and rarely written).
>>>>>>       
>>>>>>             
>>>>> In general, they are read only except that experienced user may try  
>>>>> different parameter for perf tuning.
>>>>>     
>>>>>           
>>>> __read_mostly doesn't just mean it's read mostly.  It also means it's  
>>>> read often.  Otherwise it's just wasting space in hot cachelines.
>>>>
>>>>   
>>>>         
>>>>>> I'm not even sure they should be parameters.
>>>>>>       
>>>>>>             
>>>>> For different spinlock in different OS, and for different workloads,  
>>>>> we need different parameter for tuning. It's similar as the 
>>>>> enable_ept.
>>>>>     
>>>>>           
>>>> No, global parameters don't work for tuning workloads and guests since  
>>>> they cannot be modified on a per-guest basis.  enable_ept is only 
>>>> useful for debugging and testing.
>>>>
>>>>   
>>>>         
>>>>>>> +    set_current_state(TASK_INTERRUPTIBLE);
>>>>>>> +    schedule_hrtimeout(&expires, HRTIMER_MODE_ABS);
>>>>>>> +
>>>>>>>         
>>>>>>>               
>>>>>> Please add a tracepoint for this (since it can cause significant  
>>>>>> change in behaviour),       
>>>>>>             
>>>>> Isn't trace_kvm_exit(exit_reason, ...) enough? We can tell the PLE  
>>>>> vmexit from other vmexits.
>>>>>     
>>>>>           
>>>> Right.  I thought of the software spinlock detector, but that's another 
>>>> problem.
>>>>
>>>> I think you can drop the sleep_time parameter, it can be part of the  
>>>> function.  Also kvm_vcpu_sleep() is confusing, we also sleep on halt.   
>>>> Please call it kvm_vcpu_on_spin() or something (since that's what the  
>>>> guest is doing).
>>>>         
>> kvm_vcpu_on_spin() should add the vcpu to vcpu->wq (so a new pending
>> interrupt wakes it up immediately).
>>     
>
> Updated version (also please send it separately from the vmx.c patch):
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 894a56e..43125dc 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -231,6 +231,7 @@ int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
>  void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
>  
>  void kvm_vcpu_block(struct kvm_vcpu *vcpu);
> +void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu);
>  void kvm_resched(struct kvm_vcpu *vcpu);
>  void kvm_load_guest_fpu(struct kvm_vcpu *vcpu);
>  void kvm_put_guest_fpu(struct kvm_vcpu *vcpu);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 4d0dd39..e788d70 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1479,6 +1479,21 @@ void kvm_resched(struct kvm_vcpu *vcpu)
>  }
>  EXPORT_SYMBOL_GPL(kvm_resched);
>  
> +void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu)
> +{
> +	ktime_t expires;
> +	DEFINE_WAIT(wait);
> +
> +	prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
> +
> +	/* Sleep for 100 us, and hope lock-holder got scheduled */
> +	expires = ktime_add_ns(ktime_get(), 100000UL);
> +	schedule_hrtimeout(&expires, HRTIMER_MODE_ABS);
> +
> +	finish_wait(&vcpu->wq, &wait);
> +}
> +EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
> +
>  static int kvm_vcpu_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
>  	struct kvm_vcpu *vcpu = vma->vm_file->private_data;
>
>   

-- 
best rgds,
edwin


--------------070409080603000002050601
Content-Type: text/plain;
 name="kvm_ple_hrtimer_1.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="kvm_ple_hrtimer_1.patch"

KVM:VMX: Add support for Pause-Loop Exiting

New NHM processors will support Pause-Loop Exiting by adding 2 VM-execution
control fields:
PLE_Gap    - upper bound on the amount of time between two successive
             executions of PAUSE in a loop.
PLE_Window - upper bound on the amount of time a guest is allowed to execute in
             a PAUSE loop

If the time, between this execution of PAUSE and previous one, exceeds the
PLE_Gap, processor consider this PAUSE belongs to a new loop.
Otherwise, processor determins the the total execution time of this loop(since
1st PAUSE in this loop), and triggers a VM exit if total time exceeds the
PLE_Window.
* Refer SDM volume 3b section 21.6.13 & 22.1.3.

Pause-Loop Exiting can be used to detect Lock-Holder Preemption, where one VP
is sched-out after hold a spinlock, then other VPs for same lock are sched-in
to waste the CPU time.

Our tests indicate that most spinlocks are held for less than 212 cycles.
Performance tests show that with 2X LP over-commitment we can get +2% perf
improvement for kernel build(Even more perf gain with more LPs).

Signed-off-by: Zhai Edwin <edwin.zhai@intel.com>

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 272514c..2b49454 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -56,6 +56,7 @@
 #define SECONDARY_EXEC_ENABLE_VPID              0x00000020
 #define SECONDARY_EXEC_WBINVD_EXITING		0x00000040
 #define SECONDARY_EXEC_UNRESTRICTED_GUEST	0x00000080
+#define SECONDARY_EXEC_PAUSE_LOOP_EXITING	0x00000400
 
 
 #define PIN_BASED_EXT_INTR_MASK                 0x00000001
@@ -144,6 +145,8 @@ enum vmcs_field {
 	VM_ENTRY_INSTRUCTION_LEN        = 0x0000401a,
 	TPR_THRESHOLD                   = 0x0000401c,
 	SECONDARY_VM_EXEC_CONTROL       = 0x0000401e,
+	PLE_GAP                         = 0x00004020,
+	PLE_WINDOW                      = 0x00004022,
 	VM_INSTRUCTION_ERROR            = 0x00004400,
 	VM_EXIT_REASON                  = 0x00004402,
 	VM_EXIT_INTR_INFO               = 0x00004404,
@@ -248,6 +251,7 @@ enum vmcs_field {
 #define EXIT_REASON_MSR_READ            31
 #define EXIT_REASON_MSR_WRITE           32
 #define EXIT_REASON_MWAIT_INSTRUCTION   36
+#define EXIT_REASON_PAUSE_INSTRUCTION   40
 #define EXIT_REASON_MCE_DURING_VMENTRY	 41
 #define EXIT_REASON_TPR_BELOW_THRESHOLD 43
 #define EXIT_REASON_APIC_ACCESS         44
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 70020e5..93274c6 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -61,6 +61,25 @@ module_param_named(unrestricted_guest,
 static int __read_mostly emulate_invalid_guest_state = 0;
 module_param(emulate_invalid_guest_state, bool, S_IRUGO);
 
+/*
+ * These 2 parameters are used to config the controls for Pause-Loop Exiting:
+ * ple_gap:    upper bound on the amount of time between two successive
+ *             executions of PAUSE in a loop. Also indicate if ple enabled.
+ *             According to test, this time is usually small than 41 cycles.
+ * ple_window: upper bound on the amount of time a guest is allowed to execute
+ *             in a PAUSE loop. Tests indicate that most spinlocks are held for
+ *             less than 2^12 cycles
+ * Time is measured based on a counter that runs at the same rate as the TSC,
+ * refer SDM volume 3b section 21.6.13 & 22.1.3.
+ */
+#define KVM_VMX_DEFAULT_PLE_GAP    41
+#define KVM_VMX_DEFAULT_PLE_WINDOW 4096
+static int ple_gap = KVM_VMX_DEFAULT_PLE_GAP;
+module_param(ple_gap, int, S_IRUGO);
+
+static int ple_window = KVM_VMX_DEFAULT_PLE_WINDOW;
+module_param(ple_window, int, S_IRUGO);
+
 struct vmcs {
 	u32 revision_id;
 	u32 abort;
@@ -319,6 +338,12 @@ static inline int cpu_has_vmx_unrestricted_guest(void)
 		SECONDARY_EXEC_UNRESTRICTED_GUEST;
 }
 
+static inline int cpu_has_vmx_ple(void)
+{
+	return vmcs_config.cpu_based_2nd_exec_ctrl &
+		SECONDARY_EXEC_PAUSE_LOOP_EXITING;
+}
+
 static inline int vm_need_virtualize_apic_accesses(struct kvm *kvm)
 {
 	return flexpriority_enabled &&
@@ -1240,7 +1265,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
 			SECONDARY_EXEC_WBINVD_EXITING |
 			SECONDARY_EXEC_ENABLE_VPID |
 			SECONDARY_EXEC_ENABLE_EPT |
-			SECONDARY_EXEC_UNRESTRICTED_GUEST;
+			SECONDARY_EXEC_UNRESTRICTED_GUEST |
+			SECONDARY_EXEC_PAUSE_LOOP_EXITING;
 		if (adjust_vmx_controls(min2, opt2,
 					MSR_IA32_VMX_PROCBASED_CTLS2,
 					&_cpu_based_2nd_exec_control) < 0)
@@ -1386,6 +1412,9 @@ static __init int hardware_setup(void)
 	if (enable_ept && !cpu_has_vmx_ept_2m_page())
 		kvm_disable_largepages();
 
+	if (!cpu_has_vmx_ple())
+		ple_gap = 0;
+
 	return alloc_kvm_area();
 }
 
@@ -2298,9 +2327,16 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
 			exec_control &= ~SECONDARY_EXEC_ENABLE_EPT;
 		if (!enable_unrestricted_guest)
 			exec_control &= ~SECONDARY_EXEC_UNRESTRICTED_GUEST;
+		if (!ple_gap)
+			exec_control &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING;
 		vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
 	}
 
+	if (ple_gap) {
+		vmcs_write32(PLE_GAP, ple_gap);
+		vmcs_write32(PLE_WINDOW, ple_window);
+	}
+
 	vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, !!bypass_guest_pf);
 	vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, !!bypass_guest_pf);
 	vmcs_write32(CR3_TARGET_COUNT, 0);           /* 22.2.1 */
@@ -3348,6 +3384,19 @@ out:
 }
 
 /*
+ * Indicate a busy-waiting vcpu in spinlock. We do not enable the PAUSE
+ * exiting, so only get here on cpu with PAUSE-Loop-Exiting.
+ */
+static int handle_pause(struct kvm_vcpu *vcpu,
+				struct kvm_run *kvm_run)
+{
+	skip_emulated_instruction(vcpu);
+	kvm_vcpu_on_spin(vcpu);
+
+	return 1;
+}
+
+/*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
  * to be done to userspace and return 0.
@@ -3383,6 +3432,7 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 	[EXIT_REASON_MCE_DURING_VMENTRY]      = handle_machine_check,
 	[EXIT_REASON_EPT_VIOLATION]	      = handle_ept_violation,
 	[EXIT_REASON_EPT_MISCONFIG]           = handle_ept_misconfig,
+	[EXIT_REASON_PAUSE_INSTRUCTION]       = handle_pause,
 };
 
 static const int kvm_vmx_max_exit_handlers =

--------------070409080603000002050601
Content-Type: text/plain;
 name="kvm_ple_hrtimer_2.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="kvm_ple_hrtimer_2.patch"

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b985a29..bd5a616 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -286,6 +286,7 @@ int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
 void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
 
 void kvm_vcpu_block(struct kvm_vcpu *vcpu);
+void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu);
 void kvm_resched(struct kvm_vcpu *vcpu);
 void kvm_load_guest_fpu(struct kvm_vcpu *vcpu);
 void kvm_put_guest_fpu(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c0a929f..c4289c0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1108,6 +1108,21 @@ void kvm_resched(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_resched);
 
+void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu)
+{
+	ktime_t expires;
+	DEFINE_WAIT(wait);
+
+	prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
+
+	/* Sleep for 100 us, and hope lock-holder got scheduled */
+	expires = ktime_add_ns(ktime_get(), 100000UL);
+	schedule_hrtimeout(&expires, HRTIMER_MODE_ABS);
+
+	finish_wait(&vcpu->wq, &wait);
+}
+EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
+
 static int kvm_vcpu_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct kvm_vcpu *vcpu = vma->vm_file->private_data;

--------------070409080603000002050601--