From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Zhai, Edwin" Subject: Re: [PATCH] [RESEND] KVM:VMX: Add support for Pause-Loop Exiting Date: Fri, 09 Oct 2009 18:03:20 +0800 Message-ID: <4ACF0A68.6040906@intel.com> References: <4ABA2AD7.6080008@intel.com> <4ABA2C22.7020000@redhat.com> <4ABC18C6.9070202@intel.com> <4ABF2221.4000505@redhat.com> <4AC082F9.1060502@intel.com> <4AC20CDB.2070203@redhat.com> <4AC2ADFF.300@intel.com> <20090930162249.GA7440@amt.cnet> <20091002182840.GA6533@amt.cnet> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="------------070409080603000002050601" Cc: Mark Langsdorf , "kvm@vger.kernel.org" , "Zhai, Edwin" To: Marcelo Tosatti , Avi Kivity Return-path: Received: from mga11.intel.com ([192.55.52.93]:58975 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754153AbZJIKEd (ORCPT ); Fri, 9 Oct 2009 06:04:33 -0400 In-Reply-To: <20091002182840.GA6533@amt.cnet> Sender: kvm-owner@vger.kernel.org List-ID: This is a multi-part message in MIME format. --------------070409080603000002050601 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Tosatti, See attached patch. Avi, Could you pls. do the check in if no any other comments. Thanks, Marcelo Tosatti wrote: > On Wed, Sep 30, 2009 at 01:22:49PM -0300, Marcelo Tosatti wrote: > >> On Wed, Sep 30, 2009 at 09:01:51AM +0800, Zhai, Edwin wrote: >> >>> Avi, >>> I modify it according your comments. The only thing I want to keep is >>> the module param ple_gap/window. Although they are not per-guest, they >>> can be used to find the right value, and disable PLE for debug purpose. >>> >>> Thanks, >>> >>> >>> Avi Kivity wrote: >>> >>>> On 09/28/2009 11:33 AM, Zhai, Edwin wrote: >>>> >>>> >>>>> Avi Kivity wrote: >>>>> >>>>> >>>>>> +#define KVM_VMX_DEFAULT_PLE_GAP 41 >>>>>> +#define KVM_VMX_DEFAULT_PLE_WINDOW 4096 >>>>>> +static int __read_mostly ple_gap = KVM_VMX_DEFAULT_PLE_GAP; >>>>>> +module_param(ple_gap, int, S_IRUGO); >>>>>> + >>>>>> +static int __read_mostly ple_window = KVM_VMX_DEFAULT_PLE_WINDOW; >>>>>> +module_param(ple_window, int, S_IRUGO); >>>>>> >>>>>> Shouldn't be __read_mostly since they're read very rarely >>>>>> (__read_mostly should be for variables that are very often read, >>>>>> and rarely written). >>>>>> >>>>>> >>>>> In general, they are read only except that experienced user may try >>>>> different parameter for perf tuning. >>>>> >>>>> >>>> __read_mostly doesn't just mean it's read mostly. It also means it's >>>> read often. Otherwise it's just wasting space in hot cachelines. >>>> >>>> >>>> >>>>>> I'm not even sure they should be parameters. >>>>>> >>>>>> >>>>> For different spinlock in different OS, and for different workloads, >>>>> we need different parameter for tuning. It's similar as the >>>>> enable_ept. >>>>> >>>>> >>>> No, global parameters don't work for tuning workloads and guests since >>>> they cannot be modified on a per-guest basis. enable_ept is only >>>> useful for debugging and testing. >>>> >>>> >>>> >>>>>>> + set_current_state(TASK_INTERRUPTIBLE); >>>>>>> + schedule_hrtimeout(&expires, HRTIMER_MODE_ABS); >>>>>>> + >>>>>>> >>>>>>> >>>>>> Please add a tracepoint for this (since it can cause significant >>>>>> change in behaviour), >>>>>> >>>>> Isn't trace_kvm_exit(exit_reason, ...) enough? We can tell the PLE >>>>> vmexit from other vmexits. >>>>> >>>>> >>>> Right. I thought of the software spinlock detector, but that's another >>>> problem. >>>> >>>> I think you can drop the sleep_time parameter, it can be part of the >>>> function. Also kvm_vcpu_sleep() is confusing, we also sleep on halt. >>>> Please call it kvm_vcpu_on_spin() or something (since that's what the >>>> guest is doing). >>>> >> kvm_vcpu_on_spin() should add the vcpu to vcpu->wq (so a new pending >> interrupt wakes it up immediately). >> > > Updated version (also please send it separately from the vmx.c patch): > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 894a56e..43125dc 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -231,6 +231,7 @@ int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn); > void mark_page_dirty(struct kvm *kvm, gfn_t gfn); > > void kvm_vcpu_block(struct kvm_vcpu *vcpu); > +void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu); > void kvm_resched(struct kvm_vcpu *vcpu); > void kvm_load_guest_fpu(struct kvm_vcpu *vcpu); > void kvm_put_guest_fpu(struct kvm_vcpu *vcpu); > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 4d0dd39..e788d70 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -1479,6 +1479,21 @@ void kvm_resched(struct kvm_vcpu *vcpu) > } > EXPORT_SYMBOL_GPL(kvm_resched); > > +void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu) > +{ > + ktime_t expires; > + DEFINE_WAIT(wait); > + > + prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE); > + > + /* Sleep for 100 us, and hope lock-holder got scheduled */ > + expires = ktime_add_ns(ktime_get(), 100000UL); > + schedule_hrtimeout(&expires, HRTIMER_MODE_ABS); > + > + finish_wait(&vcpu->wq, &wait); > +} > +EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin); > + > static int kvm_vcpu_fault(struct vm_area_struct *vma, struct vm_fault *vmf) > { > struct kvm_vcpu *vcpu = vma->vm_file->private_data; > > -- best rgds, edwin --------------070409080603000002050601 Content-Type: text/plain; name="kvm_ple_hrtimer_1.patch" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="kvm_ple_hrtimer_1.patch" KVM:VMX: Add support for Pause-Loop Exiting New NHM processors will support Pause-Loop Exiting by adding 2 VM-execution control fields: PLE_Gap - upper bound on the amount of time between two successive executions of PAUSE in a loop. PLE_Window - upper bound on the amount of time a guest is allowed to execute in a PAUSE loop If the time, between this execution of PAUSE and previous one, exceeds the PLE_Gap, processor consider this PAUSE belongs to a new loop. Otherwise, processor determins the the total execution time of this loop(since 1st PAUSE in this loop), and triggers a VM exit if total time exceeds the PLE_Window. * Refer SDM volume 3b section 21.6.13 & 22.1.3. Pause-Loop Exiting can be used to detect Lock-Holder Preemption, where one VP is sched-out after hold a spinlock, then other VPs for same lock are sched-in to waste the CPU time. Our tests indicate that most spinlocks are held for less than 212 cycles. Performance tests show that with 2X LP over-commitment we can get +2% perf improvement for kernel build(Even more perf gain with more LPs). Signed-off-by: Zhai Edwin diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h index 272514c..2b49454 100644 --- a/arch/x86/include/asm/vmx.h +++ b/arch/x86/include/asm/vmx.h @@ -56,6 +56,7 @@ #define SECONDARY_EXEC_ENABLE_VPID 0x00000020 #define SECONDARY_EXEC_WBINVD_EXITING 0x00000040 #define SECONDARY_EXEC_UNRESTRICTED_GUEST 0x00000080 +#define SECONDARY_EXEC_PAUSE_LOOP_EXITING 0x00000400 #define PIN_BASED_EXT_INTR_MASK 0x00000001 @@ -144,6 +145,8 @@ enum vmcs_field { VM_ENTRY_INSTRUCTION_LEN = 0x0000401a, TPR_THRESHOLD = 0x0000401c, SECONDARY_VM_EXEC_CONTROL = 0x0000401e, + PLE_GAP = 0x00004020, + PLE_WINDOW = 0x00004022, VM_INSTRUCTION_ERROR = 0x00004400, VM_EXIT_REASON = 0x00004402, VM_EXIT_INTR_INFO = 0x00004404, @@ -248,6 +251,7 @@ enum vmcs_field { #define EXIT_REASON_MSR_READ 31 #define EXIT_REASON_MSR_WRITE 32 #define EXIT_REASON_MWAIT_INSTRUCTION 36 +#define EXIT_REASON_PAUSE_INSTRUCTION 40 #define EXIT_REASON_MCE_DURING_VMENTRY 41 #define EXIT_REASON_TPR_BELOW_THRESHOLD 43 #define EXIT_REASON_APIC_ACCESS 44 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 70020e5..93274c6 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -61,6 +61,25 @@ module_param_named(unrestricted_guest, static int __read_mostly emulate_invalid_guest_state = 0; module_param(emulate_invalid_guest_state, bool, S_IRUGO); +/* + * These 2 parameters are used to config the controls for Pause-Loop Exiting: + * ple_gap: upper bound on the amount of time between two successive + * executions of PAUSE in a loop. Also indicate if ple enabled. + * According to test, this time is usually small than 41 cycles. + * ple_window: upper bound on the amount of time a guest is allowed to execute + * in a PAUSE loop. Tests indicate that most spinlocks are held for + * less than 2^12 cycles + * Time is measured based on a counter that runs at the same rate as the TSC, + * refer SDM volume 3b section 21.6.13 & 22.1.3. + */ +#define KVM_VMX_DEFAULT_PLE_GAP 41 +#define KVM_VMX_DEFAULT_PLE_WINDOW 4096 +static int ple_gap = KVM_VMX_DEFAULT_PLE_GAP; +module_param(ple_gap, int, S_IRUGO); + +static int ple_window = KVM_VMX_DEFAULT_PLE_WINDOW; +module_param(ple_window, int, S_IRUGO); + struct vmcs { u32 revision_id; u32 abort; @@ -319,6 +338,12 @@ static inline int cpu_has_vmx_unrestricted_guest(void) SECONDARY_EXEC_UNRESTRICTED_GUEST; } +static inline int cpu_has_vmx_ple(void) +{ + return vmcs_config.cpu_based_2nd_exec_ctrl & + SECONDARY_EXEC_PAUSE_LOOP_EXITING; +} + static inline int vm_need_virtualize_apic_accesses(struct kvm *kvm) { return flexpriority_enabled && @@ -1240,7 +1265,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) SECONDARY_EXEC_WBINVD_EXITING | SECONDARY_EXEC_ENABLE_VPID | SECONDARY_EXEC_ENABLE_EPT | - SECONDARY_EXEC_UNRESTRICTED_GUEST; + SECONDARY_EXEC_UNRESTRICTED_GUEST | + SECONDARY_EXEC_PAUSE_LOOP_EXITING; if (adjust_vmx_controls(min2, opt2, MSR_IA32_VMX_PROCBASED_CTLS2, &_cpu_based_2nd_exec_control) < 0) @@ -1386,6 +1412,9 @@ static __init int hardware_setup(void) if (enable_ept && !cpu_has_vmx_ept_2m_page()) kvm_disable_largepages(); + if (!cpu_has_vmx_ple()) + ple_gap = 0; + return alloc_kvm_area(); } @@ -2298,9 +2327,16 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx) exec_control &= ~SECONDARY_EXEC_ENABLE_EPT; if (!enable_unrestricted_guest) exec_control &= ~SECONDARY_EXEC_UNRESTRICTED_GUEST; + if (!ple_gap) + exec_control &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING; vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control); } + if (ple_gap) { + vmcs_write32(PLE_GAP, ple_gap); + vmcs_write32(PLE_WINDOW, ple_window); + } + vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, !!bypass_guest_pf); vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, !!bypass_guest_pf); vmcs_write32(CR3_TARGET_COUNT, 0); /* 22.2.1 */ @@ -3348,6 +3384,19 @@ out: } /* + * Indicate a busy-waiting vcpu in spinlock. We do not enable the PAUSE + * exiting, so only get here on cpu with PAUSE-Loop-Exiting. + */ +static int handle_pause(struct kvm_vcpu *vcpu, + struct kvm_run *kvm_run) +{ + skip_emulated_instruction(vcpu); + kvm_vcpu_on_spin(vcpu); + + return 1; +} + +/* * The exit handlers return 1 if the exit was handled fully and guest execution * may resume. Otherwise they set the kvm_run parameter to indicate what needs * to be done to userspace and return 0. @@ -3383,6 +3432,7 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = { [EXIT_REASON_MCE_DURING_VMENTRY] = handle_machine_check, [EXIT_REASON_EPT_VIOLATION] = handle_ept_violation, [EXIT_REASON_EPT_MISCONFIG] = handle_ept_misconfig, + [EXIT_REASON_PAUSE_INSTRUCTION] = handle_pause, }; static const int kvm_vmx_max_exit_handlers = --------------070409080603000002050601 Content-Type: text/plain; name="kvm_ple_hrtimer_2.patch" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="kvm_ple_hrtimer_2.patch" diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index b985a29..bd5a616 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -286,6 +286,7 @@ int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn); void mark_page_dirty(struct kvm *kvm, gfn_t gfn); void kvm_vcpu_block(struct kvm_vcpu *vcpu); +void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu); void kvm_resched(struct kvm_vcpu *vcpu); void kvm_load_guest_fpu(struct kvm_vcpu *vcpu); void kvm_put_guest_fpu(struct kvm_vcpu *vcpu); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index c0a929f..c4289c0 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1108,6 +1108,21 @@ void kvm_resched(struct kvm_vcpu *vcpu) } EXPORT_SYMBOL_GPL(kvm_resched); +void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu) +{ + ktime_t expires; + DEFINE_WAIT(wait); + + prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE); + + /* Sleep for 100 us, and hope lock-holder got scheduled */ + expires = ktime_add_ns(ktime_get(), 100000UL); + schedule_hrtimeout(&expires, HRTIMER_MODE_ABS); + + finish_wait(&vcpu->wq, &wait); +} +EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin); + static int kvm_vcpu_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { struct kvm_vcpu *vcpu = vma->vm_file->private_data; --------------070409080603000002050601--