[RFC PATCH V3 0/5] Utilizing VMX preemption for timer virtualization

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH V3 0/5] Utilizing VMX preemption for timer virtualization
@ 2016-06-04  0:42 Yunhong Jiang
  2016-06-04  0:42 ` [RFC PATCH V3 1/5] Rename the vmx_pre/post_block to pi_pre/post_block Yunhong Jiang
                   ` (6 more replies)
  0 siblings, 7 replies; 15+ messages in thread
From: Yunhong Jiang @ 2016-06-04  0:42 UTC (permalink / raw)
  To: kvm; +Cc: mtosatti, rkrcmar, pbonzini, kernellwp

The VMX-preemption timer is a feature on VMX, it counts down, from the
value loaded by VM entry, in VMX nonroot operation. When the timer
counts down to zero, it stops counting down and a VM exit occurs.

This patchset utilize VMX preemption timer for tsc deadline timer
virtualization. The VMX preemption timer is armed before the vm-entry if the
tsc deadline timer is enabled. A VMExit will happen if the virtual TSC
deadline timer expires.

When the vCPU thread is blocked because of HLT instruction, the tsc deadline
timer virtualization will be switched to use the current solution, i.e. use
the timer for it. It's switched back to VMX preemption timer when the vCPU
thread is unblocked.

This solution replace the complex OS's hrtimer system, and also the host
timer interrupt handling cost, with a preemption_timer VMexit. It fits well
for some NFV usage scenario, when the vCPU is bound to a pCPU and the pCPU
is isolated, or some similar scenarioes.

It adds a little bit latency for each VM-entry because we need setup the
preemption timer each time.

Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>

Performance Evalaution:
Host:
[nfv@otcnfv02 ~]$ cat /proc/cpuinfo
....
cpu family      : 6
model           : 63
model name      : Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz

Guest:
Two vCPU with vCPU pinned to isolated pCPUs, idle=poll on guest kernel.
When the vCPU is not pinned, the benefit is smaller than pinned situation.

Test tools:
cyclictest [1] running 10 minutes with 1ms interval, i.e. 600000 loop in
total.

1. enable_hv_timer=Y.

# Histogram
......
000003 000000
000004 002174
000005 042961
000006 479383
000007 071123
000008 003720
000009 000467
000010 000078
000011 000011
000012 000009
......
# Min Latencies: 00004
# Avg Latencies: 00007

2. enable_hv_timer=N.

# Histogram
......
000003 000000
000004 000000
000005 000042
000006 000772
000007 008262
000008 200759
000009 381126
000010 008056
000011 000234
000012 000367
......
# Min Latencies: 00005
# Avg Latencies: 00010

Changes since v2 [3]:
* Switch on HLT instruction instead of sched_out/sched_in.
* VMX preemption timer is broken on some CPU, added the check.
* Reduce the overhead to the vm-entry code path. We calculate the
  host deadline tsc in advance and set the vmcs exec_control earlier.
* Adding the TSC scaling support. This codepath is not tested yet because
  still looking for a platform with TSC scaling capability.
* Checking if the host delta TSC, after the multication, will be more than 32
  bit, which is the width of the vmcs field.

Changes since v1 [2]:
* Remove the vmx_sched_out and no changes to kvm_x86_ops for it.
* Remove the two expired timer checkings on each vm-entry.
* Rename the hwemul_timer to hv_timer
* Clear vmx_x86_ops's membership if preemption timer is not usable.
* Cache cpu_preemption_timer_multi.
* Keep the tracepoint with the function patch.
* Other minor changes based on Paolo's review.

[1] https://rt.wiki.kernel.org/index.php/Cyclictest
[2] http://www.spinics.net/lists/kvm/msg132895.html
[3] http://www.spinics.net/lists/kvm/msg133185.html

Yunhong Jiang (5):
  Rename the vmx_pre/post_block to pi_pre/post_block
  Add function for left shift and 64 bit division
  Utilize the vmx preemption timer
  Separate the start_sw_tscdeadline
  Utilize the vmx preemption timer for tsc deadline timer

 arch/x86/include/asm/div64.h    |  18 +++++
 arch/x86/include/asm/kvm_host.h |   8 ++
 arch/x86/kvm/lapic.c            | 121 ++++++++++++++++++++++------
 arch/x86/kvm/lapic.h            |   5 ++
 arch/x86/kvm/trace.h            |  16 ++++
 arch/x86/kvm/vmx.c              | 171 +++++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c              |  17 +++-
 7 files changed, 329 insertions(+), 27 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH V3 1/5] Rename the vmx_pre/post_block to pi_pre/post_block
  2016-06-04  0:42 [RFC PATCH V3 0/5] Utilizing VMX preemption for timer virtualization Yunhong Jiang
@ 2016-06-04  0:42 ` Yunhong Jiang
  2016-06-04  0:42 ` [RFC PATCH V3 2/5] Add function for left shift and 64 bit division Yunhong Jiang
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Yunhong Jiang @ 2016-06-04  0:42 UTC (permalink / raw)
  To: kvm; +Cc: mtosatti, rkrcmar, pbonzini, kernellwp

From: Yunhong Jiang <yunhong.jiang@gmail.com>

Prepare to add the HV timer switch to the vmx_pre/post_block. Current
functions are only for posted interrupt, rename the function name
accordingly.

Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
---
 arch/x86/kvm/vmx.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index fb93010beaa4..51b08cd43bb7 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -10706,7 +10706,7 @@ static void vmx_enable_log_dirty_pt_masked(struct kvm *kvm,
  *   this case, return 1, otherwise, return 0.
  *
  */
-static int vmx_pre_block(struct kvm_vcpu *vcpu)
+static int pi_pre_block(struct kvm_vcpu *vcpu)
 {
 	unsigned long flags;
 	unsigned int dest;
@@ -10772,7 +10772,15 @@ static int vmx_pre_block(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
-static void vmx_post_block(struct kvm_vcpu *vcpu)
+static int vmx_pre_block(struct kvm_vcpu *vcpu)
+{
+	if (pi_pre_block(vcpu))
+		return 1;
+
+	return 0;
+}
+
+static void pi_post_block(struct kvm_vcpu *vcpu)
 {
 	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
 	struct pi_desc old, new;
@@ -10813,6 +10821,11 @@ static void vmx_post_block(struct kvm_vcpu *vcpu)
 	}
 }
 
+static void vmx_post_block(struct kvm_vcpu *vcpu)
+{
+	pi_post_block(vcpu);
+}
+
 /*
  * vmx_update_pi_irte - set IRTE for Posted-Interrupts
  *
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH V3 2/5] Add function for left shift and 64 bit division
  2016-06-04  0:42 [RFC PATCH V3 0/5] Utilizing VMX preemption for timer virtualization Yunhong Jiang
  2016-06-04  0:42 ` [RFC PATCH V3 1/5] Rename the vmx_pre/post_block to pi_pre/post_block Yunhong Jiang
@ 2016-06-04  0:42 ` Yunhong Jiang
  2016-06-06 12:37   ` Paolo Bonzini
  2016-06-04  0:42 ` [RFC PATCH V3 3/5] Utilize the vmx preemption timer Yunhong Jiang
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 15+ messages in thread
From: Yunhong Jiang @ 2016-06-04  0:42 UTC (permalink / raw)
  To: kvm; +Cc: mtosatti, rkrcmar, pbonzini, kernellwp

From: Yunhong Jiang <yunhong.jiang@intel.com>

Sometimes we need convert from guest tsc to  host tsc, which is:
    host_tsc = ((unsigned __int128)(guest_tsc - tsc_offset)
            << kvm_tsc_scaling_ratio_frac_bits)
            / vcpu->arch.tsc_scaling_ratio;
where guest_tsc and host_tsc are both 64 bit.

A helper function is provided to achieve this conversion. Only supported
on x86_64 platform now. A generic solution can be provided in future if
needed.

Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
---
 arch/x86/include/asm/div64.h | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/arch/x86/include/asm/div64.h b/arch/x86/include/asm/div64.h
index ced283ac79df..6937d6d4c81a 100644
--- a/arch/x86/include/asm/div64.h
+++ b/arch/x86/include/asm/div64.h
@@ -60,6 +60,24 @@ static inline u64 div_u64_rem(u64 dividend, u32 divisor, u32 *remainder)
 #define div_u64_rem	div_u64_rem
 
 #else
+#include <linux/types.h>
+/* (a << shift) / divisor, return 1 if overflow otherwise 0 */
+static inline int u64_shl_div_u64(u64 a, unsigned int shift,
+		u64 divisor, u64 *result)
+{
+	u64 low = a << shift, high = a >> (64 - shift);
+
+	/* To avoid the overflow on divq */
+	if (high > divisor)
+		return 1;
+
+	/* Low hold the result, high hold rem which is discarded */
+	asm("divq %2\n\t" : "=a" (low), "=d" (high) :
+		"rm" (divisor), "0" (low), "1" (high));
+	*result = low;
+
+	return 0;
+}
 # include <asm-generic/div64.h>
 #endif /* CONFIG_X86_32 */
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH V3 3/5] Utilize the vmx preemption timer
  2016-06-04  0:42 [RFC PATCH V3 0/5] Utilizing VMX preemption for timer virtualization Yunhong Jiang
  2016-06-04  0:42 ` [RFC PATCH V3 1/5] Rename the vmx_pre/post_block to pi_pre/post_block Yunhong Jiang
  2016-06-04  0:42 ` [RFC PATCH V3 2/5] Add function for left shift and 64 bit division Yunhong Jiang
@ 2016-06-04  0:42 ` Yunhong Jiang
  2016-06-06 12:37   ` Paolo Bonzini
  2016-06-04  0:42 ` [RFC PATCH V3 4/5] Separate the start_sw_tscdeadline Yunhong Jiang
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 15+ messages in thread
From: Yunhong Jiang @ 2016-06-04  0:42 UTC (permalink / raw)
  To: kvm; +Cc: mtosatti, rkrcmar, pbonzini, kernellwp

From: Yunhong Jiang <yunhong.jiang@gmail.com>

Adding the basic VMX preemption timer functionality, including checking
if the feature is supported, if the feature is broken on the CPU,
setup/clean the VMX preemption timer.

Also adds a parameter to state if the VMX preemption timer should be
utilized.

Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
---
 arch/x86/include/asm/kvm_host.h |   8 +++
 arch/x86/kvm/vmx.c              | 120 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 127 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e0fbe7e70dc1..2116b7b0b6c0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -655,6 +655,11 @@ struct kvm_vcpu_arch {
 
 	int pending_ioapic_eoi;
 	int pending_external_vector;
+
+	/* the host tsc value when set hv_deadline_tsc */
+	u64 hv_orig_tsc;
+	/* apic deadline value in host tsc */
+	u64 hv_deadline_tsc;
 };
 
 struct kvm_lpage_info {
@@ -1005,6 +1010,9 @@ struct kvm_x86_ops {
 	int (*update_pi_irte)(struct kvm *kvm, unsigned int host_irq,
 			      uint32_t guest_irq, bool set);
 	void (*apicv_post_state_restore)(struct kvm_vcpu *vcpu);
+
+	int (*set_hv_timer)(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc);
+	void (*cancel_hv_timer)(struct kvm_vcpu *vcpu);
 };
 
 struct kvm_arch_async_pf {
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 51b08cd43bb7..3e407c6be171 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -110,6 +110,10 @@ module_param_named(pml, enable_pml, bool, S_IRUGO);
 
 #define KVM_VMX_TSC_MULTIPLIER_MAX     0xffffffffffffffffULL
 
+static int cpu_preemption_timer_multi;
+static bool __read_mostly enable_hv_timer;
+module_param_named(enable_hv_timer, enable_hv_timer, bool, S_IRUGO);
+
 #define KVM_GUEST_CR0_MASK (X86_CR0_NW | X86_CR0_CD)
 #define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST (X86_CR0_WP | X86_CR0_NE)
 #define KVM_VM_CR0_ALWAYS_ON						\
@@ -1056,6 +1060,61 @@ static inline bool cpu_has_vmx_virtual_intr_delivery(void)
 		SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
 }
 
+/*
+ * Comment's format: document - errata name - stepping - processor name.
+ * Refer from
+ * https://www.virtualbox.org/svn/vbox/trunk/src/VBox/VMM/VMMR0/HMR0.cpp
+ */
+static u32 vmx_preemption_cpu_tfms[] = {
+/* 323344.pdf - BA86   - D0 - Xeon 7500 Series */
+0x000206E6,
+/* 323056.pdf - AAX65  - C2 - Xeon L3406 */
+/* 322814.pdf - AAT59  - C2 - i7-600, i5-500, i5-400 and i3-300 Mobile */
+/* 322911.pdf - AAU65  - C2 - i5-600, i3-500 Desktop and Pentium G6950 */
+0x00020652,
+/* 322911.pdf - AAU65  - K0 - i5-600, i3-500 Desktop and Pentium G6950 */
+0x00020655,
+/* 322373.pdf - AAO95  - B1 - Xeon 3400 Series */
+/* 322166.pdf - AAN92  - B1 - i7-800 and i5-700 Desktop */
+/*
+ * 320767.pdf - AAP86  - B1 -
+ * i7-900 Mobile Extreme, i7-800 and i7-700 Mobile
+ */
+0x000106E5,
+/* 321333.pdf - AAM126 - C0 - Xeon 3500 */
+0x000106A0,
+/* 321333.pdf - AAM126 - C1 - Xeon 3500 */
+0x000106A1,
+/* 320836.pdf - AAJ124 - C0 - i7-900 Desktop Extreme and i7-900 Desktop */
+0x000106A4,
+ /* 321333.pdf - AAM126 - D0 - Xeon 3500 */
+ /* 321324.pdf - AAK139 - D0 - Xeon 5500 */
+ /* 320836.pdf - AAJ124 - D0 - i7-900 Extreme and i7-900 Desktop */
+0x000106A5,
+};
+
+static inline bool cpu_has_broken_vmx_preemption_timer(void)
+{
+	u32 eax = cpuid_eax(0x00000001), i;
+
+	/* Clear the reserved bits */
+	eax &= ~(0x3U << 14 | 0xfU << 28);
+	for (i = 0; i < sizeof(vmx_preemption_cpu_tfms)/sizeof(u32); i++)
+		if (eax == vmx_preemption_cpu_tfms[i])
+			return true;
+
+	return false;
+}
+
+static inline bool cpu_has_vmx_preemption_timer(void)
+{
+	if (cpu_has_broken_vmx_preemption_timer())
+		return false;
+
+	return vmcs_config.pin_based_exec_ctrl &
+		PIN_BASED_VMX_PREEMPTION_TIMER;
+}
+
 static inline bool cpu_has_vmx_posted_intr(void)
 {
 	return IS_ENABLED(CONFIG_X86_LOCAL_APIC) &&
@@ -3308,7 +3367,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
 		return -EIO;
 
 	min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING;
-	opt = PIN_BASED_VIRTUAL_NMIS | PIN_BASED_POSTED_INTR;
+	opt = PIN_BASED_VIRTUAL_NMIS | PIN_BASED_POSTED_INTR |
+		 PIN_BASED_VMX_PREEMPTION_TIMER;
 	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PINBASED_CTLS,
 				&_pin_based_exec_control) < 0)
 		return -EIO;
@@ -4781,6 +4841,8 @@ static u32 vmx_pin_based_exec_ctrl(struct vcpu_vmx *vmx)
 
 	if (!kvm_vcpu_apicv_active(&vmx->vcpu))
 		pin_based_exec_ctrl &= ~PIN_BASED_POSTED_INTR;
+	/* Enable the preemption timer dynamically */
+	pin_based_exec_ctrl &= ~PIN_BASED_VMX_PREEMPTION_TIMER;
 	return pin_based_exec_ctrl;
 }
 
@@ -6389,6 +6451,24 @@ static __init int hardware_setup(void)
 		kvm_x86_ops->enable_log_dirty_pt_masked = NULL;
 	}
 
+	/*
+	 * We support only x86_64 platform now because guest_tsc and host_tsc
+	 * conversion is only done there yet.
+	 */
+#ifdef CONFIG_X86_64
+	if (cpu_has_vmx_preemption_timer() && enable_hv_timer) {
+		u64 vmx_msr;
+
+		rdmsrl(MSR_IA32_VMX_MISC, vmx_msr);
+		cpu_preemption_timer_multi =
+			 vmx_msr & VMX_MISC_PREEMPTION_TIMER_RATE_MASK;
+	} else
+#endif
+	{
+		kvm_x86_ops->set_hv_timer = NULL;
+		kvm_x86_ops->cancel_hv_timer = NULL;
+	}
+
 	kvm_set_posted_intr_wakeup_handler(wakeup_handler);
 
 	return alloc_kvm_area();
@@ -10662,6 +10742,41 @@ static int vmx_check_intercept(struct kvm_vcpu *vcpu,
 	return X86EMUL_CONTINUE;
 }
 
+static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc)
+{
+	u64 tscl = rdtsc(), delta_tsc;
+
+	delta_tsc = guest_deadline_tsc - kvm_read_l1_tsc(vcpu, tscl);
+
+	/* Convert to host delta tsc if tsc scaling is enabled */
+	if (vcpu->arch.tsc_scaling_ratio &&
+			u64_shl_div_u64(delta_tsc,
+				kvm_tsc_scaling_ratio_frac_bits,
+				vcpu->arch.tsc_scaling_ratio,
+				&delta_tsc))
+		return -1;
+	/*
+	 * If the delta tsc can't be fit in the 32 bit after the multi shift,
+	 * we can't use the preemption timer.
+	 * It's possible that it can be fit when vmentry happens late, but
+	 * checking on every vmentry is costly, so fail earilier.
+	 */
+	if (delta_tsc >> (cpu_preemption_timer_multi + 32))
+		return -1;
+
+	vcpu->arch.hv_orig_tsc = tscl;
+	vcpu->arch.hv_deadline_tsc = tscl + delta_tsc;
+	vmcs_set_bits(PIN_BASED_VM_EXEC_CONTROL,
+			PIN_BASED_VMX_PREEMPTION_TIMER);
+	return 0;
+}
+
+static void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu)
+{
+	vmcs_clear_bits(PIN_BASED_VM_EXEC_CONTROL,
+			PIN_BASED_VMX_PREEMPTION_TIMER);
+}
+
 static void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
 {
 	if (ple_gap)
@@ -11038,6 +11153,9 @@ static struct kvm_x86_ops vmx_x86_ops = {
 	.pmu_ops = &intel_pmu_ops,
 
 	.update_pi_irte = vmx_update_pi_irte,
+
+	.set_hv_timer = vmx_set_hv_timer,
+	.cancel_hv_timer = vmx_cancel_hv_timer,
 };
 
 static int __init vmx_init(void)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH V3 4/5] Separate the start_sw_tscdeadline
  2016-06-04  0:42 [RFC PATCH V3 0/5] Utilizing VMX preemption for timer virtualization Yunhong Jiang
                   ` (2 preceding siblings ...)
  2016-06-04  0:42 ` [RFC PATCH V3 3/5] Utilize the vmx preemption timer Yunhong Jiang
@ 2016-06-04  0:42 ` Yunhong Jiang
  2016-06-04  0:42 ` [RFC PATCH V3 5/5] Utilize the vmx preemption timer for tsc deadline timer Yunhong Jiang
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Yunhong Jiang @ 2016-06-04  0:42 UTC (permalink / raw)
  To: kvm; +Cc: mtosatti, rkrcmar, pbonzini, kernellwp

From: Yunhong Jiang <yunhong.jiang@gmail.com>

The function to start the tsc deadline timer virtualization will be used
also by the sched_out function when we use hwemul_timer, so change it
to a separated function. No logic changes.

Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
---
 arch/x86/kvm/lapic.c | 57 ++++++++++++++++++++++++++++------------------------
 1 file changed, 31 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index bbb5b283ff63..f1cf8a5ede11 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1313,6 +1313,36 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
 		__delay(tsc_deadline - guest_tsc);
 }
 
+static void start_sw_tscdeadline(struct kvm_lapic *apic)
+{
+	u64 guest_tsc, tscdeadline = apic->lapic_timer.tscdeadline;
+	u64 ns = 0;
+	ktime_t expire;
+	struct kvm_vcpu *vcpu = apic->vcpu;
+	unsigned long this_tsc_khz = vcpu->arch.virtual_tsc_khz;
+	unsigned long flags;
+	ktime_t now;
+
+	if (unlikely(!tscdeadline || !this_tsc_khz))
+		return;
+
+	local_irq_save(flags);
+
+	now = apic->lapic_timer.timer.base->get_time();
+	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
+	if (likely(tscdeadline > guest_tsc)) {
+		ns = (tscdeadline - guest_tsc) * 1000000ULL;
+		do_div(ns, this_tsc_khz);
+		expire = ktime_add_ns(now, ns);
+		expire = ktime_sub_ns(expire, lapic_timer_advance_ns);
+		hrtimer_start(&apic->lapic_timer.timer,
+				expire, HRTIMER_MODE_ABS_PINNED);
+	} else
+		apic_timer_expired(apic);
+
+	local_irq_restore(flags);
+}
+
 static void start_apic_timer(struct kvm_lapic *apic)
 {
 	ktime_t now;
@@ -1359,32 +1389,7 @@ static void start_apic_timer(struct kvm_lapic *apic)
 			   ktime_to_ns(ktime_add_ns(now,
 					apic->lapic_timer.period)));
 	} else if (apic_lvtt_tscdeadline(apic)) {
-		/* lapic timer in tsc deadline mode */
-		u64 guest_tsc, tscdeadline = apic->lapic_timer.tscdeadline;
-		u64 ns = 0;
-		ktime_t expire;
-		struct kvm_vcpu *vcpu = apic->vcpu;
-		unsigned long this_tsc_khz = vcpu->arch.virtual_tsc_khz;
-		unsigned long flags;
-
-		if (unlikely(!tscdeadline || !this_tsc_khz))
-			return;
-
-		local_irq_save(flags);
-
-		now = apic->lapic_timer.timer.base->get_time();
-		guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
-		if (likely(tscdeadline > guest_tsc)) {
-			ns = (tscdeadline - guest_tsc) * 1000000ULL;
-			do_div(ns, this_tsc_khz);
-			expire = ktime_add_ns(now, ns);
-			expire = ktime_sub_ns(expire, lapic_timer_advance_ns);
-			hrtimer_start(&apic->lapic_timer.timer,
-				      expire, HRTIMER_MODE_ABS_PINNED);
-		} else
-			apic_timer_expired(apic);
-
-		local_irq_restore(flags);
+		start_sw_tscdeadline(apic);
 	}
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH V3 5/5] Utilize the vmx preemption timer for tsc deadline timer
  2016-06-04  0:42 [RFC PATCH V3 0/5] Utilizing VMX preemption for timer virtualization Yunhong Jiang
                   ` (3 preceding siblings ...)
  2016-06-04  0:42 ` [RFC PATCH V3 4/5] Separate the start_sw_tscdeadline Yunhong Jiang
@ 2016-06-04  0:42 ` Yunhong Jiang
  2016-06-06 12:36   ` Paolo Bonzini
  2016-06-06 12:45 ` [RFC PATCH V3 0/5] Utilizing VMX preemption for timer virtualization Paolo Bonzini
  2016-06-08  4:18 ` Wanpeng Li
  6 siblings, 1 reply; 15+ messages in thread
From: Yunhong Jiang @ 2016-06-04  0:42 UTC (permalink / raw)
  To: kvm; +Cc: mtosatti, rkrcmar, pbonzini, kernellwp

From: Yunhong Jiang <yunhong.jiang@gmail.com>

Utilizing the VMX preemption timer for tsc deadline timer
virtualization. The VMX preemption timer is armed when the vCPU is
running, and a VMExit will happen if the virtual TSC deadline timer
expires.

When the vCPU thread is blocked because of HLT, the tsc deadline timer
virtualization will be switched to use the current solution, i.e. use
the timer for it. It's switched back to VMX preemption timer when the
vCPU thread is unblocked.

This solution avoids the complex OS's hrtimer system, and the host
timer interrupt handling cost, with a preemption_timer VMexit. It fits
well for some NFV usage scenario, when the vCPU is bound to a pCPU and
the pCPU is isolated, or some similar scenario.

Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
---
 arch/x86/kvm/lapic.c | 72 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/lapic.h |  5 ++++
 arch/x86/kvm/trace.h | 16 ++++++++++++
 arch/x86/kvm/vmx.c   | 34 +++++++++++++++++++++++++
 arch/x86/kvm/x86.c   | 17 ++++++++++++-
 5 files changed, 142 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index f1cf8a5ede11..aedbf60846c4 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1343,6 +1343,67 @@ static void start_sw_tscdeadline(struct kvm_lapic *apic)
 	local_irq_restore(flags);
 }
 
+bool kvm_lapic_hv_timer_in_use(struct kvm_vcpu *vcpu)
+{
+	return vcpu->arch.apic->lapic_timer.hv_timer_in_use;
+}
+EXPORT_SYMBOL_GPL(kvm_lapic_hv_timer_in_use);
+
+void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu)
+{
+	struct kvm_lapic *apic = vcpu->arch.apic;
+
+	WARN_ON(!apic->lapic_timer.hv_timer_in_use);
+	WARN_ON(swait_active(&vcpu->wq));
+	kvm_x86_ops->cancel_hv_timer(vcpu);
+	apic->lapic_timer.hv_timer_in_use = 0;
+	apic_timer_expired(apic);
+}
+EXPORT_SYMBOL_GPL(kvm_lapic_expired_hv_timer);
+
+void kvm_lapic_switch_to_hv_timer(struct kvm_vcpu *vcpu)
+{
+	struct kvm_lapic *apic = vcpu->arch.apic;
+
+	WARN_ON(apic->lapic_timer.hv_timer_in_use);
+
+	if (apic_lvtt_tscdeadline(apic) &&
+	    !atomic_read(&apic->lapic_timer.pending)) {
+		u64 tscdeadline = apic->lapic_timer.tscdeadline;
+
+		if (!kvm_x86_ops->set_hv_timer(apic->vcpu, tscdeadline)) {
+			apic->lapic_timer.hv_timer_in_use = true;
+			hrtimer_cancel(&apic->lapic_timer.timer);
+		}
+
+		/* In case the sw timer triggered in above small window */
+		if (atomic_read(&apic->lapic_timer.pending) &&
+				apic->lapic_timer.hv_timer_in_use)
+			kvm_x86_ops->cancel_hv_timer(apic->vcpu);
+		trace_kvm_hv_timer_state(vcpu->vcpu_id,
+				apic->lapic_timer.hv_timer_in_use);
+	}
+}
+EXPORT_SYMBOL_GPL(kvm_lapic_switch_to_hv_timer);
+
+void kvm_lapic_switch_to_sw_timer(struct kvm_vcpu *vcpu)
+{
+	struct kvm_lapic *apic = vcpu->arch.apic;
+
+	/* Possibly the TSC deadline timer is not enabled yet */
+	if (!apic->lapic_timer.hv_timer_in_use)
+		return;
+
+	kvm_x86_ops->cancel_hv_timer(vcpu);
+	apic->lapic_timer.hv_timer_in_use = false;
+
+	if (atomic_read(&apic->lapic_timer.pending))
+		return;
+
+	start_sw_tscdeadline(apic);
+}
+EXPORT_SYMBOL_GPL(kvm_lapic_switch_to_sw_timer);
+
 static void start_apic_timer(struct kvm_lapic *apic)
 {
 	ktime_t now;
@@ -1389,7 +1450,16 @@ static void start_apic_timer(struct kvm_lapic *apic)
 			   ktime_to_ns(ktime_add_ns(now,
 					apic->lapic_timer.period)));
 	} else if (apic_lvtt_tscdeadline(apic)) {
-		start_sw_tscdeadline(apic);
+		/* lapic timer in tsc deadline mode */
+		u64 tscdeadline = apic->lapic_timer.tscdeadline;
+
+		if (kvm_x86_ops->set_hv_timer &&
+		    !kvm_x86_ops->set_hv_timer(apic->vcpu, tscdeadline)) {
+			apic->lapic_timer.hv_timer_in_use = true;
+			trace_kvm_hv_timer_state(apic->vcpu->vcpu_id,
+					apic->lapic_timer.hv_timer_in_use);
+		} else
+			start_sw_tscdeadline(apic);
 	}
 }
 
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 891c6da7d4aa..336ba51bb16e 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -20,6 +20,7 @@ struct kvm_timer {
 	u64 tscdeadline;
 	u64 expired_tscdeadline;
 	atomic_t pending;			/* accumulated triggered timers */
+	bool hv_timer_in_use;
 };
 
 struct kvm_lapic {
@@ -212,4 +213,8 @@ bool kvm_intr_is_single_vcpu_fast(struct kvm *kvm, struct kvm_lapic_irq *irq,
 			struct kvm_vcpu **dest_vcpu);
 int kvm_vector_to_index(u32 vector, u32 dest_vcpus,
 			const unsigned long *bitmap, u32 bitmap_size);
+void kvm_lapic_switch_to_sw_timer(struct kvm_vcpu *vcpu);
+void kvm_lapic_switch_to_hv_timer(struct kvm_vcpu *vcpu);
+void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu);
+bool kvm_lapic_hv_timer_in_use(struct kvm_vcpu *vcpu);
 #endif
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 8de925031b5c..58bc0d68e933 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -6,6 +6,7 @@
 #include <asm/svm.h>
 #include <asm/clocksource.h>
 #include <asm/pvclock-abi.h>
+#include <lapic.h>
 
 #undef TRACE_SYSTEM
 #define TRACE_SYSTEM kvm
@@ -1348,6 +1349,21 @@ TRACE_EVENT(kvm_avic_unaccelerated_access,
 		  __entry->vec)
 );
 
+TRACE_EVENT(kvm_hv_timer_state,
+		TP_PROTO(unsigned int vcpu_id, unsigned int hv_timer_in_use),
+		TP_ARGS(vcpu_id, hv_timer_in_use),
+		TP_STRUCT__entry(
+			__field(unsigned int, vcpu_id)
+			__field(unsigned int, hv_timer_in_use)
+			),
+		TP_fast_assign(
+			__entry->vcpu_id = vcpu_id;
+			__entry->hv_timer_in_use = hv_timer_in_use;
+			),
+		TP_printk("vcpu_id %x hv_timer %x\n",
+			__entry->vcpu_id,
+			__entry->hv_timer_in_use)
+);
 #endif /* _TRACE_KVM_H */
 
 #undef TRACE_INCLUDE_PATH
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 3e407c6be171..9948797b65e5 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -7644,6 +7644,11 @@ static int handle_pcommit(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+static int handle_preemption_timer(struct kvm_vcpu *vcpu)
+{
+	kvm_lapic_expired_hv_timer(vcpu);
+	return 1;
+}
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -7695,6 +7700,7 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 	[EXIT_REASON_XRSTORS]                 = handle_xrstors,
 	[EXIT_REASON_PML_FULL]		      = handle_pml_full,
 	[EXIT_REASON_PCOMMIT]                 = handle_pcommit,
+	[EXIT_REASON_PREEMPTION_TIMER]	      = handle_preemption_timer,
 };
 
 static const int kvm_vmx_max_exit_handlers =
@@ -8703,6 +8709,26 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
 					msrs[i].host);
 }
 
+void vmx_arm_hv_timer(struct kvm_vcpu *vcpu)
+{
+	struct kvm_lapic *apic = vcpu->arch.apic;
+	u64 tscl;
+	u32 delta_tsc;
+
+	if (!apic->lapic_timer.hv_timer_in_use)
+		return;
+
+	tscl = rdtsc();
+	if (vcpu->arch.hv_deadline_tsc > tscl)
+		/* sure to be 32 bit only because checked on set_hv_timer */
+		delta_tsc = (u32)((vcpu->arch.hv_deadline_tsc - tscl) >>
+			cpu_preemption_timer_multi);
+	else
+		delta_tsc = 0;
+
+	vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, delta_tsc);
+}
+
 static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -8752,6 +8778,8 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 	atomic_switch_perf_msrs(vmx);
 	debugctlmsr = get_debugctlmsr();
 
+	vmx_arm_hv_timer(vcpu);
+
 	vmx->__launched = vmx->loaded_vmcs->launched;
 	asm(
 		/* Store host registers */
@@ -10892,6 +10920,9 @@ static int vmx_pre_block(struct kvm_vcpu *vcpu)
 	if (pi_pre_block(vcpu))
 		return 1;
 
+	if (kvm_lapic_hv_timer_in_use(vcpu))
+		kvm_lapic_switch_to_sw_timer(vcpu);
+
 	return 0;
 }
 
@@ -10938,6 +10969,9 @@ static void pi_post_block(struct kvm_vcpu *vcpu)
 
 static void vmx_post_block(struct kvm_vcpu *vcpu)
 {
+	if (kvm_x86_ops->set_hv_timer)
+		kvm_lapic_switch_to_hv_timer(vcpu);
+
 	pi_post_block(vcpu);
 }
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 902d9da12392..a75d1437426c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2735,10 +2735,25 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	}
 
 	if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
+		u64 tscl = rdtsc();
 		s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
-				rdtsc() - vcpu->arch.last_host_tsc;
+				tscl - vcpu->arch.last_host_tsc;
 		if (tsc_delta < 0)
 			mark_tsc_unstable("KVM discovered backwards TSC");
+
+		/*
+		 * If tsc backwrap, we need update the hv_deadline_tsc,
+		 * otherwise, the ((hv_deadline_tsc-tsc) >>
+		 * cpu_preemption_timer_multi) may >32bit on vcpu_run().
+		 * This may cause deadline_tsc not so accurate, but that's
+		 * inevitable anyway.
+		 */
+		if (tscl < vcpu->arch.hv_orig_tsc) {
+			vcpu->arch.hv_orig_tsc = tscl;
+			vcpu->arch.hv_deadline_tsc -=
+				vcpu->arch.hv_orig_tsc - tscl;
+		}
+
 		if (check_tsc_unstable()) {
 			u64 offset = kvm_compute_tsc_offset(vcpu,
 						vcpu->arch.last_guest_tsc);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH V3 5/5] Utilize the vmx preemption timer for tsc deadline timer
  2016-06-04  0:42 ` [RFC PATCH V3 5/5] Utilize the vmx preemption timer for tsc deadline timer Yunhong Jiang
@ 2016-06-06 12:36   ` Paolo Bonzini
  0 siblings, 0 replies; 15+ messages in thread
From: Paolo Bonzini @ 2016-06-06 12:36 UTC (permalink / raw)
  To: Yunhong Jiang, kvm; +Cc: mtosatti, rkrcmar, kernellwp



On 04/06/2016 02:42, Yunhong Jiang wrote:
> +void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_lapic *apic = vcpu->arch.apic;
> +
> +	WARN_ON(!apic->lapic_timer.hv_timer_in_use);
> +	WARN_ON(swait_active(&vcpu->wq));
> +	kvm_x86_ops->cancel_hv_timer(vcpu);
> +	apic->lapic_timer.hv_timer_in_use = 0;

Please use "false" here instead of 0.

> +	apic_timer_expired(apic);
> +}
> +EXPORT_SYMBOL_GPL(kvm_lapic_expired_hv_timer);
> +

> 
> +		if (!kvm_x86_ops->set_hv_timer(apic->vcpu, tscdeadline)) {
> +			apic->lapic_timer.hv_timer_in_use = true;
> +			hrtimer_cancel(&apic->lapic_timer.timer);
> +		}
> +
> +		/* In case the sw timer triggered in above small window */
> +		if (atomic_read(&apic->lapic_timer.pending) &&
> +				apic->lapic_timer.hv_timer_in_use)
> +			kvm_x86_ops->cancel_hv_timer(apic->vcpu);

Put this "if" after hrtimer_cancel, and you can avoid the "&&
apic->lapic_timer.hv_timer_in_use".  Also, I think it's clear to add an
extra

	apic->lapic_timer.hv_timer_in_use = false;

here, since you have it in kvm_lapic_expired_hv_timer.

> +void vmx_arm_hv_timer(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_lapic *apic = vcpu->arch.apic;
> +	u64 tscl;
> +	u32 delta_tsc;
> +
> +	if (!apic->lapic_timer.hv_timer_in_use)

Can you change this to something like vcpu->arch.hv_deadline_tsc != -1
(by adjusting set_hv_timer and cancel_hv_timer)?

> +		return;
> +
> +	tscl = rdtsc();
> +	if (vcpu->arch.hv_deadline_tsc > tscl)
> +		/* sure to be 32 bit only because checked on set_hv_timer */

	/* vmx_set_hv_timer checks that the delta fits in 32 bits */

> +		delta_tsc = (u32)((vcpu->arch.hv_deadline_tsc - tscl) >>
> +			cpu_preemption_timer_multi);
> +	else
> +		delta_tsc = 0;
> +
> +	vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, delta_tsc);
> +}
> +

...

> +		u64 tscl = rdtsc();
>  		s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
> -				rdtsc() - vcpu->arch.last_host_tsc;
> +				tscl - vcpu->arch.last_host_tsc;
>  		if (tsc_delta < 0)
>  			mark_tsc_unstable("KVM discovered backwards TSC");
> +
> +		/*
> +		 * If tsc backwrap, we need update the hv_deadline_tsc,
> +		 * otherwise, the ((hv_deadline_tsc-tsc) >>
> +		 * cpu_preemption_timer_multi) may >32bit on vcpu_run().
> +		 * This may cause deadline_tsc not so accurate, but that's
> +		 * inevitable anyway.
> +		 */
> +		if (tscl < vcpu->arch.hv_orig_tsc) {
> +			vcpu->arch.hv_orig_tsc = tscl;
> +			vcpu->arch.hv_deadline_tsc -=
> +				vcpu->arch.hv_orig_tsc - tscl;
> +		}

Can you instead call vmx_set_hv_timer again (and if it returns an error,
call kvm_lapic_switch_to_sw_timer)?  Perhaps it should even be done
unconditionally on vmx_vcpu_load.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH V3 2/5] Add function for left shift and 64 bit division
  2016-06-04  0:42 ` [RFC PATCH V3 2/5] Add function for left shift and 64 bit division Yunhong Jiang
@ 2016-06-06 12:37   ` Paolo Bonzini
  0 siblings, 0 replies; 15+ messages in thread
From: Paolo Bonzini @ 2016-06-06 12:37 UTC (permalink / raw)
  To: Yunhong Jiang, kvm; +Cc: mtosatti, rkrcmar, kernellwp



On 04/06/2016 02:42, Yunhong Jiang wrote:
> From: Yunhong Jiang <yunhong.jiang@intel.com>
> 
> Sometimes we need convert from guest tsc to  host tsc, which is:
>     host_tsc = ((unsigned __int128)(guest_tsc - tsc_offset)
>             << kvm_tsc_scaling_ratio_frac_bits)
>             / vcpu->arch.tsc_scaling_ratio;
> where guest_tsc and host_tsc are both 64 bit.
> 
> A helper function is provided to achieve this conversion. Only supported
> on x86_64 platform now. A generic solution can be provided in future if
> needed.
> 
> Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
> ---
>  arch/x86/include/asm/div64.h | 18 ++++++++++++++++++

Please put this in vmx.c instead.  You can merge it with patch 3, even.

Thanks,

Paolo

>  1 file changed, 18 insertions(+)
> 
> diff --git a/arch/x86/include/asm/div64.h b/arch/x86/include/asm/div64.h
> index ced283ac79df..6937d6d4c81a 100644
> --- a/arch/x86/include/asm/div64.h
> +++ b/arch/x86/include/asm/div64.h
> @@ -60,6 +60,24 @@ static inline u64 div_u64_rem(u64 dividend, u32 divisor, u32 *remainder)
>  #define div_u64_rem	div_u64_rem
>  
>  #else
> +#include <linux/types.h>
> +/* (a << shift) / divisor, return 1 if overflow otherwise 0 */
> +static inline int u64_shl_div_u64(u64 a, unsigned int shift,
> +		u64 divisor, u64 *result)
> +{
> +	u64 low = a << shift, high = a >> (64 - shift);
> +
> +	/* To avoid the overflow on divq */
> +	if (high > divisor)
> +		return 1;
> +
> +	/* Low hold the result, high hold rem which is discarded */
> +	asm("divq %2\n\t" : "=a" (low), "=d" (high) :
> +		"rm" (divisor), "0" (low), "1" (high));
> +	*result = low;
> +
> +	return 0;
> +}
>  # include <asm-generic/div64.h>
>  #endif /* CONFIG_X86_32 */
>  
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH V3 3/5] Utilize the vmx preemption timer
  2016-06-04  0:42 ` [RFC PATCH V3 3/5] Utilize the vmx preemption timer Yunhong Jiang
@ 2016-06-06 12:37   ` Paolo Bonzini
  0 siblings, 0 replies; 15+ messages in thread
From: Paolo Bonzini @ 2016-06-06 12:37 UTC (permalink / raw)
  To: Yunhong Jiang, kvm; +Cc: mtosatti, rkrcmar, kernellwp



On 04/06/2016 02:42, Yunhong Jiang wrote:
> +static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc)
> +{
> +	u64 tscl = rdtsc(), delta_tsc;
> +
> +	delta_tsc = guest_deadline_tsc - kvm_read_l1_tsc(vcpu, tscl);
> +
> +	/* Convert to host delta tsc if tsc scaling is enabled */
> +	if (vcpu->arch.tsc_scaling_ratio &&
> +			u64_shl_div_u64(delta_tsc,
> +				kvm_tsc_scaling_ratio_frac_bits,
> +				vcpu->arch.tsc_scaling_ratio,
> +				&delta_tsc))
> +		return -1;

Please return -EOVERFLOW or -ERANGE.  It's just aesthetic, but usually
in Linux functions do not return -1.

Thanks,

Paolo

> +	/*
> +	 * If the delta tsc can't be fit in the 32 bit after the multi shift,
> +	 * we can't use the preemption timer.
> +	 * It's possible that it can be fit when vmentry happens late, but
> +	 * checking on every vmentry is costly, so fail earilier.
> +	 */
> +	if (delta_tsc >> (cpu_preemption_timer_multi + 32))
> +		return -1;
> +
> +	vcpu->arch.hv_orig_tsc = tscl;
> +	vcpu->arch.hv_deadline_tsc = tscl + delta_tsc;
> +	vmcs_set_bits(PIN_BASED_VM_EXEC_CONTROL,
> +			PIN_BASED_VMX_PREEMPTION_TIMER);
> +	return 0;
> +}

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH V3 0/5] Utilizing VMX preemption for timer virtualization
  2016-06-04  0:42 [RFC PATCH V3 0/5] Utilizing VMX preemption for timer virtualization Yunhong Jiang
                   ` (4 preceding siblings ...)
  2016-06-04  0:42 ` [RFC PATCH V3 5/5] Utilize the vmx preemption timer for tsc deadline timer Yunhong Jiang
@ 2016-06-06 12:45 ` Paolo Bonzini
  2016-06-06 18:26   ` yunhong jiang
                     ` (2 more replies)
  2016-06-08  4:18 ` Wanpeng Li
  6 siblings, 3 replies; 15+ messages in thread
From: Paolo Bonzini @ 2016-06-06 12:45 UTC (permalink / raw)
  To: Yunhong Jiang, kvm; +Cc: mtosatti, rkrcmar, kernellwp, David Matlack

On 04/06/2016 02:42, Yunhong Jiang wrote:
> It adds a little bit latency for each VM-entry because we need setup the
> preemption timer each time.

Really it doesn't according to your tests:

> 1. enable_hv_timer=Y.
> 
> 000004 002174
> 000005 042961
> 000006 479383
> 000007 071123
> 000008 003720
> 
> 2. enable_hv_timer=N.
> 
> # Histogram
> ......
> 000005 000042
> 000006 000772
> 000007 008262
> 000008 200759
> 000009 381126
> 000010 008056

So perhaps you can replace that paragraph with "The benefits offset the
small extra work to do on each VM-entry to setup the preemption timer".

I'll play with this patch and kvm-unit-tests in the next few days.

David, it would be great if you could also try this on your
message-passing benchmarks (e.g. TCP_RR).  On one hand they are heavy on
vmexits, on the other hand they also have many expensive TSC deadline
WRMSRs.  I have requested a few small changes, but I am very happy with
the logic and the vmentry cost.

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH V3 0/5] Utilizing VMX preemption for timer virtualization
  2016-06-06 12:45 ` [RFC PATCH V3 0/5] Utilizing VMX preemption for timer virtualization Paolo Bonzini
@ 2016-06-06 18:26   ` yunhong jiang
  2016-06-07 16:31   ` David Matlack
  2016-06-07 19:30   ` David Matlack
  2 siblings, 0 replies; 15+ messages in thread
From: yunhong jiang @ 2016-06-06 18:26 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, mtosatti, rkrcmar, kernellwp, David Matlack

On Mon, 6 Jun 2016 14:45:37 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:

> 
> 
> On 04/06/2016 02:42, Yunhong Jiang wrote:
> > It adds a little bit latency for each VM-entry because we need
> > setup the preemption timer each time.
> 
> Really it doesn't according to your tests:
> 
> > 1. enable_hv_timer=Y.
> > 
> > 000004 002174
> > 000005 042961
> > 000006 479383
> > 000007 071123
> > 000008 003720
> > 
> > 2. enable_hv_timer=N.
> > 
> > # Histogram
> > ......
> > 000005 000042
> > 000006 000772
> > 000007 008262
> > 000008 200759
> > 000009 381126
> > 000010 008056
> 
> So perhaps you can replace that paragraph with "The benefits offset
> the small extra work to do on each VM-entry to setup the preemption
> timer".
> 
> I'll play with this patch and kvm-unit-tests in the next few days.
> 
> David, it would be great if you could also try this on your
> message-passing benchmarks (e.g. TCP_RR).  On one hand they are heavy
> on vmexits, on the other hand they also have many expensive TSC
> deadline WRMSRs.  I have requested a few small changes, but I am very
> happy with the logic and the vmentry cost.
> 
> Thanks,

Paolo, thanks for the feedback a lot. I will get a system with TSC scaling and
try there, and then will update the patch accordingly.

Thanks
--jyh

> 
> Paolo


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH V3 0/5] Utilizing VMX preemption for timer virtualization
  2016-06-06 12:45 ` [RFC PATCH V3 0/5] Utilizing VMX preemption for timer virtualization Paolo Bonzini
  2016-06-06 18:26   ` yunhong jiang
@ 2016-06-07 16:31   ` David Matlack
  2016-06-07 19:30   ` David Matlack
  2 siblings, 0 replies; 15+ messages in thread
From: David Matlack @ 2016-06-07 16:31 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Yunhong Jiang, kvm list, Marcelo Tosatti,
	Radim Krčmář, Wanpeng Li

On Mon, Jun 6, 2016 at 5:45 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> David, it would be great if you could also try this on your
> message-passing benchmarks (e.g. TCP_RR).  On one hand they are heavy on
> vmexits, on the other hand they also have many expensive TSC deadline
> WRMSRs.

Sure, will do.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH V3 0/5] Utilizing VMX preemption for timer virtualization
  2016-06-06 12:45 ` [RFC PATCH V3 0/5] Utilizing VMX preemption for timer virtualization Paolo Bonzini
  2016-06-06 18:26   ` yunhong jiang
  2016-06-07 16:31   ` David Matlack
@ 2016-06-07 19:30   ` David Matlack
  2 siblings, 0 replies; 15+ messages in thread
From: David Matlack @ 2016-06-07 19:30 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Yunhong Jiang, kvm list, Marcelo Tosatti,
	Radim Krčmář, Wanpeng Li

On Mon, Jun 6, 2016 at 5:45 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 04/06/2016 02:42, Yunhong Jiang wrote:
>> It adds a little bit latency for each VM-entry because we need setup the
>> preemption timer each time.
>
> Really it doesn't according to your tests:
>
>> 1. enable_hv_timer=Y.
>>
>> 000004 002174
>> 000005 042961
>> 000006 479383
>> 000007 071123
>> 000008 003720
>>
>> 2. enable_hv_timer=N.
>>
>> # Histogram
>> ......
>> 000005 000042
>> 000006 000772
>> 000007 008262
>> 000008 200759
>> 000009 381126
>> 000010 008056
>
> So perhaps you can replace that paragraph with "The benefits offset the
> small extra work to do on each VM-entry to setup the preemption timer".
>
> I'll play with this patch and kvm-unit-tests in the next few days.

LMK how this goes, especially vmexit.c with enable_hv_timer=Y. It's
turning out to be non-trivial to get this patchset into a kernel that works
with my test setup. But if you find any regressions I can spend some
more time getting it working.

>
> David, it would be great if you could also try this on your
> message-passing benchmarks (e.g. TCP_RR).  On one hand they are heavy on
> vmexits, on the other hand they also have many expensive TSC deadline
> WRMSRs.  I have requested a few small changes, but I am very happy with
> the logic and the vmentry cost.
>
> Thanks,
>
> Paolo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH V3 0/5] Utilizing VMX preemption for timer virtualization
  2016-06-04  0:42 [RFC PATCH V3 0/5] Utilizing VMX preemption for timer virtualization Yunhong Jiang
                   ` (5 preceding siblings ...)
  2016-06-06 12:45 ` [RFC PATCH V3 0/5] Utilizing VMX preemption for timer virtualization Paolo Bonzini
@ 2016-06-08  4:18 ` Wanpeng Li
  2016-06-13 20:56   ` yunhong jiang
  6 siblings, 1 reply; 15+ messages in thread
From: Wanpeng Li @ 2016-06-08  4:18 UTC (permalink / raw)
  To: Yunhong Jiang; +Cc: kvm, Marcelo Tosatti, Radim Krcmar, Paolo Bonzini

2016-06-04 8:42 GMT+08:00 Yunhong Jiang <yunhong.jiang@linux.intel.com>:
> The VMX-preemption timer is a feature on VMX, it counts down, from the
> value loaded by VM entry, in VMX nonroot operation. When the timer
> counts down to zero, it stops counting down and a VM exit occurs.
>
> This patchset utilize VMX preemption timer for tsc deadline timer
> virtualization. The VMX preemption timer is armed before the vm-entry if the
> tsc deadline timer is enabled. A VMExit will happen if the virtual TSC
> deadline timer expires.
>
> When the vCPU thread is blocked because of HLT instruction, the tsc deadline
> timer virtualization will be switched to use the current solution, i.e. use
> the timer for it. It's switched back to VMX preemption timer when the vCPU
> thread is unblocked.
>
> This solution replace the complex OS's hrtimer system, and also the host
> timer interrupt handling cost, with a preemption_timer VMexit. It fits well
> for some NFV usage scenario, when the vCPU is bound to a pCPU and the pCPU
> is isolated, or some similar scenarioes.
>
> It adds a little bit latency for each VM-entry because we need setup the
> preemption timer each time.
>
> Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
>
> Performance Evalaution:
> Host:
> [nfv@otcnfv02 ~]$ cat /proc/cpuinfo
> ....
> cpu family      : 6
> model           : 63
> model name      : Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
>
> Guest:
> Two vCPU with vCPU pinned to isolated pCPUs, idle=poll on guest kernel.
> When the vCPU is not pinned, the benefit is smaller than pinned situation.
>
> Test tools:
> cyclictest [1] running 10 minutes with 1ms interval, i.e. 600000 loop in
> total.
>
> 1. enable_hv_timer=Y.
>
> # Histogram
> ......
> 000003 000000
> 000004 002174
> 000005 042961
> 000006 479383
> 000007 071123
> 000008 003720
> 000009 000467
> 000010 000078
> 000011 000011
> 000012 000009
> ......
> # Min Latencies: 00004
> # Avg Latencies: 00007
>
> 2. enable_hv_timer=N.
>
> # Histogram
> ......
> 000003 000000
> 000004 000000
> 000005 000042
> 000006 000772
> 000007 008262
> 000008 200759
> 000009 381126
> 000010 008056
> 000011 000234
> 000012 000367
> ......
> # Min Latencies: 00005
> # Avg Latencies: 00010
>

I sometimes observed that cyclictest avg overflow in guest.

policy: other/other: loadavg: 0.79 0.19 0.06 2/355 1872          999847623940096
policy: other/other: loadavg: 0.79 0.19 0.06 1/349 1883          629164130618368
T: 0 ( 1838) P: 0 I:1000 C:   5092 Min:      8 Act: -750 Avg:8495211086576766976
T: 0 ( 1838) P: 0 I:1000 C:   6934 Min:      8 Act: -878
Avg:-9223372036854775808 Max:      -3

Host:

 grep HZ /boot/config-`uname -r`
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
# CONFIG_NO_HZ is not set
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
CONFIG_MACHZ_WDT=m

Guest(3.5, other kernel versions also can reproduce it):

grep HZ /boot/config-`uname -r`
CONFIG_NO_HZ=y
CONFIG_RCU_FAST_NO_HZ=y
# CONFIG_HZ_100 is not set
CONFIG_HZ_250=y
# CONFIG_HZ_300 is not set
# CONFIG_HZ_1000 is not set
CONFIG_HZ=250
CONFIG_MACHZ_WDT=m

Anyone meet such things? Any tips to solve it?

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH V3 0/5] Utilizing VMX preemption for timer virtualization
  2016-06-08  4:18 ` Wanpeng Li
@ 2016-06-13 20:56   ` yunhong jiang
  0 siblings, 0 replies; 15+ messages in thread
From: yunhong jiang @ 2016-06-13 20:56 UTC (permalink / raw)
  To: Wanpeng Li; +Cc: kvm, Marcelo Tosatti, Radim Krcmar, Paolo Bonzini

On Wed, 8 Jun 2016 12:18:57 +0800
Wanpeng Li <kernellwp@gmail.com> wrote:

> 2016-06-04 8:42 GMT+08:00 Yunhong Jiang
> <yunhong.jiang@linux.intel.com>:
> > The VMX-preemption timer is a feature on VMX, it counts down, from
> > the value loaded by VM entry, in VMX nonroot operation. When the
> > timer counts down to zero, it stops counting down and a VM exit
> > occurs.
> >
> > This patchset utilize VMX preemption timer for tsc deadline timer
> > virtualization. The VMX preemption timer is armed before the
> > vm-entry if the tsc deadline timer is enabled. A VMExit will happen
> > if the virtual TSC deadline timer expires.
> >
> > When the vCPU thread is blocked because of HLT instruction, the tsc
> > deadline timer virtualization will be switched to use the current
> > solution, i.e. use the timer for it. It's switched back to VMX
> > preemption timer when the vCPU thread is unblocked.
> >
> > This solution replace the complex OS's hrtimer system, and also the
> > host timer interrupt handling cost, with a preemption_timer VMexit.
> > It fits well for some NFV usage scenario, when the vCPU is bound to
> > a pCPU and the pCPU is isolated, or some similar scenarioes.
> >
> > It adds a little bit latency for each VM-entry because we need
> > setup the preemption timer each time.
> >
> > Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
> >
> > Performance Evalaution:
> > Host:
> > [nfv@otcnfv02 ~]$ cat /proc/cpuinfo
> > ....
> > cpu family      : 6
> > model           : 63
> > model name      : Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
> >
> > Guest:
> > Two vCPU with vCPU pinned to isolated pCPUs, idle=poll on guest
> > kernel. When the vCPU is not pinned, the benefit is smaller than
> > pinned situation.
> >
> > Test tools:
> > cyclictest [1] running 10 minutes with 1ms interval, i.e. 600000
> > loop in total.
> >
> > 1. enable_hv_timer=Y.
> >
> > # Histogram
> > ......
> > 000003 000000
> > 000004 002174
> > 000005 042961
> > 000006 479383
> > 000007 071123
> > 000008 003720
> > 000009 000467
> > 000010 000078
> > 000011 000011
> > 000012 000009
> > ......
> > # Min Latencies: 00004
> > # Avg Latencies: 00007
> >
> > 2. enable_hv_timer=N.
> >
> > # Histogram
> > ......
> > 000003 000000
> > 000004 000000
> > 000005 000042
> > 000006 000772
> > 000007 008262
> > 000008 200759
> > 000009 381126
> > 000010 008056
> > 000011 000234
> > 000012 000367
> > ......
> > # Min Latencies: 00005
> > # Avg Latencies: 00010
> >
> 
> I sometimes observed that cyclictest avg overflow in guest.
> 
> policy: other/other: loadavg: 0.79 0.19 0.06 2/355 1872
> 999847623940096 policy: other/other: loadavg: 0.79 0.19 0.06 1/349
> 1883          629164130618368 T: 0 ( 1838) P: 0 I:1000 C:   5092
> Min:      8 Act: -750 Avg:8495211086576766976 T: 0 ( 1838) P: 0
> I:1000 C:   6934 Min:      8 Act: -878 Avg:-9223372036854775808
> Max:      -3
> 
> Host:
> 
>  grep HZ /boot/config-`uname -r`
> CONFIG_NO_HZ_COMMON=y
> # CONFIG_HZ_PERIODIC is not set
> CONFIG_NO_HZ_IDLE=y
> # CONFIG_NO_HZ_FULL is not set
> # CONFIG_NO_HZ is not set
> # CONFIG_HZ_100 is not set
> # CONFIG_HZ_250 is not set
> # CONFIG_HZ_300 is not set
> CONFIG_HZ_1000=y
> CONFIG_HZ=1000
> CONFIG_MACHZ_WDT=m
> 
> Guest(3.5, other kernel versions also can reproduce it):
> 
> grep HZ /boot/config-`uname -r`
> CONFIG_NO_HZ=y
> CONFIG_RCU_FAST_NO_HZ=y
> # CONFIG_HZ_100 is not set
> CONFIG_HZ_250=y
> # CONFIG_HZ_300 is not set
> # CONFIG_HZ_1000 is not set
> CONFIG_HZ=250
> CONFIG_MACHZ_WDT=m
> 
> Anyone meet such things? Any tips to solve it?

Sorry for slow response. Did you try with this patchset or not? Does it happen w/o this patch?

If there are so overflow, it usually mean the guest timer is not accurate. I can try to check if any hints if you can provide your host information and your qemu parameter.

Thanks
--jyh
  
> 
> Regards,
> Wanpeng Li


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-06-13 21:01 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-06-04  0:42 [RFC PATCH V3 0/5] Utilizing VMX preemption for timer virtualization Yunhong Jiang
2016-06-04  0:42 ` [RFC PATCH V3 1/5] Rename the vmx_pre/post_block to pi_pre/post_block Yunhong Jiang
2016-06-04  0:42 ` [RFC PATCH V3 2/5] Add function for left shift and 64 bit division Yunhong Jiang
2016-06-06 12:37   ` Paolo Bonzini
2016-06-04  0:42 ` [RFC PATCH V3 3/5] Utilize the vmx preemption timer Yunhong Jiang
2016-06-06 12:37   ` Paolo Bonzini
2016-06-04  0:42 ` [RFC PATCH V3 4/5] Separate the start_sw_tscdeadline Yunhong Jiang
2016-06-04  0:42 ` [RFC PATCH V3 5/5] Utilize the vmx preemption timer for tsc deadline timer Yunhong Jiang
2016-06-06 12:36   ` Paolo Bonzini
2016-06-06 12:45 ` [RFC PATCH V3 0/5] Utilizing VMX preemption for timer virtualization Paolo Bonzini
2016-06-06 18:26   ` yunhong jiang
2016-06-07 16:31   ` David Matlack
2016-06-07 19:30   ` David Matlack
2016-06-08  4:18 ` Wanpeng Li
2016-06-13 20:56   ` yunhong jiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox