[PATCH v3 0/7] Dynamic Pause Loop Exiting window.

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/7] Dynamic Pause Loop Exiting window.
@ 2014-08-21 16:08 Radim Krčmář
  2014-08-21 16:08 ` [PATCH v3 1/7] KVM: add kvm_arch_sched_in Radim Krčmář
                   ` (8 more replies)
  0 siblings, 9 replies; 29+ messages in thread
From: Radim Krčmář @ 2014-08-21 16:08 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, Paolo Bonzini, Gleb Natapov, Raghavendra KT,
	Vinod Chegu, Hui-Zhi Zhao, Christian Borntraeger, Lisa Mitchell

v2 -> v3:
 * copy&paste frenzy [v3 4/7] (split modify_ple_window)
 * commented update_ple_window_actual_max [v3 4/7]
 * renamed shrinker to modifier [v3 4/7]
 * removed an extraneous max(new, ple_window) [v3 4/7] (should have been in v2)
 * changed tracepoint argument type, printing and macro abstractions [v3 5/7]
 * renamed ple_t to ple_int [v3 6/7] (visible in modinfo)
 * intelligent updates of ple_window [v3 7/7]

---
v1 -> v2:
 * squashed [v1 4/9] and [v1 5/9] (clamping)
 * dropped [v1 7/9] (CPP abstractions)
 * merged core of [v1 9/9] into [v1 4/9] (automatic maximum)
 * reworked kernel_param_ops: closer to pure int [v2 6/6]
 * introduced ple_window_actual_max & reworked clamping [v2 4/6]
 * added seqlock for parameter modifications [v2 6/6]

---
PLE does not scale in its current form.  When increasing VCPU count
above 150, one can hit soft lockups because of runqueue lock contention.
(Which says a lot about performance.)

The main reason is that kvm_ple_loop cycles through all VCPUs.
Replacing it with a scalable solution would be ideal, but it has already
been well optimized for various workloads, so this series tries to
alleviate one different major problem while minimizing a chance of
regressions: we have too many useless PLE exits.

Just increasing PLE window would help some cases, but it still spirals
out of control.  By increasing the window after every PLE exit, we can
limit the amount of useless ones, so we don't reach the state where CPUs
spend 99% of the time waiting for a lock.

HP confirmed that this series prevents soft lockups and TSC sync errors
on large guests.


Radim Krčmář (7):
  KVM: add kvm_arch_sched_in
  KVM: x86: introduce sched_in to kvm_x86_ops
  KVM: VMX: make PLE window per-VCPU
  KVM: VMX: dynamise PLE window
  KVM: trace kvm_ple_window grow/shrink
  KVM: VMX: runtime knobs for dynamic PLE window
  KVM: VMX: optimize ple_window updates to VMCS

 arch/arm/kvm/arm.c              |   4 ++
 arch/mips/kvm/mips.c            |   4 ++
 arch/powerpc/kvm/powerpc.c      |   4 ++
 arch/s390/kvm/kvm-s390.c        |   4 ++
 arch/x86/include/asm/kvm_host.h |   2 +
 arch/x86/kvm/svm.c              |   6 ++
 arch/x86/kvm/trace.h            |  30 ++++++++
 arch/x86/kvm/vmx.c              | 147 ++++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/x86.c              |   6 ++
 include/linux/kvm_host.h        |   2 +
 virt/kvm/kvm_main.c             |   2 +
 11 files changed, 207 insertions(+), 4 deletions(-)

-- 
2.1.0

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v3 1/7] KVM: add kvm_arch_sched_in
  2014-08-21 16:08 [PATCH v3 0/7] Dynamic Pause Loop Exiting window Radim Krčmář
@ 2014-08-21 16:08 ` Radim Krčmář
  2014-08-21 18:49   ` Raghavendra K T
  2014-08-21 16:08 ` [PATCH v3 2/7] KVM: x86: introduce sched_in to kvm_x86_ops Radim Krčmář
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Radim Krčmář @ 2014-08-21 16:08 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, Paolo Bonzini, Gleb Natapov, Raghavendra KT,
	Vinod Chegu, Hui-Zhi Zhao, Christian Borntraeger, Lisa Mitchell

Introduce preempt notifiers for architecture specific code.
Advantage over creating a new notifier in every arch is slightly simpler
code and guaranteed call order with respect to kvm_sched_in.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
---
 arch/arm/kvm/arm.c         | 4 ++++
 arch/mips/kvm/mips.c       | 4 ++++
 arch/powerpc/kvm/powerpc.c | 4 ++++
 arch/s390/kvm/kvm-s390.c   | 4 ++++
 arch/x86/kvm/x86.c         | 4 ++++
 include/linux/kvm_host.h   | 2 ++
 virt/kvm/kvm_main.c        | 2 ++
 7 files changed, 24 insertions(+)

diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index a99e0cd..9f788eb 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -288,6 +288,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 {
 }
 
+void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+}
+
 void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
 	vcpu->cpu = cpu;
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index cd71141..2362df2 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -1002,6 +1002,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 {
 }
 
+void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+}
+
 int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu,
 				  struct kvm_translation *tr)
 {
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 4c79284..cbc432f 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -720,6 +720,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 	kvmppc_subarch_vcpu_uninit(vcpu);
 }
 
+void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+}
+
 void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
 #ifdef CONFIG_BOOKE
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index ce81eb2..a3c324e 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -555,6 +555,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 	/* Nothing todo */
 }
 
+void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+}
+
 void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
 	save_fp_ctl(&vcpu->arch.host_fpregs.fpc);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8f1e22d..d7c214f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7146,6 +7146,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 		static_key_slow_dec(&kvm_no_apic_vcpu);
 }
 
+void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+}
+
 int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 {
 	if (type)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a4c33b3..ebd7236 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -624,6 +624,8 @@ void kvm_arch_exit(void);
 int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu);
 
+void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu);
+
 void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 33712fb..d3c3ed0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3123,6 +3123,8 @@ static void kvm_sched_in(struct preempt_notifier *pn, int cpu)
 	if (vcpu->preempted)
 		vcpu->preempted = false;
 
+	kvm_arch_sched_in(vcpu, cpu);
+
 	kvm_arch_vcpu_load(vcpu, cpu);
 }
 
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v3 2/7] KVM: x86: introduce sched_in to kvm_x86_ops
  2014-08-21 16:08 [PATCH v3 0/7] Dynamic Pause Loop Exiting window Radim Krčmář
  2014-08-21 16:08 ` [PATCH v3 1/7] KVM: add kvm_arch_sched_in Radim Krčmář
@ 2014-08-21 16:08 ` Radim Krčmář
  2014-08-21 18:50   ` Raghavendra K T
  2014-08-21 16:08 ` [PATCH v3 3/7] KVM: VMX: make PLE window per-VCPU Radim Krčmář
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Radim Krčmář @ 2014-08-21 16:08 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, Paolo Bonzini, Gleb Natapov, Raghavendra KT,
	Vinod Chegu, Hui-Zhi Zhao, Christian Borntraeger, Lisa Mitchell

sched_in preempt notifier is available for x86, allow its use in
specific virtualization technlogies as well.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
---
 arch/x86/include/asm/kvm_host.h | 2 ++
 arch/x86/kvm/svm.c              | 6 ++++++
 arch/x86/kvm/vmx.c              | 6 ++++++
 arch/x86/kvm/x86.c              | 1 +
 4 files changed, 15 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5724601..358e2f3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -772,6 +772,8 @@ struct kvm_x86_ops {
 	bool (*mpx_supported)(void);
 
 	int (*check_nested_events)(struct kvm_vcpu *vcpu, bool external_intr);
+
+	void (*sched_in)(struct kvm_vcpu *kvm, int cpu);
 };
 
 struct kvm_arch_async_pf {
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index ddf7427..4baf1bc 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -4305,6 +4305,10 @@ static void svm_handle_external_intr(struct kvm_vcpu *vcpu)
 	local_irq_enable();
 }
 
+static void svm_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+}
+
 static struct kvm_x86_ops svm_x86_ops = {
 	.cpu_has_kvm_support = has_svm,
 	.disabled_by_bios = is_disabled,
@@ -4406,6 +4410,8 @@ static struct kvm_x86_ops svm_x86_ops = {
 
 	.check_intercept = svm_check_intercept,
 	.handle_external_intr = svm_handle_external_intr,
+
+	.sched_in = svm_sched_in,
 };
 
 static int __init svm_init(void)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index bfe11cf..2b306f9 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -8846,6 +8846,10 @@ static int vmx_check_intercept(struct kvm_vcpu *vcpu,
 	return X86EMUL_CONTINUE;
 }
 
+void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+}
+
 static struct kvm_x86_ops vmx_x86_ops = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,
@@ -8951,6 +8955,8 @@ static struct kvm_x86_ops vmx_x86_ops = {
 	.mpx_supported = vmx_mpx_supported,
 
 	.check_nested_events = vmx_check_nested_events,
+
+	.sched_in = vmx_sched_in,
 };
 
 static int __init vmx_init(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d7c214f..5696ee7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7148,6 +7148,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 
 void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
 {
+	kvm_x86_ops->sched_in(vcpu, cpu);
 }
 
 int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v3 3/7] KVM: VMX: make PLE window per-VCPU
  2014-08-21 16:08 [PATCH v3 0/7] Dynamic Pause Loop Exiting window Radim Krčmář
  2014-08-21 16:08 ` [PATCH v3 1/7] KVM: add kvm_arch_sched_in Radim Krčmář
  2014-08-21 16:08 ` [PATCH v3 2/7] KVM: x86: introduce sched_in to kvm_x86_ops Radim Krčmář
@ 2014-08-21 16:08 ` Radim Krčmář
  2014-08-21 18:52   ` Raghavendra K T
  2014-08-21 16:08 ` [PATCH v3 4/7] KVM: VMX: dynamise PLE window Radim Krčmář
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Radim Krčmář @ 2014-08-21 16:08 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, Paolo Bonzini, Gleb Natapov, Raghavendra KT,
	Vinod Chegu, Hui-Zhi Zhao, Christian Borntraeger, Lisa Mitchell

Change PLE window into per-VCPU variable, seeded from module parameter,
to allow greater flexibility.

Brings in a small overhead on every vmentry.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
---
 arch/x86/kvm/vmx.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2b306f9..18e0e52 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -484,6 +484,9 @@ struct vcpu_vmx {
 
 	/* Support for a guest hypervisor (nested VMX) */
 	struct nested_vmx nested;
+
+	/* Dynamic PLE window. */
+	int ple_window;
 };
 
 enum segment_cache_field {
@@ -4402,7 +4405,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
 
 	if (ple_gap) {
 		vmcs_write32(PLE_GAP, ple_gap);
-		vmcs_write32(PLE_WINDOW, ple_window);
+		vmx->ple_window = ple_window;
 	}
 
 	vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, 0);
@@ -7387,6 +7390,9 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 	if (vmx->emulation_required)
 		return;
 
+	if (ple_gap)
+		vmcs_write32(PLE_WINDOW, vmx->ple_window);
+
 	if (vmx->nested.sync_shadow_vmcs) {
 		copy_vmcs12_to_shadow(vmx);
 		vmx->nested.sync_shadow_vmcs = false;
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v3 4/7] KVM: VMX: dynamise PLE window
  2014-08-21 16:08 [PATCH v3 0/7] Dynamic Pause Loop Exiting window Radim Krčmář
                   ` (2 preceding siblings ...)
  2014-08-21 16:08 ` [PATCH v3 3/7] KVM: VMX: make PLE window per-VCPU Radim Krčmář
@ 2014-08-21 16:08 ` Radim Krčmář
  2014-08-21 19:10   ` Raghavendra K T
  2014-08-21 16:08 ` [PATCH v3 5/7] KVM: trace kvm_ple_window grow/shrink Radim Krčmář
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Radim Krčmář @ 2014-08-21 16:08 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, Paolo Bonzini, Gleb Natapov, Raghavendra KT,
	Vinod Chegu, Hui-Zhi Zhao, Christian Borntraeger, Lisa Mitchell

Window is increased on every PLE exit and decreased on every sched_in.
The idea is that we don't want to PLE exit if there is no preemption
going on.
We do this with sched_in() because it does not hold rq lock.

There are two new kernel parameters for changing the window:
 ple_window_grow and ple_window_shrink
ple_window_grow affects the window on PLE exit and ple_window_shrink
does it on sched_in;  depending on their value, the window is modifier
like this: (ple_window is kvm_intel's global)

  ple_window_shrink/ |
  ple_window_grow    | PLE exit           | sched_in
  -------------------+--------------------+---------------------
  < 1                |  = ple_window      |  = ple_window
  < ple_window       | *= ple_window_grow | /= ple_window_shrink
  otherwise          | += ple_window_grow | -= ple_window_shrink

A third new parameter, ple_window_max, controls the maximal ple_window;
it is internally rounded down to a closest multiple of ple_window_grow.

VCPU's PLE window is never allowed below ple_window.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
---
 arch/x86/kvm/vmx.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 85 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 18e0e52..a2fa6ba 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -125,14 +125,32 @@ module_param(nested, bool, S_IRUGO);
  * Time is measured based on a counter that runs at the same rate as the TSC,
  * refer SDM volume 3b section 21.6.13 & 22.1.3.
  */
-#define KVM_VMX_DEFAULT_PLE_GAP    128
-#define KVM_VMX_DEFAULT_PLE_WINDOW 4096
+#define KVM_VMX_DEFAULT_PLE_GAP           128
+#define KVM_VMX_DEFAULT_PLE_WINDOW        4096
+#define KVM_VMX_DEFAULT_PLE_WINDOW_GROW   2
+#define KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK 0
+#define KVM_VMX_DEFAULT_PLE_WINDOW_MAX    \
+		INT_MAX / KVM_VMX_DEFAULT_PLE_WINDOW_GROW
+
 static int ple_gap = KVM_VMX_DEFAULT_PLE_GAP;
 module_param(ple_gap, int, S_IRUGO);
 
 static int ple_window = KVM_VMX_DEFAULT_PLE_WINDOW;
 module_param(ple_window, int, S_IRUGO);
 
+/* Default doubles per-vcpu window every exit. */
+static int ple_window_grow = KVM_VMX_DEFAULT_PLE_WINDOW_GROW;
+module_param(ple_window_grow, int, S_IRUGO);
+
+/* Default resets per-vcpu window every exit to ple_window. */
+static int ple_window_shrink = KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK;
+module_param(ple_window_shrink, int, S_IRUGO);
+
+/* Default is to compute the maximum so we can never overflow. */
+static int ple_window_actual_max = KVM_VMX_DEFAULT_PLE_WINDOW_MAX;
+static int ple_window_max        = KVM_VMX_DEFAULT_PLE_WINDOW_MAX;
+module_param(ple_window_max, int, S_IRUGO);
+
 extern const ulong vmx_return;
 
 #define NR_AUTOLOAD_MSRS 8
@@ -5679,12 +5697,73 @@ out:
 	return ret;
 }
 
+static int __grow_ple_window(int val)
+{
+	if (ple_window_grow < 1)
+		return ple_window;
+
+	val = min(val, ple_window_actual_max);
+
+	if (ple_window_grow < ple_window)
+		val *= ple_window_grow;
+	else
+		val += ple_window_grow;
+
+	return val;
+}
+
+static int __shrink_ple_window(int val, int modifier, int minimum)
+{
+	if (modifier < 1)
+		return ple_window;
+
+	if (modifier < ple_window)
+		val /= modifier;
+	else
+		val -= modifier;
+
+	return max(val, minimum);
+}
+
+static void grow_ple_window(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	vmx->ple_window = __grow_ple_window(vmx->ple_window);
+}
+
+static void shrink_ple_window(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	vmx->ple_window = __shrink_ple_window(vmx->ple_window,
+	                                      ple_window_shrink, ple_window);
+}
+
+/*
+ * ple_window_actual_max is computed to be one grow_ple_window() below
+ * ple_window_max. (See __grow_ple_window for the reason.)
+ * This prevents overflows, because ple_window_max is int.
+ * ple_window_max effectively rounded down to a multiple of ple_window_grow in
+ * this process.
+ * ple_window_max is also prevented from setting vmx->ple_window < ple_window.
+ */
+static void update_ple_window_actual_max(void)
+{
+	ple_window_actual_max =
+			__shrink_ple_window(max(ple_window_max, ple_window),
+			                    ple_window_grow, INT_MIN);
+}
+
 /*
  * Indicate a busy-waiting vcpu in spinlock. We do not enable the PAUSE
  * exiting, so only get here on cpu with PAUSE-Loop-Exiting.
  */
 static int handle_pause(struct kvm_vcpu *vcpu)
 {
+	if (ple_gap)
+		grow_ple_window(vcpu);
+
 	skip_emulated_instruction(vcpu);
 	kvm_vcpu_on_spin(vcpu);
 
@@ -8854,6 +8933,8 @@ static int vmx_check_intercept(struct kvm_vcpu *vcpu,
 
 void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
 {
+	if (ple_gap)
+		shrink_ple_window(vcpu);
 }
 
 static struct kvm_x86_ops vmx_x86_ops = {
@@ -9077,6 +9158,8 @@ static int __init vmx_init(void)
 	} else
 		kvm_disable_tdp();
 
+	update_ple_window_actual_max();
+
 	return 0;
 
 out7:
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v3 5/7] KVM: trace kvm_ple_window grow/shrink
  2014-08-21 16:08 [PATCH v3 0/7] Dynamic Pause Loop Exiting window Radim Krčmář
                   ` (3 preceding siblings ...)
  2014-08-21 16:08 ` [PATCH v3 4/7] KVM: VMX: dynamise PLE window Radim Krčmář
@ 2014-08-21 16:08 ` Radim Krčmář
  2014-08-25 13:53   ` Sabrina Dubroca
  2014-08-21 16:08 ` [PATCH v3 6/7] KVM: VMX: runtime knobs for dynamic PLE window Radim Krčmář
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Radim Krčmář @ 2014-08-21 16:08 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, Paolo Bonzini, Gleb Natapov, Raghavendra KT,
	Vinod Chegu, Hui-Zhi Zhao, Christian Borntraeger, Lisa Mitchell

Tracepoint for dynamic PLE window, fired on every potential change.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
---
 arch/x86/kvm/trace.h | 30 ++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx.c   | 10 ++++++++--
 arch/x86/kvm/x86.c   |  1 +
 3 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index e850a7d..1742dfb 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -848,6 +848,36 @@ TRACE_EVENT(kvm_track_tsc,
 		  __print_symbolic(__entry->host_clock, host_clocks))
 );
 
+TRACE_EVENT(kvm_ple_window,
+	TP_PROTO(bool grow, unsigned int vcpu_id, int new, int old),
+	TP_ARGS(grow, vcpu_id, new, old),
+
+	TP_STRUCT__entry(
+		__field(                bool,      grow         )
+		__field(        unsigned int,   vcpu_id         )
+		__field(                 int,       new         )
+		__field(                 int,       old         )
+	),
+
+	TP_fast_assign(
+		__entry->grow           = grow;
+		__entry->vcpu_id        = vcpu_id;
+		__entry->new            = new;
+		__entry->old            = old;
+	),
+
+	TP_printk("vcpu %u: ple_window %d (%s %d)",
+	          __entry->vcpu_id,
+	          __entry->new,
+	          __entry->grow ? "grow" : "shrink",
+	          __entry->old)
+);
+
+#define trace_kvm_ple_window_grow(vcpu_id, new, old) \
+	trace_kvm_ple_window(true, vcpu_id, new, old)
+#define trace_kvm_ple_window_shrink(vcpu_id, new, old) \
+	trace_kvm_ple_window(false, vcpu_id, new, old)
+
 #endif /* CONFIG_X86_64 */
 
 #endif /* _TRACE_KVM_H */
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a2fa6ba..273cbd5 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5728,16 +5728,22 @@ static int __shrink_ple_window(int val, int modifier, int minimum)
 static void grow_ple_window(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	int old = vmx->ple_window;
 
-	vmx->ple_window = __grow_ple_window(vmx->ple_window);
+	vmx->ple_window = __grow_ple_window(old);
+
+	trace_kvm_ple_window_grow(vcpu->vcpu_id, vmx->ple_window, old);
 }
 
 static void shrink_ple_window(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	int old = vmx->ple_window;
 
-	vmx->ple_window = __shrink_ple_window(vmx->ple_window,
+	vmx->ple_window = __shrink_ple_window(old,
 	                                      ple_window_shrink, ple_window);
+
+	trace_kvm_ple_window_shrink(vcpu->vcpu_id, vmx->ple_window, old);
 }
 
 /*
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5696ee7..814b20c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7648,3 +7648,4 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_invlpga);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_skinit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_intercepts);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_write_tsc_offset);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_ple_window);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v3 6/7] KVM: VMX: runtime knobs for dynamic PLE window
  2014-08-21 16:08 [PATCH v3 0/7] Dynamic Pause Loop Exiting window Radim Krčmář
                   ` (4 preceding siblings ...)
  2014-08-21 16:08 ` [PATCH v3 5/7] KVM: trace kvm_ple_window grow/shrink Radim Krčmář
@ 2014-08-21 16:08 ` Radim Krčmář
  2014-08-21 19:17   ` Raghavendra K T
  2014-08-21 16:08 ` [PATCH v3 7/7] KVM: VMX: optimize ple_window updates to VMCS Radim Krčmář
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Radim Krčmář @ 2014-08-21 16:08 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, Paolo Bonzini, Gleb Natapov, Raghavendra KT,
	Vinod Chegu, Hui-Zhi Zhao, Christian Borntraeger, Lisa Mitchell

ple_window is updated on every vmentry, so there is no reason to have it
read-only anymore.
ple_window* weren't writable to prevent runtime overflow races;
they are prevented by a seqlock.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
---
 arch/x86/kvm/vmx.c | 46 +++++++++++++++++++++++++++++++++++++---------
 1 file changed, 37 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 273cbd5..1318232 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -132,24 +132,29 @@ module_param(nested, bool, S_IRUGO);
 #define KVM_VMX_DEFAULT_PLE_WINDOW_MAX    \
 		INT_MAX / KVM_VMX_DEFAULT_PLE_WINDOW_GROW
 
+static struct kernel_param_ops param_ops_ple_int;
+#define param_check_ple_int(name, p) __param_check(name, p, int)
+
+static DEFINE_SEQLOCK(ple_window_seqlock);
+
 static int ple_gap = KVM_VMX_DEFAULT_PLE_GAP;
 module_param(ple_gap, int, S_IRUGO);
 
 static int ple_window = KVM_VMX_DEFAULT_PLE_WINDOW;
-module_param(ple_window, int, S_IRUGO);
+module_param(ple_window, ple_int, S_IRUGO | S_IWUSR);
 
 /* Default doubles per-vcpu window every exit. */
 static int ple_window_grow = KVM_VMX_DEFAULT_PLE_WINDOW_GROW;
-module_param(ple_window_grow, int, S_IRUGO);
+module_param(ple_window_grow, ple_int, S_IRUGO | S_IWUSR);
 
 /* Default resets per-vcpu window every exit to ple_window. */
 static int ple_window_shrink = KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK;
-module_param(ple_window_shrink, int, S_IRUGO);
+module_param(ple_window_shrink, int, S_IRUGO | S_IWUSR);
 
 /* Default is to compute the maximum so we can never overflow. */
 static int ple_window_actual_max = KVM_VMX_DEFAULT_PLE_WINDOW_MAX;
 static int ple_window_max        = KVM_VMX_DEFAULT_PLE_WINDOW_MAX;
-module_param(ple_window_max, int, S_IRUGO);
+module_param(ple_window_max, ple_int, S_IRUGO | S_IWUSR);
 
 extern const ulong vmx_return;
 
@@ -5729,8 +5734,12 @@ static void grow_ple_window(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	int old = vmx->ple_window;
+	unsigned seq;
 
-	vmx->ple_window = __grow_ple_window(old);
+	do {
+		seq = read_seqbegin(&ple_window_seqlock);
+		vmx->ple_window = __grow_ple_window(old);
+	} while (read_seqretry(&ple_window_seqlock, seq));
 
 	trace_kvm_ple_window_grow(vcpu->vcpu_id, vmx->ple_window, old);
 }
@@ -5739,9 +5748,13 @@ static void shrink_ple_window(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	int old = vmx->ple_window;
+	unsigned seq;
 
-	vmx->ple_window = __shrink_ple_window(old,
-	                                      ple_window_shrink, ple_window);
+	do {
+		seq = read_seqbegin(&ple_window_seqlock);
+		vmx->ple_window = __shrink_ple_window(old, ple_window_shrink,
+		                                      ple_window);
+	} while (read_seqretry(&ple_window_seqlock, seq));
 
 	trace_kvm_ple_window_shrink(vcpu->vcpu_id, vmx->ple_window, old);
 }
@@ -5761,6 +5774,23 @@ static void update_ple_window_actual_max(void)
 			                    ple_window_grow, INT_MIN);
 }
 
+static int param_set_ple_int(const char *arg, const struct kernel_param *kp)
+{
+	int ret;
+
+	write_seqlock(&ple_window_seqlock);
+	ret = param_set_int(arg, kp);
+	update_ple_window_actual_max();
+	write_sequnlock(&ple_window_seqlock);
+
+	return ret;
+}
+
+static struct kernel_param_ops param_ops_ple_int = {
+	.set = param_set_ple_int,
+	.get = param_get_int,
+};
+
 /*
  * Indicate a busy-waiting vcpu in spinlock. We do not enable the PAUSE
  * exiting, so only get here on cpu with PAUSE-Loop-Exiting.
@@ -9164,8 +9194,6 @@ static int __init vmx_init(void)
 	} else
 		kvm_disable_tdp();
 
-	update_ple_window_actual_max();
-
 	return 0;
 
 out7:
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v3 7/7] KVM: VMX: optimize ple_window updates to VMCS
  2014-08-21 16:08 [PATCH v3 0/7] Dynamic Pause Loop Exiting window Radim Krčmář
                   ` (5 preceding siblings ...)
  2014-08-21 16:08 ` [PATCH v3 6/7] KVM: VMX: runtime knobs for dynamic PLE window Radim Krčmář
@ 2014-08-21 16:08 ` Radim Krčmář
  2014-08-21 19:18   ` Raghavendra K T
  2014-08-21 16:30 ` [PATCH v3 0/7] Dynamic Pause Loop Exiting window Paolo Bonzini
  2014-08-21 18:40 ` Raghavendra K T
  8 siblings, 1 reply; 29+ messages in thread
From: Radim Krčmář @ 2014-08-21 16:08 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, Paolo Bonzini, Gleb Natapov, Raghavendra KT,
	Vinod Chegu, Hui-Zhi Zhao, Christian Borntraeger, Lisa Mitchell

ple_window is preserved in VMCS, so can write it only after a change.
Do this by keeping a dirty bit.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
---
 arch/x86/kvm/vmx.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 1318232..f32415b 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -510,6 +510,7 @@ struct vcpu_vmx {
 
 	/* Dynamic PLE window. */
 	int ple_window;
+	bool ple_window_dirty;
 };
 
 enum segment_cache_field {
@@ -4426,6 +4427,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
 		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
 	}
 
+	vmx->ple_window_dirty = ple_gap;
 	if (ple_gap) {
 		vmcs_write32(PLE_GAP, ple_gap);
 		vmx->ple_window = ple_window;
@@ -5741,6 +5743,9 @@ static void grow_ple_window(struct kvm_vcpu *vcpu)
 		vmx->ple_window = __grow_ple_window(old);
 	} while (read_seqretry(&ple_window_seqlock, seq));
 
+	if (vmx->ple_window != old)
+		vmx->ple_window_dirty = true;
+
 	trace_kvm_ple_window_grow(vcpu->vcpu_id, vmx->ple_window, old);
 }
 
@@ -5756,6 +5761,9 @@ static void shrink_ple_window(struct kvm_vcpu *vcpu)
 		                                      ple_window);
 	} while (read_seqretry(&ple_window_seqlock, seq));
 
+	if (vmx->ple_window != old)
+		vmx->ple_window_dirty = true;
+
 	trace_kvm_ple_window_shrink(vcpu->vcpu_id, vmx->ple_window, old);
 }
 
@@ -7505,8 +7513,10 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 	if (vmx->emulation_required)
 		return;
 
-	if (ple_gap)
+	if (vmx->ple_window_dirty) {
+		vmx->ple_window_dirty = false;
 		vmcs_write32(PLE_WINDOW, vmx->ple_window);
+	}
 
 	if (vmx->nested.sync_shadow_vmcs) {
 		copy_vmcs12_to_shadow(vmx);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 0/7] Dynamic Pause Loop Exiting window.
  2014-08-21 16:08 [PATCH v3 0/7] Dynamic Pause Loop Exiting window Radim Krčmář
                   ` (6 preceding siblings ...)
  2014-08-21 16:08 ` [PATCH v3 7/7] KVM: VMX: optimize ple_window updates to VMCS Radim Krčmář
@ 2014-08-21 16:30 ` Paolo Bonzini
  2014-08-21 16:50   ` Radim Krčmář
  2014-08-21 17:03   ` Raghavendra K T
  2014-08-21 18:40 ` Raghavendra K T
  8 siblings, 2 replies; 29+ messages in thread
From: Paolo Bonzini @ 2014-08-21 16:30 UTC (permalink / raw)
  To: Radim Krčmář, kvm
  Cc: linux-kernel, Gleb Natapov, Raghavendra KT, Vinod Chegu,
	Hui-Zhi Zhao, Christian Borntraeger, Lisa Mitchell

Il 21/08/2014 18:08, Radim Krčmář ha scritto:
> v2 -> v3:
>  * copy&paste frenzy [v3 4/7] (split modify_ple_window)
>  * commented update_ple_window_actual_max [v3 4/7]
>  * renamed shrinker to modifier [v3 4/7]
>  * removed an extraneous max(new, ple_window) [v3 4/7] (should have been in v2)
>  * changed tracepoint argument type, printing and macro abstractions [v3 5/7]
>  * renamed ple_t to ple_int [v3 6/7] (visible in modinfo)
>  * intelligent updates of ple_window [v3 7/7]
> 
> ---
> v1 -> v2:
>  * squashed [v1 4/9] and [v1 5/9] (clamping)
>  * dropped [v1 7/9] (CPP abstractions)
>  * merged core of [v1 9/9] into [v1 4/9] (automatic maximum)
>  * reworked kernel_param_ops: closer to pure int [v2 6/6]
>  * introduced ple_window_actual_max & reworked clamping [v2 4/6]
>  * added seqlock for parameter modifications [v2 6/6]
> 
> ---
> PLE does not scale in its current form.  When increasing VCPU count
> above 150, one can hit soft lockups because of runqueue lock contention.
> (Which says a lot about performance.)
> 
> The main reason is that kvm_ple_loop cycles through all VCPUs.
> Replacing it with a scalable solution would be ideal, but it has already
> been well optimized for various workloads, so this series tries to
> alleviate one different major problem while minimizing a chance of
> regressions: we have too many useless PLE exits.
> 
> Just increasing PLE window would help some cases, but it still spirals
> out of control.  By increasing the window after every PLE exit, we can
> limit the amount of useless ones, so we don't reach the state where CPUs
> spend 99% of the time waiting for a lock.
> 
> HP confirmed that this series prevents soft lockups and TSC sync errors
> on large guests.

Hi,

I'm not sure of the usefulness of patch 6, so I'm going to drop it.
I'll keep it in my local junkyard branch in case it's going to be useful
in some scenario I didn't think of.

Patch 7 can be easily rebased, so no need to repost (and I might even
squash it into patch 3, what do you think?).

Paolo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 0/7] Dynamic Pause Loop Exiting window.
  2014-08-21 16:30 ` [PATCH v3 0/7] Dynamic Pause Loop Exiting window Paolo Bonzini
@ 2014-08-21 16:50   ` Radim Krčmář
  2014-08-21 16:53     ` Paolo Bonzini
  2014-08-22  4:45     ` Wanpeng Li
  2014-08-21 17:03   ` Raghavendra K T
  1 sibling, 2 replies; 29+ messages in thread
From: Radim Krčmář @ 2014-08-21 16:50 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, Gleb Natapov, Raghavendra KT, Vinod Chegu,
	Hui-Zhi Zhao, Christian Borntraeger, Lisa Mitchell

2014-08-21 18:30+0200, Paolo Bonzini:
> Il 21/08/2014 18:08, Radim Krčmář ha scritto:
> I'm not sure of the usefulness of patch 6, so I'm going to drop it.
> I'll keep it in my local junkyard branch in case it's going to be useful
> in some scenario I didn't think of.

I've been using it to benchmark different values, because it is more
convenient than reloading the module after shutting down guests.
(And easier to sell than writing to kernel memory.)

I don't think the additional code is worth it though.

> Patch 7 can be easily rebased, so no need to repost (and I might even
> squash it into patch 3, what do you think?).

Yeah, the core is already a huge patch, so it does look weird without
squashing.  (No-one wants to rebase to that point anyway.)

Thanks.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 0/7] Dynamic Pause Loop Exiting window.
  2014-08-21 16:50   ` Radim Krčmář
@ 2014-08-21 16:53     ` Paolo Bonzini
  2014-08-22  4:45     ` Wanpeng Li
  1 sibling, 0 replies; 29+ messages in thread
From: Paolo Bonzini @ 2014-08-21 16:53 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: kvm, linux-kernel, Gleb Natapov, Raghavendra KT, Vinod Chegu,
	Hui-Zhi Zhao, Christian Borntraeger, Lisa Mitchell

Il 21/08/2014 18:50, Radim Krčmář ha scritto:
> 2014-08-21 18:30+0200, Paolo Bonzini:
>> Il 21/08/2014 18:08, Radim Krčmář ha scritto:
>> I'm not sure of the usefulness of patch 6, so I'm going to drop it.
>> I'll keep it in my local junkyard branch in case it's going to be useful
>> in some scenario I didn't think of.
> 
> I've been using it to benchmark different values, because it is more
> convenient than reloading the module after shutting down guests.
> (And easier to sell than writing to kernel memory.)
> 
> I don't think the additional code is worth it though.
> 
>> Patch 7 can be easily rebased, so no need to repost (and I might even
>> squash it into patch 3, what do you think?).
> 
> Yeah, the core is already a huge patch, so it does look weird without
> squashing.  (No-one wants to rebase to that point anyway.)

Ok, my queue is a bit large so I'll probably not push to git.kernel.org 
until next week but in any case this is what it will look like:

Radim Krčmář (5):
      KVM: add kvm_arch_sched_in
      KVM: x86: introduce sched_in to kvm_x86_ops
      KVM: VMX: make PLE window per-VCPU  <<< ple_window_dirty squashed here
      KVM: VMX: dynamise PLE window
      KVM: trace kvm_ple_window grow/shrink

Paolo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 0/7] Dynamic Pause Loop Exiting window.
  2014-08-21 16:30 ` [PATCH v3 0/7] Dynamic Pause Loop Exiting window Paolo Bonzini
  2014-08-21 16:50   ` Radim Krčmář
@ 2014-08-21 17:03   ` Raghavendra K T
  1 sibling, 0 replies; 29+ messages in thread
From: Raghavendra K T @ 2014-08-21 17:03 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Radim Krčmář, kvm, linux-kernel, Gleb Natapov,
	Vinod Chegu, Hui-Zhi Zhao, Christian Borntraeger, Lisa Mitchell

On 08/21/2014 10:00 PM, Paolo Bonzini wrote:
> Il 21/08/2014 18:08, Radim Krčmář ha scritto:
>> v2 -> v3:
>>   * copy&paste frenzy [v3 4/7] (split modify_ple_window)
>>   * commented update_ple_window_actual_max [v3 4/7]
>>   * renamed shrinker to modifier [v3 4/7]
>>   * removed an extraneous max(new, ple_window) [v3 4/7] (should have been in v2)
>>   * changed tracepoint argument type, printing and macro abstractions [v3 5/7]
>>   * renamed ple_t to ple_int [v3 6/7] (visible in modinfo)
>>   * intelligent updates of ple_window [v3 7/7]
>>
>> ---
>> v1 -> v2:
>>   * squashed [v1 4/9] and [v1 5/9] (clamping)
>>   * dropped [v1 7/9] (CPP abstractions)
>>   * merged core of [v1 9/9] into [v1 4/9] (automatic maximum)
>>   * reworked kernel_param_ops: closer to pure int [v2 6/6]
>>   * introduced ple_window_actual_max & reworked clamping [v2 4/6]
>>   * added seqlock for parameter modifications [v2 6/6]
>>
>> ---
>> PLE does not scale in its current form.  When increasing VCPU count
>> above 150, one can hit soft lockups because of runqueue lock contention.
>> (Which says a lot about performance.)
>>
>> The main reason is that kvm_ple_loop cycles through all VCPUs.
>> Replacing it with a scalable solution would be ideal, but it has already
>> been well optimized for various workloads, so this series tries to
>> alleviate one different major problem while minimizing a chance of
>> regressions: we have too many useless PLE exits.
>>
>> Just increasing PLE window would help some cases, but it still spirals
>> out of control.  By increasing the window after every PLE exit, we can
>> limit the amount of useless ones, so we don't reach the state where CPUs
>> spend 99% of the time waiting for a lock.
>>
>> HP confirmed that this series prevents soft lockups and TSC sync errors
>> on large guests.
>
> Hi,
>
> I'm not sure of the usefulness of patch 6, so I'm going to drop it.
> I'll keep it in my local junkyard branch in case it's going to be useful
> in some scenario I didn't think of.

I think grow knob may be helpful to some extent considering number of 
vcpus can vary from few to hundreds, which in turn helps in fast
convergence of ple_window value in non overcommit scenarios.

I will try to experiment with shrink knob. One argument favouring
shrink knob may be the fact that we rudely reset vmx->ple_window
back to default 4k. Ofcourse danger on the other side is slow 
convergence during overcommit/sudden burst of load.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 0/7] Dynamic Pause Loop Exiting window.
  2014-08-21 16:08 [PATCH v3 0/7] Dynamic Pause Loop Exiting window Radim Krčmář
                   ` (7 preceding siblings ...)
  2014-08-21 16:30 ` [PATCH v3 0/7] Dynamic Pause Loop Exiting window Paolo Bonzini
@ 2014-08-21 18:40 ` Raghavendra K T
  8 siblings, 0 replies; 29+ messages in thread
From: Raghavendra K T @ 2014-08-21 18:40 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: kvm, linux-kernel, Paolo Bonzini, Gleb Natapov, Vinod Chegu,
	Hui-Zhi Zhao, Christian Borntraeger, Lisa Mitchell

On 08/21/2014 09:38 PM, Radim Krčmář wrote:
> v2 -> v3:
>   * copy&paste frenzy [v3 4/7] (split modify_ple_window)
>   * commented update_ple_window_actual_max [v3 4/7]
>   * renamed shrinker to modifier [v3 4/7]
>   * removed an extraneous max(new, ple_window) [v3 4/7] (should have been in v2)
>   * changed tracepoint argument type, printing and macro abstractions [v3 5/7]
>   * renamed ple_t to ple_int [v3 6/7] (visible in modinfo)
>   * intelligent updates of ple_window [v3 7/7]
>
> ---
> v1 -> v2:
>   * squashed [v1 4/9] and [v1 5/9] (clamping)
>   * dropped [v1 7/9] (CPP abstractions)
>   * merged core of [v1 9/9] into [v1 4/9] (automatic maximum)
>   * reworked kernel_param_ops: closer to pure int [v2 6/6]
>   * introduced ple_window_actual_max & reworked clamping [v2 4/6]
>   * added seqlock for parameter modifications [v2 6/6]
>
> ---

Was able to test, both V1 and V2. and trace showed good behaviour of 
ple_window in undercommit and overcommit.
Considering V3 does not have any change w.r.t functionality except 
intelligent update with dirty field, Please feel free to add

Tested-by: Raghavendra KT <raghavendra.kt@linux.vnet.ibm.com>

I do have some observations and comments though.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 1/7] KVM: add kvm_arch_sched_in
  2014-08-21 16:08 ` [PATCH v3 1/7] KVM: add kvm_arch_sched_in Radim Krčmář
@ 2014-08-21 18:49   ` Raghavendra K T
  2014-08-21 20:31     ` Radim Krčmář
  0 siblings, 1 reply; 29+ messages in thread
From: Raghavendra K T @ 2014-08-21 18:49 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: kvm, linux-kernel, Paolo Bonzini, Gleb Natapov, Vinod Chegu,
	Hui-Zhi Zhao, Christian Borntraeger, Lisa Mitchell

On 08/21/2014 09:38 PM, Radim Krčmář wrote:
> Introduce preempt notifiers for architecture specific code.
> Advantage over creating a new notifier in every arch is slightly simpler
> code and guaranteed call order with respect to kvm_sched_in.
>
> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
> ---

Reviewed-by: Raghavendra KT <raghavendra.kt@linux.vnet.ibm.com>

No surprise that ia64 doesn't show here :). and I also would have liked
static inlines (as indicated by Paolo).

>   arch/arm/kvm/arm.c         | 4 ++++
>   arch/mips/kvm/mips.c       | 4 ++++
>   arch/powerpc/kvm/powerpc.c | 4 ++++
>   arch/s390/kvm/kvm-s390.c   | 4 ++++
>   arch/x86/kvm/x86.c         | 4 ++++
>   include/linux/kvm_host.h   | 2 ++
>   virt/kvm/kvm_main.c        | 2 ++
>   7 files changed, 24 insertions(+)
>
> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> index a99e0cd..9f788eb 100644
> --- a/arch/arm/kvm/arm.c
> +++ b/arch/arm/kvm/arm.c
> @@ -288,6 +288,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
>   {
>   }
>
> +void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
> +{
> +}
> +
>   void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>   {
>   	vcpu->cpu = cpu;
> diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> index cd71141..2362df2 100644
> --- a/arch/mips/kvm/mips.c
> +++ b/arch/mips/kvm/mips.c
> @@ -1002,6 +1002,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
>   {
>   }
>
> +void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
> +{
> +}
> +
>   int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu,
>   				  struct kvm_translation *tr)
>   {
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 4c79284..cbc432f 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -720,6 +720,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
>   	kvmppc_subarch_vcpu_uninit(vcpu);
>   }
>
> +void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
> +{
> +}
> +
>   void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>   {
>   #ifdef CONFIG_BOOKE
> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index ce81eb2..a3c324e 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -555,6 +555,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
>   	/* Nothing todo */
>   }
>
> +void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
> +{
> +}
> +
>   void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>   {
>   	save_fp_ctl(&vcpu->arch.host_fpregs.fpc);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 8f1e22d..d7c214f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7146,6 +7146,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
>   		static_key_slow_dec(&kvm_no_apic_vcpu);
>   }
>
> +void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
> +{
> +}
> +
>   int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>   {
>   	if (type)
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index a4c33b3..ebd7236 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -624,6 +624,8 @@ void kvm_arch_exit(void);
>   int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu);
>   void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu);
>
> +void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu);
> +
>   void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu);
>   void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
>   void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 33712fb..d3c3ed0 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3123,6 +3123,8 @@ static void kvm_sched_in(struct preempt_notifier *pn, int cpu)
>   	if (vcpu->preempted)
>   		vcpu->preempted = false;
>
> +	kvm_arch_sched_in(vcpu, cpu);
> +
>   	kvm_arch_vcpu_load(vcpu, cpu);
>   }
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 2/7] KVM: x86: introduce sched_in to kvm_x86_ops
  2014-08-21 16:08 ` [PATCH v3 2/7] KVM: x86: introduce sched_in to kvm_x86_ops Radim Krčmář
@ 2014-08-21 18:50   ` Raghavendra K T
  0 siblings, 0 replies; 29+ messages in thread
From: Raghavendra K T @ 2014-08-21 18:50 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: kvm, linux-kernel, Paolo Bonzini, Gleb Natapov, Vinod Chegu,
	Hui-Zhi Zhao, Christian Borntraeger, Lisa Mitchell

On 08/21/2014 09:38 PM, Radim Krčmář wrote:
> sched_in preempt notifier is available for x86, allow its use in
> specific virtualization technlogies as well.
>
> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>

Reviewed-by: Raghavendra KT <raghavendra.kt@linux.vnet.ibm.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/7] KVM: VMX: make PLE window per-VCPU
  2014-08-21 16:08 ` [PATCH v3 3/7] KVM: VMX: make PLE window per-VCPU Radim Krčmář
@ 2014-08-21 18:52   ` Raghavendra K T
  0 siblings, 0 replies; 29+ messages in thread
From: Raghavendra K T @ 2014-08-21 18:52 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: kvm, linux-kernel, Paolo Bonzini, Gleb Natapov, Vinod Chegu,
	Hui-Zhi Zhao, Christian Borntraeger, Lisa Mitchell

On 08/21/2014 09:38 PM, Radim Krčmář wrote:
> Change PLE window into per-VCPU variable, seeded from module parameter,
> to allow greater flexibility.
>
> Brings in a small overhead on every vmentry.
>
> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>

with intelligent update in patch 7
Reviewed-by: Raghavendra KT <raghavendra.kt@linux.vnet.ibm.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 4/7] KVM: VMX: dynamise PLE window
  2014-08-21 16:08 ` [PATCH v3 4/7] KVM: VMX: dynamise PLE window Radim Krčmář
@ 2014-08-21 19:10   ` Raghavendra K T
  2014-08-21 20:31     ` Paolo Bonzini
  2014-08-21 20:59     ` Radim Krčmář
  0 siblings, 2 replies; 29+ messages in thread
From: Raghavendra K T @ 2014-08-21 19:10 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: kvm, linux-kernel, Paolo Bonzini, Gleb Natapov, Vinod Chegu,
	Hui-Zhi Zhao, Christian Borntraeger, Lisa Mitchell

On 08/21/2014 09:38 PM, Radim Krčmář wrote:
> Window is increased on every PLE exit and decreased on every sched_in.
> The idea is that we don't want to PLE exit if there is no preemption
> going on.
> We do this with sched_in() because it does not hold rq lock.
>
> There are two new kernel parameters for changing the window:
>   ple_window_grow and ple_window_shrink
> ple_window_grow affects the window on PLE exit and ple_window_shrink
> does it on sched_in;  depending on their value, the window is modifier
> like this: (ple_window is kvm_intel's global)
>
>    ple_window_shrink/ |
>    ple_window_grow    | PLE exit           | sched_in
>    -------------------+--------------------+---------------------
>    < 1                |  = ple_window      |  = ple_window
>    < ple_window       | *= ple_window_grow | /= ple_window_shrink
>    otherwise          | += ple_window_grow | -= ple_window_shrink
>
> A third new parameter, ple_window_max, controls the maximal ple_window;
> it is internally rounded down to a closest multiple of ple_window_grow.
>
> VCPU's PLE window is never allowed below ple_window.
>
> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
> ---
>   arch/x86/kvm/vmx.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 85 insertions(+), 2 deletions(-)
>

Thanks for the nice patch.
default of grow = 2 and shrink = 0 is very good, which aids fast
clamping in overcommit and less ple_exits in undercommit.
with a small concern over modifier (shrinker) value in patch 6,

Reviewed-by: Raghavendra KT <raghavendra.kt@linux.vnet.ibm.com>

> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 18e0e52..a2fa6ba 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -125,14 +125,32 @@ module_param(nested, bool, S_IRUGO);
>    * Time is measured based on a counter that runs at the same rate as the TSC,
>    * refer SDM volume 3b section 21.6.13 & 22.1.3.
>    */
> -#define KVM_VMX_DEFAULT_PLE_GAP    128
> -#define KVM_VMX_DEFAULT_PLE_WINDOW 4096
> +#define KVM_VMX_DEFAULT_PLE_GAP           128
> +#define KVM_VMX_DEFAULT_PLE_WINDOW        4096
> +#define KVM_VMX_DEFAULT_PLE_WINDOW_GROW   2
> +#define KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK 0
> +#define KVM_VMX_DEFAULT_PLE_WINDOW_MAX    \
> +		INT_MAX / KVM_VMX_DEFAULT_PLE_WINDOW_GROW
> +
>   static int ple_gap = KVM_VMX_DEFAULT_PLE_GAP;
>   module_param(ple_gap, int, S_IRUGO);
>
>   static int ple_window = KVM_VMX_DEFAULT_PLE_WINDOW;
>   module_param(ple_window, int, S_IRUGO);
>
> +/* Default doubles per-vcpu window every exit. */
> +static int ple_window_grow = KVM_VMX_DEFAULT_PLE_WINDOW_GROW;
> +module_param(ple_window_grow, int, S_IRUGO);
> +
> +/* Default resets per-vcpu window every exit to ple_window. */
> +static int ple_window_shrink = KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK;
> +module_param(ple_window_shrink, int, S_IRUGO);
> +
> +/* Default is to compute the maximum so we can never overflow. */
> +static int ple_window_actual_max = KVM_VMX_DEFAULT_PLE_WINDOW_MAX;
> +static int ple_window_max        = KVM_VMX_DEFAULT_PLE_WINDOW_MAX;
> +module_param(ple_window_max, int, S_IRUGO);

Some observations:

1. I think a large ple_window for e.g.,  16MB itself
would be very large value after which the ple_window change would not
have much effect.

So could KVM_VMX_DEFAULT_PLE_WINDOW_MAX be just
KVM_VMX_DEFAULT_PLE_WINDOW^2 or something ? or otherwise..

How about ..
define KVM_VMX_DEFAULT_PLE_WINDOW_MAX  = INT_MAX 
/KVM_VMX_DEFAULT_PLE_WINDOW.

(considering we don't allow default grow to be greater than default ple
window).

2. can we consider min_ple_window value of 2k. tracing showed that in
overcommit there were several occations of 4k <- 4k situations.

3. Do you think using ( << and >> ) instead of (*, /) saves some cycles
considering we could have potentially have grow,shrink numbers != power of 2

> +
>   extern const ulong vmx_return;
>
>   #define NR_AUTOLOAD_MSRS 8
> @@ -5679,12 +5697,73 @@ out:
>   	return ret;
>   }
>
> +static int __grow_ple_window(int val)
> +{
> +	if (ple_window_grow < 1)

why not ple_window_grow <= 1 ?


> +		return ple_window;
> +
> +	val = min(val, ple_window_actual_max);
> +
> +	if (ple_window_grow < ple_window)
> +		val *= ple_window_grow;
> +	else
> +		val += ple_window_grow;
> +
> +	return val;
> +}
> +
> +static int __shrink_ple_window(int val, int modifier, int minimum)
> +{
> +	if (modifier < 1)

why not modifier < 1. May be this would address a concern in patch 6.

> +		return ple_window;
> +
> +	if (modifier < ple_window)
> +		val /= modifier;
> +	else
> +		val -= modifier;
> +
> +	return max(val, minimum);
> +}
> +
> +static void grow_ple_window(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	vmx->ple_window = __grow_ple_window(vmx->ple_window);
> +}
> +
> +static void shrink_ple_window(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	vmx->ple_window = __shrink_ple_window(vmx->ple_window,
> +	                                      ple_window_shrink, ple_window);
> +}
> +
> +/*
> + * ple_window_actual_max is computed to be one grow_ple_window() below
> + * ple_window_max. (See __grow_ple_window for the reason.)
> + * This prevents overflows, because ple_window_max is int.
> + * ple_window_max effectively rounded down to a multiple of ple_window_grow in
> + * this process.
> + * ple_window_max is also prevented from setting vmx->ple_window < ple_window.
> + */
> +static void update_ple_window_actual_max(void)
> +{
> +	ple_window_actual_max =
> +			__shrink_ple_window(max(ple_window_max, ple_window),
> +			                    ple_window_grow, INT_MIN);
> +}
> +
>   /*
>    * Indicate a busy-waiting vcpu in spinlock. We do not enable the PAUSE
>    * exiting, so only get here on cpu with PAUSE-Loop-Exiting.
>    */
>   static int handle_pause(struct kvm_vcpu *vcpu)
>   {
> +	if (ple_gap)
> +		grow_ple_window(vcpu);
> +
>   	skip_emulated_instruction(vcpu);
>   	kvm_vcpu_on_spin(vcpu);
>
> @@ -8854,6 +8933,8 @@ static int vmx_check_intercept(struct kvm_vcpu *vcpu,
>
>   void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
>   {
> +	if (ple_gap)
> +		shrink_ple_window(vcpu);
>   }
>
>   static struct kvm_x86_ops vmx_x86_ops = {
> @@ -9077,6 +9158,8 @@ static int __init vmx_init(void)
>   	} else
>   		kvm_disable_tdp();
>
> +	update_ple_window_actual_max();
> +
>   	return 0;
>
>   out7:
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 6/7] KVM: VMX: runtime knobs for dynamic PLE window
  2014-08-21 16:08 ` [PATCH v3 6/7] KVM: VMX: runtime knobs for dynamic PLE window Radim Krčmář
@ 2014-08-21 19:17   ` Raghavendra K T
  2014-08-21 21:03     ` Radim Krčmář
  0 siblings, 1 reply; 29+ messages in thread
From: Raghavendra K T @ 2014-08-21 19:17 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: kvm, linux-kernel, Paolo Bonzini, Gleb Natapov, Vinod Chegu,
	Hui-Zhi Zhao, Christian Borntraeger, Lisa Mitchell

On 08/21/2014 09:38 PM, Radim Krčmář wrote:
> ple_window is updated on every vmentry, so there is no reason to have it
> read-only anymore.
> ple_window* weren't writable to prevent runtime overflow races;
> they are prevented by a seqlock.
>
> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
> ---
>   arch/x86/kvm/vmx.c | 46 +++++++++++++++++++++++++++++++++++++---------
>   1 file changed, 37 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 273cbd5..1318232 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -132,24 +132,29 @@ module_param(nested, bool, S_IRUGO);
>   #define KVM_VMX_DEFAULT_PLE_WINDOW_MAX    \
>   		INT_MAX / KVM_VMX_DEFAULT_PLE_WINDOW_GROW
>
> +static struct kernel_param_ops param_ops_ple_int;
> +#define param_check_ple_int(name, p) __param_check(name, p, int)
> +
> +static DEFINE_SEQLOCK(ple_window_seqlock);
> +
>   static int ple_gap = KVM_VMX_DEFAULT_PLE_GAP;
>   module_param(ple_gap, int, S_IRUGO);
>
>   static int ple_window = KVM_VMX_DEFAULT_PLE_WINDOW;
> -module_param(ple_window, int, S_IRUGO);
> +module_param(ple_window, ple_int, S_IRUGO | S_IWUSR);
>
>   /* Default doubles per-vcpu window every exit. */
>   static int ple_window_grow = KVM_VMX_DEFAULT_PLE_WINDOW_GROW;
> -module_param(ple_window_grow, int, S_IRUGO);
> +module_param(ple_window_grow, ple_int, S_IRUGO | S_IWUSR);
>
>   /* Default resets per-vcpu window every exit to ple_window. */
>   static int ple_window_shrink = KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK;
> -module_param(ple_window_shrink, int, S_IRUGO);
> +module_param(ple_window_shrink, int, S_IRUGO | S_IWUSR);
>
>   /* Default is to compute the maximum so we can never overflow. */
>   static int ple_window_actual_max = KVM_VMX_DEFAULT_PLE_WINDOW_MAX;
>   static int ple_window_max        = KVM_VMX_DEFAULT_PLE_WINDOW_MAX;
> -module_param(ple_window_max, int, S_IRUGO);
> +module_param(ple_window_max, ple_int, S_IRUGO | S_IWUSR);
>
>

Positive thing about able to change default grow/shrink value is the
flexibility of tuning ple window to different workloads and different
number of cpus.

But is it that a value of shrink = 1 and grow > 1 is problematic ?
(running a undercommit workload and then overcommit workload)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 7/7] KVM: VMX: optimize ple_window updates to VMCS
  2014-08-21 16:08 ` [PATCH v3 7/7] KVM: VMX: optimize ple_window updates to VMCS Radim Krčmář
@ 2014-08-21 19:18   ` Raghavendra K T
  0 siblings, 0 replies; 29+ messages in thread
From: Raghavendra K T @ 2014-08-21 19:18 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: kvm, linux-kernel, Paolo Bonzini, Gleb Natapov, Vinod Chegu,
	Hui-Zhi Zhao, Christian Borntraeger, Lisa Mitchell

On 08/21/2014 09:38 PM, Radim Krčmář wrote:
> ple_window is preserved in VMCS, so can write it only after a change.
> Do this by keeping a dirty bit.
>
> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>

Reviewed-by: Raghavendra KT <raghavendra.kt@linux.vnet.ibm.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 1/7] KVM: add kvm_arch_sched_in
  2014-08-21 18:49   ` Raghavendra K T
@ 2014-08-21 20:31     ` Radim Krčmář
  2014-08-21 21:12       ` Paolo Bonzini
  0 siblings, 1 reply; 29+ messages in thread
From: Radim Krčmář @ 2014-08-21 20:31 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: kvm, linux-kernel, Paolo Bonzini, Gleb Natapov, Vinod Chegu,
	Hui-Zhi Zhao, Christian Borntraeger, Lisa Mitchell

2014-08-22 00:19+0530, Raghavendra K T:
> On 08/21/2014 09:38 PM, Radim Krčmář wrote:
> >Introduce preempt notifiers for architecture specific code.
> >Advantage over creating a new notifier in every arch is slightly simpler
> >code and guaranteed call order with respect to kvm_sched_in.
> >
> >Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
> >---
> 
> Reviewed-by: Raghavendra KT <raghavendra.kt@linux.vnet.ibm.com>
> 
> No surprise that ia64 doesn't show here :).

Oh, totally forgot about its predicted comeback ...
Paolo, do you want to fixup this?  (It removes double newline.)

diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
index 0729ba6..1630624 100644
--- a/arch/ia64/kvm/kvm-ia64.c
+++ b/arch/ia64/kvm/kvm-ia64.c
@@ -1468,6 +1468,9 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 	kfree(vcpu->arch.apic);
 }
 
+void kvm_arch_sched_in(struct kvm_vcpu *vcpu)
+{
+}
 
 long kvm_arch_vcpu_ioctl(struct file *filp,
 			 unsigned int ioctl, unsigned long arg)

>                                             and I also would have liked
> static inlines (as indicated by Paolo).

I'll send a patch that converts empty functions into static inlines,
maybe return-0s too, soon(ish).

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 4/7] KVM: VMX: dynamise PLE window
  2014-08-21 19:10   ` Raghavendra K T
@ 2014-08-21 20:31     ` Paolo Bonzini
  2014-08-21 20:59     ` Radim Krčmář
  1 sibling, 0 replies; 29+ messages in thread
From: Paolo Bonzini @ 2014-08-21 20:31 UTC (permalink / raw)
  To: Raghavendra K T, Radim Krčmář
  Cc: kvm, linux-kernel, Gleb Natapov, Vinod Chegu, Hui-Zhi Zhao,
	Christian Borntraeger, Lisa Mitchell

Il 21/08/2014 21:10, Raghavendra K T ha scritto:
> 2. can we consider min_ple_window value of 2k. tracing showed that in
> overcommit there were several occations of 4k <- 4k situations.

Yes, that could be good now that we have adaptive exits.  Perhaps even 1k.

> 3. Do you think using ( << and >> ) instead of (*, /) saves some cycles
> considering we could have potentially have grow,shrink numbers != power
> of 2

* is very fast. / a bit less so but it never occurs with the defaults.
So I think it's okay.  I'm not sure why one would use a multiplier
bigger than 2.

Paolo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 4/7] KVM: VMX: dynamise PLE window
  2014-08-21 19:10   ` Raghavendra K T
  2014-08-21 20:31     ` Paolo Bonzini
@ 2014-08-21 20:59     ` Radim Krčmář
  1 sibling, 0 replies; 29+ messages in thread
From: Radim Krčmář @ 2014-08-21 20:59 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: kvm, linux-kernel, Paolo Bonzini, Gleb Natapov, Vinod Chegu,
	Hui-Zhi Zhao, Christian Borntraeger, Lisa Mitchell

2014-08-22 00:40+0530, Raghavendra K T:
> On 08/21/2014 09:38 PM, Radim Krčmář wrote:
> Thanks for the nice patch.
> default of grow = 2 and shrink = 0 is very good, which aids fast
> clamping in overcommit and less ple_exits in undercommit.
> with a small concern over modifier (shrinker) value in patch 6,
> 
> Reviewed-by: Raghavendra KT <raghavendra.kt@linux.vnet.ibm.com>

Thank you for the review and testing.

> >-#define KVM_VMX_DEFAULT_PLE_GAP    128
> >-#define KVM_VMX_DEFAULT_PLE_WINDOW 4096
> >+#define KVM_VMX_DEFAULT_PLE_GAP           128
> >+#define KVM_VMX_DEFAULT_PLE_WINDOW        4096
> >+#define KVM_VMX_DEFAULT_PLE_WINDOW_GROW   2
> >+#define KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK 0
> >+#define KVM_VMX_DEFAULT_PLE_WINDOW_MAX    \
> >+		INT_MAX / KVM_VMX_DEFAULT_PLE_WINDOW_GROW
> 
> Some observations:
> 
> 1. I think a large ple_window for e.g.,  16MB itself
> would be very large value after which the ple_window change would not
> have much effect.

Agreed, 16M is around 4ms@4GHz, holding a spinlock for that long is
closer to a bug.

> So could KVM_VMX_DEFAULT_PLE_WINDOW_MAX be just
> KVM_VMX_DEFAULT_PLE_WINDOW^2 or something ? or otherwise..

I'd be perfectly content with a high default maximum like this, yet
there isn't much point in having that default at all if we can't reach
it in practice, so with the current default, we are at least ready for
THz+ x86 intels :)

I though about deafaulting it to some guessed fraction of the scheduler
tick, but then, I wanted to merge at least something :)

> How about ..
> define KVM_VMX_DEFAULT_PLE_WINDOW_MAX  = INT_MAX
> /KVM_VMX_DEFAULT_PLE_WINDOW.

= 524288 (2^19), too low IMO,
no-overcommit scenarios seem to go higher quite often.

> (considering we don't allow default grow to be greater than default ple
> window).
> 
> 2. can we consider min_ple_window value of 2k. tracing showed that in
> overcommit there were several occations of 4k <- 4k situations.

Low values are more affected by KVM's overhead on every PLE exit, but
benchmarks really decide this ...

Having a separate ple_window_min would allow more tuning, so I can do
that if there are benefits of lower windows.
On the other hand, I thought that increasing the default window would be
a good default, which would make separate minimum even better.

> 3. Do you think using ( << and >> ) instead of (*, /) saves some cycles
> considering we could have potentially have grow,shrink numbers != power of 2

* takes one or two cycles more than <<, so I wouldn't replace it,
because it limits available grows a lot.
(I don't expect we would set values above 5.)

/ is slow (around ten times slower than *), which the reason why we
avoid it by default, but I still prefer more options.

> >+static int __grow_ple_window(int val)
> >+{
> >+	if (ple_window_grow < 1)
> 
> why not ple_window_grow <= 1 ?

Emergent mini-feature: have different static levels of ple_window, in
combination with dynamic knobs.
(set grow and shrink to 1, choose ple_window before starting a guest)

And because you have to willingly set 1, I'm not aware of any advantage
of '<= 1'.

> >+static int __shrink_ple_window(int val, int modifier, int minimum)
> >+{
> >+	if (modifier < 1)
> 
> why not modifier < 1. May be this would address a concern in patch 6.

"/= 1" gives no shrinking, which I considered as a feature -- you can
easily search for the maximal achievable ple_window that way.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 6/7] KVM: VMX: runtime knobs for dynamic PLE window
  2014-08-21 19:17   ` Raghavendra K T
@ 2014-08-21 21:03     ` Radim Krčmář
  0 siblings, 0 replies; 29+ messages in thread
From: Radim Krčmář @ 2014-08-21 21:03 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: kvm, linux-kernel, Paolo Bonzini, Gleb Natapov, Vinod Chegu,
	Hui-Zhi Zhao, Christian Borntraeger, Lisa Mitchell

2014-08-22 00:47+0530, Raghavendra K T:
> Positive thing about able to change default grow/shrink value is the
> flexibility of tuning ple window to different workloads and different
> number of cpus.
> 
> But is it that a value of shrink = 1 and grow > 1 is problematic ?
> (running a undercommit workload and then overcommit workload)

Yeah, could have mentioned it in commit message, but I hope that my
previous email explained it.  (1 = don't shrink, design decision :)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 1/7] KVM: add kvm_arch_sched_in
  2014-08-21 20:31     ` Radim Krčmář
@ 2014-08-21 21:12       ` Paolo Bonzini
  0 siblings, 0 replies; 29+ messages in thread
From: Paolo Bonzini @ 2014-08-21 21:12 UTC (permalink / raw)
  To: Radim Krčmář, Raghavendra K T
  Cc: kvm, linux-kernel, Gleb Natapov, Vinod Chegu, Hui-Zhi Zhao,
	Christian Borntraeger, Lisa Mitchell

Il 21/08/2014 22:31, Radim Krčmář ha scritto:
>> > No surprise that ia64 doesn't show here :).
> Oh, totally forgot about its predicted comeback ...
> Paolo, do you want to fixup this?  (It removes double newline.)

Bah, me too.  I'll get to killing ia64 one of these days.

Paolo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 0/7] Dynamic Pause Loop Exiting window.
  2014-08-21 16:50   ` Radim Krčmář
  2014-08-21 16:53     ` Paolo Bonzini
@ 2014-08-22  4:45     ` Wanpeng Li
  2014-08-25 15:11       ` Radim Krčmář
  1 sibling, 1 reply; 29+ messages in thread
From: Wanpeng Li @ 2014-08-22  4:45 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: Paolo Bonzini, kvm, linux-kernel, Gleb Natapov, Raghavendra KT,
	Vinod Chegu, Hui-Zhi Zhao, Christian Borntraeger, Lisa Mitchell

Hi Radim,
On Thu, Aug 21, 2014 at 06:50:03PM +0200, Radim Krčmář wrote:
>2014-08-21 18:30+0200, Paolo Bonzini:
>> Il 21/08/2014 18:08, Radim Krčmář ha scritto:
>> I'm not sure of the usefulness of patch 6, so I'm going to drop it.
>> I'll keep it in my local junkyard branch in case it's going to be useful
>> in some scenario I didn't think of.
>
>I've been using it to benchmark different values, because it is more

Is there any benchmark data for this patchset?

Regards,
Wanpeng Li 

>convenient than reloading the module after shutting down guests.
>(And easier to sell than writing to kernel memory.)
>
>I don't think the additional code is worth it though.
>
>> Patch 7 can be easily rebased, so no need to repost (and I might even
>> squash it into patch 3, what do you think?).
>
>Yeah, the core is already a huge patch, so it does look weird without
>squashing.  (No-one wants to rebase to that point anyway.)
>
>Thanks.
>--
>To unsubscribe from this list: send the line "unsubscribe kvm" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 5/7] KVM: trace kvm_ple_window grow/shrink
  2014-08-21 16:08 ` [PATCH v3 5/7] KVM: trace kvm_ple_window grow/shrink Radim Krčmář
@ 2014-08-25 13:53   ` Sabrina Dubroca
  2014-08-25 14:32     ` Radim Krčmář
  0 siblings, 1 reply; 29+ messages in thread
From: Sabrina Dubroca @ 2014-08-25 13:53 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: kvm, linux-kernel, Paolo Bonzini, Gleb Natapov, Raghavendra KT,
	Vinod Chegu, Hui-Zhi Zhao, Christian Borntraeger, Lisa Mitchell

Hello,

2014-08-21, 18:08:09 +0200, Radim Krčmář wrote:
> Tracepoint for dynamic PLE window, fired on every potential change.
> 
> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
> ---
>  arch/x86/kvm/trace.h | 30 ++++++++++++++++++++++++++++++
>  arch/x86/kvm/vmx.c   | 10 ++++++++--
>  arch/x86/kvm/x86.c   |  1 +
>  3 files changed, 39 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> index e850a7d..1742dfb 100644
> --- a/arch/x86/kvm/trace.h
> +++ b/arch/x86/kvm/trace.h
> @@ -848,6 +848,36 @@ TRACE_EVENT(kvm_track_tsc,
>  		  __print_symbolic(__entry->host_clock, host_clocks))
>  );
>  
> +TRACE_EVENT(kvm_ple_window,
> +	TP_PROTO(bool grow, unsigned int vcpu_id, int new, int old),
> +	TP_ARGS(grow, vcpu_id, new, old),
> +
> +	TP_STRUCT__entry(
> +		__field(                bool,      grow         )
> +		__field(        unsigned int,   vcpu_id         )
> +		__field(                 int,       new         )
> +		__field(                 int,       old         )
> +	),
> +
> +	TP_fast_assign(
> +		__entry->grow           = grow;
> +		__entry->vcpu_id        = vcpu_id;
> +		__entry->new            = new;
> +		__entry->old            = old;
> +	),
> +
> +	TP_printk("vcpu %u: ple_window %d (%s %d)",
> +	          __entry->vcpu_id,
> +	          __entry->new,
> +	          __entry->grow ? "grow" : "shrink",
> +	          __entry->old)
> +);
> +
> +#define trace_kvm_ple_window_grow(vcpu_id, new, old) \
> +	trace_kvm_ple_window(true, vcpu_id, new, old)
> +#define trace_kvm_ple_window_shrink(vcpu_id, new, old) \
> +	trace_kvm_ple_window(false, vcpu_id, new, old)
> +
>  #endif /* CONFIG_X86_64 */

Looks like these are needed on 32-bit as well.
Today's linux-next doesn't build:

  CC [M]  arch/x86/kvm/x86.o
In file included from include/linux/linkage.h:6:0,
                 from include/linux/preempt.h:9,
                 from include/linux/preempt_mask.h:4,
                 from include/linux/hardirq.h:4,
                 from include/linux/kvm_host.h:10,
                 from arch/x86/kvm/x86.c:22:
include/linux/tracepoint.h:214:20: error: ‘__tracepoint_kvm_ple_window’ undeclared here (not in a function)
  EXPORT_SYMBOL_GPL(__tracepoint_##name)
                    ^
include/linux/export.h:57:16: note: in definition of macro ‘__EXPORT_SYMBOL’
  extern typeof(sym) sym;     \
                ^
include/linux/tracepoint.h:214:2: note: in expansion of macro ‘EXPORT_SYMBOL_GPL’
  EXPORT_SYMBOL_GPL(__tracepoint_##name)
  ^
arch/x86/kvm/x86.c:7676:1: note: in expansion of macro ‘EXPORT_TRACEPOINT_SYMBOL_GPL’
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_ple_window);
 ^


and if I comment out the EXPORT_TRACEPOINT_SYMBOL_GPL:

arch/x86/kvm/vmx.c: In function ‘grow_ple_window’:
arch/x86/kvm/vmx.c:5742:2: error: implicit declaration of function ‘trace_kvm_ple_window_grow’ [-Werror=implicit-function-declaration]
  trace_kvm_ple_window_grow(vcpu->vcpu_id, vmx->ple_window, old);
  ^
arch/x86/kvm/vmx.c: In function ‘shrink_ple_window’:
arch/x86/kvm/vmx.c:5756:2: error: implicit declaration of function ‘trace_kvm_ple_window_shrink’ [-Werror=implicit-function-declaration]
  trace_kvm_ple_window_shrink(vcpu->vcpu_id, vmx->ple_window, old);
  ^
cc1: some warnings being treated as errors
make[2]: *** [arch/x86/kvm/vmx.o] Error 1



I moved the line

#endif /* CONFIG_X86_64 */

above

TRACE_EVENT(kvm_ple_window,

and it builds.


Thanks,

-- 
Sabrina

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 5/7] KVM: trace kvm_ple_window grow/shrink
  2014-08-25 13:53   ` Sabrina Dubroca
@ 2014-08-25 14:32     ` Radim Krčmář
  2014-08-25 14:44       ` Paolo Bonzini
  0 siblings, 1 reply; 29+ messages in thread
From: Radim Krčmář @ 2014-08-25 14:32 UTC (permalink / raw)
  To: Sabrina Dubroca
  Cc: kvm, linux-kernel, Paolo Bonzini, Gleb Natapov, Raghavendra KT,
	Vinod Chegu, Hui-Zhi Zhao, Christian Borntraeger, Lisa Mitchell

2014-08-25 15:53+0200, Sabrina Dubroca:
> Hello,
> 
> 2014-08-21, 18:08:09 +0200, Radim Krčmář wrote:
> > Tracepoint for dynamic PLE window, fired on every potential change.
> > +#define trace_kvm_ple_window_grow(vcpu_id, new, old) \
> > +	trace_kvm_ple_window(true, vcpu_id, new, old)
> > +#define trace_kvm_ple_window_shrink(vcpu_id, new, old) \
> > +	trace_kvm_ple_window(false, vcpu_id, new, old)
> > +
> >  #endif /* CONFIG_X86_64 */
> 
> Looks like these are needed on 32-bit as well.
> Today's linux-next doesn't build:
> 
> [...]
> 
> I moved the line
> 
> #endif /* CONFIG_X86_64 */
> 
> above
> 
> TRACE_EVENT(kvm_ple_window,
> 
> and it builds.

Thanks!

Paolo, can you still fix this "just" by rebasing?

---
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 1742dfb..4c2868f 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -848,6 +848,8 @@ TRACE_EVENT(kvm_track_tsc,
 		  __print_symbolic(__entry->host_clock, host_clocks))
 );
 
+#endif /* CONFIG_X86_64 */
+
 TRACE_EVENT(kvm_ple_window,
 	TP_PROTO(bool grow, unsigned int vcpu_id, int new, int old),
 	TP_ARGS(grow, vcpu_id, new, old),
@@ -878,8 +880,6 @@ TRACE_EVENT(kvm_ple_window,
 #define trace_kvm_ple_window_shrink(vcpu_id, new, old) \
 	trace_kvm_ple_window(false, vcpu_id, new, old)
 
-#endif /* CONFIG_X86_64 */
-
 #endif /* _TRACE_KVM_H */
 
 #undef TRACE_INCLUDE_PATH

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 5/7] KVM: trace kvm_ple_window grow/shrink
  2014-08-25 14:32     ` Radim Krčmář
@ 2014-08-25 14:44       ` Paolo Bonzini
  0 siblings, 0 replies; 29+ messages in thread
From: Paolo Bonzini @ 2014-08-25 14:44 UTC (permalink / raw)
  To: Radim Krčmář, Sabrina Dubroca
  Cc: kvm, linux-kernel, Gleb Natapov, Raghavendra KT, Vinod Chegu,
	Hui-Zhi Zhao, Christian Borntraeger, Lisa Mitchell

Il 25/08/2014 16:32, Radim Krčmář ha scritto:
> Paolo, can you still fix this "just" by rebasing?

Maybe I could, but I just pushed the fix to kvm/next as a separate commit.

Paolo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 0/7] Dynamic Pause Loop Exiting window.
  2014-08-22  4:45     ` Wanpeng Li
@ 2014-08-25 15:11       ` Radim Krčmář
  0 siblings, 0 replies; 29+ messages in thread
From: Radim Krčmář @ 2014-08-25 15:11 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Paolo Bonzini, kvm, linux-kernel, Gleb Natapov, Raghavendra KT,
	Vinod Chegu, Hui-Zhi Zhao, Christian Borntraeger, Lisa Mitchell

2014-08-22 12:45+0800, Wanpeng Li:
> Hi Radim,
> On Thu, Aug 21, 2014 at 06:50:03PM +0200, Radim Krčmář wrote:
> >2014-08-21 18:30+0200, Paolo Bonzini:
> >> Il 21/08/2014 18:08, Radim Krčmář ha scritto:
> >> I'm not sure of the usefulness of patch 6, so I'm going to drop it.
> >> I'll keep it in my local junkyard branch in case it's going to be useful
> >> in some scenario I didn't think of.
> >
> >I've been using it to benchmark different values, because it is more
> 
> Is there any benchmark data for this patchset?

Sorry, I already returned the testing machine and it wasn't quality
benchmarking, so I haven't kept the results ...

I used ebizzy and dbench, because ebizzy had large difference between
PLE on/off and dbench minimal (without overcommit), so one was looking
for improvements while the other was checking regressions.
(And they are easy to set up.)

From what I remember, this patch had roughly 5x better performance with
ebizzy on 60 VCPU guests and no obvious difference for dbench.
(And improvement under overcommit was visible for both.)

There was a significant reduction in %sys, which never raised much above
30%, as oposed to original 90%+.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2014-08-25 15:11 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-08-21 16:08 [PATCH v3 0/7] Dynamic Pause Loop Exiting window Radim Krčmář
2014-08-21 16:08 ` [PATCH v3 1/7] KVM: add kvm_arch_sched_in Radim Krčmář
2014-08-21 18:49   ` Raghavendra K T
2014-08-21 20:31     ` Radim Krčmář
2014-08-21 21:12       ` Paolo Bonzini
2014-08-21 16:08 ` [PATCH v3 2/7] KVM: x86: introduce sched_in to kvm_x86_ops Radim Krčmář
2014-08-21 18:50   ` Raghavendra K T
2014-08-21 16:08 ` [PATCH v3 3/7] KVM: VMX: make PLE window per-VCPU Radim Krčmář
2014-08-21 18:52   ` Raghavendra K T
2014-08-21 16:08 ` [PATCH v3 4/7] KVM: VMX: dynamise PLE window Radim Krčmář
2014-08-21 19:10   ` Raghavendra K T
2014-08-21 20:31     ` Paolo Bonzini
2014-08-21 20:59     ` Radim Krčmář
2014-08-21 16:08 ` [PATCH v3 5/7] KVM: trace kvm_ple_window grow/shrink Radim Krčmář
2014-08-25 13:53   ` Sabrina Dubroca
2014-08-25 14:32     ` Radim Krčmář
2014-08-25 14:44       ` Paolo Bonzini
2014-08-21 16:08 ` [PATCH v3 6/7] KVM: VMX: runtime knobs for dynamic PLE window Radim Krčmář
2014-08-21 19:17   ` Raghavendra K T
2014-08-21 21:03     ` Radim Krčmář
2014-08-21 16:08 ` [PATCH v3 7/7] KVM: VMX: optimize ple_window updates to VMCS Radim Krčmář
2014-08-21 19:18   ` Raghavendra K T
2014-08-21 16:30 ` [PATCH v3 0/7] Dynamic Pause Loop Exiting window Paolo Bonzini
2014-08-21 16:50   ` Radim Krčmář
2014-08-21 16:53     ` Paolo Bonzini
2014-08-22  4:45     ` Wanpeng Li
2014-08-25 15:11       ` Radim Krčmář
2014-08-21 17:03   ` Raghavendra K T
2014-08-21 18:40 ` Raghavendra K T

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox