[PATCH v4 0/8] KVM paravirt remote flush tlb

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/8] KVM paravirt remote flush tlb
@ 2012-08-21 11:25 Nikunj A. Dadhania
  2012-08-21 11:26 ` [PATCH v4 1/8] mm, x86: Add HAVE_RCU_TABLE_FREE support Nikunj A. Dadhania
                   ` (9 more replies)
  0 siblings, 10 replies; 23+ messages in thread
From: Nikunj A. Dadhania @ 2012-08-21 11:25 UTC (permalink / raw)
  To: mtosatti, avi
  Cc: raghukt, alex.shi, kvm, stefano.stabellini, peterz, hpa,
	vsrivatsa, mingo

Remote flushing api's does a busy wait which is fine in bare-metal
scenario. But with-in the guest, the vcpus might have been pre-empted
or blocked. In this scenario, the initator vcpu would end up
busy-waiting for a long amount of time.

This was discovered in our gang scheduling test and other way to solve
this is by para-virtualizing the flush_tlb_others_ipi(now shows up as
smp_call_function_many after Alex Shi's TLB optimization)

This patch set implements para-virt flush tlbs making sure that it
does not wait for vcpus that are sleeping. And all the sleeping vcpus
flush the tlb on guest enter. Idea was discussed here:
https://lkml.org/lkml/2012/2/20/157

This also brings one more dependency for lock-less page walk that is
performed by get_user_pages_fast(gup_fast). gup_fast disables the
interrupt and assumes that the pages will not be freed during that
period. And this was fine as the flush_tlb_others_ipi would wait for
all the IPI to be processed and return back. With the new approach of
not waiting for the sleeping vcpus, this assumption is not valid
anymore. So now HAVE_RCU_TABLE_FREE is used to free the pages. This
will make sure that all the cpus would atleast process smp_callback
before the pages are freed.

Changelog from v3:
• Add helper for cleaning up vcpu_state information (Marcelo)
• Fix code for checking vs_page and leaking page refs (Marcelo)

Changelog from v2:
• Rebase to 3.5 based linus(commit - f7da9cd) kernel.
• Port PV-Flush to new TLB-Optimization code by Alex Shi
• Use pinned pages to avoid overhead during guest enter/exit (Marcelo)
• Remove kick, as this is not improving much
• Use bit fields in the state(flush_on_enter and vcpu_running) flag to
  avoid smp barriers (Marcelo)

Changelog from v1:
• Race fixes reported by Vatsa
• Address gup_fast dependency using PeterZ's rcu table free patch
• Fix rcu_table_free for hw pagetable walkers

Here are the results from PLE hardware. Here is the setup details:
• 32 CPUs (HT disabled)
• 64-bit VM
   • 32vcpus
   • 8GB RAM

base =  3.6-rc1 + ple handler optimization patch
pvflushv4 =  3.6-rc1 + ple handler optimization patch + pvflushv4 patch

kernbench(lower is better)
==========================
         base      pvflushv4      %improvement
1VM    48.5800       46.8513       3.55846
2VM   108.1823      104.6410       3.27346
3VM   183.2733      163.3547      10.86825

ebizzy(higher is better)
========================
         base         pvflushv4      %improvement
1VM     2414.5000     2089.8750     -13.44481
2VM     2167.6250     2371.7500      9.41699
3VM     1600.1111     2102.5556     31.40060

Thanks Raghu for running the tests.

[1] http://article.gmane.org/gmane.linux.kernel/1329752

---

Nikunj A. Dadhania (6):
      KVM Guest: Add VCPU running/pre-empted state for guest
      KVM-HV: Add VCPU running/pre-empted state for guest
      KVM Guest: Add paravirt kvm_flush_tlb_others
      KVM-HV: Add flush_on_enter before guest enter
      Enable HAVE_RCU_TABLE_FREE for kvm when PARAVIRT_TLB_FLUSH is enabled
      KVM-doc: Add paravirt tlb flush document

Peter Zijlstra (2):
      mm, x86: Add HAVE_RCU_TABLE_FREE support
      mm: Add missing TLB invalidate to RCU page-table freeing

 Documentation/virtual/kvm/msr.txt                |    4 +
 Documentation/virtual/kvm/paravirt-tlb-flush.txt |   53 ++++++++++++++
 arch/Kconfig                                     |    3 +
 arch/powerpc/Kconfig                             |    1 
 arch/sparc/Kconfig                               |    1 
 arch/x86/Kconfig                                 |   11 +++
 arch/x86/include/asm/kvm_host.h                  |    7 ++
 arch/x86/include/asm/kvm_para.h                  |   13 +++
 arch/x86/include/asm/tlb.h                       |    1 
 arch/x86/include/asm/tlbflush.h                  |   11 +++
 arch/x86/kernel/kvm.c                            |   38 ++++++++++
 arch/x86/kvm/cpuid.c                             |    1 
 arch/x86/kvm/x86.c                               |   84 +++++++++++++++++++++-
 arch/x86/mm/pgtable.c                            |    6 +-
 arch/x86/mm/tlb.c                                |   36 +++++++++
 include/asm-generic/tlb.h                        |    9 ++
 mm/memory.c                                      |   43 ++++++++++-
 17 files changed, 311 insertions(+), 11 deletions(-)
 create mode 100644 Documentation/virtual/kvm/paravirt-tlb-flush.txt

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v4 1/8] mm, x86: Add HAVE_RCU_TABLE_FREE support
  2012-08-21 11:25 [PATCH v4 0/8] KVM paravirt remote flush tlb Nikunj A. Dadhania
@ 2012-08-21 11:26 ` Nikunj A. Dadhania
  2012-08-21 11:26 ` [PATCH v4 2/8] mm: Add missing TLB invalidate to RCU page-table freeing Nikunj A. Dadhania
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Nikunj A. Dadhania @ 2012-08-21 11:26 UTC (permalink / raw)
  To: mtosatti, avi
  Cc: raghukt, alex.shi, kvm, stefano.stabellini, peterz, hpa,
	vsrivatsa, mingo

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Implements optional HAVE_RCU_TABLE_FREE support for x86.

This is useful for things like Xen and KVM where paravirt tlb flush
means the software page table walkers like GUP-fast cannot rely on
IRQs disabling like regular x86 can.

Not for inclusion - is part of PeterZ's "Unify TLB gather implementations"
http://mid.gmane.org/20120627211540.459910855@chello.nl

Cc: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-r106wg6t7crxxhva55jnacrj@git.kernel.org
---
 arch/x86/include/asm/tlb.h |    1 +
 arch/x86/mm/pgtable.c      |    6 +++---
 include/asm-generic/tlb.h  |    9 +++++++++
 3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 4fef207..f5489f0 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -1,6 +1,7 @@
 #ifndef _ASM_X86_TLB_H
 #define _ASM_X86_TLB_H
 
+#define __tlb_remove_table(table) free_page_and_swap_cache(table)
 #define tlb_start_vma(tlb, vma) do { } while (0)
 #define tlb_end_vma(tlb, vma) do { } while (0)
 #define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 8573b83..34fa168 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -51,21 +51,21 @@ void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte)
 {
 	pgtable_page_dtor(pte);
 	paravirt_release_pte(page_to_pfn(pte));
-	tlb_remove_page(tlb, pte);
+	tlb_remove_table(tlb, pte);
 }
 
 #if PAGETABLE_LEVELS > 2
 void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
 {
 	paravirt_release_pmd(__pa(pmd) >> PAGE_SHIFT);
-	tlb_remove_page(tlb, virt_to_page(pmd));
+	tlb_remove_table(tlb, virt_to_page(pmd));
 }
 
 #if PAGETABLE_LEVELS > 3
 void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
 {
 	paravirt_release_pud(__pa(pud) >> PAGE_SHIFT);
-	tlb_remove_page(tlb, virt_to_page(pud));
+	tlb_remove_table(tlb, virt_to_page(pud));
 }
 #endif	/* PAGETABLE_LEVELS > 3 */
 #endif	/* PAGETABLE_LEVELS > 2 */
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index ed6642a..d382b22 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -19,6 +19,8 @@
 #include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
 
+static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page);
+
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 /*
  * Semi RCU freeing of the page directories.
@@ -60,6 +62,13 @@ struct mmu_table_batch {
 extern void tlb_table_flush(struct mmu_gather *tlb);
 extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
 
+#else
+
+static inline void tlb_remove_table(struct mmu_gather *tlb, void *table)
+{
+	tlb_remove_page(tlb, table);
+}
+
 #endif
 
 /*


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v4 2/8] mm: Add missing TLB invalidate to RCU page-table freeing
  2012-08-21 11:25 [PATCH v4 0/8] KVM paravirt remote flush tlb Nikunj A. Dadhania
  2012-08-21 11:26 ` [PATCH v4 1/8] mm, x86: Add HAVE_RCU_TABLE_FREE support Nikunj A. Dadhania
@ 2012-08-21 11:26 ` Nikunj A. Dadhania
  2012-08-21 11:26 ` [PATCH v4 3/8] KVM Guest: Add VCPU running/pre-empted state for guest Nikunj A. Dadhania
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Nikunj A. Dadhania @ 2012-08-21 11:26 UTC (permalink / raw)
  To: mtosatti, avi
  Cc: raghukt, alex.shi, kvm, stefano.stabellini, peterz, hpa,
	vsrivatsa, mingo

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

For normal systems we need a TLB invalidate before freeing the
page-tables, the generic RCU based page-table freeing code lacked
this.

This is because this code originally came from ppc where the hardware
never walks the linux page-tables and thus this invalidate is not
required.

Others, notably s390 which ran into this problem in cd94154cc6a
("[S390] fix tlb flushing for page table pages"), do very much need
this TLB invalidation.

Therefore add it, with a Kconfig option to disable it so as to not
unduly slow down PPC and SPARC64 which neither of them need it.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-z32nke0csqopykthsk1zjg8f@git.kernel.org

Not for inclusion - is part of PeterZ's "Unify TLB gather implementations"
http://mid.gmane.org/20120627211540.459910855@chello.nl

[Fix to check *batch is not NULL]
Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
---
 arch/Kconfig         |    3 +++
 arch/powerpc/Kconfig |    1 +
 arch/sparc/Kconfig   |    1 +
 mm/memory.c          |   43 +++++++++++++++++++++++++++++++++++++------
 4 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 8c3d957..fec1c9b 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -231,6 +231,9 @@ config HAVE_ARCH_MUTEX_CPU_RELAX
 config HAVE_RCU_TABLE_FREE
 	bool
 
+config STRICT_TLB_FILL
+	bool
+
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
 	bool
 
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 9a5d3cd..fb70260 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -127,6 +127,7 @@ config PPC
 	select GENERIC_IRQ_SHOW_LEVEL
 	select IRQ_FORCED_THREADING
 	select HAVE_RCU_TABLE_FREE if SMP
+	select STRICT_TLB_FILL
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_BPF_JIT if PPC64
 	select HAVE_ARCH_JUMP_LABEL
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index e74ff13..126e500 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -52,6 +52,7 @@ config SPARC64
 	select HAVE_KRETPROBES
 	select HAVE_KPROBES
 	select HAVE_RCU_TABLE_FREE if SMP
+	select STRICT_TLB_FILL
 	select HAVE_MEMBLOCK
 	select HAVE_MEMBLOCK_NODE_MAP
 	select HAVE_SYSCALL_WRAPPERS
diff --git a/mm/memory.c b/mm/memory.c
index 91f6945..2ef9ce1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -332,12 +332,47 @@ static void tlb_remove_table_rcu(struct rcu_head *head)
 	free_page((unsigned long)batch);
 }
 
+#ifdef CONFIG_STRICT_TLB_FILL
+/*
+ * Some archictures (sparc64, ppc) cannot refill TLBs after the they've removed
+ * the PTE entries from their hash-table. Their hardware never looks at the
+ * linux page-table structures, so they don't need a hardware TLB invalidate
+ * when tearing down the page-table structure itself.
+ */
+static inline void tlb_table_flush_mmu(struct mmu_gather *tlb) { }
+
+/*
+ * When there's less than two users of this mm there cannot be
+ * a concurrent page-table walk.
+ */
+static inline bool tlb_table_fast(struct mmu_gather *tlb)
+{
+	return atomic_read(&tlb->mm->mm_users) < 2;
+}
+#else
+static inline void tlb_table_flush_mmu(struct mmu_gather *tlb)
+{
+	tlb_flush_mmu(tlb);
+}
+
+/*
+ * Even if there's only a single user, speculative TLB loads can
+ * wreck stuff.
+ */
+static inline bool tlb_table_fast(struct mmu_gather *tlb)
+{
+	return false;
+}
+#endif /* CONFIG_STRICT_TLB_FILL */
+
 void tlb_table_flush(struct mmu_gather *tlb)
 {
 	struct mmu_table_batch **batch = &tlb->batch;
 
 	if (*batch) {
-		call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu);
+		tlb_table_flush_mmu(tlb);
+		if (*batch)
+			call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu);
 		*batch = NULL;
 	}
 }
@@ -348,11 +383,7 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table)
 
 	tlb->need_flush = 1;
 
-	/*
-	 * When there's less then two users of this mm there cannot be a
-	 * concurrent page-table walk.
-	 */
-	if (atomic_read(&tlb->mm->mm_users) < 2) {
+	if (tlb_table_fast(tlb)) {
 		__tlb_remove_table(table);
 		return;
 	}


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v4 3/8] KVM Guest: Add VCPU running/pre-empted state for guest
  2012-08-21 11:25 [PATCH v4 0/8] KVM paravirt remote flush tlb Nikunj A. Dadhania
  2012-08-21 11:26 ` [PATCH v4 1/8] mm, x86: Add HAVE_RCU_TABLE_FREE support Nikunj A. Dadhania
  2012-08-21 11:26 ` [PATCH v4 2/8] mm: Add missing TLB invalidate to RCU page-table freeing Nikunj A. Dadhania
@ 2012-08-21 11:26 ` Nikunj A. Dadhania
  2012-08-23  9:36   ` Marcelo Tosatti
  2012-08-21 11:26 ` [PATCH v4 4/8] KVM-HV: " Nikunj A. Dadhania
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 23+ messages in thread
From: Nikunj A. Dadhania @ 2012-08-21 11:26 UTC (permalink / raw)
  To: mtosatti, avi
  Cc: raghukt, alex.shi, kvm, stefano.stabellini, peterz, hpa,
	vsrivatsa, mingo

From: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>

The patch adds guest code for msr between guest and hypervisor. The
msr will export the vcpu running/pre-empted information to the guest
from host. This will enable guest to intelligently send ipi to running
vcpus and set flag for pre-empted vcpus. This will prevent waiting for
vcpus that are not running.

Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
---
 arch/x86/include/asm/kvm_para.h |   13 +++++++++++++
 arch/x86/kernel/kvm.c           |   36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 49 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 2f7712e..5dfb975 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -23,6 +23,7 @@
 #define KVM_FEATURE_ASYNC_PF		4
 #define KVM_FEATURE_STEAL_TIME		5
 #define KVM_FEATURE_PV_EOI		6
+#define KVM_FEATURE_VCPU_STATE          7
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
@@ -39,6 +40,7 @@
 #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
 #define MSR_KVM_STEAL_TIME  0x4b564d03
 #define MSR_KVM_PV_EOI_EN      0x4b564d04
+#define MSR_KVM_VCPU_STATE  0x4b564d05
 
 struct kvm_steal_time {
 	__u64 steal;
@@ -51,6 +53,17 @@ struct kvm_steal_time {
 #define KVM_STEAL_VALID_BITS ((-1ULL << (KVM_STEAL_ALIGNMENT_BITS + 1)))
 #define KVM_STEAL_RESERVED_MASK (((1 << KVM_STEAL_ALIGNMENT_BITS) - 1 ) << 1)
 
+struct kvm_vcpu_state {
+	__u64 state;
+	__u32 pad[14];
+};
+/* bits in vcpu_state->state */
+#define KVM_VCPU_STATE_IN_GUEST_MODE 0
+#define KVM_VCPU_STATE_SHOULD_FLUSH  1
+
+#define KVM_VCPU_STATE_ALIGN_BITS 5
+#define KVM_VCPU_STATE_VALID_BITS ((-1ULL << (KVM_VCPU_STATE_ALIGN_BITS + 1)))
+
 #define KVM_MAX_MMU_OP_BATCH           32
 
 #define KVM_ASYNC_PF_ENABLED			(1 << 0)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index c1d61ee..37e6599 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -66,6 +66,9 @@ static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
 static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
 static int has_steal_clock = 0;
 
+DEFINE_PER_CPU(struct kvm_vcpu_state, vcpu_state) __aligned(64);
+static int has_vcpu_state;
+
 /*
  * No need for any "IO delay" on KVM
  */
@@ -302,6 +305,22 @@ static void kvm_guest_apic_eoi_write(u32 reg, u32 val)
 	apic_write(APIC_EOI, APIC_EOI_ACK);
 }
 
+static void kvm_register_vcpu_state(void)
+{
+	int cpu = smp_processor_id();
+	struct kvm_vcpu_state *v_state;
+
+	if (!has_vcpu_state)
+		return;
+
+	v_state = &per_cpu(vcpu_state, cpu);
+	memset(v_state, 0, sizeof(*v_state));
+
+	wrmsrl(MSR_KVM_VCPU_STATE, (__pa(v_state) | KVM_MSR_ENABLED));
+	printk(KERN_INFO "kvm-vcpustate: cpu %d, msr %lx\n",
+		cpu, __pa(v_state));
+}
+
 void __cpuinit kvm_guest_cpu_init(void)
 {
 	if (!kvm_para_available())
@@ -330,6 +349,9 @@ void __cpuinit kvm_guest_cpu_init(void)
 
 	if (has_steal_clock)
 		kvm_register_steal_time();
+
+	if (has_vcpu_state)
+		kvm_register_vcpu_state();
 }
 
 static void kvm_pv_disable_apf(void)
@@ -393,6 +415,14 @@ void kvm_disable_steal_time(void)
 	wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
 }
 
+void kvm_disable_vcpu_state(void)
+{
+	if (!has_vcpu_state)
+		return;
+
+	wrmsr(MSR_KVM_VCPU_STATE, 0, 0);
+}
+
 #ifdef CONFIG_SMP
 static void __init kvm_smp_prepare_boot_cpu(void)
 {
@@ -410,6 +440,7 @@ static void __cpuinit kvm_guest_cpu_online(void *dummy)
 
 static void kvm_guest_cpu_offline(void *dummy)
 {
+	kvm_disable_vcpu_state();
 	kvm_disable_steal_time();
 	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
 		wrmsrl(MSR_KVM_PV_EOI_EN, 0);
@@ -469,6 +500,11 @@ void __init kvm_guest_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
 		apic_set_eoi_write(kvm_guest_apic_eoi_write);
 
+#ifdef CONFIG_PARAVIRT_TLB_FLUSH
+	if (kvm_para_has_feature(KVM_FEATURE_VCPU_STATE))
+		has_vcpu_state = 1;
+#endif
+
 #ifdef CONFIG_SMP
 	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
 	register_cpu_notifier(&kvm_cpu_notifier);


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v4 4/8] KVM-HV: Add VCPU running/pre-empted state for guest
  2012-08-21 11:25 [PATCH v4 0/8] KVM paravirt remote flush tlb Nikunj A. Dadhania
                   ` (2 preceding siblings ...)
  2012-08-21 11:26 ` [PATCH v4 3/8] KVM Guest: Add VCPU running/pre-empted state for guest Nikunj A. Dadhania
@ 2012-08-21 11:26 ` Nikunj A. Dadhania
  2012-08-23 11:46   ` Marcelo Tosatti
  2012-08-21 11:27 ` [PATCH v4 5/8] KVM Guest: Add paravirt kvm_flush_tlb_others Nikunj A. Dadhania
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 23+ messages in thread
From: Nikunj A. Dadhania @ 2012-08-21 11:26 UTC (permalink / raw)
  To: mtosatti, avi
  Cc: raghukt, alex.shi, kvm, stefano.stabellini, peterz, hpa,
	vsrivatsa, mingo

From: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>

Hypervisor code to indicate guest running/pre-empteded status through
msr. The page is now pinned during MSR write time and use
kmap_atomic/kunmap_atomic to access the shared area vcpu_state area.

Suggested-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
---
 arch/x86/include/asm/kvm_host.h |    7 +++
 arch/x86/kvm/cpuid.c            |    1 
 arch/x86/kvm/x86.c              |   88 ++++++++++++++++++++++++++++++++++++++-
 3 files changed, 94 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 09155d6..441348f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -429,6 +429,13 @@ struct kvm_vcpu_arch {
 		struct kvm_steal_time steal;
 	} st;
 
+	/* indicates vcpu is running or preempted */
+	struct {
+		u64 msr_val;
+		struct page *vs_page;
+		unsigned int vs_offset;
+	} v_state;
+
 	u64 last_guest_tsc;
 	u64 last_kernel_ns;
 	u64 last_host_tsc;
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 0595f13..37ab364 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -411,6 +411,7 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 			     (1 << KVM_FEATURE_CLOCKSOURCE2) |
 			     (1 << KVM_FEATURE_ASYNC_PF) |
 			     (1 << KVM_FEATURE_PV_EOI) |
+			     (1 << KVM_FEATURE_VCPU_STATE) |
 			     (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
 
 		if (sched_info_on())
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 59b5950..43f2c19 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -806,13 +806,13 @@ EXPORT_SYMBOL_GPL(kvm_rdpmc);
  * kvm-specific. Those are put in the beginning of the list.
  */
 
-#define KVM_SAVE_MSRS_BEGIN	9
+#define KVM_SAVE_MSRS_BEGIN	10
 static u32 msrs_to_save[] = {
 	MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
 	MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
 	HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
 	HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
-	MSR_KVM_PV_EOI_EN,
+	MSR_KVM_VCPU_STATE, MSR_KVM_PV_EOI_EN,
 	MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
 	MSR_STAR,
 #ifdef CONFIG_X86_64
@@ -1557,6 +1557,63 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
 		&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
 }
 
+static void kvm_set_atomic(u64 *addr, u64 old, u64 new)
+{
+	int loop = 1000000;
+	while (1) {
+		if (cmpxchg(addr, old, new) == old)
+			break;
+		loop--;
+		if (!loop) {
+			pr_info("atomic cur: %lx old: %lx new: %lx\n",
+				*addr, old, new);
+			break;
+		}
+	}
+}
+
+static void kvm_set_vcpu_state(struct kvm_vcpu *vcpu)
+{
+	struct kvm_vcpu_state *vs;
+	char *kaddr;
+
+	if (!((vcpu->arch.v_state.msr_val & KVM_MSR_ENABLED) &&
+			vcpu->arch.v_state.vs_page))
+		return;
+
+	kaddr = kmap_atomic(vcpu->arch.v_state.vs_page);
+	kaddr += vcpu->arch.v_state.vs_offset;
+	vs = kaddr;
+	kvm_set_atomic(&vs->state, 0, 1 << KVM_VCPU_STATE_IN_GUEST_MODE);
+	kunmap_atomic(kaddr);
+}
+
+static void kvm_clear_vcpu_state(struct kvm_vcpu *vcpu)
+{
+	struct kvm_vcpu_state *vs;
+	char *kaddr;
+
+	if (!((vcpu->arch.v_state.msr_val & KVM_MSR_ENABLED) &&
+			vcpu->arch.v_state.vs_page))
+		return;
+
+	kaddr = kmap_atomic(vcpu->arch.v_state.vs_page);
+	kaddr += vcpu->arch.v_state.vs_offset;
+	vs = kaddr;
+	kvm_set_atomic(&vs->state, 1 << KVM_VCPU_STATE_IN_GUEST_MODE, 0);
+	kunmap_atomic(kaddr);
+}
+
+static void kvm_vcpu_state_reset(struct kvm_vcpu *vcpu)
+{
+	vcpu->arch.v_state.msr_val = 0;
+	vcpu->arch.v_state.vs_offset = 0;
+	if (vcpu->arch.v_state.vs_page) {
+		kvm_release_page_dirty(vcpu->arch.v_state.vs_page);
+		vcpu->arch.v_state.vs_page = NULL;
+	}
+}
+
 int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 {
 	bool pr = false;
@@ -1676,6 +1733,24 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 			return 1;
 		break;
 
+	case MSR_KVM_VCPU_STATE:
+		kvm_vcpu_state_reset(vcpu);
+
+		if (!(data & KVM_MSR_ENABLED))
+			break;
+
+		vcpu->arch.v_state.vs_page = gfn_to_page(vcpu->kvm, data >> PAGE_SHIFT);
+
+		if (is_error_page(vcpu->arch.v_state.vs_page)) {
+			kvm_release_page_clean(vcpu->arch.v_state.vs_page);
+			vcpu->arch.v_state.vs_page = NULL;
+			pr_info("KVM: VCPU_STATE - Unable to pin the page\n");
+			break;
+		}
+		vcpu->arch.v_state.vs_offset = data & ~(PAGE_MASK | KVM_MSR_ENABLED);
+		vcpu->arch.v_state.msr_val = data;
+		break;
+
 	case MSR_IA32_MCG_CTL:
 	case MSR_IA32_MCG_STATUS:
 	case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1:
@@ -1996,6 +2071,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
 	case MSR_KVM_STEAL_TIME:
 		data = vcpu->arch.st.msr_val;
 		break;
+	case MSR_KVM_VCPU_STATE:
+		data = vcpu->arch.v_state.msr_val;
+		break;
 	case MSR_IA32_P5_MC_ADDR:
 	case MSR_IA32_P5_MC_TYPE:
 	case MSR_IA32_MCG_CAP:
@@ -5312,6 +5390,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		kvm_load_guest_fpu(vcpu);
 	kvm_load_guest_xcr0(vcpu);
 
+	kvm_set_vcpu_state(vcpu);
+
 	vcpu->mode = IN_GUEST_MODE;
 
 	/* We should set ->mode before check ->requests,
@@ -5323,6 +5403,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 	if (vcpu->mode == EXITING_GUEST_MODE || vcpu->requests
 	    || need_resched() || signal_pending(current)) {
+		kvm_clear_vcpu_state(vcpu);
 		vcpu->mode = OUTSIDE_GUEST_MODE;
 		smp_wmb();
 		local_irq_enable();
@@ -5361,6 +5442,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 	vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu);
 
+	kvm_clear_vcpu_state(vcpu);
 	vcpu->mode = OUTSIDE_GUEST_MODE;
 	smp_wmb();
 	local_irq_enable();
@@ -5987,6 +6069,7 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
 void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
 {
 	kvmclock_reset(vcpu);
+	kvm_vcpu_state_reset(vcpu);
 
 	free_cpumask_var(vcpu->arch.wbinvd_dirty_mask);
 	fx_free(vcpu);
@@ -6045,6 +6128,7 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
 	vcpu->arch.st.msr_val = 0;
 
 	kvmclock_reset(vcpu);
+	kvm_vcpu_state_reset(vcpu);
 
 	kvm_clear_async_pf_completion_queue(vcpu);
 	kvm_async_pf_hash_reset(vcpu);


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v4 5/8] KVM Guest: Add paravirt kvm_flush_tlb_others
  2012-08-21 11:25 [PATCH v4 0/8] KVM paravirt remote flush tlb Nikunj A. Dadhania
                   ` (3 preceding siblings ...)
  2012-08-21 11:26 ` [PATCH v4 4/8] KVM-HV: " Nikunj A. Dadhania
@ 2012-08-21 11:27 ` Nikunj A. Dadhania
  2012-08-21 11:27 ` [PATCH v4 6/8] KVM-HV: Add flush_on_enter before guest enter Nikunj A. Dadhania
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Nikunj A. Dadhania @ 2012-08-21 11:27 UTC (permalink / raw)
  To: mtosatti, avi
  Cc: raghukt, alex.shi, kvm, stefano.stabellini, peterz, hpa,
	vsrivatsa, mingo

From: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>

flush_tlb_others_ipi depends on lot of statics in tlb.c.  Replicated
the flush_tlb_others_ipi as kvm_flush_tlb_others to further adapt to
paravirtualization.

Use the vcpu state information inside the kvm_flush_tlb_others to
avoid sending ipi to pre-empted vcpus.

* Do not send ipi's to offline vcpus and set flush_on_enter flag
* For online vcpus: Wait for them to clear the flag

The approach was discussed here: https://lkml.org/lkml/2012/2/20/157

v3:
* use only one state variable for vcpu-running/flush_on_enter
* use cmpxchg to update the state
* adapt to Alex Shi's TLB flush optimization

v2:
* use ACCESS_ONCE so the value is not register cached
* Separate HV and Guest code

Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>

--
Pseudo Algo:
------------
   Hypervisor
   ==========
   guest_exit()
       if (!(xchg(state, NOT_IN_GUEST) == SHOULD_FLUSH))
           tlb_flush(vcpu);

   guest_enter()
       if (!(xchg(state, IN_GUEST) == SHOULD_FLUSH))
	   tlb_flush(vcpu);

    Guest
    =====
    flushcpumask = cpumask;
    for_each_cpu(i, flushmask) {
        state = vs->state;
        if(!test_bit(IN_GUEST_MODE, state)) {
            if (cmpxchg(&vs->state, state,
                        state | (1 << SHOULD_FLUSH)) == SUCCESS)
	       cpumask_clear_cpu(flushmask,i)
        }
    }

    smp_call_function_many(f->flushmask, flush_tlb_func)
---
 arch/x86/include/asm/tlbflush.h |   11 +++++++++++
 arch/x86/kernel/kvm.c           |    4 +++-
 arch/x86/mm/tlb.c               |   36 ++++++++++++++++++++++++++++++++++++
 3 files changed, 50 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 74a4433..0a343a1 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -119,6 +119,13 @@ static inline void native_flush_tlb_others(const struct cpumask *cpumask,
 {
 }
 
+static inline void kvm_flush_tlb_others(const struct cpumask *cpumask,
+					struct mm_struct *mm,
+					unsigned long start,
+					unsigned long end)
+{
+}
+
 static inline void reset_lazy_tlbstate(void)
 {
 }
@@ -153,6 +160,10 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 				struct mm_struct *mm,
 				unsigned long start, unsigned long end);
 
+void kvm_flush_tlb_others(const struct cpumask *cpumask,
+			struct mm_struct *mm, unsigned long start,
+			unsigned long end);
+
 #define TLBSTATE_OK	1
 #define TLBSTATE_LAZY	2
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 37e6599..b538a31 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -501,8 +501,10 @@ void __init kvm_guest_init(void)
 		apic_set_eoi_write(kvm_guest_apic_eoi_write);
 
 #ifdef CONFIG_PARAVIRT_TLB_FLUSH
-	if (kvm_para_has_feature(KVM_FEATURE_VCPU_STATE))
+	if (kvm_para_has_feature(KVM_FEATURE_VCPU_STATE)) {
 		has_vcpu_state = 1;
+		pv_mmu_ops.flush_tlb_others = kvm_flush_tlb_others;
+	}
 #endif
 
 #ifdef CONFIG_SMP
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 613cd83..645df99 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -6,6 +6,7 @@
 #include <linux/interrupt.h>
 #include <linux/module.h>
 #include <linux/cpu.h>
+#include <linux/kvm_para.h>
 
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
@@ -119,6 +120,41 @@ static void flush_tlb_func(void *info)
 
 }
 
+#ifdef CONFIG_KVM_GUEST
+
+DECLARE_PER_CPU(struct kvm_vcpu_state, vcpu_state) __aligned(64);
+
+void kvm_flush_tlb_others(const struct cpumask *cpumask,
+			struct mm_struct *mm, unsigned long start,
+			unsigned long end)
+{
+	struct flush_tlb_info info;
+	struct kvm_vcpu_state *v_state;
+	u64 state;
+	int cpu;
+	cpumask_t flushmask;
+
+	cpumask_copy(&flushmask, cpumask);
+	info.flush_mm = mm;
+	info.flush_start = start;
+	info.flush_end = end;
+	/*
+	 * We have to call flush only on online vCPUs. And
+	 * queue flush_on_enter for pre-empted vCPUs
+	 */
+	for_each_cpu(cpu, to_cpumask(&flushmask)) {
+		v_state = &per_cpu(vcpu_state, cpu);
+		state = v_state->state;
+		if (!test_bit(KVM_VCPU_STATE_IN_GUEST_MODE, &state)) {
+			if (cmpxchg(&v_state->state, state, state | 1 << KVM_VCPU_STATE_SHOULD_FLUSH))
+				cpumask_clear_cpu(cpu, to_cpumask(&flushmask));
+		}
+	}
+
+	smp_call_function_many(&flushmask, flush_tlb_func, &info, 1);
+}
+#endif /* CONFIG_KVM_GUEST */
+
 void native_flush_tlb_others(const struct cpumask *cpumask,
 				 struct mm_struct *mm, unsigned long start,
 				 unsigned long end)


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v4 6/8] KVM-HV: Add flush_on_enter before guest enter
  2012-08-21 11:25 [PATCH v4 0/8] KVM paravirt remote flush tlb Nikunj A. Dadhania
                   ` (4 preceding siblings ...)
  2012-08-21 11:27 ` [PATCH v4 5/8] KVM Guest: Add paravirt kvm_flush_tlb_others Nikunj A. Dadhania
@ 2012-08-21 11:27 ` Nikunj A. Dadhania
  2012-08-21 11:27 ` [PATCH v4 7/8] Enable HAVE_RCU_TABLE_FREE for kvm when PARAVIRT_TLB_FLUSH is enabled Nikunj A. Dadhania
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Nikunj A. Dadhania @ 2012-08-21 11:27 UTC (permalink / raw)
  To: mtosatti, avi
  Cc: raghukt, alex.shi, kvm, stefano.stabellini, peterz, hpa,
	vsrivatsa, mingo

PV-Flush guest would indicate to flush on enter, flush the TLB before
entering and exiting the guest.

Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
---
 arch/x86/kvm/x86.c |   28 ++++++++++++----------------
 1 files changed, 12 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 43f2c19..07fdb0f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1557,20 +1557,9 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
 		&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
 }
 
-static void kvm_set_atomic(u64 *addr, u64 old, u64 new)
-{
-	int loop = 1000000;
-	while (1) {
-		if (cmpxchg(addr, old, new) == old)
-			break;
-		loop--;
-		if (!loop) {
-			pr_info("atomic cur: %lx old: %lx new: %lx\n",
-				*addr, old, new);
-			break;
-		}
-	}
-}
+#define VS_NOT_IN_GUEST      (0)
+#define VS_IN_GUEST          (1 << KVM_VCPU_STATE_IN_GUEST_MODE)
+#define VS_SHOULD_FLUSH      (1 << KVM_VCPU_STATE_SHOULD_FLUSH)
 
 static void kvm_set_vcpu_state(struct kvm_vcpu *vcpu)
 {
@@ -1584,7 +1573,13 @@ static void kvm_set_vcpu_state(struct kvm_vcpu *vcpu)
 	kaddr = kmap_atomic(vcpu->arch.v_state.vs_page);
 	kaddr += vcpu->arch.v_state.vs_offset;
 	vs = kaddr;
-	kvm_set_atomic(&vs->state, 0, 1 << KVM_VCPU_STATE_IN_GUEST_MODE);
+	if (xchg(&vs->state, VS_IN_GUEST) == VS_SHOULD_FLUSH) {
+		/* 
+		 * Do TLB_FLUSH before entering the guest, its passed
+		 * the stage of request checking 
+		 */
+		kvm_x86_ops->tlb_flush(vcpu);
+	}
 	kunmap_atomic(kaddr);
 }
 
@@ -1600,7 +1595,8 @@ static void kvm_clear_vcpu_state(struct kvm_vcpu *vcpu)
 	kaddr = kmap_atomic(vcpu->arch.v_state.vs_page);
 	kaddr += vcpu->arch.v_state.vs_offset;
 	vs = kaddr;
-	kvm_set_atomic(&vs->state, 1 << KVM_VCPU_STATE_IN_GUEST_MODE, 0);
+	if (xchg(&vs->state, VS_NOT_IN_GUEST) == VS_SHOULD_FLUSH)
+		kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
 	kunmap_atomic(kaddr);
 }
 


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v4 7/8] Enable HAVE_RCU_TABLE_FREE for kvm when PARAVIRT_TLB_FLUSH is enabled
  2012-08-21 11:25 [PATCH v4 0/8] KVM paravirt remote flush tlb Nikunj A. Dadhania
                   ` (5 preceding siblings ...)
  2012-08-21 11:27 ` [PATCH v4 6/8] KVM-HV: Add flush_on_enter before guest enter Nikunj A. Dadhania
@ 2012-08-21 11:27 ` Nikunj A. Dadhania
  2012-08-21 11:28 ` [PATCH v4 8/8] KVM-doc: Add paravirt tlb flush document Nikunj A. Dadhania
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Nikunj A. Dadhania @ 2012-08-21 11:27 UTC (permalink / raw)
  To: mtosatti, avi
  Cc: raghukt, alex.shi, kvm, stefano.stabellini, peterz, hpa,
	vsrivatsa, mingo


Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
---
 arch/x86/Kconfig |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c70684f..354160d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -612,6 +612,17 @@ config PARAVIRT_SPINLOCKS
 
 	  If you are unsure how to answer this question, answer N.
 
+config PARAVIRT_TLB_FLUSH
+       bool "Paravirtualization layer for TLB Flush"
+       depends on PARAVIRT && SMP && EXPERIMENTAL
+       select HAVE_RCU_TABLE_FREE
+       ---help---
+         Paravirtualized Flush TLB replace the native implementation
+         with something virtualization-friendly (for example, set a
+         flag for sleeping vcpu and do not wait for it).
+
+         If you are unsure how to answer this question, answer N.
+
 config PARAVIRT_CLOCK
 	bool
 


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v4 8/8] KVM-doc: Add paravirt tlb flush document
  2012-08-21 11:25 [PATCH v4 0/8] KVM paravirt remote flush tlb Nikunj A. Dadhania
                   ` (6 preceding siblings ...)
  2012-08-21 11:27 ` [PATCH v4 7/8] Enable HAVE_RCU_TABLE_FREE for kvm when PARAVIRT_TLB_FLUSH is enabled Nikunj A. Dadhania
@ 2012-08-21 11:28 ` Nikunj A. Dadhania
  2012-08-23 11:48 ` [PATCH v4 0/8] KVM paravirt remote flush tlb Marcelo Tosatti
  2012-09-03 14:33 ` Avi Kivity
  9 siblings, 0 replies; 23+ messages in thread
From: Nikunj A. Dadhania @ 2012-08-21 11:28 UTC (permalink / raw)
  To: mtosatti, avi
  Cc: raghukt, alex.shi, kvm, stefano.stabellini, peterz, hpa,
	vsrivatsa, mingo


Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
---
 Documentation/virtual/kvm/msr.txt                |    4 ++
 Documentation/virtual/kvm/paravirt-tlb-flush.txt |   53 ++++++++++++++++++++++
 2 files changed, 57 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/virtual/kvm/paravirt-tlb-flush.txt

diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt
index 7304710..92a6af6 100644
--- a/Documentation/virtual/kvm/msr.txt
+++ b/Documentation/virtual/kvm/msr.txt
@@ -256,3 +256,7 @@ MSR_KVM_EOI_EN: 0x4b564d04
 	guest must both read the least significant bit in the memory area and
 	clear it using a single CPU instruction, such as test and clear, or
 	compare and exchange.
+
+MSR_KVM_VCPU_STATE: 0x4b564d05
+
+Refer: Documentation/virtual/kvm/paravirt-tlb-flush.txt
diff --git a/Documentation/virtual/kvm/paravirt-tlb-flush.txt b/Documentation/virtual/kvm/paravirt-tlb-flush.txt
new file mode 100644
index 0000000..0eaabd7
--- /dev/null
+++ b/Documentation/virtual/kvm/paravirt-tlb-flush.txt
@@ -0,0 +1,53 @@
+KVM - Paravirt TLB Flush
+Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>, IBM, 2012
+========================================================
+
+Remote flushing api's does a busy wait which is fine in bare-metal
+scenario. But with-in the guest, the vcpus might have been pre-empted
+or blocked. In this scenario, the initator vcpu would end up
+busy-waiting for a long amount of time.
+
+This would require to have information of guest running/not-running
+within the guest to take a decision. The following MSR introduces vcpu
+running state information.
+
+Using this MSR we have implemented para-virt flush tlbs making sure
+that it does not wait for vcpus that are not-running. And TLB flushing
+for them is deferred, which is done on guest enter.
+
+MSR_KVM_VCPU_STATE: 0x4b564d04
+
+	data: 64-byte alignment physical address of a memory area which must be
+	in guest RAM, plus an enable bit in bit 0. This memory is expected to
+	hold a copy of the following structure:
+
+	struct kvm_steal_time {
+		__u64 state;
+		__u32 pad[14];
+	}
+
+	whose data will be filled in by the hypervisor/guest. Only one
+	write, or registration, is needed for each VCPU.  The interval
+	between updates of this structure is arbitrary and
+	implementation-dependent.  The hypervisor may update this
+	structure at any time it sees fit until anything with bit0 ==
+	0 is written to it. Guest is required to make sure this
+	structure is initialized to zero.
+
+	This would enable a VCPU to know running status of sibling
+	VCPUs. The information can further be used to determine if an
+	IPI needs to be send to the non-running VCPU and wait for them
+	unnecessarily. For e.g. flush_tlb_others_ipi.
+
+	Fields have the following meanings:
+
+		state: has bit  following fields:
+
+		Bit 0 - vcpu running state. Hypervisor would set vcpu
+		      	running/not running. Value 1 meaning the vcpu
+		      	is running and value 0 means vcpu is
+		      	pre-empted out.
+
+		Bit 1 - hypervisor should flush tlb is set during
+		        guest enter/exit
+


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 3/8] KVM Guest: Add VCPU running/pre-empted state for guest
  2012-08-21 11:26 ` [PATCH v4 3/8] KVM Guest: Add VCPU running/pre-empted state for guest Nikunj A. Dadhania
@ 2012-08-23  9:36   ` Marcelo Tosatti
  2012-08-24  5:39     ` Nikunj A Dadhania
  0 siblings, 1 reply; 23+ messages in thread
From: Marcelo Tosatti @ 2012-08-23  9:36 UTC (permalink / raw)
  To: Nikunj A. Dadhania
  Cc: avi, raghukt, alex.shi, kvm, stefano.stabellini, peterz, hpa,
	vsrivatsa, mingo

On Tue, Aug 21, 2012 at 04:56:35PM +0530, Nikunj A. Dadhania wrote:
> From: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
> 
> The patch adds guest code for msr between guest and hypervisor. The
> msr will export the vcpu running/pre-empted information to the guest
> from host. This will enable guest to intelligently send ipi to running
> vcpus and set flag for pre-empted vcpus. This will prevent waiting for
> vcpus that are not running.
> 
> Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
> ---
>  arch/x86/include/asm/kvm_para.h |   13 +++++++++++++
>  arch/x86/kernel/kvm.c           |   36 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 49 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index 2f7712e..5dfb975 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -23,6 +23,7 @@
>  #define KVM_FEATURE_ASYNC_PF		4
>  #define KVM_FEATURE_STEAL_TIME		5
>  #define KVM_FEATURE_PV_EOI		6
> +#define KVM_FEATURE_VCPU_STATE          7
>  
>  /* The last 8 bits are used to indicate how to interpret the flags field
>   * in pvclock structure. If no bits are set, all flags are ignored.
> @@ -39,6 +40,7 @@
>  #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
>  #define MSR_KVM_STEAL_TIME  0x4b564d03
>  #define MSR_KVM_PV_EOI_EN      0x4b564d04
> +#define MSR_KVM_VCPU_STATE  0x4b564d05
>  
>  struct kvm_steal_time {
>  	__u64 steal;
> @@ -51,6 +53,17 @@ struct kvm_steal_time {
>  #define KVM_STEAL_VALID_BITS ((-1ULL << (KVM_STEAL_ALIGNMENT_BITS + 1)))
>  #define KVM_STEAL_RESERVED_MASK (((1 << KVM_STEAL_ALIGNMENT_BITS) - 1 ) << 1)
>  
> +struct kvm_vcpu_state {
> +	__u64 state;
> +	__u32 pad[14];
> +};
> +/* bits in vcpu_state->state */
> +#define KVM_VCPU_STATE_IN_GUEST_MODE 0
> +#define KVM_VCPU_STATE_SHOULD_FLUSH  1
> +
> +#define KVM_VCPU_STATE_ALIGN_BITS 5
> +#define KVM_VCPU_STATE_VALID_BITS ((-1ULL << (KVM_VCPU_STATE_ALIGN_BITS + 1)))
> +
>  #define KVM_MAX_MMU_OP_BATCH           32
>  
>  #define KVM_ASYNC_PF_ENABLED			(1 << 0)
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index c1d61ee..37e6599 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -66,6 +66,9 @@ static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
>  static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
>  static int has_steal_clock = 0;
>  
> +DEFINE_PER_CPU(struct kvm_vcpu_state, vcpu_state) __aligned(64);
> +static int has_vcpu_state;
> +
>  /*
>   * No need for any "IO delay" on KVM
>   */
> @@ -302,6 +305,22 @@ static void kvm_guest_apic_eoi_write(u32 reg, u32 val)
>  	apic_write(APIC_EOI, APIC_EOI_ACK);
>  }
>  
> +static void kvm_register_vcpu_state(void)
> +{
> +	int cpu = smp_processor_id();
> +	struct kvm_vcpu_state *v_state;
> +
> +	if (!has_vcpu_state)
> +		return;
> +
> +	v_state = &per_cpu(vcpu_state, cpu);
> +	memset(v_state, 0, sizeof(*v_state));
> +
> +	wrmsrl(MSR_KVM_VCPU_STATE, (__pa(v_state) | KVM_MSR_ENABLED));
> +	printk(KERN_INFO "kvm-vcpustate: cpu %d, msr %lx\n",
> +		cpu, __pa(v_state));
> +}
> +
>  void __cpuinit kvm_guest_cpu_init(void)
>  {
>  	if (!kvm_para_available())
> @@ -330,6 +349,9 @@ void __cpuinit kvm_guest_cpu_init(void)
>  
>  	if (has_steal_clock)
>  		kvm_register_steal_time();
> +
> +	if (has_vcpu_state)
> +		kvm_register_vcpu_state();
>  }
>  
>  static void kvm_pv_disable_apf(void)
> @@ -393,6 +415,14 @@ void kvm_disable_steal_time(void)
>  	wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
>  }
>  
> +void kvm_disable_vcpu_state(void)
> +{
> +	if (!has_vcpu_state)
> +		return;
> +
> +	wrmsr(MSR_KVM_VCPU_STATE, 0, 0);

wrmsrl (to be consistent).

> +}
> +
>  #ifdef CONFIG_SMP
>  static void __init kvm_smp_prepare_boot_cpu(void)
>  {
> @@ -410,6 +440,7 @@ static void __cpuinit kvm_guest_cpu_online(void *dummy)
>  
>  static void kvm_guest_cpu_offline(void *dummy)
>  {
> +	kvm_disable_vcpu_state();

Should disable MSR at kvm_pv_guest_cpu_reboot.

>  	kvm_disable_steal_time();
>  	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
>  		wrmsrl(MSR_KVM_PV_EOI_EN, 0);
> @@ -469,6 +500,11 @@ void __init kvm_guest_init(void)
>  	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
>  		apic_set_eoi_write(kvm_guest_apic_eoi_write);
>  
> +#ifdef CONFIG_PARAVIRT_TLB_FLUSH
> +	if (kvm_para_has_feature(KVM_FEATURE_VCPU_STATE))
> +		has_vcpu_state = 1;
> +#endif

Why only this hunk guarded by CONFIG_PARAVIRT_TLB_FLUSH and not
the rest of the code?

Is there a switch to enable/disable this feature on the kernel
command line? Grep for early_param in kvm.c.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 4/8] KVM-HV: Add VCPU running/pre-empted state for guest
  2012-08-21 11:26 ` [PATCH v4 4/8] KVM-HV: " Nikunj A. Dadhania
@ 2012-08-23 11:46   ` Marcelo Tosatti
  2012-08-24  5:19     ` Nikunj A Dadhania
  0 siblings, 1 reply; 23+ messages in thread
From: Marcelo Tosatti @ 2012-08-23 11:46 UTC (permalink / raw)
  To: Nikunj A. Dadhania
  Cc: avi, raghukt, alex.shi, kvm, stefano.stabellini, peterz, hpa,
	vsrivatsa, mingo

On Tue, Aug 21, 2012 at 04:56:43PM +0530, Nikunj A. Dadhania wrote:
> From: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
> 
> Hypervisor code to indicate guest running/pre-empteded status through
> msr. The page is now pinned during MSR write time and use
> kmap_atomic/kunmap_atomic to access the shared area vcpu_state area.
> 
> Suggested-by: Marcelo Tosatti <mtosatti@redhat.com>
> Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
> ---
>  arch/x86/include/asm/kvm_host.h |    7 +++
>  arch/x86/kvm/cpuid.c            |    1 
>  arch/x86/kvm/x86.c              |   88 ++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 94 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 09155d6..441348f 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -429,6 +429,13 @@ struct kvm_vcpu_arch {
>  		struct kvm_steal_time steal;
>  	} st;
>  
> +	/* indicates vcpu is running or preempted */
> +	struct {
> +		u64 msr_val;
> +		struct page *vs_page;
> +		unsigned int vs_offset;
> +	} v_state;
> +
>  	u64 last_guest_tsc;
>  	u64 last_kernel_ns;
>  	u64 last_host_tsc;
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 0595f13..37ab364 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -411,6 +411,7 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>  			     (1 << KVM_FEATURE_CLOCKSOURCE2) |
>  			     (1 << KVM_FEATURE_ASYNC_PF) |
>  			     (1 << KVM_FEATURE_PV_EOI) |
> +			     (1 << KVM_FEATURE_VCPU_STATE) |
>  			     (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
>  
>  		if (sched_info_on())
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 59b5950..43f2c19 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -806,13 +806,13 @@ EXPORT_SYMBOL_GPL(kvm_rdpmc);
>   * kvm-specific. Those are put in the beginning of the list.
>   */
>  
> -#define KVM_SAVE_MSRS_BEGIN	9
> +#define KVM_SAVE_MSRS_BEGIN	10
>  static u32 msrs_to_save[] = {
>  	MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
>  	MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
>  	HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
>  	HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
> -	MSR_KVM_PV_EOI_EN,
> +	MSR_KVM_VCPU_STATE, MSR_KVM_PV_EOI_EN,
>  	MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
>  	MSR_STAR,
>  #ifdef CONFIG_X86_64
> @@ -1557,6 +1557,63 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
>  		&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
>  }
>  
> +static void kvm_set_atomic(u64 *addr, u64 old, u64 new)
> +{
> +	int loop = 1000000;
> +	while (1) {
> +		if (cmpxchg(addr, old, new) == old)
> +			break;
> +		loop--;
> +		if (!loop) {
> +			pr_info("atomic cur: %lx old: %lx new: %lx\n",
> +				*addr, old, new);
> +			break;
> +		}
> +	}
> +}
> +
> +static void kvm_set_vcpu_state(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_vcpu_state *vs;
> +	char *kaddr;
> +
> +	if (!((vcpu->arch.v_state.msr_val & KVM_MSR_ENABLED) &&
> +			vcpu->arch.v_state.vs_page))
> +		return;

It was agreed it was necessary to have valid vs_page only if MSR was
enabled? Or was that a misunderstanding?

Looks good otherwise.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 0/8] KVM paravirt remote flush tlb
  2012-08-21 11:25 [PATCH v4 0/8] KVM paravirt remote flush tlb Nikunj A. Dadhania
                   ` (7 preceding siblings ...)
  2012-08-21 11:28 ` [PATCH v4 8/8] KVM-doc: Add paravirt tlb flush document Nikunj A. Dadhania
@ 2012-08-23 11:48 ` Marcelo Tosatti
  2012-09-03 14:33 ` Avi Kivity
  9 siblings, 0 replies; 23+ messages in thread
From: Marcelo Tosatti @ 2012-08-23 11:48 UTC (permalink / raw)
  To: Nikunj A. Dadhania, Peter Zijlstra, Avi Kivity
  Cc: avi, raghukt, alex.shi, kvm, stefano.stabellini, peterz, hpa,
	vsrivatsa, mingo

On Tue, Aug 21, 2012 at 04:55:52PM +0530, Nikunj A. Dadhania wrote:
> Remote flushing api's does a busy wait which is fine in bare-metal
> scenario. But with-in the guest, the vcpus might have been pre-empted
> or blocked. In this scenario, the initator vcpu would end up
> busy-waiting for a long amount of time.
> 
> This was discovered in our gang scheduling test and other way to solve
> this is by para-virtualizing the flush_tlb_others_ipi(now shows up as
> smp_call_function_many after Alex Shi's TLB optimization)
> 
> This patch set implements para-virt flush tlbs making sure that it
> does not wait for vcpus that are sleeping. And all the sleeping vcpus
> flush the tlb on guest enter. Idea was discussed here:
> https://lkml.org/lkml/2012/2/20/157
> 
> This also brings one more dependency for lock-less page walk that is
> performed by get_user_pages_fast(gup_fast). gup_fast disables the
> interrupt and assumes that the pages will not be freed during that
> period. And this was fine as the flush_tlb_others_ipi would wait for
> all the IPI to be processed and return back. With the new approach of
> not waiting for the sleeping vcpus, this assumption is not valid
> anymore. So now HAVE_RCU_TABLE_FREE is used to free the pages. This
> will make sure that all the cpus would atleast process smp_callback
> before the pages are freed.
> 
> Changelog from v3:
> • Add helper for cleaning up vcpu_state information (Marcelo)
> • Fix code for checking vs_page and leaking page refs (Marcelo)
> 
> Changelog from v2:
> • Rebase to 3.5 based linus(commit - f7da9cd) kernel.
> • Port PV-Flush to new TLB-Optimization code by Alex Shi
> • Use pinned pages to avoid overhead during guest enter/exit (Marcelo)
> • Remove kick, as this is not improving much
> • Use bit fields in the state(flush_on_enter and vcpu_running) flag to
>   avoid smp barriers (Marcelo)
> 
> Changelog from v1:
> • Race fixes reported by Vatsa
> • Address gup_fast dependency using PeterZ's rcu table free patch
> • Fix rcu_table_free for hw pagetable walkers
> 
> Here are the results from PLE hardware. Here is the setup details:
> • 32 CPUs (HT disabled)
> • 64-bit VM
>    • 32vcpus
>    • 8GB RAM
> 
> base =  3.6-rc1 + ple handler optimization patch
> pvflushv4 =  3.6-rc1 + ple handler optimization patch + pvflushv4 patch
> 
> kernbench(lower is better)
> ==========================
>          base      pvflushv4      %improvement
> 1VM    48.5800       46.8513       3.55846
> 2VM   108.1823      104.6410       3.27346
> 3VM   183.2733      163.3547      10.86825
> 
> ebizzy(higher is better)
> ========================
>          base         pvflushv4      %improvement
> 1VM     2414.5000     2089.8750     -13.44481
> 2VM     2167.6250     2371.7500      9.41699
> 3VM     1600.1111     2102.5556     31.40060
> 
> Thanks Raghu for running the tests.
> 
> [1] http://article.gmane.org/gmane.linux.kernel/1329752
> 
> ---
> 
> Nikunj A. Dadhania (6):
>       KVM Guest: Add VCPU running/pre-empted state for guest
>       KVM-HV: Add VCPU running/pre-empted state for guest
>       KVM Guest: Add paravirt kvm_flush_tlb_others
>       KVM-HV: Add flush_on_enter before guest enter
>       Enable HAVE_RCU_TABLE_FREE for kvm when PARAVIRT_TLB_FLUSH is enabled
>       KVM-doc: Add paravirt tlb flush document
> 
> Peter Zijlstra (2):
>       mm, x86: Add HAVE_RCU_TABLE_FREE support
>       mm: Add missing TLB invalidate to RCU page-table freeing
> 
> 
>  Documentation/virtual/kvm/msr.txt                |    4 +
>  Documentation/virtual/kvm/paravirt-tlb-flush.txt |   53 ++++++++++++++
>  arch/Kconfig                                     |    3 +
>  arch/powerpc/Kconfig                             |    1 
>  arch/sparc/Kconfig                               |    1 
>  arch/x86/Kconfig                                 |   11 +++
>  arch/x86/include/asm/kvm_host.h                  |    7 ++
>  arch/x86/include/asm/kvm_para.h                  |   13 +++
>  arch/x86/include/asm/tlb.h                       |    1 
>  arch/x86/include/asm/tlbflush.h                  |   11 +++
>  arch/x86/kernel/kvm.c                            |   38 ++++++++++
>  arch/x86/kvm/cpuid.c                             |    1 
>  arch/x86/kvm/x86.c                               |   84 +++++++++++++++++++++-
>  arch/x86/mm/pgtable.c                            |    6 +-
>  arch/x86/mm/tlb.c                                |   36 +++++++++
>  include/asm-generic/tlb.h                        |    9 ++
>  mm/memory.c                                      |   43 ++++++++++-
>  17 files changed, 311 insertions(+), 11 deletions(-)
>  create mode 100644 Documentation/virtual/kvm/paravirt-tlb-flush.txt

Avi, PeterZ can you please review?


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 4/8] KVM-HV: Add VCPU running/pre-empted state for guest
  2012-08-23 11:46   ` Marcelo Tosatti
@ 2012-08-24  5:19     ` Nikunj A Dadhania
  2012-08-24 15:02       ` Marcelo Tosatti
  0 siblings, 1 reply; 23+ messages in thread
From: Nikunj A Dadhania @ 2012-08-24  5:19 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: avi, raghukt, alex.shi, kvm, stefano.stabellini, peterz, hpa,
	vsrivatsa, mingo

On Thu, 23 Aug 2012 08:46:22 -0300, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Tue, Aug 21, 2012 at 04:56:43PM +0530, Nikunj A. Dadhania wrote:
> > From: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
> > 
> > Hypervisor code to indicate guest running/pre-empteded status through
> > msr. The page is now pinned during MSR write time and use
> > kmap_atomic/kunmap_atomic to access the shared area vcpu_state area.
> > 
> > Suggested-by: Marcelo Tosatti <mtosatti@redhat.com>
> > Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
> > ---

[...]

> > +
> > +static void kvm_set_vcpu_state(struct kvm_vcpu *vcpu)
> > +{
> > +	struct kvm_vcpu_state *vs;
> > +	char *kaddr;
> > +
> > +	if (!((vcpu->arch.v_state.msr_val & KVM_MSR_ENABLED) &&
> > +			vcpu->arch.v_state.vs_page))
> > +		return;
> 
> It was agreed it was necessary to have valid vs_page only if MSR was
> enabled? Or was that a misunderstanding?
>
There is a case where MSR is enabled but vs_page is NULL, this is
gaurding that case. The check is now:

if (!(msr_enabled && vs_page))
   return;

I had proposed that here:
http://www.spinics.net/lists/kvm/msg77147.html

Regards
Nikunj


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 3/8] KVM Guest: Add VCPU running/pre-empted state for guest
  2012-08-23  9:36   ` Marcelo Tosatti
@ 2012-08-24  5:39     ` Nikunj A Dadhania
  2012-08-24 15:02       ` Marcelo Tosatti
  0 siblings, 1 reply; 23+ messages in thread
From: Nikunj A Dadhania @ 2012-08-24  5:39 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: avi, raghukt, alex.shi, kvm, stefano.stabellini, peterz, hpa,
	vsrivatsa, mingo

On Thu, 23 Aug 2012 06:36:43 -0300, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Tue, Aug 21, 2012 at 04:56:35PM +0530, Nikunj A. Dadhania wrote:
> >  
> > +void kvm_disable_vcpu_state(void)
> > +{
> > +	if (!has_vcpu_state)
> > +		return;
> > +
> > +	wrmsr(MSR_KVM_VCPU_STATE, 0, 0);
> 
> wrmsrl (to be consistent).
>
Sure, will change
 
> > +}
> > +
> >  #ifdef CONFIG_SMP
> >  static void __init kvm_smp_prepare_boot_cpu(void)
> >  {
> > @@ -410,6 +440,7 @@ static void __cpuinit kvm_guest_cpu_online(void *dummy)
> >  
> >  static void kvm_guest_cpu_offline(void *dummy)
> >  {
> > +	kvm_disable_vcpu_state();
> 
> Should disable MSR at kvm_pv_guest_cpu_reboot.
> 
Sure, can you explain the difference for my understanding?

> >  	kvm_disable_steal_time();
> >  	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
> >  		wrmsrl(MSR_KVM_PV_EOI_EN, 0);
> > @@ -469,6 +500,11 @@ void __init kvm_guest_init(void)
> >  	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
> >  		apic_set_eoi_write(kvm_guest_apic_eoi_write);
> >  
> > +#ifdef CONFIG_PARAVIRT_TLB_FLUSH
> > +	if (kvm_para_has_feature(KVM_FEATURE_VCPU_STATE))
> > +		has_vcpu_state = 1;
> > +#endif
> 
> Why only this hunk guarded by CONFIG_PARAVIRT_TLB_FLUSH and not
> the rest of the code?
> 
The guest should have been compiled with CONFIG_PARAVIRT_TLB_FLUSH, as
the config also brings in HAVE_RCU_TABLE_FREE code into picture. We
should not enable this code without HAVE_RCU_TABLE_FREE.

Did not want to spray this across all the code, as the compiler will
take care of throwing out the kvm_tlb_flush_others

> Is there a switch to enable/disable this feature on the kernel
> command line? 
>
No, havent added it. 

> Grep for early_param in kvm.c.
> 
Let me know if that is required.

Regards
Nikunj


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 3/8] KVM Guest: Add VCPU running/pre-empted state for guest
  2012-08-24  5:39     ` Nikunj A Dadhania
@ 2012-08-24 15:02       ` Marcelo Tosatti
  2012-08-27  4:24         ` Nikunj A Dadhania
  0 siblings, 1 reply; 23+ messages in thread
From: Marcelo Tosatti @ 2012-08-24 15:02 UTC (permalink / raw)
  To: Nikunj A Dadhania
  Cc: avi, raghukt, alex.shi, kvm, stefano.stabellini, peterz, hpa,
	vsrivatsa, mingo

On Fri, Aug 24, 2012 at 11:09:39AM +0530, Nikunj A Dadhania wrote:
> On Thu, 23 Aug 2012 06:36:43 -0300, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Tue, Aug 21, 2012 at 04:56:35PM +0530, Nikunj A. Dadhania wrote:
> > >  
> > > +void kvm_disable_vcpu_state(void)
> > > +{
> > > +	if (!has_vcpu_state)
> > > +		return;
> > > +
> > > +	wrmsr(MSR_KVM_VCPU_STATE, 0, 0);
> > 
> > wrmsrl (to be consistent).
> >
> Sure, will change
>  
> > > +}
> > > +
> > >  #ifdef CONFIG_SMP
> > >  static void __init kvm_smp_prepare_boot_cpu(void)
> > >  {
> > > @@ -410,6 +440,7 @@ static void __cpuinit kvm_guest_cpu_online(void *dummy)
> > >  
> > >  static void kvm_guest_cpu_offline(void *dummy)
> > >  {
> > > +	kvm_disable_vcpu_state();
> > 
> > Should disable MSR at kvm_pv_guest_cpu_reboot.
> > 
> Sure, can you explain the difference for my understanding?

These are different callbacks. One is used for CPU offline, the
other during reboot.

> > >  	kvm_disable_steal_time();
> > >  	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
> > >  		wrmsrl(MSR_KVM_PV_EOI_EN, 0);
> > > @@ -469,6 +500,11 @@ void __init kvm_guest_init(void)
> > >  	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
> > >  		apic_set_eoi_write(kvm_guest_apic_eoi_write);
> > >  
> > > +#ifdef CONFIG_PARAVIRT_TLB_FLUSH
> > > +	if (kvm_para_has_feature(KVM_FEATURE_VCPU_STATE))
> > > +		has_vcpu_state = 1;
> > > +#endif
> > 
> > Why only this hunk guarded by CONFIG_PARAVIRT_TLB_FLUSH and not
> > the rest of the code?
> > 
> The guest should have been compiled with CONFIG_PARAVIRT_TLB_FLUSH, as
> the config also brings in HAVE_RCU_TABLE_FREE code into picture. We
> should not enable this code without HAVE_RCU_TABLE_FREE.
> 
> Did not want to spray this across all the code, as the compiler will
> take care of throwing out the kvm_tlb_flush_others
> 
> > Is there a switch to enable/disable this feature on the kernel
> > command line? 
> >
> No, havent added it. 
> 
> > Grep for early_param in kvm.c.
> > 
> Let me know if that is required.

Yes, please add it. Its useful.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 4/8] KVM-HV: Add VCPU running/pre-empted state for guest
  2012-08-24  5:19     ` Nikunj A Dadhania
@ 2012-08-24 15:02       ` Marcelo Tosatti
  0 siblings, 0 replies; 23+ messages in thread
From: Marcelo Tosatti @ 2012-08-24 15:02 UTC (permalink / raw)
  To: Nikunj A Dadhania
  Cc: avi, raghukt, alex.shi, kvm, stefano.stabellini, peterz, hpa,
	vsrivatsa, mingo

On Fri, Aug 24, 2012 at 10:49:26AM +0530, Nikunj A Dadhania wrote:
> On Thu, 23 Aug 2012 08:46:22 -0300, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Tue, Aug 21, 2012 at 04:56:43PM +0530, Nikunj A. Dadhania wrote:
> > > From: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
> > > 
> > > Hypervisor code to indicate guest running/pre-empteded status through
> > > msr. The page is now pinned during MSR write time and use
> > > kmap_atomic/kunmap_atomic to access the shared area vcpu_state area.
> > > 
> > > Suggested-by: Marcelo Tosatti <mtosatti@redhat.com>
> > > Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
> > > ---
> 
> [...]
> 
> > > +
> > > +static void kvm_set_vcpu_state(struct kvm_vcpu *vcpu)
> > > +{
> > > +	struct kvm_vcpu_state *vs;
> > > +	char *kaddr;
> > > +
> > > +	if (!((vcpu->arch.v_state.msr_val & KVM_MSR_ENABLED) &&
> > > +			vcpu->arch.v_state.vs_page))
> > > +		return;
> > 
> > It was agreed it was necessary to have valid vs_page only if MSR was
> > enabled? Or was that a misunderstanding?
> >
> There is a case where MSR is enabled but vs_page is NULL, this is
> gaurding that case. The check is now:
> 
> if (!(msr_enabled && vs_page))
>    return;
> 
> I had proposed that here:
> http://www.spinics.net/lists/kvm/msg77147.html

OK, its fine.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 3/8] KVM Guest: Add VCPU running/pre-empted state for guest
  2012-08-24 15:02       ` Marcelo Tosatti
@ 2012-08-27  4:24         ` Nikunj A Dadhania
  0 siblings, 0 replies; 23+ messages in thread
From: Nikunj A Dadhania @ 2012-08-27  4:24 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: avi, raghukt, alex.shi, kvm, stefano.stabellini, peterz, hpa,
	vsrivatsa, mingo

On Fri, 24 Aug 2012 12:02:27 -0300, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Fri, Aug 24, 2012 at 11:09:39AM +0530, Nikunj A Dadhania wrote:
> > On Thu, 23 Aug 2012 06:36:43 -0300, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > > On Tue, Aug 21, 2012 at 04:56:35PM +0530, Nikunj A. Dadhania wrote:
[...]

> > > > @@ -469,6 +500,11 @@ void __init kvm_guest_init(void)
> > > >  	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
> > > >  		apic_set_eoi_write(kvm_guest_apic_eoi_write);
> > > >  
> > > > +#ifdef CONFIG_PARAVIRT_TLB_FLUSH
> > > > +	if (kvm_para_has_feature(KVM_FEATURE_VCPU_STATE))
> > > > +		has_vcpu_state = 1;
> > > > +#endif
> > > 
> > > Why only this hunk guarded by CONFIG_PARAVIRT_TLB_FLUSH and not
> > > the rest of the code?
> > > 
> > The guest should have been compiled with CONFIG_PARAVIRT_TLB_FLUSH, as
> > the config also brings in HAVE_RCU_TABLE_FREE code into picture. We
> > should not enable this code without HAVE_RCU_TABLE_FREE.
> > 
> > Did not want to spray this across all the code, as the compiler will
> > take care of throwing out the kvm_tlb_flush_others
> > 
> > > Is there a switch to enable/disable this feature on the kernel
> > > command line? 
> > >
> > No, havent added it. 
> > 
> > > Grep for early_param in kvm.c.
> > > 
> > Let me know if that is required.
> 
> Yes, please add it. Its useful.
> 
Done, will send it in my next version.

Nikunj


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 0/8] KVM paravirt remote flush tlb
  2012-08-21 11:25 [PATCH v4 0/8] KVM paravirt remote flush tlb Nikunj A. Dadhania
                   ` (8 preceding siblings ...)
  2012-08-23 11:48 ` [PATCH v4 0/8] KVM paravirt remote flush tlb Marcelo Tosatti
@ 2012-09-03 14:33 ` Avi Kivity
  2012-09-03 15:27   ` Avi Kivity
  2012-09-04  1:30   ` Nikunj A Dadhania
  9 siblings, 2 replies; 23+ messages in thread
From: Avi Kivity @ 2012-09-03 14:33 UTC (permalink / raw)
  To: Nikunj A. Dadhania
  Cc: mtosatti, raghukt, alex.shi, kvm, stefano.stabellini, peterz, hpa,
	vsrivatsa, mingo

On 08/21/2012 02:25 PM, Nikunj A. Dadhania wrote:
> 
> kernbench(lower is better)
> ==========================
>          base      pvflushv4      %improvement
> 1VM    48.5800       46.8513       3.55846
> 2VM   108.1823      104.6410       3.27346
> 3VM   183.2733      163.3547      10.86825
> 
> ebizzy(higher is better)
> ========================
>          base         pvflushv4      %improvement
> 1VM     2414.5000     2089.8750     -13.44481
> 2VM     2167.6250     2371.7500      9.41699
> 3VM     1600.1111     2102.5556     31.40060
> 

The regression is worrying.  We're improving the contended case at the
cost of the non-contended case, this is usually the wrong thing to do.
Do we have any clear idea of the cause of the regression?


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 0/8] KVM paravirt remote flush tlb
  2012-09-03 14:33 ` Avi Kivity
@ 2012-09-03 15:27   ` Avi Kivity
  2012-09-04  1:30   ` Nikunj A Dadhania
  1 sibling, 0 replies; 23+ messages in thread
From: Avi Kivity @ 2012-09-03 15:27 UTC (permalink / raw)
  To: Nikunj A. Dadhania
  Cc: mtosatti, raghukt, alex.shi, kvm, stefano.stabellini, peterz, hpa,
	vsrivatsa, mingo

On 09/03/2012 05:33 PM, Avi Kivity wrote:
> On 08/21/2012 02:25 PM, Nikunj A. Dadhania wrote:
>> 
>> kernbench(lower is better)
>> ==========================
>>          base      pvflushv4      %improvement
>> 1VM    48.5800       46.8513       3.55846
>> 2VM   108.1823      104.6410       3.27346
>> 3VM   183.2733      163.3547      10.86825
>> 
>> ebizzy(higher is better)
>> ========================
>>          base         pvflushv4      %improvement
>> 1VM     2414.5000     2089.8750     -13.44481
>> 2VM     2167.6250     2371.7500      9.41699
>> 3VM     1600.1111     2102.5556     31.40060
>> 
> 
> The regression is worrying.  We're improving the contended case at the
> cost of the non-contended case, this is usually the wrong thing to do.
> Do we have any clear idea of the cause of the regression?

For that matter, why do we get an improvement in the 1VM kbuild test?
All vcpus should be running most of the time, barring I/O thread activity.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 0/8] KVM paravirt remote flush tlb
  2012-09-03 14:33 ` Avi Kivity
  2012-09-03 15:27   ` Avi Kivity
@ 2012-09-04  1:30   ` Nikunj A Dadhania
  2012-09-04  7:51     ` Avi Kivity
  1 sibling, 1 reply; 23+ messages in thread
From: Nikunj A Dadhania @ 2012-09-04  1:30 UTC (permalink / raw)
  To: Avi Kivity
  Cc: mtosatti, raghukt, alex.shi, kvm, stefano.stabellini, peterz, hpa,
	vsrivatsa, mingo

On Mon, 03 Sep 2012 17:33:46 +0300, Avi Kivity <avi@redhat.com> wrote:
> On 08/21/2012 02:25 PM, Nikunj A. Dadhania wrote:
> > 
> > kernbench(lower is better)
> > ==========================
> >          base      pvflushv4      %improvement
> > 1VM    48.5800       46.8513       3.55846
> > 2VM   108.1823      104.6410       3.27346
> > 3VM   183.2733      163.3547      10.86825
> > 
> > ebizzy(higher is better)
> > ========================
> >          base         pvflushv4      %improvement
> > 1VM     2414.5000     2089.8750     -13.44481
> > 2VM     2167.6250     2371.7500      9.41699
> > 3VM     1600.1111     2102.5556     31.40060
> > 
> 
> The regression is worrying.  We're improving the contended case at the
> cost of the non-contended case, this is usually the wrong thing to do.
> Do we have any clear idea of the cause of the regression?
> 
Previous perf numbers suggest that in 1VM scenario flush_tlb_others_ipi
is around 2%, while for contented case its around 10%. That is what is
helping contended case.

Regards,
Nikunj


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 0/8] KVM paravirt remote flush tlb
  2012-09-04  1:30   ` Nikunj A Dadhania
@ 2012-09-04  7:51     ` Avi Kivity
  2012-09-04  8:08       ` Nikunj A Dadhania
  0 siblings, 1 reply; 23+ messages in thread
From: Avi Kivity @ 2012-09-04  7:51 UTC (permalink / raw)
  To: Nikunj A Dadhania
  Cc: mtosatti, raghukt, alex.shi, kvm, stefano.stabellini, peterz, hpa,
	vsrivatsa, mingo

On 09/04/2012 04:30 AM, Nikunj A Dadhania wrote:
> On Mon, 03 Sep 2012 17:33:46 +0300, Avi Kivity <avi@redhat.com> wrote:
>> On 08/21/2012 02:25 PM, Nikunj A. Dadhania wrote:
>> > 
>> > kernbench(lower is better)
>> > ==========================
>> >          base      pvflushv4      %improvement
>> > 1VM    48.5800       46.8513       3.55846
>> > 2VM   108.1823      104.6410       3.27346
>> > 3VM   183.2733      163.3547      10.86825
>> > 
>> > ebizzy(higher is better)
>> > ========================
>> >          base         pvflushv4      %improvement
>> > 1VM     2414.5000     2089.8750     -13.44481
>> > 2VM     2167.6250     2371.7500      9.41699
>> > 3VM     1600.1111     2102.5556     31.40060
>> > 
>> 
>> The regression is worrying.  We're improving the contended case at the
>> cost of the non-contended case, this is usually the wrong thing to do.
>> Do we have any clear idea of the cause of the regression?
>> 
> Previous perf numbers suggest that in 1VM scenario flush_tlb_others_ipi
> is around 2%, while for contented case its around 10%. That is what is
> helping contended case.

But what is causing the regression for the uncontended case?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 0/8] KVM paravirt remote flush tlb
  2012-09-04  7:51     ` Avi Kivity
@ 2012-09-04  8:08       ` Nikunj A Dadhania
  2012-09-04 10:37         ` Avi Kivity
  0 siblings, 1 reply; 23+ messages in thread
From: Nikunj A Dadhania @ 2012-09-04  8:08 UTC (permalink / raw)
  To: Avi Kivity
  Cc: mtosatti, raghukt, alex.shi, kvm, stefano.stabellini, peterz, hpa,
	vsrivatsa, mingo

On Tue, 04 Sep 2012 10:51:06 +0300, Avi Kivity <avi@redhat.com> wrote:
> On 09/04/2012 04:30 AM, Nikunj A Dadhania wrote:
> > On Mon, 03 Sep 2012 17:33:46 +0300, Avi Kivity <avi@redhat.com> wrote:
> >> On 08/21/2012 02:25 PM, Nikunj A. Dadhania wrote:
> >> > 
> >> > kernbench(lower is better)
> >> > ==========================
> >> >          base      pvflushv4      %improvement
> >> > 1VM    48.5800       46.8513       3.55846
> >> > 2VM   108.1823      104.6410       3.27346
> >> > 3VM   183.2733      163.3547      10.86825
> >> > 
> >> > ebizzy(higher is better)
> >> > ========================
> >> >          base         pvflushv4      %improvement
> >> > 1VM     2414.5000     2089.8750     -13.44481
> >> > 2VM     2167.6250     2371.7500      9.41699
> >> > 3VM     1600.1111     2102.5556     31.40060
> >> > 
> >> 
> >> The regression is worrying.  We're improving the contended case at the
> >> cost of the non-contended case, this is usually the wrong thing to do.
> >> Do we have any clear idea of the cause of the regression?
> >> 
> > Previous perf numbers suggest that in 1VM scenario flush_tlb_others_ipi
> > is around 2%, while for contented case its around 10%. That is what is
> > helping contended case.
> 
> But what is causing the regression for the uncontended case?
> 
Haven't been able to nail that, any clue on how to profile would help.

Regards
Nikunj


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 0/8] KVM paravirt remote flush tlb
  2012-09-04  8:08       ` Nikunj A Dadhania
@ 2012-09-04 10:37         ` Avi Kivity
  0 siblings, 0 replies; 23+ messages in thread
From: Avi Kivity @ 2012-09-04 10:37 UTC (permalink / raw)
  To: Nikunj A Dadhania
  Cc: mtosatti, raghukt, alex.shi, kvm, stefano.stabellini, peterz, hpa,
	vsrivatsa, mingo

On 09/04/2012 11:08 AM, Nikunj A Dadhania wrote:
> On Tue, 04 Sep 2012 10:51:06 +0300, Avi Kivity <avi@redhat.com> wrote:
>> On 09/04/2012 04:30 AM, Nikunj A Dadhania wrote:
>> > On Mon, 03 Sep 2012 17:33:46 +0300, Avi Kivity <avi@redhat.com> wrote:
>> >> On 08/21/2012 02:25 PM, Nikunj A. Dadhania wrote:
>> >> > 
>> >> > kernbench(lower is better)
>> >> > ==========================
>> >> >          base      pvflushv4      %improvement
>> >> > 1VM    48.5800       46.8513       3.55846
>> >> > 2VM   108.1823      104.6410       3.27346
>> >> > 3VM   183.2733      163.3547      10.86825
>> >> > 
>> >> > ebizzy(higher is better)
>> >> > ========================
>> >> >          base         pvflushv4      %improvement
>> >> > 1VM     2414.5000     2089.8750     -13.44481
>> >> > 2VM     2167.6250     2371.7500      9.41699
>> >> > 3VM     1600.1111     2102.5556     31.40060
>> >> > 
>> >> 
>> >> The regression is worrying.  We're improving the contended case at the
>> >> cost of the non-contended case, this is usually the wrong thing to do.
>> >> Do we have any clear idea of the cause of the regression?
>> >> 
>> > Previous perf numbers suggest that in 1VM scenario flush_tlb_others_ipi
>> > is around 2%, while for contented case its around 10%. That is what is
>> > helping contended case.
>> 
>> But what is causing the regression for the uncontended case?
>> 
> Haven't been able to nail that, any clue on how to profile would help.

perf top, perf kvm top, kvm_stat should help.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2012-09-04 10:38 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-08-21 11:25 [PATCH v4 0/8] KVM paravirt remote flush tlb Nikunj A. Dadhania
2012-08-21 11:26 ` [PATCH v4 1/8] mm, x86: Add HAVE_RCU_TABLE_FREE support Nikunj A. Dadhania
2012-08-21 11:26 ` [PATCH v4 2/8] mm: Add missing TLB invalidate to RCU page-table freeing Nikunj A. Dadhania
2012-08-21 11:26 ` [PATCH v4 3/8] KVM Guest: Add VCPU running/pre-empted state for guest Nikunj A. Dadhania
2012-08-23  9:36   ` Marcelo Tosatti
2012-08-24  5:39     ` Nikunj A Dadhania
2012-08-24 15:02       ` Marcelo Tosatti
2012-08-27  4:24         ` Nikunj A Dadhania
2012-08-21 11:26 ` [PATCH v4 4/8] KVM-HV: " Nikunj A. Dadhania
2012-08-23 11:46   ` Marcelo Tosatti
2012-08-24  5:19     ` Nikunj A Dadhania
2012-08-24 15:02       ` Marcelo Tosatti
2012-08-21 11:27 ` [PATCH v4 5/8] KVM Guest: Add paravirt kvm_flush_tlb_others Nikunj A. Dadhania
2012-08-21 11:27 ` [PATCH v4 6/8] KVM-HV: Add flush_on_enter before guest enter Nikunj A. Dadhania
2012-08-21 11:27 ` [PATCH v4 7/8] Enable HAVE_RCU_TABLE_FREE for kvm when PARAVIRT_TLB_FLUSH is enabled Nikunj A. Dadhania
2012-08-21 11:28 ` [PATCH v4 8/8] KVM-doc: Add paravirt tlb flush document Nikunj A. Dadhania
2012-08-23 11:48 ` [PATCH v4 0/8] KVM paravirt remote flush tlb Marcelo Tosatti
2012-09-03 14:33 ` Avi Kivity
2012-09-03 15:27   ` Avi Kivity
2012-09-04  1:30   ` Nikunj A Dadhania
2012-09-04  7:51     ` Avi Kivity
2012-09-04  8:08       ` Nikunj A Dadhania
2012-09-04 10:37         ` Avi Kivity

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).