public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/11] KVM: x86: optimize for guest page written
@ 2011-07-26 11:24 Xiao Guangrong
  2011-07-26 11:25 ` [PATCH 01/11] KVM: MMU: avoid pte_list_desc run out in kvm_mmu_pte_write Xiao Guangrong
                   ` (10 more replies)
  0 siblings, 11 replies; 53+ messages in thread
From: Xiao Guangrong @ 2011-07-26 11:24 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

Too keep shadow page consistency, we should write-protect the guest page if
if it is a page structure. Unfortunately, even if the guest page structure is
tear-down and is used for other usage, we still write-protect it and cause page
fault if it is written, in this case, we need to zap the corresponding shadow
page and let the guest page became normal as possible, that is just what
kvm_mmu_pte_write does, however, sometimes, it does not work well:
- kvm_mmu_pte_write is unsafe since we need to alloc pte_list_desc in the
  function when spte is prefetched, unfortunately, we can not know how many
  spte need to be prefetched on this path, that means we can use out of the
  free  pte_list_desc object in the cache, and BUG_ON() is triggered, also some
  path does not fill the cache, such as INS instruction emulated that does not
  trigger page fault.


- we usually use repeat string instructions to clear the page, for example,
  we call memset to clear a page table, 'stosb' is used in this function, and 
  repeated for 1024 times, that means we should occupy mmu lock for 1024 times
  and walking shadow page cache for 1024 times, it is terrible.

- Sometimes, we only modify the last one byte of a pte to update status bit,
  for example, clear_bit is used to clear r/w bit in linux kernel and 'andb'
  instruction is used in this function, in this case, kvm_mmu_pte_write will
  treat it as misaligned access, and the shadow page table is zapped.

- detecting write-flooding does not work well, when we handle page written, if
  the last speculative spte is not accessed, we treat the page is
  write-flooding, however, we can speculative spte on many path, such as pte
  prefetch, page synced, that means the last speculative spte may be not point
  to the written page and the written page can be accessed via other sptes, so
  depends on the Accessed bit of the last speculative spte is not enough.


In this patchset, we fixed/avoided these issues:
- instead of filling the cache in page fault path, we do it in
  kvm_mmu_pte_write, and do not prefetch the spte if it dose not have free
  pte_list_desc object in the cache.

- if it is the repeat string instructions emulated and it is not a IO/MMIO
  access, we can zap all the corresponding shadow pages and return to the guest
  then, the mapping can became writable and directly write the page

- do not zap the shadow page if it only modify the last byte of pte.

- Instead of detected page accessed, we can detect whether the spte is accessed
  or not, if the spte is not accessed but it is written frequently, we treat is
  not a page table or it not used for a long time.


Performance test result:
the performance is obvious improved tested by kernebench:

Before patchset      After patchset
3m0.094s               2m50.177s
3m1.813s               2m52.774s
3m6.239                2m51.512





^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 01/11] KVM: MMU: avoid pte_list_desc run out in kvm_mmu_pte_write
  2011-07-26 11:24 [PATCH 0/11] KVM: x86: optimize for guest page written Xiao Guangrong
@ 2011-07-26 11:25 ` Xiao Guangrong
  2011-07-27  9:00   ` Avi Kivity
  2011-07-26 11:25 ` [PATCH 02/11] KVM: x86: cleanup pio/pout emulated Xiao Guangrong
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 53+ messages in thread
From: Xiao Guangrong @ 2011-07-26 11:25 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

kvm_mmu_pte_write is unsafe since we need to alloc pte_list_desc in the
function when spte is prefetched, unfortunately, we can not know how many
spte need to be prefetched on this path, that means we can use out of the
free  pte_list_desc object in the cache, and BUG_ON() is triggered, also some
path does not fill the cache, such as INS instruction emulated that does not
trigger page fault

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/kvm/mmu.c |   20 +++++++++++++++-----
 1 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9335e1b..587695b 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -593,6 +593,11 @@ static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
 	return 0;
 }
 
+static int mmu_memory_cache_free_object(struct kvm_mmu_memory_cache *cache)
+{
+	return cache->nobjs;
+}
+
 static void mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc,
 				  struct kmem_cache *cache)
 {
@@ -3526,6 +3531,14 @@ static void kvm_mmu_access_page(struct kvm_vcpu *vcpu, gfn_t gfn)
 		set_bit(PT_ACCESSED_SHIFT, (unsigned long *)spte);
 }
 
+static int get_free_pte_list_desc_nr(struct kvm_vcpu *vcpu)
+{
+	struct kvm_mmu_memory_cache *cache;
+
+	cache = &vcpu->arch.mmu_pte_list_desc_cache;
+	return mmu_memory_cache_free_object(cache);
+}
+
 void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 		       const u8 *new, int bytes,
 		       bool guest_initiated)
@@ -3583,6 +3596,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 		break;
 	}
 
+	mmu_topup_memory_caches(vcpu);
 	spin_lock(&vcpu->kvm->mmu_lock);
 	if (atomic_read(&vcpu->kvm->arch.invlpg_counter) != invlpg_counter)
 		gentry = 0;
@@ -3653,7 +3667,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 			mmu_page_zap_pte(vcpu->kvm, sp, spte);
 			if (gentry &&
 			      !((sp->role.word ^ vcpu->arch.mmu.base_role.word)
-			      & mask.word))
+			      & mask.word) && get_free_pte_list_desc_nr(vcpu))
 				mmu_pte_write_new_pte(vcpu, sp, spte, &gentry);
 			if (!remote_flush && need_remote_flush(entry, *spte))
 				remote_flush = true;
@@ -3714,10 +3728,6 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code,
 		goto out;
 	}
 
-	r = mmu_topup_memory_caches(vcpu);
-	if (r)
-		goto out;
-
 	er = x86_emulate_instruction(vcpu, cr2, 0, insn, insn_len);
 
 	switch (er) {
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 02/11] KVM: x86: cleanup pio/pout emulated
  2011-07-26 11:24 [PATCH 0/11] KVM: x86: optimize for guest page written Xiao Guangrong
  2011-07-26 11:25 ` [PATCH 01/11] KVM: MMU: avoid pte_list_desc run out in kvm_mmu_pte_write Xiao Guangrong
@ 2011-07-26 11:25 ` Xiao Guangrong
  2011-07-26 11:26 ` [PATCH 03/11] KVM: x86: fast emulate repeat string write instructions Xiao Guangrong
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 53+ messages in thread
From: Xiao Guangrong @ 2011-07-26 11:25 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

Remove the same code between emulator_pio_in_emulated and
emulator_pio_out_emulated

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/kvm/x86.c |   59 ++++++++++++++++++++++-----------------------------
 1 files changed, 26 insertions(+), 33 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e80f0d7..5543f99 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4320,32 +4320,24 @@ static int kernel_pio(struct kvm_vcpu *vcpu, void *pd)
 	return r;
 }
 
-
-static int emulator_pio_in_emulated(struct x86_emulate_ctxt *ctxt,
-				    int size, unsigned short port, void *val,
-				    unsigned int count)
+static int emulator_pio_in_out(struct kvm_vcpu *vcpu, int size,
+			       unsigned short port, void *val,
+			       unsigned int count, bool in)
 {
-	struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
-
-	if (vcpu->arch.pio.count)
-		goto data_avail;
-
-	trace_kvm_pio(0, port, size, count);
+	trace_kvm_pio(!in, port, size, count);
 
 	vcpu->arch.pio.port = port;
-	vcpu->arch.pio.in = 1;
+	vcpu->arch.pio.in = in;
 	vcpu->arch.pio.count  = count;
 	vcpu->arch.pio.size = size;
 
 	if (!kernel_pio(vcpu, vcpu->arch.pio_data)) {
-	data_avail:
-		memcpy(val, vcpu->arch.pio_data, size * count);
 		vcpu->arch.pio.count = 0;
 		return 1;
 	}
 
 	vcpu->run->exit_reason = KVM_EXIT_IO;
-	vcpu->run->io.direction = KVM_EXIT_IO_IN;
+	vcpu->run->io.direction = in ? KVM_EXIT_IO_IN : KVM_EXIT_IO_OUT;
 	vcpu->run->io.size = size;
 	vcpu->run->io.data_offset = KVM_PIO_PAGE_OFFSET * PAGE_SIZE;
 	vcpu->run->io.count = count;
@@ -4354,36 +4346,37 @@ static int emulator_pio_in_emulated(struct x86_emulate_ctxt *ctxt,
 	return 0;
 }
 
-static int emulator_pio_out_emulated(struct x86_emulate_ctxt *ctxt,
-				     int size, unsigned short port,
-				     const void *val, unsigned int count)
+static int emulator_pio_in_emulated(struct x86_emulate_ctxt *ctxt,
+				    int size, unsigned short port, void *val,
+				    unsigned int count)
 {
 	struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
+	int ret;
 
-	trace_kvm_pio(1, port, size, count);
-
-	vcpu->arch.pio.port = port;
-	vcpu->arch.pio.in = 0;
-	vcpu->arch.pio.count = count;
-	vcpu->arch.pio.size = size;
-
-	memcpy(vcpu->arch.pio_data, val, size * count);
+	if (vcpu->arch.pio.count)
+		goto data_avail;
 
-	if (!kernel_pio(vcpu, vcpu->arch.pio_data)) {
+	ret = emulator_pio_in_out(vcpu, size, port, val, count, true);
+	if (ret) {
+data_avail:
+		memcpy(val, vcpu->arch.pio_data, size * count);
 		vcpu->arch.pio.count = 0;
 		return 1;
 	}
 
-	vcpu->run->exit_reason = KVM_EXIT_IO;
-	vcpu->run->io.direction = KVM_EXIT_IO_OUT;
-	vcpu->run->io.size = size;
-	vcpu->run->io.data_offset = KVM_PIO_PAGE_OFFSET * PAGE_SIZE;
-	vcpu->run->io.count = count;
-	vcpu->run->io.port = port;
-
 	return 0;
 }
 
+static int emulator_pio_out_emulated(struct x86_emulate_ctxt *ctxt,
+				     int size, unsigned short port,
+				     const void *val, unsigned int count)
+{
+	struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
+
+	memcpy(vcpu->arch.pio_data, val, size * count);
+	return emulator_pio_in_out(vcpu, size, port, (void *)val, count, false);
+}
+
 static unsigned long get_segment_base(struct kvm_vcpu *vcpu, int seg)
 {
 	return kvm_x86_ops->get_segment_base(vcpu, seg);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 03/11] KVM: x86: fast emulate repeat string write instructions
  2011-07-26 11:24 [PATCH 0/11] KVM: x86: optimize for guest page written Xiao Guangrong
  2011-07-26 11:25 ` [PATCH 01/11] KVM: MMU: avoid pte_list_desc run out in kvm_mmu_pte_write Xiao Guangrong
  2011-07-26 11:25 ` [PATCH 02/11] KVM: x86: cleanup pio/pout emulated Xiao Guangrong
@ 2011-07-26 11:26 ` Xiao Guangrong
  2011-07-26 12:27   ` Gleb Natapov
  2011-07-27  9:04   ` Avi Kivity
  2011-07-26 11:28 ` [PATCH 04/11] KVM: MMU: do not mark access bit on pte write path Xiao Guangrong
                   ` (7 subsequent siblings)
  10 siblings, 2 replies; 53+ messages in thread
From: Xiao Guangrong @ 2011-07-26 11:26 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

We usually use repeat string instructions to clear the page, for example,
we call memset to clear a page table, stosb is used in this function, and 
repeated for 1024 times, that means we should occupy mmu lock for 1024 times
and walking shadow page cache for 1024 times, it is terrible

In fact, if it is the repeat string instructions emulated and it is not a
IO/MMIO access, we can zap all the corresponding shadow pages and return to the
guest, then the mapping can became writable and directly write the page

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/include/asm/kvm_emulate.h |    1 +
 arch/x86/include/asm/kvm_host.h    |    2 +-
 arch/x86/kvm/emulate.c             |    5 +++--
 arch/x86/kvm/mmu.c                 |    4 ++--
 arch/x86/kvm/paging_tmpl.h         |    3 ++-
 arch/x86/kvm/x86.c                 |    9 +++++++--
 6 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h
index 6040d11..7e5c4d5 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -244,6 +244,7 @@ struct x86_emulate_ctxt {
 	bool guest_mode; /* guest running a nested guest */
 	bool perm_ok; /* do not check permissions if true */
 	bool only_vendor_specific_insn;
+	bool pio_mmio_emulate; /* it is the pio/mmio access if true */
 
 	bool have_exception;
 	struct x86_exception exception;
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c00ec28..95f55a9 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -752,7 +752,7 @@ int fx_init(struct kvm_vcpu *vcpu);
 void kvm_mmu_flush_tlb(struct kvm_vcpu *vcpu);
 void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 		       const u8 *new, int bytes,
-		       bool guest_initiated);
+		       bool guest_initiated, bool repeat_write);
 int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva);
 void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu);
 int kvm_mmu_load(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 6f08bc9..36fbac6 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -4009,8 +4009,9 @@ writeback:
 			 * Re-enter guest when pio read ahead buffer is empty
 			 * or, if it is not used, after each 1024 iteration.
 			 */
-			if ((r->end != 0 || ctxt->regs[VCPU_REGS_RCX] & 0x3ff) &&
-			    (r->end == 0 || r->end != r->pos)) {
+			if ((r->end != 0 || ctxt->regs[VCPU_REGS_RCX] & 0x3ff)
+			      && (r->end == 0 || r->end != r->pos) &&
+			      ctxt->pio_mmio_emulate) {
 				/*
 				 * Reset read cache. Usually happens before
 				 * decode, but since instruction is restarted
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 587695b..5193de8 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3541,7 +3541,7 @@ static int get_free_pte_list_desc_nr(struct kvm_vcpu *vcpu)
 
 void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 		       const u8 *new, int bytes,
-		       bool guest_initiated)
+		       bool guest_initiated, bool repeat_write)
 {
 	gfn_t gfn = gpa >> PAGE_SHIFT;
 	union kvm_mmu_page_role mask = { .word = 0 };
@@ -3622,7 +3622,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 		pte_size = sp->role.cr4_pae ? 8 : 4;
 		misaligned = (offset ^ (offset + bytes - 1)) & ~(pte_size - 1);
 		misaligned |= bytes < 4;
-		if (misaligned || flooded) {
+		if (misaligned || flooded || repeat_write) {
 			/*
 			 * Misaligned accesses are too much trouble to fix
 			 * up; also, they usually indicate a page is not used
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 507e2b8..02daac5 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -710,7 +710,8 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
 
 	if (mmu_topup_memory_caches(vcpu))
 		return;
-	kvm_mmu_pte_write(vcpu, pte_gpa, NULL, sizeof(pt_element_t), 0);
+	kvm_mmu_pte_write(vcpu, pte_gpa, NULL, sizeof(pt_element_t),
+			  false, false);
 }
 
 static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t vaddr, u32 access,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5543f99..6fb9001 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4058,7 +4058,8 @@ int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
 	ret = kvm_write_guest(vcpu->kvm, gpa, val, bytes);
 	if (ret < 0)
 		return 0;
-	kvm_mmu_pte_write(vcpu, gpa, val, bytes, 1);
+	kvm_mmu_pte_write(vcpu, gpa, val, bytes, true,
+			  vcpu->arch.emulate_ctxt.rep_prefix);
 	return 1;
 }
 
@@ -4161,6 +4162,8 @@ static int emulator_read_write_onepage(unsigned long addr, void *val,
 		return X86EMUL_CONTINUE;
 
 mmio:
+	vcpu->arch.emulate_ctxt.pio_mmio_emulate = true;
+
 	/*
 	 * Is this MMIO handled locally?
 	 */
@@ -4295,7 +4298,7 @@ static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt,
 	if (!exchanged)
 		return X86EMUL_CMPXCHG_FAILED;
 
-	kvm_mmu_pte_write(vcpu, gpa, new, bytes, 1);
+	kvm_mmu_pte_write(vcpu, gpa, new, bytes, true, false);
 
 	return X86EMUL_CONTINUE;
 
@@ -4326,6 +4329,7 @@ static int emulator_pio_in_out(struct kvm_vcpu *vcpu, int size,
 {
 	trace_kvm_pio(!in, port, size, count);
 
+	vcpu->arch.emulate_ctxt.pio_mmio_emulate = true;
 	vcpu->arch.pio.port = port;
 	vcpu->arch.pio.in = in;
 	vcpu->arch.pio.count  = count;
@@ -4817,6 +4821,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
 		ctxt->interruptibility = 0;
 		ctxt->have_exception = false;
 		ctxt->perm_ok = false;
+		ctxt->pio_mmio_emulate = false;
 
 		ctxt->only_vendor_specific_insn
 			= emulation_type & EMULTYPE_TRAP_UD;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 04/11] KVM: MMU: do not mark access bit on pte write path
  2011-07-26 11:24 [PATCH 0/11] KVM: x86: optimize for guest page written Xiao Guangrong
                   ` (2 preceding siblings ...)
  2011-07-26 11:26 ` [PATCH 03/11] KVM: x86: fast emulate repeat string write instructions Xiao Guangrong
@ 2011-07-26 11:28 ` Xiao Guangrong
  2011-07-27  9:08   ` Avi Kivity
  2011-07-26 11:28 ` [PATCH 05/11] KVM: MMU: cleanup FNAME(invlpg) Xiao Guangrong
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 53+ messages in thread
From: Xiao Guangrong @ 2011-07-26 11:28 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

In current code, the accessed bit is always set when page fault occurred,
do not need to set it on pte write path

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/include/asm/kvm_host.h |    1 -
 arch/x86/kvm/mmu.c              |   22 +---------------------
 2 files changed, 1 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 95f55a9..ad88bfb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -355,7 +355,6 @@ struct kvm_vcpu_arch {
 	gfn_t last_pt_write_gfn;
 	int   last_pt_write_count;
 	u64  *last_pte_updated;
-	gfn_t last_pte_gfn;
 
 	struct fpu guest_fpu;
 	u64 xcr0;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5193de8..992a8a5 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2196,11 +2196,6 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 	if (set_mmio_spte(sptep, gfn, pfn, pte_access))
 		return 0;
 
-	/*
-	 * We don't set the accessed bit, since we sometimes want to see
-	 * whether the guest actually used the pte (in order to detect
-	 * demand paging).
-	 */
 	spte = PT_PRESENT_MASK;
 	if (!speculative)
 		spte |= shadow_accessed_mask;
@@ -2351,10 +2346,8 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 		}
 	}
 	kvm_release_pfn_clean(pfn);
-	if (speculative) {
+	if (speculative)
 		vcpu->arch.last_pte_updated = sptep;
-		vcpu->arch.last_pte_gfn = gfn;
-	}
 }
 
 static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
@@ -3519,18 +3512,6 @@ static bool last_updated_pte_accessed(struct kvm_vcpu *vcpu)
 	return !!(spte && (*spte & shadow_accessed_mask));
 }
 
-static void kvm_mmu_access_page(struct kvm_vcpu *vcpu, gfn_t gfn)
-{
-	u64 *spte = vcpu->arch.last_pte_updated;
-
-	if (spte
-	    && vcpu->arch.last_pte_gfn == gfn
-	    && shadow_accessed_mask
-	    && !(*spte & shadow_accessed_mask)
-	    && is_shadow_present_pte(*spte))
-		set_bit(PT_ACCESSED_SHIFT, (unsigned long *)spte);
-}
-
 static int get_free_pte_list_desc_nr(struct kvm_vcpu *vcpu)
 {
 	struct kvm_mmu_memory_cache *cache;
@@ -3604,7 +3585,6 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 	++vcpu->kvm->stat.mmu_pte_write;
 	trace_kvm_mmu_audit(vcpu, AUDIT_PRE_PTE_WRITE);
 	if (guest_initiated) {
-		kvm_mmu_access_page(vcpu, gfn);
 		if (gfn == vcpu->arch.last_pt_write_gfn
 		    && !last_updated_pte_accessed(vcpu)) {
 			++vcpu->arch.last_pt_write_count;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 05/11] KVM: MMU: cleanup FNAME(invlpg)
  2011-07-26 11:24 [PATCH 0/11] KVM: x86: optimize for guest page written Xiao Guangrong
                   ` (3 preceding siblings ...)
  2011-07-26 11:28 ` [PATCH 04/11] KVM: MMU: do not mark access bit on pte write path Xiao Guangrong
@ 2011-07-26 11:28 ` Xiao Guangrong
  2011-07-26 11:29 ` [PATCH 06/11] KVM: MMU: fast prefetch spte on invlpg path Xiao Guangrong
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 53+ messages in thread
From: Xiao Guangrong @ 2011-07-26 11:28 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

Directly Use mmu_page_zap_pte to zap spte in FNAME(invlpg), also remove the
same code between FNAME(invlpg) and FNAME(sync_page)

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/kvm/mmu.c         |   16 ++++++++++------
 arch/x86/kvm/paging_tmpl.h |   42 +++++++++++++++---------------------------
 2 files changed, 25 insertions(+), 33 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 992a8a5..52ef6d7 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1801,7 +1801,7 @@ static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 	}
 }
 
-static void mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
+static bool mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
 			     u64 *spte)
 {
 	u64 pte;
@@ -1809,17 +1809,21 @@ static void mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
 
 	pte = *spte;
 	if (is_shadow_present_pte(pte)) {
-		if (is_last_spte(pte, sp->role.level))
+		if (is_last_spte(pte, sp->role.level)) {
 			drop_spte(kvm, spte);
-		else {
+			if (is_large_pte(pte))
+				--kvm->stat.lpages;
+		} else {
 			child = page_header(pte & PT64_BASE_ADDR_MASK);
 			drop_parent_pte(child, spte);
 		}
-	} else if (is_mmio_spte(pte))
+		return true;
+	}
+
+	if (is_mmio_spte(pte))
 		mmu_spte_clear_no_track(spte);
 
-	if (is_large_pte(pte))
-		--kvm->stat.lpages;
+	return false;
 }
 
 static void kvm_mmu_page_unlink_children(struct kvm *kvm,
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 02daac5..64210a0 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -652,6 +652,16 @@ out_unlock:
 	return 0;
 }
 
+static gpa_t FNAME(get_first_pte_gpa)(struct kvm_mmu_page *sp)
+{
+	int offset = 0;
+
+	if (PTTYPE == 32)
+		offset = sp->role.quadrant << PT64_LEVEL_BITS;
+
+	return gfn_to_gpa(sp->gfn) + offset * sizeof(pt_element_t);
+}
+
 static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
 {
 	struct kvm_shadow_walk_iterator iterator;
@@ -659,7 +669,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
 	gpa_t pte_gpa = -1;
 	int level;
 	u64 *sptep;
-	int need_flush = 0;
 
 	vcpu_clear_mmio_info(vcpu, gva);
 
@@ -671,36 +680,20 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
 
 		sp = page_header(__pa(sptep));
 		if (is_last_spte(*sptep, level)) {
-			int offset, shift;
-
 			if (!sp->unsync)
 				break;
 
-			shift = PAGE_SHIFT -
-				  (PT_LEVEL_BITS - PT64_LEVEL_BITS) * level;
-			offset = sp->role.quadrant << shift;
-
-			pte_gpa = (sp->gfn << PAGE_SHIFT) + offset;
+			pte_gpa = FNAME(get_first_pte_gpa)(sp);
 			pte_gpa += (sptep - sp->spt) * sizeof(pt_element_t);
 
-			if (is_shadow_present_pte(*sptep)) {
-				if (is_large_pte(*sptep))
-					--vcpu->kvm->stat.lpages;
-				drop_spte(vcpu->kvm, sptep);
-				need_flush = 1;
-			} else if (is_mmio_spte(*sptep))
-				mmu_spte_clear_no_track(sptep);
-
-			break;
+			if (mmu_page_zap_pte(vcpu->kvm, sp, sptep))
+				kvm_flush_remote_tlbs(vcpu->kvm);
 		}
 
 		if (!is_shadow_present_pte(*sptep) || !sp->unsync_children)
 			break;
 	}
 
-	if (need_flush)
-		kvm_flush_remote_tlbs(vcpu->kvm);
-
 	atomic_inc(&vcpu->kvm->arch.invlpg_counter);
 
 	spin_unlock(&vcpu->kvm->mmu_lock);
@@ -766,19 +759,14 @@ static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
  */
 static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 {
-	int i, offset, nr_present;
+	int i, nr_present = 0;
 	bool host_writable;
 	gpa_t first_pte_gpa;
 
-	offset = nr_present = 0;
-
 	/* direct kvm_mmu_page can not be unsync. */
 	BUG_ON(sp->role.direct);
 
-	if (PTTYPE == 32)
-		offset = sp->role.quadrant << PT64_LEVEL_BITS;
-
-	first_pte_gpa = gfn_to_gpa(sp->gfn) + offset * sizeof(pt_element_t);
+	first_pte_gpa = FNAME(get_first_pte_gpa)(sp);
 
 	for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
 		unsigned pte_access;
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 06/11] KVM: MMU: fast prefetch spte on invlpg path
  2011-07-26 11:24 [PATCH 0/11] KVM: x86: optimize for guest page written Xiao Guangrong
                   ` (4 preceding siblings ...)
  2011-07-26 11:28 ` [PATCH 05/11] KVM: MMU: cleanup FNAME(invlpg) Xiao Guangrong
@ 2011-07-26 11:29 ` Xiao Guangrong
  2011-07-26 11:29 ` [PATCH 07/11] KVM: MMU: remove unnecessary kvm_mmu_free_some_pages Xiao Guangrong
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 53+ messages in thread
From: Xiao Guangrong @ 2011-07-26 11:29 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

Fast prefetch spte for the unsync shadow page on invlpg path

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/include/asm/kvm_host.h |    4 +---
 arch/x86/kvm/mmu.c              |   38 +++++++++++++++-----------------------
 arch/x86/kvm/paging_tmpl.h      |   23 ++++++++++-------------
 arch/x86/kvm/x86.c              |    4 ++--
 4 files changed, 28 insertions(+), 41 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ad88bfb..ce17642 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -456,7 +456,6 @@ struct kvm_arch {
 	unsigned int n_requested_mmu_pages;
 	unsigned int n_max_mmu_pages;
 	unsigned int indirect_shadow_pages;
-	atomic_t invlpg_counter;
 	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
 	/*
 	 * Hash table of struct kvm_mmu_page.
@@ -750,8 +749,7 @@ int fx_init(struct kvm_vcpu *vcpu);
 
 void kvm_mmu_flush_tlb(struct kvm_vcpu *vcpu);
 void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
-		       const u8 *new, int bytes,
-		       bool guest_initiated, bool repeat_write);
+		       const u8 *new, int bytes, bool repeat_write);
 int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva);
 void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu);
 int kvm_mmu_load(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 52ef6d7..dfe8e6b 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3525,8 +3525,7 @@ static int get_free_pte_list_desc_nr(struct kvm_vcpu *vcpu)
 }
 
 void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
-		       const u8 *new, int bytes,
-		       bool guest_initiated, bool repeat_write)
+		       const u8 *new, int bytes, bool repeat_write)
 {
 	gfn_t gfn = gpa >> PAGE_SHIFT;
 	union kvm_mmu_page_role mask = { .word = 0 };
@@ -3535,7 +3534,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 	LIST_HEAD(invalid_list);
 	u64 entry, gentry, *spte;
 	unsigned pte_size, page_offset, misaligned, quadrant, offset;
-	int level, npte, invlpg_counter, r, flooded = 0;
+	int level, npte, r, flooded = 0;
 	bool remote_flush, local_flush, zap_page;
 
 	/*
@@ -3550,19 +3549,16 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 
 	pgprintk("%s: gpa %llx bytes %d\n", __func__, gpa, bytes);
 
-	invlpg_counter = atomic_read(&vcpu->kvm->arch.invlpg_counter);
-
 	/*
 	 * Assume that the pte write on a page table of the same type
 	 * as the current vcpu paging mode since we update the sptes only
 	 * when they have the same mode.
 	 */
-	if ((is_pae(vcpu) && bytes == 4) || !new) {
+	if (is_pae(vcpu) && bytes == 4) {
 		/* Handle a 32-bit guest writing two halves of a 64-bit gpte */
-		if (is_pae(vcpu)) {
-			gpa &= ~(gpa_t)7;
-			bytes = 8;
-		}
+		gpa &= ~(gpa_t)7;
+		bytes = 8;
+
 		r = kvm_read_guest(vcpu->kvm, gpa, &gentry, min(bytes, 8));
 		if (r)
 			gentry = 0;
@@ -3583,22 +3579,18 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 
 	mmu_topup_memory_caches(vcpu);
 	spin_lock(&vcpu->kvm->mmu_lock);
-	if (atomic_read(&vcpu->kvm->arch.invlpg_counter) != invlpg_counter)
-		gentry = 0;
 	kvm_mmu_free_some_pages(vcpu);
 	++vcpu->kvm->stat.mmu_pte_write;
 	trace_kvm_mmu_audit(vcpu, AUDIT_PRE_PTE_WRITE);
-	if (guest_initiated) {
-		if (gfn == vcpu->arch.last_pt_write_gfn
-		    && !last_updated_pte_accessed(vcpu)) {
-			++vcpu->arch.last_pt_write_count;
-			if (vcpu->arch.last_pt_write_count >= 3)
-				flooded = 1;
-		} else {
-			vcpu->arch.last_pt_write_gfn = gfn;
-			vcpu->arch.last_pt_write_count = 1;
-			vcpu->arch.last_pte_updated = NULL;
-		}
+	if (gfn == vcpu->arch.last_pt_write_gfn
+	    && !last_updated_pte_accessed(vcpu)) {
+		++vcpu->arch.last_pt_write_count;
+		if (vcpu->arch.last_pt_write_count >= 3)
+			flooded = 1;
+	} else {
+		vcpu->arch.last_pt_write_gfn = gfn;
+		vcpu->arch.last_pt_write_count = 1;
+		vcpu->arch.last_pte_updated = NULL;
 	}
 
 	mask.cr0_wp = mask.cr4_pae = mask.nxe = 1;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 64210a0..3466229 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -666,20 +666,22 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
 {
 	struct kvm_shadow_walk_iterator iterator;
 	struct kvm_mmu_page *sp;
-	gpa_t pte_gpa = -1;
 	int level;
 	u64 *sptep;
 
 	vcpu_clear_mmio_info(vcpu, gva);
+	mmu_topup_memory_caches(vcpu);
 
 	spin_lock(&vcpu->kvm->mmu_lock);
-
 	for_each_shadow_entry(vcpu, gva, iterator) {
 		level = iterator.level;
 		sptep = iterator.sptep;
 
 		sp = page_header(__pa(sptep));
 		if (is_last_spte(*sptep, level)) {
+			pt_element_t gpte;
+			gpa_t pte_gpa;
+
 			if (!sp->unsync)
 				break;
 
@@ -688,23 +690,18 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
 
 			if (mmu_page_zap_pte(vcpu->kvm, sp, sptep))
 				kvm_flush_remote_tlbs(vcpu->kvm);
+
+			if (kvm_read_guest_atomic(vcpu->kvm, pte_gpa, &gpte,
+						  sizeof(pt_element_t)))
+				break;
+
+			FNAME(update_pte)(vcpu, sp, sptep, &gpte);
 		}
 
 		if (!is_shadow_present_pte(*sptep) || !sp->unsync_children)
 			break;
 	}
-
-	atomic_inc(&vcpu->kvm->arch.invlpg_counter);
-
 	spin_unlock(&vcpu->kvm->mmu_lock);
-
-	if (pte_gpa == -1)
-		return;
-
-	if (mmu_topup_memory_caches(vcpu))
-		return;
-	kvm_mmu_pte_write(vcpu, pte_gpa, NULL, sizeof(pt_element_t),
-			  false, false);
 }
 
 static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t vaddr, u32 access,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6fb9001..c3c8bae 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4058,7 +4058,7 @@ int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
 	ret = kvm_write_guest(vcpu->kvm, gpa, val, bytes);
 	if (ret < 0)
 		return 0;
-	kvm_mmu_pte_write(vcpu, gpa, val, bytes, true,
+	kvm_mmu_pte_write(vcpu, gpa, val, bytes,
 			  vcpu->arch.emulate_ctxt.rep_prefix);
 	return 1;
 }
@@ -4298,7 +4298,7 @@ static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt,
 	if (!exchanged)
 		return X86EMUL_CMPXCHG_FAILED;
 
-	kvm_mmu_pte_write(vcpu, gpa, new, bytes, true, false);
+	kvm_mmu_pte_write(vcpu, gpa, new, bytes, false);
 
 	return X86EMUL_CONTINUE;
 
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 07/11] KVM: MMU: remove unnecessary kvm_mmu_free_some_pages
  2011-07-26 11:24 [PATCH 0/11] KVM: x86: optimize for guest page written Xiao Guangrong
                   ` (5 preceding siblings ...)
  2011-07-26 11:29 ` [PATCH 06/11] KVM: MMU: fast prefetch spte on invlpg path Xiao Guangrong
@ 2011-07-26 11:29 ` Xiao Guangrong
  2011-07-26 11:30 ` [PATCH 08/11] KVM: MMU: split kvm_mmu_pte_write function Xiao Guangrong
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 53+ messages in thread
From: Xiao Guangrong @ 2011-07-26 11:29 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

In kvm_mmu_pte_write, we do not need to alloc shadow page, so calling
kvm_mmu_free_some_pages is really unnecessary

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/kvm/mmu.c |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index dfe8e6b..7629802 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3579,7 +3579,6 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 
 	mmu_topup_memory_caches(vcpu);
 	spin_lock(&vcpu->kvm->mmu_lock);
-	kvm_mmu_free_some_pages(vcpu);
 	++vcpu->kvm->stat.mmu_pte_write;
 	trace_kvm_mmu_audit(vcpu, AUDIT_PRE_PTE_WRITE);
 	if (gfn == vcpu->arch.last_pt_write_gfn
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 08/11] KVM: MMU: split kvm_mmu_pte_write function
  2011-07-26 11:24 [PATCH 0/11] KVM: x86: optimize for guest page written Xiao Guangrong
                   ` (6 preceding siblings ...)
  2011-07-26 11:29 ` [PATCH 07/11] KVM: MMU: remove unnecessary kvm_mmu_free_some_pages Xiao Guangrong
@ 2011-07-26 11:30 ` Xiao Guangrong
  2011-07-26 11:31 ` [PATCH 09/11] KVM: MMU: remove the mismatch shadow page Xiao Guangrong
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 53+ messages in thread
From: Xiao Guangrong @ 2011-07-26 11:30 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

kvm_mmu_pte_write is too long, we split it for better readable

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/kvm/mmu.c |  180 ++++++++++++++++++++++++++++++++--------------------
 1 files changed, 111 insertions(+), 69 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 7629802..931c23a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3524,48 +3524,29 @@ static int get_free_pte_list_desc_nr(struct kvm_vcpu *vcpu)
 	return mmu_memory_cache_free_object(cache);
 }
 
-void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
-		       const u8 *new, int bytes, bool repeat_write)
+static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu *vcpu, gpa_t *gpa,
+				    const u8 *new, int *bytes)
 {
-	gfn_t gfn = gpa >> PAGE_SHIFT;
-	union kvm_mmu_page_role mask = { .word = 0 };
-	struct kvm_mmu_page *sp;
-	struct hlist_node *node;
-	LIST_HEAD(invalid_list);
-	u64 entry, gentry, *spte;
-	unsigned pte_size, page_offset, misaligned, quadrant, offset;
-	int level, npte, r, flooded = 0;
-	bool remote_flush, local_flush, zap_page;
-
-	/*
-	 * If we don't have indirect shadow pages, it means no page is
-	 * write-protected, so we can exit simply.
-	 */
-	if (!ACCESS_ONCE(vcpu->kvm->arch.indirect_shadow_pages))
-		return;
-
-	zap_page = remote_flush = local_flush = false;
-	offset = offset_in_page(gpa);
-
-	pgprintk("%s: gpa %llx bytes %d\n", __func__, gpa, bytes);
+	u64 gentry;
+	int r;
 
 	/*
 	 * Assume that the pte write on a page table of the same type
 	 * as the current vcpu paging mode since we update the sptes only
 	 * when they have the same mode.
 	 */
-	if (is_pae(vcpu) && bytes == 4) {
+	if (is_pae(vcpu) && *bytes == 4) {
 		/* Handle a 32-bit guest writing two halves of a 64-bit gpte */
-		gpa &= ~(gpa_t)7;
-		bytes = 8;
+		*gpa &= ~(gpa_t)7;
+		*bytes = 8;
 
-		r = kvm_read_guest(vcpu->kvm, gpa, &gentry, min(bytes, 8));
+		r = kvm_read_guest(vcpu->kvm, *gpa, &gentry, min(*bytes, 8));
 		if (r)
 			gentry = 0;
 		new = (const u8 *)&gentry;
 	}
 
-	switch (bytes) {
+	switch (*bytes) {
 	case 4:
 		gentry = *(const u32 *)new;
 		break;
@@ -3577,66 +3558,127 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 		break;
 	}
 
-	mmu_topup_memory_caches(vcpu);
-	spin_lock(&vcpu->kvm->mmu_lock);
-	++vcpu->kvm->stat.mmu_pte_write;
-	trace_kvm_mmu_audit(vcpu, AUDIT_PRE_PTE_WRITE);
+	return gentry;
+}
+
+/*
+ * If we're seeing too many writes to a page, it may no longer be a page table,
+ * or we may be forking, in which case it is better to unmap the page.
+ */
+static bool detect_write_flooding(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	bool flooded = false;
+
 	if (gfn == vcpu->arch.last_pt_write_gfn
 	    && !last_updated_pte_accessed(vcpu)) {
 		++vcpu->arch.last_pt_write_count;
 		if (vcpu->arch.last_pt_write_count >= 3)
-			flooded = 1;
+			flooded = true;
 	} else {
 		vcpu->arch.last_pt_write_gfn = gfn;
 		vcpu->arch.last_pt_write_count = 1;
 		vcpu->arch.last_pte_updated = NULL;
 	}
 
+	return flooded;
+}
+
+/*
+ * Misaligned accesses are too much trouble to fix up; also, they usually
+ * indicate a page is not used as a page table.
+ */
+static bool detect_write_misaligned(struct kvm_mmu_page *sp, gpa_t gpa,
+				    int bytes)
+{
+	unsigned offset, pte_size, misaligned;
+
+	pgprintk("misaligned: gpa %llx bytes %d role %x\n",
+		 gpa, bytes, sp->role.word);
+
+	offset = offset_in_page(gpa);
+	pte_size = sp->role.cr4_pae ? 8 : 4;
+	misaligned = (offset ^ (offset + bytes - 1)) & ~(pte_size - 1);
+	misaligned |= bytes < 4;
+
+	return misaligned;
+}
+
+static u64 *get_written_sptes(struct kvm_mmu_page *sp, gpa_t gpa, int *nspte)
+{
+	unsigned page_offset, quadrant;
+	u64 *spte;
+	int level;
+
+	page_offset = offset_in_page(gpa);
+	level = sp->role.level;
+	*nspte = 1;
+	if (!sp->role.cr4_pae) {
+		page_offset <<= 1;	/* 32->64 */
+		/*
+		 * A 32-bit pde maps 4MB while the shadow pdes map
+		 * only 2MB.  So we need to double the offset again
+		 * and zap two pdes instead of one.
+		 */
+		if (level == PT32_ROOT_LEVEL) {
+			page_offset &= ~7; /* kill rounding error */
+			page_offset <<= 1;
+			*nspte = 2;
+		}
+		quadrant = page_offset >> PAGE_SHIFT;
+		page_offset &= ~PAGE_MASK;
+		if (quadrant != sp->role.quadrant)
+			return NULL;
+	}
+
+	spte = &sp->spt[page_offset / sizeof(*spte)];
+	return spte;
+}
+
+void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
+		       const u8 *new, int bytes, bool repeat_write)
+{
+	gfn_t gfn = gpa >> PAGE_SHIFT;
+	union kvm_mmu_page_role mask = { .word = 0 };
+	struct kvm_mmu_page *sp;
+	struct hlist_node *node;
+	LIST_HEAD(invalid_list);
+	u64 entry, gentry, *spte;
+	int npte;
+	bool remote_flush, local_flush, zap_page, flooded, misaligned;
+
+	/*
+	 * If we don't have indirect shadow pages, it means no page is
+	 * write-protected, so we can exit simply.
+	 */
+	if (!ACCESS_ONCE(vcpu->kvm->arch.indirect_shadow_pages))
+		return;
+
+	zap_page = remote_flush = local_flush = false;
+
+	pgprintk("%s: gpa %llx bytes %d\n", __func__, gpa, bytes);
+
+	gentry = mmu_pte_write_fetch_gpte(vcpu, &gpa, new, &bytes);
+	mmu_topup_memory_caches(vcpu);
+	spin_lock(&vcpu->kvm->mmu_lock);
+	++vcpu->kvm->stat.mmu_pte_write;
+	trace_kvm_mmu_audit(vcpu, AUDIT_PRE_PTE_WRITE);
+
+	flooded = detect_write_flooding(vcpu, gfn);
 	mask.cr0_wp = mask.cr4_pae = mask.nxe = 1;
 	for_each_gfn_indirect_valid_sp(vcpu->kvm, sp, gfn, node) {
-		pte_size = sp->role.cr4_pae ? 8 : 4;
-		misaligned = (offset ^ (offset + bytes - 1)) & ~(pte_size - 1);
-		misaligned |= bytes < 4;
+		misaligned = detect_write_misaligned(sp, gpa, bytes);
 		if (misaligned || flooded || repeat_write) {
-			/*
-			 * Misaligned accesses are too much trouble to fix
-			 * up; also, they usually indicate a page is not used
-			 * as a page table.
-			 *
-			 * If we're seeing too many writes to a page,
-			 * it may no longer be a page table, or we may be
-			 * forking, in which case it is better to unmap the
-			 * page.
-			 */
-			pgprintk("misaligned: gpa %llx bytes %d role %x\n",
-				 gpa, bytes, sp->role.word);
 			zap_page |= !!kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
 						     &invalid_list);
 			++vcpu->kvm->stat.mmu_flooded;
 			continue;
 		}
-		page_offset = offset;
-		level = sp->role.level;
-		npte = 1;
-		if (!sp->role.cr4_pae) {
-			page_offset <<= 1;	/* 32->64 */
-			/*
-			 * A 32-bit pde maps 4MB while the shadow pdes map
-			 * only 2MB.  So we need to double the offset again
-			 * and zap two pdes instead of one.
-			 */
-			if (level == PT32_ROOT_LEVEL) {
-				page_offset &= ~7; /* kill rounding error */
-				page_offset <<= 1;
-				npte = 2;
-			}
-			quadrant = page_offset >> PAGE_SHIFT;
-			page_offset &= ~PAGE_MASK;
-			if (quadrant != sp->role.quadrant)
-				continue;
-		}
+
+		spte = get_written_sptes(sp, gpa, &npte);
+		if (!spte)
+			continue;
+
 		local_flush = true;
-		spte = &sp->spt[page_offset / sizeof(*spte)];
 		while (npte--) {
 			entry = *spte;
 			mmu_page_zap_pte(vcpu->kvm, sp, spte);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 09/11] KVM: MMU: remove the mismatch shadow page
  2011-07-26 11:24 [PATCH 0/11] KVM: x86: optimize for guest page written Xiao Guangrong
                   ` (7 preceding siblings ...)
  2011-07-26 11:30 ` [PATCH 08/11] KVM: MMU: split kvm_mmu_pte_write function Xiao Guangrong
@ 2011-07-26 11:31 ` Xiao Guangrong
  2011-07-27  9:11   ` Avi Kivity
  2011-07-26 11:31 ` [PATCH 10/11] KVM: MMU: fix detecting misaligned accessed Xiao Guangrong
  2011-07-26 11:32 ` [PATCH 11/11] KVM: MMU: improve write flooding detected Xiao Guangrong
  10 siblings, 1 reply; 53+ messages in thread
From: Xiao Guangrong @ 2011-07-26 11:31 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

If the shadow page has different cpu mode with current vcpu, we do better
remove them since the OS does not changes cpu mode frequently

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/kvm/mmu.c |   26 +++++++++++++++++++-------
 1 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 931c23a..2328ee6 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3603,6 +3603,18 @@ static bool detect_write_misaligned(struct kvm_mmu_page *sp, gpa_t gpa,
 	return misaligned;
 }
 
+/*
+ * The OS hardly changes cpu mode after boot, we can zap the shadow page if
+ * it is mismatched with the current vcpu.
+ */
+static bool detect_mismatch_sp(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
+{
+	union kvm_mmu_page_role mask;
+
+	mask.cr0_wp = mask.cr4_pae = mask.nxe = 1;
+	return (sp->role.word ^ vcpu->arch.mmu.base_role.word) & mask.word;
+}
+
 static u64 *get_written_sptes(struct kvm_mmu_page *sp, gpa_t gpa, int *nspte)
 {
 	unsigned page_offset, quadrant;
@@ -3638,13 +3650,12 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 		       const u8 *new, int bytes, bool repeat_write)
 {
 	gfn_t gfn = gpa >> PAGE_SHIFT;
-	union kvm_mmu_page_role mask = { .word = 0 };
 	struct kvm_mmu_page *sp;
 	struct hlist_node *node;
 	LIST_HEAD(invalid_list);
 	u64 entry, gentry, *spte;
 	int npte;
-	bool remote_flush, local_flush, zap_page, flooded, misaligned;
+	bool remote_flush, local_flush, zap_page, flooded;
 
 	/*
 	 * If we don't have indirect shadow pages, it means no page is
@@ -3664,10 +3675,13 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 	trace_kvm_mmu_audit(vcpu, AUDIT_PRE_PTE_WRITE);
 
 	flooded = detect_write_flooding(vcpu, gfn);
-	mask.cr0_wp = mask.cr4_pae = mask.nxe = 1;
 	for_each_gfn_indirect_valid_sp(vcpu->kvm, sp, gfn, node) {
+		bool mismatch, misaligned;
+
 		misaligned = detect_write_misaligned(sp, gpa, bytes);
-		if (misaligned || flooded || repeat_write) {
+		mismatch = detect_mismatch_sp(vcpu, sp);
+
+		if (misaligned || mismatch || flooded || repeat_write) {
 			zap_page |= !!kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
 						     &invalid_list);
 			++vcpu->kvm->stat.mmu_flooded;
@@ -3682,9 +3696,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 		while (npte--) {
 			entry = *spte;
 			mmu_page_zap_pte(vcpu->kvm, sp, spte);
-			if (gentry &&
-			      !((sp->role.word ^ vcpu->arch.mmu.base_role.word)
-			      & mask.word) && get_free_pte_list_desc_nr(vcpu))
+			if (gentry && get_free_pte_list_desc_nr(vcpu))
 				mmu_pte_write_new_pte(vcpu, sp, spte, &gentry);
 			if (!remote_flush && need_remote_flush(entry, *spte))
 				remote_flush = true;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 10/11] KVM: MMU: fix detecting misaligned accessed
  2011-07-26 11:24 [PATCH 0/11] KVM: x86: optimize for guest page written Xiao Guangrong
                   ` (8 preceding siblings ...)
  2011-07-26 11:31 ` [PATCH 09/11] KVM: MMU: remove the mismatch shadow page Xiao Guangrong
@ 2011-07-26 11:31 ` Xiao Guangrong
  2011-07-27  9:15   ` Avi Kivity
  2011-07-26 11:32 ` [PATCH 11/11] KVM: MMU: improve write flooding detected Xiao Guangrong
  10 siblings, 1 reply; 53+ messages in thread
From: Xiao Guangrong @ 2011-07-26 11:31 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

Sometimes, we only modify the last one byte of a pte to update status bit,
for example, clear_bit is used to clear r/w bit in linux kernel and 'andb'
instruction is used in this function, in this case, kvm_mmu_pte_write will
treat it as misaligned access, and the shadow page table is zapped

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/kvm/mmu.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 2328ee6..bb55b15 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3597,6 +3597,14 @@ static bool detect_write_misaligned(struct kvm_mmu_page *sp, gpa_t gpa,
 
 	offset = offset_in_page(gpa);
 	pte_size = sp->role.cr4_pae ? 8 : 4;
+
+	/*
+	 * Sometimes, the OS only writes the last one bytes to update status
+	 * bits, for example, in linux, andb instruction is used in clear_bit().
+	 */
+	if (sp->role.level == 1 && !(offset & (pte_size - 1)) && bytes == 1)
+		return false;
+
 	misaligned = (offset ^ (offset + bytes - 1)) & ~(pte_size - 1);
 	misaligned |= bytes < 4;
 
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 11/11] KVM: MMU: improve write flooding detected
  2011-07-26 11:24 [PATCH 0/11] KVM: x86: optimize for guest page written Xiao Guangrong
                   ` (9 preceding siblings ...)
  2011-07-26 11:31 ` [PATCH 10/11] KVM: MMU: fix detecting misaligned accessed Xiao Guangrong
@ 2011-07-26 11:32 ` Xiao Guangrong
  2011-07-27  9:23   ` Avi Kivity
  10 siblings, 1 reply; 53+ messages in thread
From: Xiao Guangrong @ 2011-07-26 11:32 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

Detecting write-flooding does not work well, when we handle page written, if
the last speculative spte is not accessed, we treat the page is
write-flooding, however, we can speculative spte on many path, such as pte
prefetch, page synced, that means the last speculative spte may be not point
to the written page and the written page can be accessed via other sptes, so
depends on the Accessed bit of the last speculative spte is not enough

Instead of detected page accessed, we can detect whether the spte is accessed
or not, if the spte is not accessed but it is written frequently, we treat is
not a page table or it not used for a long time

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/include/asm/kvm_host.h |    6 +---
 arch/x86/kvm/mmu.c              |   53 +++++++++------------------------------
 arch/x86/kvm/paging_tmpl.h      |    9 +-----
 3 files changed, 16 insertions(+), 52 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ce17642..8c938db 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -239,6 +239,8 @@ struct kvm_mmu_page {
 	int clear_spte_count;
 #endif
 
+	int write_flooding_count;
+
 	struct rcu_head rcu;
 };
 
@@ -352,10 +354,6 @@ struct kvm_vcpu_arch {
 	struct kvm_mmu_memory_cache mmu_page_cache;
 	struct kvm_mmu_memory_cache mmu_page_header_cache;
 
-	gfn_t last_pt_write_gfn;
-	int   last_pt_write_count;
-	u64  *last_pte_updated;
-
 	struct fpu guest_fpu;
 	u64 xcr0;
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index bb55b15..8c2885c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1688,6 +1688,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 		} else if (sp->unsync)
 			kvm_mmu_mark_parents_unsync(sp);
 
+		sp->write_flooding_count = 0;
 		trace_kvm_mmu_get_page(sp, false);
 		return sp;
 	}
@@ -1840,15 +1841,6 @@ static void kvm_mmu_put_page(struct kvm_mmu_page *sp, u64 *parent_pte)
 	mmu_page_remove_parent_pte(sp, parent_pte);
 }
 
-static void kvm_mmu_reset_last_pte_updated(struct kvm *kvm)
-{
-	int i;
-	struct kvm_vcpu *vcpu;
-
-	kvm_for_each_vcpu(i, vcpu, kvm)
-		vcpu->arch.last_pte_updated = NULL;
-}
-
 static void kvm_mmu_unlink_parents(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
 	u64 *parent_pte;
@@ -1908,7 +1900,6 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
 	}
 
 	sp->role.invalid = 1;
-	kvm_mmu_reset_last_pte_updated(kvm);
 	return ret;
 }
 
@@ -2350,8 +2341,6 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 		}
 	}
 	kvm_release_pfn_clean(pfn);
-	if (speculative)
-		vcpu->arch.last_pte_updated = sptep;
 }
 
 static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
@@ -3509,13 +3498,6 @@ static void mmu_pte_write_flush_tlb(struct kvm_vcpu *vcpu, bool zap_page,
 		kvm_mmu_flush_tlb(vcpu);
 }
 
-static bool last_updated_pte_accessed(struct kvm_vcpu *vcpu)
-{
-	u64 *spte = vcpu->arch.last_pte_updated;
-
-	return !!(spte && (*spte & shadow_accessed_mask));
-}
-
 static int get_free_pte_list_desc_nr(struct kvm_vcpu *vcpu)
 {
 	struct kvm_mmu_memory_cache *cache;
@@ -3565,22 +3547,14 @@ static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu *vcpu, gpa_t *gpa,
  * If we're seeing too many writes to a page, it may no longer be a page table,
  * or we may be forking, in which case it is better to unmap the page.
  */
-static bool detect_write_flooding(struct kvm_vcpu *vcpu, gfn_t gfn)
+static bool detect_write_flooding(struct kvm_mmu_page *sp, u64 *spte)
 {
-	bool flooded = false;
-
-	if (gfn == vcpu->arch.last_pt_write_gfn
-	    && !last_updated_pte_accessed(vcpu)) {
-		++vcpu->arch.last_pt_write_count;
-		if (vcpu->arch.last_pt_write_count >= 3)
-			flooded = true;
-	} else {
-		vcpu->arch.last_pt_write_gfn = gfn;
-		vcpu->arch.last_pt_write_count = 1;
-		vcpu->arch.last_pte_updated = NULL;
-	}
+	if (spte && !(*spte & shadow_accessed_mask))
+		sp->write_flooding_count++;
+	else
+		sp->write_flooding_count = 0;
 
-	return flooded;
+	return sp->write_flooding_count >= 3;
 }
 
 /*
@@ -3663,7 +3637,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 	LIST_HEAD(invalid_list);
 	u64 entry, gentry, *spte;
 	int npte;
-	bool remote_flush, local_flush, zap_page, flooded;
+	bool remote_flush, local_flush, zap_page;
 
 	/*
 	 * If we don't have indirect shadow pages, it means no page is
@@ -3682,21 +3656,18 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 	++vcpu->kvm->stat.mmu_pte_write;
 	trace_kvm_mmu_audit(vcpu, AUDIT_PRE_PTE_WRITE);
 
-	flooded = detect_write_flooding(vcpu, gfn);
 	for_each_gfn_indirect_valid_sp(vcpu->kvm, sp, gfn, node) {
-		bool mismatch, misaligned;
-
-		misaligned = detect_write_misaligned(sp, gpa, bytes);
-		mismatch = detect_mismatch_sp(vcpu, sp);
+		spte = get_written_sptes(sp, gpa, &npte);
 
-		if (misaligned || mismatch || flooded || repeat_write) {
+		if (repeat_write || detect_write_misaligned(sp, gpa, bytes) ||
+		      detect_write_flooding(sp, spte) ||
+		      detect_mismatch_sp(vcpu, sp)) {
 			zap_page |= !!kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
 						     &invalid_list);
 			++vcpu->kvm->stat.mmu_flooded;
 			continue;
 		}
 
-		spte = get_written_sptes(sp, gpa, &npte);
 		if (!spte)
 			continue;
 
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 3466229..82063b2 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -595,11 +595,9 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
 	 */
 	if (!r) {
 		pgprintk("%s: guest page fault\n", __func__);
-		if (!prefault) {
+		if (!prefault)
 			inject_page_fault(vcpu, &walker.fault);
-			/* reset fork detector */
-			vcpu->arch.last_pt_write_count = 0;
-		}
+
 		return 0;
 	}
 
@@ -637,9 +635,6 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
 	pgprintk("%s: shadow pte %p %llx emulate %d\n", __func__,
 		 sptep, *sptep, emulate);
 
-	if (!emulate)
-		vcpu->arch.last_pt_write_count = 0; /* reset fork detector */
-
 	++vcpu->stat.pf_fixed;
 	trace_kvm_mmu_audit(vcpu, AUDIT_POST_PAGE_FAULT);
 	spin_unlock(&vcpu->kvm->mmu_lock);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/11] KVM: x86: fast emulate repeat string write instructions
  2011-07-26 11:26 ` [PATCH 03/11] KVM: x86: fast emulate repeat string write instructions Xiao Guangrong
@ 2011-07-26 12:27   ` Gleb Natapov
  2011-07-26 13:53     ` Avi Kivity
  2011-07-27  1:47     ` Xiao Guangrong
  2011-07-27  9:04   ` Avi Kivity
  1 sibling, 2 replies; 53+ messages in thread
From: Gleb Natapov @ 2011-07-26 12:27 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, Marcelo Tosatti, LKML, KVM

On Tue, Jul 26, 2011 at 07:26:46PM +0800, Xiao Guangrong wrote:
> We usually use repeat string instructions to clear the page, for example,
By "we" do you mean Linux guest?

> we call memset to clear a page table, stosb is used in this function, and 
> repeated for 1024 times, that means we should occupy mmu lock for 1024 times
> and walking shadow page cache for 1024 times, it is terrible
> 
> In fact, if it is the repeat string instructions emulated and it is not a
> IO/MMIO access, we can zap all the corresponding shadow pages and return to the
> guest, then the mapping can became writable and directly write the page
> 
So this patch does two independent things as far as I can see. First it
stops reentering guest if rep instruction is done on memory and second
it drops shadow pages if access to shadowed page table is rep. Why not
separate those in different patches?  BTW not entering guest periodically
increases interrupt latencies. Why not zap shadow, make page writable
and reenter the guest instead of emulation, it should be much faster (do we
care to optimize for old cpus by complicating the code anyway?).

> Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
> ---
>  arch/x86/include/asm/kvm_emulate.h |    1 +
>  arch/x86/include/asm/kvm_host.h    |    2 +-
>  arch/x86/kvm/emulate.c             |    5 +++--
>  arch/x86/kvm/mmu.c                 |    4 ++--
>  arch/x86/kvm/paging_tmpl.h         |    3 ++-
>  arch/x86/kvm/x86.c                 |    9 +++++++--
>  6 files changed, 16 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h
> index 6040d11..7e5c4d5 100644
> --- a/arch/x86/include/asm/kvm_emulate.h
> +++ b/arch/x86/include/asm/kvm_emulate.h
> @@ -244,6 +244,7 @@ struct x86_emulate_ctxt {
>  	bool guest_mode; /* guest running a nested guest */
>  	bool perm_ok; /* do not check permissions if true */
>  	bool only_vendor_specific_insn;
> +	bool pio_mmio_emulate; /* it is the pio/mmio access if true */
>  
>  	bool have_exception;
>  	struct x86_exception exception;
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index c00ec28..95f55a9 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -752,7 +752,7 @@ int fx_init(struct kvm_vcpu *vcpu);
>  void kvm_mmu_flush_tlb(struct kvm_vcpu *vcpu);
>  void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
>  		       const u8 *new, int bytes,
> -		       bool guest_initiated);
> +		       bool guest_initiated, bool repeat_write);
>  int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva);
>  void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu);
>  int kvm_mmu_load(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index 6f08bc9..36fbac6 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -4009,8 +4009,9 @@ writeback:
>  			 * Re-enter guest when pio read ahead buffer is empty
>  			 * or, if it is not used, after each 1024 iteration.
>  			 */
> -			if ((r->end != 0 || ctxt->regs[VCPU_REGS_RCX] & 0x3ff) &&
> -			    (r->end == 0 || r->end != r->pos)) {
> +			if ((r->end != 0 || ctxt->regs[VCPU_REGS_RCX] & 0x3ff)
> +			      && (r->end == 0 || r->end != r->pos) &&
> +			      ctxt->pio_mmio_emulate) {
>  				/*
>  				 * Reset read cache. Usually happens before
>  				 * decode, but since instruction is restarted
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 587695b..5193de8 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -3541,7 +3541,7 @@ static int get_free_pte_list_desc_nr(struct kvm_vcpu *vcpu)
>  
>  void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
>  		       const u8 *new, int bytes,
> -		       bool guest_initiated)
> +		       bool guest_initiated, bool repeat_write)
>  {
>  	gfn_t gfn = gpa >> PAGE_SHIFT;
>  	union kvm_mmu_page_role mask = { .word = 0 };
> @@ -3622,7 +3622,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
>  		pte_size = sp->role.cr4_pae ? 8 : 4;
>  		misaligned = (offset ^ (offset + bytes - 1)) & ~(pte_size - 1);
>  		misaligned |= bytes < 4;
> -		if (misaligned || flooded) {
> +		if (misaligned || flooded || repeat_write) {
>  			/*
>  			 * Misaligned accesses are too much trouble to fix
>  			 * up; also, they usually indicate a page is not used
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 507e2b8..02daac5 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -710,7 +710,8 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
>  
>  	if (mmu_topup_memory_caches(vcpu))
>  		return;
> -	kvm_mmu_pte_write(vcpu, pte_gpa, NULL, sizeof(pt_element_t), 0);
> +	kvm_mmu_pte_write(vcpu, pte_gpa, NULL, sizeof(pt_element_t),
> +			  false, false);
>  }
>  
>  static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t vaddr, u32 access,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 5543f99..6fb9001 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4058,7 +4058,8 @@ int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
>  	ret = kvm_write_guest(vcpu->kvm, gpa, val, bytes);
>  	if (ret < 0)
>  		return 0;
> -	kvm_mmu_pte_write(vcpu, gpa, val, bytes, 1);
> +	kvm_mmu_pte_write(vcpu, gpa, val, bytes, true,
> +			  vcpu->arch.emulate_ctxt.rep_prefix);
>  	return 1;
>  }
>  
> @@ -4161,6 +4162,8 @@ static int emulator_read_write_onepage(unsigned long addr, void *val,
>  		return X86EMUL_CONTINUE;
>  
>  mmio:
> +	vcpu->arch.emulate_ctxt.pio_mmio_emulate = true;
> +
>  	/*
>  	 * Is this MMIO handled locally?
>  	 */
> @@ -4295,7 +4298,7 @@ static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt,
>  	if (!exchanged)
>  		return X86EMUL_CMPXCHG_FAILED;
>  
> -	kvm_mmu_pte_write(vcpu, gpa, new, bytes, 1);
> +	kvm_mmu_pte_write(vcpu, gpa, new, bytes, true, false);
>  
>  	return X86EMUL_CONTINUE;
>  
> @@ -4326,6 +4329,7 @@ static int emulator_pio_in_out(struct kvm_vcpu *vcpu, int size,
>  {
>  	trace_kvm_pio(!in, port, size, count);
>  
> +	vcpu->arch.emulate_ctxt.pio_mmio_emulate = true;
>  	vcpu->arch.pio.port = port;
>  	vcpu->arch.pio.in = in;
>  	vcpu->arch.pio.count  = count;
> @@ -4817,6 +4821,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
>  		ctxt->interruptibility = 0;
>  		ctxt->have_exception = false;
>  		ctxt->perm_ok = false;
> +		ctxt->pio_mmio_emulate = false;
>  
>  		ctxt->only_vendor_specific_insn
>  			= emulation_type & EMULTYPE_TRAP_UD;
> -- 
> 1.7.5.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
			Gleb.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/11] KVM: x86: fast emulate repeat string write instructions
  2011-07-26 12:27   ` Gleb Natapov
@ 2011-07-26 13:53     ` Avi Kivity
  2011-07-27  1:47     ` Xiao Guangrong
  1 sibling, 0 replies; 53+ messages in thread
From: Avi Kivity @ 2011-07-26 13:53 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Xiao Guangrong, Marcelo Tosatti, LKML, KVM

On 07/26/2011 03:27 PM, Gleb Natapov wrote:
> On Tue, Jul 26, 2011 at 07:26:46PM +0800, Xiao Guangrong wrote:
> >  We usually use repeat string instructions to clear the page, for example,
> By "we" do you mean Linux guest?
>
> >  we call memset to clear a page table, stosb is used in this function, and
> >  repeated for 1024 times, that means we should occupy mmu lock for 1024 times
> >  and walking shadow page cache for 1024 times, it is terrible
> >
> >  In fact, if it is the repeat string instructions emulated and it is not a
> >  IO/MMIO access, we can zap all the corresponding shadow pages and return to the
> >  guest, then the mapping can became writable and directly write the page
> >
> So this patch does two independent things as far as I can see. First it
> stops reentering guest if rep instruction is done on memory and second
> it drops shadow pages if access to shadowed page table is rep. Why not
> separate those in different patches?  BTW not entering guest periodically
> increases interrupt latencies. Why not zap shadow, make page writable
> and reenter the guest instead of emulation, it should be much faster (do we
> care to optimize for old cpus by complicating the code anyway?).
>

The second thing is mentioned on the TODO list in a more general way: 
tag instructions that are typically used to modify the page tables, and 
drop shadow if any other instruction is used.  Since MOVS is typically 
not used to update pagetables, it would not be tagged.

The list would include, I'd guess, and, or, bts, btc, mov, xchg, 
cmpxchg, and cmpxchg8b.  Anything else?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/11] KVM: x86: fast emulate repeat string write instructions
  2011-07-26 12:27   ` Gleb Natapov
  2011-07-26 13:53     ` Avi Kivity
@ 2011-07-27  1:47     ` Xiao Guangrong
  2011-07-27  4:26       ` Gleb Natapov
  1 sibling, 1 reply; 53+ messages in thread
From: Xiao Guangrong @ 2011-07-27  1:47 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Avi Kivity, Marcelo Tosatti, LKML, KVM

On 07/26/2011 08:27 PM, Gleb Natapov wrote:
> On Tue, Jul 26, 2011 at 07:26:46PM +0800, Xiao Guangrong wrote:
>> We usually use repeat string instructions to clear the page, for example,
> By "we" do you mean Linux guest?
> 

I do not know other guests except linux, but, generally rep instruction is
not used to update a page table which is been using.

>> we call memset to clear a page table, stosb is used in this function, and 
>> repeated for 1024 times, that means we should occupy mmu lock for 1024 times
>> and walking shadow page cache for 1024 times, it is terrible
>>
>> In fact, if it is the repeat string instructions emulated and it is not a
>> IO/MMIO access, we can zap all the corresponding shadow pages and return to the
>> guest, then the mapping can became writable and directly write the page
>>


> So this patch does two independent things as far as I can see. First it
> stops reentering guest if rep instruction is done on memory and second

No.
Oppositely, it enters guest as soon as possible if rep instruction is done
on memory ;-)
After this patch, we only need to emulate one time for a rep instruction.

> it drops shadow pages if access to shadowed page table is rep. Why not
> separate those in different patches?  

Umm, i will zap shadow page firstly and stop emulation rep instruction in
the second patch.

> BTW not entering guest periodically
> increases interrupt latencies.

Oppositely, It reduces the latencies. :-)

> Why not zap shadow, make page writable
> and reenter the guest instead of emulation, it should be much faster (do we
> care to optimize for old cpus by complicating the code anyway?).
> 

We do better lazily update the mapping to writable, it can be done by the later access.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/11] KVM: x86: fast emulate repeat string write instructions
  2011-07-27  1:47     ` Xiao Guangrong
@ 2011-07-27  4:26       ` Gleb Natapov
  2011-07-27  6:32         ` Xiao Guangrong
  0 siblings, 1 reply; 53+ messages in thread
From: Gleb Natapov @ 2011-07-27  4:26 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, Marcelo Tosatti, LKML, KVM

On Wed, Jul 27, 2011 at 09:47:52AM +0800, Xiao Guangrong wrote:
> On 07/26/2011 08:27 PM, Gleb Natapov wrote:
> > On Tue, Jul 26, 2011 at 07:26:46PM +0800, Xiao Guangrong wrote:
> >> We usually use repeat string instructions to clear the page, for example,
> > By "we" do you mean Linux guest?
> > 
> 
> I do not know other guests except linux, but, generally rep instruction is
> not used to update a page table which is been using.
> 
> >> we call memset to clear a page table, stosb is used in this function, and 
> >> repeated for 1024 times, that means we should occupy mmu lock for 1024 times
> >> and walking shadow page cache for 1024 times, it is terrible
> >>
> >> In fact, if it is the repeat string instructions emulated and it is not a
> >> IO/MMIO access, we can zap all the corresponding shadow pages and return to the
> >> guest, then the mapping can became writable and directly write the page
> >>
> 
> 
> > So this patch does two independent things as far as I can see. First it
> > stops reentering guest if rep instruction is done on memory and second
> 
> No.
> Oppositely, it enters guest as soon as possible if rep instruction is done
> on memory ;-)
Oops. Indeed. I read it other way around. So why not just return
X86EMUL_UNHANDLEABLE from emulator_write_emulated_onepage() which should
have the same effect?

> After this patch, we only need to emulate one time for a rep instruction.
> 
> > it drops shadow pages if access to shadowed page table is rep. Why not
> > separate those in different patches?  
> 
> Umm, i will zap shadow page firstly and stop emulation rep instruction in
> the second patch.
> 
> > BTW not entering guest periodically
> > increases interrupt latencies.
> 
> Oppositely, It reduces the latencies. :-)
> 
> > Why not zap shadow, make page writable
> > and reenter the guest instead of emulation, it should be much faster (do we
> > care to optimize for old cpus by complicating the code anyway?).
> > 
> 
> We do better lazily update the mapping to writable, it can be done by the later access.

--
			Gleb.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/11] KVM: x86: fast emulate repeat string write instructions
  2011-07-27  4:26       ` Gleb Natapov
@ 2011-07-27  6:32         ` Xiao Guangrong
  2011-07-27  7:51           ` Gleb Natapov
  0 siblings, 1 reply; 53+ messages in thread
From: Xiao Guangrong @ 2011-07-27  6:32 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Avi Kivity, Marcelo Tosatti, LKML, KVM

On 07/27/2011 12:26 PM, Gleb Natapov wrote:
> On Wed, Jul 27, 2011 at 09:47:52AM +0800, Xiao Guangrong wrote:
>> On 07/26/2011 08:27 PM, Gleb Natapov wrote:
>>> On Tue, Jul 26, 2011 at 07:26:46PM +0800, Xiao Guangrong wrote:
>>>> We usually use repeat string instructions to clear the page, for example,
>>> By "we" do you mean Linux guest?
>>>
>>
>> I do not know other guests except linux, but, generally rep instruction is
>> not used to update a page table which is been using.
>>
>>>> we call memset to clear a page table, stosb is used in this function, and 
>>>> repeated for 1024 times, that means we should occupy mmu lock for 1024 times
>>>> and walking shadow page cache for 1024 times, it is terrible
>>>>
>>>> In fact, if it is the repeat string instructions emulated and it is not a
>>>> IO/MMIO access, we can zap all the corresponding shadow pages and return to the
>>>> guest, then the mapping can became writable and directly write the page
>>>>
>>
>>
>>> So this patch does two independent things as far as I can see. First it
>>> stops reentering guest if rep instruction is done on memory and second
>>
>> No.
>> Oppositely, it enters guest as soon as possible if rep instruction is done
>> on memory ;-)
> Oops. Indeed. I read it other way around. So why not just return
> X86EMUL_UNHANDLEABLE from emulator_write_emulated_onepage() which should
> have the same effect?
> 

It seams not, the count register(RCX) is not decreased, and redundant work
need to be done by handling EMULATION_FAILED.
So, emulator_write_emulated_onepage() is not a good place i think. :-)


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/11] KVM: x86: fast emulate repeat string write instructions
  2011-07-27  6:32         ` Xiao Guangrong
@ 2011-07-27  7:51           ` Gleb Natapov
  2011-07-27  9:36             ` Xiao Guangrong
  0 siblings, 1 reply; 53+ messages in thread
From: Gleb Natapov @ 2011-07-27  7:51 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, Marcelo Tosatti, LKML, KVM

On Wed, Jul 27, 2011 at 02:32:43PM +0800, Xiao Guangrong wrote:
> On 07/27/2011 12:26 PM, Gleb Natapov wrote:
> > On Wed, Jul 27, 2011 at 09:47:52AM +0800, Xiao Guangrong wrote:
> >> On 07/26/2011 08:27 PM, Gleb Natapov wrote:
> >>> On Tue, Jul 26, 2011 at 07:26:46PM +0800, Xiao Guangrong wrote:
> >>>> We usually use repeat string instructions to clear the page, for example,
> >>> By "we" do you mean Linux guest?
> >>>
> >>
> >> I do not know other guests except linux, but, generally rep instruction is
> >> not used to update a page table which is been using.
> >>
> >>>> we call memset to clear a page table, stosb is used in this function, and 
> >>>> repeated for 1024 times, that means we should occupy mmu lock for 1024 times
> >>>> and walking shadow page cache for 1024 times, it is terrible
> >>>>
> >>>> In fact, if it is the repeat string instructions emulated and it is not a
> >>>> IO/MMIO access, we can zap all the corresponding shadow pages and return to the
> >>>> guest, then the mapping can became writable and directly write the page
> >>>>
> >>
> >>
> >>> So this patch does two independent things as far as I can see. First it
> >>> stops reentering guest if rep instruction is done on memory and second
> >>
> >> No.
> >> Oppositely, it enters guest as soon as possible if rep instruction is done
> >> on memory ;-)
> > Oops. Indeed. I read it other way around. So why not just return
> > X86EMUL_UNHANDLEABLE from emulator_write_emulated_onepage() which should
> > have the same effect?
> > 
> 
> It seams not, the count register(RCX) is not decreased, and redundant work
> need to be done by handling EMULATION_FAILED.
The only difference is that with your approach one rep is emulated and then
control goes back to a guest. With EMULATION_FAILED kvm returns to a guest
immediately, so RCX shouldn't be decreased. There shouldn't a be big difference
performance wise and if there is it is likely on EMULATION_FAILED side.
Last but not least emulate.c knows nothing about the hack.

> So, emulator_write_emulated_onepage() is not a good place i think. :-)

--
			Gleb.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 01/11] KVM: MMU: avoid pte_list_desc run out in kvm_mmu_pte_write
  2011-07-26 11:25 ` [PATCH 01/11] KVM: MMU: avoid pte_list_desc run out in kvm_mmu_pte_write Xiao Guangrong
@ 2011-07-27  9:00   ` Avi Kivity
  2011-07-27  9:37     ` Xiao Guangrong
  0 siblings, 1 reply; 53+ messages in thread
From: Avi Kivity @ 2011-07-27  9:00 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 07/26/2011 02:25 PM, Xiao Guangrong wrote:
> kvm_mmu_pte_write is unsafe since we need to alloc pte_list_desc in the
> function when spte is prefetched, unfortunately, we can not know how many
> spte need to be prefetched on this path, that means we can use out of the
> free  pte_list_desc object in the cache, and BUG_ON() is triggered, also some
> path does not fill the cache, such as INS instruction emulated that does not
> trigger page fault
>
>
>   void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
>   		       const u8 *new, int bytes,
>   		       bool guest_initiated)
> @@ -3583,6 +3596,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
>   		break;
>   	}
>
> +	mmu_topup_memory_caches(vcpu);

Please add a comment here describing why it's okay to ignore the error 
return.

>   	spin_lock(&vcpu->kvm->mmu_lock);
>   	if (atomic_read(&vcpu->kvm->arch.invlpg_counter) != invlpg_counter)
>   		gentry = 0;
> @@ -3653,7 +3667,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
>   			mmu_page_zap_pte(vcpu->kvm, sp, spte);
>   			if (gentry&&
>   			!((sp->role.word ^ vcpu->arch.mmu.base_role.word)
> -			&  mask.word))
> +			&  mask.word)&&  get_free_pte_list_desc_nr(vcpu))
>   				mmu_pte_write_new_pte(vcpu, sp, spte,&gentry);

Wow, this bug was here since 2.6.23.  Good catch.

Please wrap or rename get_free_pte_list_desc_nr() in rmap_can_add(vcpu) 
so it's clearer why we're doing this.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/11] KVM: x86: fast emulate repeat string write instructions
  2011-07-26 11:26 ` [PATCH 03/11] KVM: x86: fast emulate repeat string write instructions Xiao Guangrong
  2011-07-26 12:27   ` Gleb Natapov
@ 2011-07-27  9:04   ` Avi Kivity
  2011-07-27  9:37     ` Xiao Guangrong
  1 sibling, 1 reply; 53+ messages in thread
From: Avi Kivity @ 2011-07-27  9:04 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 07/26/2011 02:26 PM, Xiao Guangrong wrote:
> We usually use repeat string instructions to clear the page, for example,
> we call memset to clear a page table, stosb is used in this function, and
> repeated for 1024 times, that means we should occupy mmu lock for 1024 times
> and walking shadow page cache for 1024 times, it is terrible
>
> In fact, if it is the repeat string instructions emulated and it is not a
> IO/MMIO access, we can zap all the corresponding shadow pages and return to the
> guest, then the mapping can became writable and directly write the page

Please generalize this to fail emulation on all non-page-table 
instructions when emulating due to a write protected page that we can 
unprotect.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/11] KVM: MMU: do not mark access bit on pte write path
  2011-07-26 11:28 ` [PATCH 04/11] KVM: MMU: do not mark access bit on pte write path Xiao Guangrong
@ 2011-07-27  9:08   ` Avi Kivity
  2011-07-27 10:04     ` Xiao Guangrong
  0 siblings, 1 reply; 53+ messages in thread
From: Avi Kivity @ 2011-07-27  9:08 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 07/26/2011 02:28 PM, Xiao Guangrong wrote:
> In current code, the accessed bit is always set when page fault occurred,
> do not need to set it on pte write path
>

Is this true? a write with pte.w=pte.a=0 sets pte.a?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/11] KVM: MMU: remove the mismatch shadow page
  2011-07-26 11:31 ` [PATCH 09/11] KVM: MMU: remove the mismatch shadow page Xiao Guangrong
@ 2011-07-27  9:11   ` Avi Kivity
  2011-07-27  9:13     ` Avi Kivity
  0 siblings, 1 reply; 53+ messages in thread
From: Avi Kivity @ 2011-07-27  9:11 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 07/26/2011 02:31 PM, Xiao Guangrong wrote:
> If the shadow page has different cpu mode with current vcpu, we do better
> remove them since the OS does not changes cpu mode frequently
>

This is true, but is it really needed?  Won't such pages be torn down 
very quickly anyway, and will never be reestablished?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/11] KVM: MMU: remove the mismatch shadow page
  2011-07-27  9:11   ` Avi Kivity
@ 2011-07-27  9:13     ` Avi Kivity
  2011-07-27 10:05       ` Xiao Guangrong
  0 siblings, 1 reply; 53+ messages in thread
From: Avi Kivity @ 2011-07-27  9:13 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 07/27/2011 12:11 PM, Avi Kivity wrote:
> On 07/26/2011 02:31 PM, Xiao Guangrong wrote:
>> If the shadow page has different cpu mode with current vcpu, we do 
>> better
>> remove them since the OS does not changes cpu mode frequently
>>
>
> This is true

It's actually not true.  With nested virtualization the cpu changes 
modes all the time.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 10/11] KVM: MMU: fix detecting misaligned accessed
  2011-07-26 11:31 ` [PATCH 10/11] KVM: MMU: fix detecting misaligned accessed Xiao Guangrong
@ 2011-07-27  9:15   ` Avi Kivity
  2011-07-27 10:10     ` Xiao Guangrong
  0 siblings, 1 reply; 53+ messages in thread
From: Avi Kivity @ 2011-07-27  9:15 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 07/26/2011 02:31 PM, Xiao Guangrong wrote:
> Sometimes, we only modify the last one byte of a pte to update status bit,
> for example, clear_bit is used to clear r/w bit in linux kernel and 'andb'
> instruction is used in this function, in this case, kvm_mmu_pte_write will
> treat it as misaligned access, and the shadow page table is zapped
>
> @@ -3597,6 +3597,14 @@ static bool detect_write_misaligned(struct kvm_mmu_page *sp, gpa_t gpa,
>
>   	offset = offset_in_page(gpa);
>   	pte_size = sp->role.cr4_pae ? 8 : 4;
> +
> +	/*
> +	 * Sometimes, the OS only writes the last one bytes to update status
> +	 * bits, for example, in linux, andb instruction is used in clear_bit().
> +	 */
> +	if (sp->role.level == 1&&  !(offset&  (pte_size - 1))&&  bytes == 1)
> +		return false;
> +

Could be true for level > 1, no?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 11/11] KVM: MMU: improve write flooding detected
  2011-07-26 11:32 ` [PATCH 11/11] KVM: MMU: improve write flooding detected Xiao Guangrong
@ 2011-07-27  9:23   ` Avi Kivity
  2011-07-27 10:20     ` Xiao Guangrong
  0 siblings, 1 reply; 53+ messages in thread
From: Avi Kivity @ 2011-07-27  9:23 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 07/26/2011 02:32 PM, Xiao Guangrong wrote:
> Detecting write-flooding does not work well, when we handle page written, if
> the last speculative spte is not accessed, we treat the page is
> write-flooding, however, we can speculative spte on many path, such as pte
> prefetch, page synced, that means the last speculative spte may be not point
> to the written page and the written page can be accessed via other sptes, so
> depends on the Accessed bit of the last speculative spte is not enough
>
> Instead of detected page accessed, we can detect whether the spte is accessed
> or not, if the spte is not accessed but it is written frequently, we treat is
> not a page table or it not used for a long time
>
>   static int get_free_pte_list_desc_nr(struct kvm_vcpu *vcpu)
>   {
>   	struct kvm_mmu_memory_cache *cache;
> @@ -3565,22 +3547,14 @@ static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu *vcpu, gpa_t *gpa,
>    * If we're seeing too many writes to a page, it may no longer be a page table,
>    * or we may be forking, in which case it is better to unmap the page.
>    */
> -static bool detect_write_flooding(struct kvm_vcpu *vcpu, gfn_t gfn)
> +static bool detect_write_flooding(struct kvm_mmu_page *sp, u64 *spte)
>   {
> -	bool flooded = false;
> -
> -	if (gfn == vcpu->arch.last_pt_write_gfn
> -	&&  !last_updated_pte_accessed(vcpu)) {
> -		++vcpu->arch.last_pt_write_count;
> -		if (vcpu->arch.last_pt_write_count>= 3)
> -			flooded = true;
> -	} else {
> -		vcpu->arch.last_pt_write_gfn = gfn;
> -		vcpu->arch.last_pt_write_count = 1;
> -		vcpu->arch.last_pte_updated = NULL;
> -	}
> +	if (spte&&  !(*spte&  shadow_accessed_mask))
> +		sp->write_flooding_count++;
> +	else
> +		sp->write_flooding_count = 0;
>
> -	return flooded;
> +	return sp->write_flooding_count>= 3;
>   }

I think this is a little dangerous.  A guest kernel may be instantiating 
multiple gptes on a page fault, but guest userspace hits only one of 
them (the one which caused the page fault) - I think Windows does this, 
but I'm not sure.

Maybe we should inspect parent_ptes instead?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/11] KVM: x86: fast emulate repeat string write instructions
  2011-07-27  7:51           ` Gleb Natapov
@ 2011-07-27  9:36             ` Xiao Guangrong
  0 siblings, 0 replies; 53+ messages in thread
From: Xiao Guangrong @ 2011-07-27  9:36 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Avi Kivity, Marcelo Tosatti, LKML, KVM

On 07/27/2011 03:51 PM, Gleb Natapov wrote:
> On Wed, Jul 27, 2011 at 02:32:43PM +0800, Xiao Guangrong wrote:
>> On 07/27/2011 12:26 PM, Gleb Natapov wrote:
>>> On Wed, Jul 27, 2011 at 09:47:52AM +0800, Xiao Guangrong wrote:
>>>> On 07/26/2011 08:27 PM, Gleb Natapov wrote:
>>>>> On Tue, Jul 26, 2011 at 07:26:46PM +0800, Xiao Guangrong wrote:
>>>>>> We usually use repeat string instructions to clear the page, for example,
>>>>> By "we" do you mean Linux guest?
>>>>>
>>>>
>>>> I do not know other guests except linux, but, generally rep instruction is
>>>> not used to update a page table which is been using.
>>>>
>>>>>> we call memset to clear a page table, stosb is used in this function, and 
>>>>>> repeated for 1024 times, that means we should occupy mmu lock for 1024 times
>>>>>> and walking shadow page cache for 1024 times, it is terrible
>>>>>>
>>>>>> In fact, if it is the repeat string instructions emulated and it is not a
>>>>>> IO/MMIO access, we can zap all the corresponding shadow pages and return to the
>>>>>> guest, then the mapping can became writable and directly write the page
>>>>>>
>>>>
>>>>
>>>>> So this patch does two independent things as far as I can see. First it
>>>>> stops reentering guest if rep instruction is done on memory and second
>>>>
>>>> No.
>>>> Oppositely, it enters guest as soon as possible if rep instruction is done
>>>> on memory ;-)
>>> Oops. Indeed. I read it other way around. So why not just return
>>> X86EMUL_UNHANDLEABLE from emulator_write_emulated_onepage() which should
>>> have the same effect?
>>>
>>
>> It seams not, the count register(RCX) is not decreased, and redundant work
>> need to be done by handling EMULATION_FAILED.
> The only difference is that with your approach one rep is emulated and then
> control goes back to a guest. With EMULATION_FAILED kvm returns to a guest
> immediately, so RCX shouldn't be decreased. There shouldn't a be big difference
> performance wise and if there is it is likely on EMULATION_FAILED side.
> Last but not least emulate.c knows nothing about the hack.
> 

Umm. it is reasonable, i'll update this patch. Thanks, Gleb!

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 01/11] KVM: MMU: avoid pte_list_desc run out in kvm_mmu_pte_write
  2011-07-27  9:00   ` Avi Kivity
@ 2011-07-27  9:37     ` Xiao Guangrong
  0 siblings, 0 replies; 53+ messages in thread
From: Xiao Guangrong @ 2011-07-27  9:37 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

On 07/27/2011 05:00 PM, Avi Kivity wrote:
> On 07/26/2011 02:25 PM, Xiao Guangrong wrote:
>> kvm_mmu_pte_write is unsafe since we need to alloc pte_list_desc in the
>> function when spte is prefetched, unfortunately, we can not know how many
>> spte need to be prefetched on this path, that means we can use out of the
>> free  pte_list_desc object in the cache, and BUG_ON() is triggered, also some
>> path does not fill the cache, such as INS instruction emulated that does not
>> trigger page fault
>>
>>
>>   void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
>>                  const u8 *new, int bytes,
>>                  bool guest_initiated)
>> @@ -3583,6 +3596,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
>>           break;
>>       }
>>
>> +    mmu_topup_memory_caches(vcpu);
> 
> Please add a comment here describing why it's okay to ignore the error return.
> 

OK

>>       spin_lock(&vcpu->kvm->mmu_lock);
>>       if (atomic_read(&vcpu->kvm->arch.invlpg_counter) != invlpg_counter)
>>           gentry = 0;
>> @@ -3653,7 +3667,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
>>               mmu_page_zap_pte(vcpu->kvm, sp, spte);
>>               if (gentry&&
>>               !((sp->role.word ^ vcpu->arch.mmu.base_role.word)
>> -            &  mask.word))
>> +            &  mask.word)&&  get_free_pte_list_desc_nr(vcpu))
>>                   mmu_pte_write_new_pte(vcpu, sp, spte,&gentry);
> 
> Wow, this bug was here since 2.6.23.  Good catch.
> 
> Please wrap or rename get_free_pte_list_desc_nr() in rmap_can_add(vcpu) so it's clearer why we're doing this.
> 

OK, will do, thanks!


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/11] KVM: x86: fast emulate repeat string write instructions
  2011-07-27  9:04   ` Avi Kivity
@ 2011-07-27  9:37     ` Xiao Guangrong
  0 siblings, 0 replies; 53+ messages in thread
From: Xiao Guangrong @ 2011-07-27  9:37 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

On 07/27/2011 05:04 PM, Avi Kivity wrote:
> On 07/26/2011 02:26 PM, Xiao Guangrong wrote:
>> We usually use repeat string instructions to clear the page, for example,
>> we call memset to clear a page table, stosb is used in this function, and
>> repeated for 1024 times, that means we should occupy mmu lock for 1024 times
>> and walking shadow page cache for 1024 times, it is terrible
>>
>> In fact, if it is the repeat string instructions emulated and it is not a
>> IO/MMIO access, we can zap all the corresponding shadow pages and return to the
>> guest, then the mapping can became writable and directly write the page
> 
> Please generalize this to fail emulation on all non-page-table instructions when emulating due to a write protected page that we can unprotect.
> 

OK, will do it in the next version, thanks!

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/11] KVM: MMU: do not mark access bit on pte write path
  2011-07-27  9:08   ` Avi Kivity
@ 2011-07-27 10:04     ` Xiao Guangrong
  0 siblings, 0 replies; 53+ messages in thread
From: Xiao Guangrong @ 2011-07-27 10:04 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

On 07/27/2011 05:08 PM, Avi Kivity wrote:
> On 07/26/2011 02:28 PM, Xiao Guangrong wrote:
>> In current code, the accessed bit is always set when page fault occurred,
>> do not need to set it on pte write path
>>
> 
> Is this true? a write with pte.w=pte.a=0 sets pte.a?
> 

Generally, we will call set_spte() with speculative = false, then Accessed bit
is set on page fault path, but there has two case:
- if pte.d = 1, everything is ok
- if pte.d = 0, so this write can change pte.d = 1, the access permission of
  shadow page can be changed from read-only to writable. So, we will find a
  new sp to establish the mapping. In this case, the original spte is not
  set Accessed bit.

However, we used Accessed bit to detect write flooding, so the first write can
set the dirty bit, and the later write can not cause sp changed. I think it is
not too bad.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/11] KVM: MMU: remove the mismatch shadow page
  2011-07-27  9:13     ` Avi Kivity
@ 2011-07-27 10:05       ` Xiao Guangrong
  0 siblings, 0 replies; 53+ messages in thread
From: Xiao Guangrong @ 2011-07-27 10:05 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

On 07/27/2011 05:13 PM, Avi Kivity wrote:
> On 07/27/2011 12:11 PM, Avi Kivity wrote:
>> On 07/26/2011 02:31 PM, Xiao Guangrong wrote:
>>> If the shadow page has different cpu mode with current vcpu, we do better
>>> remove them since the OS does not changes cpu mode frequently
>>>
>>
>> This is true
> 
> It's actually not true.  With nested virtualization the cpu changes modes all the time.
> 

Oh, yes, i forgot this, thanks point it out!


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 10/11] KVM: MMU: fix detecting misaligned accessed
  2011-07-27  9:15   ` Avi Kivity
@ 2011-07-27 10:10     ` Xiao Guangrong
  0 siblings, 0 replies; 53+ messages in thread
From: Xiao Guangrong @ 2011-07-27 10:10 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

On 07/27/2011 05:15 PM, Avi Kivity wrote:
> On 07/26/2011 02:31 PM, Xiao Guangrong wrote:
>> Sometimes, we only modify the last one byte of a pte to update status bit,
>> for example, clear_bit is used to clear r/w bit in linux kernel and 'andb'
>> instruction is used in this function, in this case, kvm_mmu_pte_write will
>> treat it as misaligned access, and the shadow page table is zapped
>>
>> @@ -3597,6 +3597,14 @@ static bool detect_write_misaligned(struct kvm_mmu_page *sp, gpa_t gpa,
>>
>>       offset = offset_in_page(gpa);
>>       pte_size = sp->role.cr4_pae ? 8 : 4;
>> +
>> +    /*
>> +     * Sometimes, the OS only writes the last one bytes to update status
>> +     * bits, for example, in linux, andb instruction is used in clear_bit().
>> +     */
>> +    if (sp->role.level == 1&&  !(offset&  (pte_size - 1))&&  bytes == 1)
>> +        return false;
>> +
> 
> Could be true for level > 1, no?
> 

In my origin mind, i thought one-byte-instruction is usually used to update the last pte,
but we do better remove this restriction. Will fix it in the next version, thanks!

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 11/11] KVM: MMU: improve write flooding detected
  2011-07-27  9:23   ` Avi Kivity
@ 2011-07-27 10:20     ` Xiao Guangrong
  2011-07-27 11:08       ` Avi Kivity
  0 siblings, 1 reply; 53+ messages in thread
From: Xiao Guangrong @ 2011-07-27 10:20 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

On 07/27/2011 05:23 PM, Avi Kivity wrote:
> On 07/26/2011 02:32 PM, Xiao Guangrong wrote:
>> Detecting write-flooding does not work well, when we handle page written, if
>> the last speculative spte is not accessed, we treat the page is
>> write-flooding, however, we can speculative spte on many path, such as pte
>> prefetch, page synced, that means the last speculative spte may be not point
>> to the written page and the written page can be accessed via other sptes, so
>> depends on the Accessed bit of the last speculative spte is not enough
>>
>> Instead of detected page accessed, we can detect whether the spte is accessed
>> or not, if the spte is not accessed but it is written frequently, we treat is
>> not a page table or it not used for a long time
>>
>>   static int get_free_pte_list_desc_nr(struct kvm_vcpu *vcpu)
>>   {
>>       struct kvm_mmu_memory_cache *cache;
>> @@ -3565,22 +3547,14 @@ static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu *vcpu, gpa_t *gpa,
>>    * If we're seeing too many writes to a page, it may no longer be a page table,
>>    * or we may be forking, in which case it is better to unmap the page.
>>    */
>> -static bool detect_write_flooding(struct kvm_vcpu *vcpu, gfn_t gfn)
>> +static bool detect_write_flooding(struct kvm_mmu_page *sp, u64 *spte)
>>   {
>> -    bool flooded = false;
>> -
>> -    if (gfn == vcpu->arch.last_pt_write_gfn
>> -    &&  !last_updated_pte_accessed(vcpu)) {
>> -        ++vcpu->arch.last_pt_write_count;
>> -        if (vcpu->arch.last_pt_write_count>= 3)
>> -            flooded = true;
>> -    } else {
>> -        vcpu->arch.last_pt_write_gfn = gfn;
>> -        vcpu->arch.last_pt_write_count = 1;
>> -        vcpu->arch.last_pte_updated = NULL;
>> -    }
>> +    if (spte&&  !(*spte&  shadow_accessed_mask))
>> +        sp->write_flooding_count++;
>> +    else
>> +        sp->write_flooding_count = 0;
>>
>> -    return flooded;
>> +    return sp->write_flooding_count>= 3;
>>   }
> 
> I think this is a little dangerous.  A guest kernel may be instantiating multiple gptes on a page fault, but guest userspace hits only one of them (the one which caused the page fault) - I think Windows does this, but I'm not sure.
> 

I think this case is not bad: if the guest kernel need to write multiple gptes (>=3),
it will cause many page fault, we do better zap the shadow page and let it become writable as
soon as possible.
(And, we have pte-fetch, it can quickly establish the mapping for a new shadow page)

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 11/11] KVM: MMU: improve write flooding detected
  2011-07-27 10:20     ` Xiao Guangrong
@ 2011-07-27 11:08       ` Avi Kivity
  2011-07-28  2:43         ` Xiao Guangrong
  0 siblings, 1 reply; 53+ messages in thread
From: Avi Kivity @ 2011-07-27 11:08 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 07/27/2011 01:20 PM, Xiao Guangrong wrote:
> >>    }
> >
> >  I think this is a little dangerous.  A guest kernel may be instantiating multiple gptes on a page fault, but guest userspace hits only one of them (the one which caused the page fault) - I think Windows does this, but I'm not sure.
> >
>
> I think this case is not bad: if the guest kernel need to write multiple gptes (>=3),
> it will cause many page fault, we do better zap the shadow page and let it become writable as
> soon as possible.
> (And, we have pte-fetch, it can quickly establish the mapping for a new shadow page)

Actually, what should save us is unsync pages.  Why are we hitting this 
path at all?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 11/11] KVM: MMU: improve write flooding detected
  2011-07-27 11:08       ` Avi Kivity
@ 2011-07-28  2:43         ` Xiao Guangrong
  0 siblings, 0 replies; 53+ messages in thread
From: Xiao Guangrong @ 2011-07-28  2:43 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

On 07/27/2011 07:08 PM, Avi Kivity wrote:
> On 07/27/2011 01:20 PM, Xiao Guangrong wrote:
>> >>    }
>> >
>> >  I think this is a little dangerous.  A guest kernel may be instantiating multiple gptes on a page fault, but guest userspace hits only one of them (the one which caused the page fault) - I think Windows does this, but I'm not sure.
>> >
>>
>> I think this case is not bad: if the guest kernel need to write multiple gptes (>=3),
>> it will cause many page fault, we do better zap the shadow page and let it become writable as
>> soon as possible.
>> (And, we have pte-fetch, it can quickly establish the mapping for a new shadow page)
> 
> Actually, what should save us is unsync pages.  Why are we hitting this path at all?
> 

Avi,

The shadow page can not became unsync if it has other sp that sp.gfn = gfn && sp.role.level != 1,
for example:
- if the gfn is not only used for the last page structure(PTE page)
or
- gfn was used for upper page structure before but we do not zap the old shadow pages

So, if this gfn is written, #PF is generated, we hope that these sp can be zapped earlier,
the later #PF can detect this gfn is not have shadow pages, and the mapping can became writable.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 11/11] KVM: MMU: improve write flooding detected
  2011-08-16  6:40 [PATCH 01/11] KVM: MMU: avoid pte_list_desc running out in kvm_mmu_pte_write Xiao Guangrong
@ 2011-08-16  6:46 ` Xiao Guangrong
  2011-08-23  8:00   ` Marcelo Tosatti
  0 siblings, 1 reply; 53+ messages in thread
From: Xiao Guangrong @ 2011-08-16  6:46 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

Detecting write-flooding does not work well, when we handle page written, if
the last speculative spte is not accessed, we treat the page is
write-flooding, however, we can speculative spte on many path, such as pte
prefetch, page synced, that means the last speculative spte may be not point
to the written page and the written page can be accessed via other sptes, so
depends on the Accessed bit of the last speculative spte is not enough

Instead of detected page accessed, we can detect whether the spte is accessed
or not, if the spte is not accessed but it is written frequently, we treat is
not a page table or it not used for a long time

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/include/asm/kvm_host.h |    6 +---
 arch/x86/kvm/mmu.c              |   48 +++++++++------------------------------
 arch/x86/kvm/paging_tmpl.h      |    9 +-----
 3 files changed, 15 insertions(+), 48 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 927ba73..9d17238 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -239,6 +239,8 @@ struct kvm_mmu_page {
 	int clear_spte_count;
 #endif
 
+	int write_flooding_count;
+
 	struct rcu_head rcu;
 };
 
@@ -353,10 +355,6 @@ struct kvm_vcpu_arch {
 	struct kvm_mmu_memory_cache mmu_page_cache;
 	struct kvm_mmu_memory_cache mmu_page_header_cache;
 
-	gfn_t last_pt_write_gfn;
-	int   last_pt_write_count;
-	u64  *last_pte_updated;
-
 	struct fpu guest_fpu;
 	u64 xcr0;
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index adaa160..3230f84 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1695,6 +1695,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 		} else if (sp->unsync)
 			kvm_mmu_mark_parents_unsync(sp);
 
+		sp->write_flooding_count = 0;
 		trace_kvm_mmu_get_page(sp, false);
 		return sp;
 	}
@@ -1847,15 +1848,6 @@ static void kvm_mmu_put_page(struct kvm_mmu_page *sp, u64 *parent_pte)
 	mmu_page_remove_parent_pte(sp, parent_pte);
 }
 
-static void kvm_mmu_reset_last_pte_updated(struct kvm *kvm)
-{
-	int i;
-	struct kvm_vcpu *vcpu;
-
-	kvm_for_each_vcpu(i, vcpu, kvm)
-		vcpu->arch.last_pte_updated = NULL;
-}
-
 static void kvm_mmu_unlink_parents(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
 	u64 *parent_pte;
@@ -1915,7 +1907,6 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
 	}
 
 	sp->role.invalid = 1;
-	kvm_mmu_reset_last_pte_updated(kvm);
 	return ret;
 }
 
@@ -2360,8 +2351,6 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 		}
 	}
 	kvm_release_pfn_clean(pfn);
-	if (speculative)
-		vcpu->arch.last_pte_updated = sptep;
 }
 
 static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
@@ -3522,13 +3511,6 @@ static void mmu_pte_write_flush_tlb(struct kvm_vcpu *vcpu, bool zap_page,
 		kvm_mmu_flush_tlb(vcpu);
 }
 
-static bool last_updated_pte_accessed(struct kvm_vcpu *vcpu)
-{
-	u64 *spte = vcpu->arch.last_pte_updated;
-
-	return !!(spte && (*spte & shadow_accessed_mask));
-}
-
 static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu *vcpu, gpa_t *gpa,
 				    const u8 *new, int *bytes)
 {
@@ -3569,22 +3551,14 @@ static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu *vcpu, gpa_t *gpa,
  * If we're seeing too many writes to a page, it may no longer be a page table,
  * or we may be forking, in which case it is better to unmap the page.
  */
-static bool detect_write_flooding(struct kvm_vcpu *vcpu, gfn_t gfn)
+static bool detect_write_flooding(struct kvm_mmu_page *sp, u64 *spte)
 {
-	bool flooded = false;
-
-	if (gfn == vcpu->arch.last_pt_write_gfn
-	    && !last_updated_pte_accessed(vcpu)) {
-		++vcpu->arch.last_pt_write_count;
-		if (vcpu->arch.last_pt_write_count >= 3)
-			flooded = true;
-	} else {
-		vcpu->arch.last_pt_write_gfn = gfn;
-		vcpu->arch.last_pt_write_count = 1;
-		vcpu->arch.last_pte_updated = NULL;
-	}
+	if (spte && !(*spte & shadow_accessed_mask))
+		sp->write_flooding_count++;
+	else
+		sp->write_flooding_count = 0;
 
-	return flooded;
+	return sp->write_flooding_count >= 3;
 }
 
 /*
@@ -3656,7 +3630,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 	LIST_HEAD(invalid_list);
 	u64 entry, gentry, *spte;
 	int npte;
-	bool remote_flush, local_flush, zap_page, flooded, misaligned;
+	bool remote_flush, local_flush, zap_page;
 
 	/*
 	 * If we don't have indirect shadow pages, it means no page is
@@ -3675,12 +3649,12 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 	++vcpu->kvm->stat.mmu_pte_write;
 	trace_kvm_mmu_audit(vcpu, AUDIT_PRE_PTE_WRITE);
 
-	flooded = detect_write_flooding(vcpu, gfn);
 	mask.cr0_wp = mask.cr4_pae = mask.nxe = 1;
 	for_each_gfn_indirect_valid_sp(vcpu->kvm, sp, gfn, node) {
-		misaligned = detect_write_misaligned(sp, gpa, bytes);
+		spte = get_written_sptes(sp, gpa, &npte);
 
-		if (misaligned || flooded) {
+		if (detect_write_misaligned(sp, gpa, bytes) ||
+		      detect_write_flooding(sp, spte)) {
 			zap_page |= !!kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
 						     &invalid_list);
 			++vcpu->kvm->stat.mmu_flooded;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index bdc2241..ec5c1b4 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -599,11 +599,9 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
 	 */
 	if (!r) {
 		pgprintk("%s: guest page fault\n", __func__);
-		if (!prefault) {
+		if (!prefault)
 			inject_page_fault(vcpu, &walker.fault);
-			/* reset fork detector */
-			vcpu->arch.last_pt_write_count = 0;
-		}
+
 		return 0;
 	}
 
@@ -641,9 +639,6 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
 	pgprintk("%s: shadow pte %p %llx emulate %d\n", __func__,
 		 sptep, *sptep, emulate);
 
-	if (!emulate)
-		vcpu->arch.last_pt_write_count = 0; /* reset fork detector */
-
 	++vcpu->stat.pf_fixed;
 	trace_kvm_mmu_audit(vcpu, AUDIT_POST_PAGE_FAULT);
 	spin_unlock(&vcpu->kvm->mmu_lock);
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 11/11] KVM: MMU: improve write flooding detected
  2011-08-16  6:46 ` [PATCH 11/11] KVM: MMU: improve write flooding detected Xiao Guangrong
@ 2011-08-23  8:00   ` Marcelo Tosatti
  2011-08-23 10:55     ` Xiao Guangrong
  0 siblings, 1 reply; 53+ messages in thread
From: Marcelo Tosatti @ 2011-08-23  8:00 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, LKML, KVM

On Tue, Aug 16, 2011 at 02:46:47PM +0800, Xiao Guangrong wrote:
> Detecting write-flooding does not work well, when we handle page written, if
> the last speculative spte is not accessed, we treat the page is
> write-flooding, however, we can speculative spte on many path, such as pte
> prefetch, page synced, that means the last speculative spte may be not point
> to the written page and the written page can be accessed via other sptes, so
> depends on the Accessed bit of the last speculative spte is not enough

Yes, a stale last_speculative_spte is possible, but is this fact a
noticeable problem in practice?

Was this detected by code inspection?

> Instead of detected page accessed, we can detect whether the spte is accessed
> or not, if the spte is not accessed but it is written frequently, we treat is
> not a page table or it not used for a long time
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
> ---
>  arch/x86/include/asm/kvm_host.h |    6 +---
>  arch/x86/kvm/mmu.c              |   48 +++++++++------------------------------
>  arch/x86/kvm/paging_tmpl.h      |    9 +-----
>  3 files changed, 15 insertions(+), 48 deletions(-)
>
 
> -static bool detect_write_flooding(struct kvm_vcpu *vcpu, gfn_t gfn)
> +static bool detect_write_flooding(struct kvm_mmu_page *sp, u64 *spte)
>  {
> -	bool flooded = false;
> -
> -	if (gfn == vcpu->arch.last_pt_write_gfn
> -	    && !last_updated_pte_accessed(vcpu)) {
> -		++vcpu->arch.last_pt_write_count;
> -		if (vcpu->arch.last_pt_write_count >= 3)
> -			flooded = true;
> -	} else {
> -		vcpu->arch.last_pt_write_gfn = gfn;
> -		vcpu->arch.last_pt_write_count = 1;
> -		vcpu->arch.last_pte_updated = NULL;
> -	}
> +	if (spte && !(*spte & shadow_accessed_mask))
> +		sp->write_flooding_count++;
> +	else
> +		sp->write_flooding_count = 0;

This relies on the sptes being created by speculative means
or by pressure on the host clearing the accessed bit for the
shadow page to be zapped. 

There is no guarantee that either of these is true for a given
spte.

And if the sptes do not have accessed bit set, any nonconsecutive 3 pte
updates will zap the page.

Back to the first question, what is the motivation for this heuristic
change? Do you have any numbers?

If its a significant problem, perhaps getting rid of the
'last_spte_accessed' part is enough.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 11/11] KVM: MMU: improve write flooding detected
  2011-08-23  8:00   ` Marcelo Tosatti
@ 2011-08-23 10:55     ` Xiao Guangrong
  2011-08-23 12:38       ` Marcelo Tosatti
  0 siblings, 1 reply; 53+ messages in thread
From: Xiao Guangrong @ 2011-08-23 10:55 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, LKML, KVM

Hi Marcelo,

On 08/23/2011 04:00 PM, Marcelo Tosatti wrote:
> On Tue, Aug 16, 2011 at 02:46:47PM +0800, Xiao Guangrong wrote:
>> Detecting write-flooding does not work well, when we handle page written, if
>> the last speculative spte is not accessed, we treat the page is
>> write-flooding, however, we can speculative spte on many path, such as pte
>> prefetch, page synced, that means the last speculative spte may be not point
>> to the written page and the written page can be accessed via other sptes, so
>> depends on the Accessed bit of the last speculative spte is not enough
> 
> Yes, a stale last_speculative_spte is possible, but is this fact a
> noticeable problem in practice?
> 
> Was this detected by code inspection?
> 

I detected this because: i noticed some shadow page is zapped by
write-flooding but it is accessed soon, it causes the shadow page zapped
and alloced again and again(very frequently).

Another reason is that: in current code, write-flooding is little complex
and it stuffs code in many places, actually, write-flooding is only needed for
shadow page/nested guest, so i want to simplify it and wrap its code up.

>> -	}
>> +	if (spte && !(*spte & shadow_accessed_mask))
>> +		sp->write_flooding_count++;
>> +	else
>> +		sp->write_flooding_count = 0;
> 
> This relies on the sptes being created by speculative means
> or by pressure on the host clearing the accessed bit for the
> shadow page to be zapped. 
> 
> There is no guarantee that either of these is true for a given
> spte.
> 
> And if the sptes do not have accessed bit set, any nonconsecutive 3 pte
> updates will zap the page.
> 

Please note we clear 'sp->write_flooding_count' when it is accessed from
shadow page cache (in kvm_mmu_get_page), it means if any spte of sp generates
#PF, the fooding count can be reset.

And, i think there are not problems since: if the spte without accssed bit is
written frequently, it means the guest page table is accessed infrequently or
during the writing, the guest page table is not accessed, in this time, zapping
this shadow page is not bad.

Comparing the old way, the advantage of it is good for zapping upper shadow page,
for example, in the old way:
if a gfn is used as PDE for a task, later, the gfn is freed and used as PTE for
the new task, so we have two shadow pages in the host, one sp1.level = 2 and the
other sp2.level = 1. So, when we detect write-flooding, the vcpu->last_pte_updated
always point to sp2.pte. As sp2 is used for the new task, we always detected both
shadow pages are bing used, but actually, sp1 is not used by guest anymore.

> Back to the first question, what is the motivation for this heuristic
> change? Do you have any numbers?
> 

Yes, i have done the quick test:

before this patch:
2m56.561
2m50.651
2m51.220
2m52.199
2m48.066

After this patch:
2m51.194
2m55.980
2m50.755
2m47.396
2m46.807

It shows the new way is little better than the old way.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 11/11] KVM: MMU: improve write flooding detected
  2011-08-23 10:55     ` Xiao Guangrong
@ 2011-08-23 12:38       ` Marcelo Tosatti
  2011-08-23 16:32         ` Xiao Guangrong
  0 siblings, 1 reply; 53+ messages in thread
From: Marcelo Tosatti @ 2011-08-23 12:38 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, LKML, KVM

On Tue, Aug 23, 2011 at 06:55:39PM +0800, Xiao Guangrong wrote:
> Hi Marcelo,
> 
> On 08/23/2011 04:00 PM, Marcelo Tosatti wrote:
> > On Tue, Aug 16, 2011 at 02:46:47PM +0800, Xiao Guangrong wrote:
> >> Detecting write-flooding does not work well, when we handle page written, if
> >> the last speculative spte is not accessed, we treat the page is
> >> write-flooding, however, we can speculative spte on many path, such as pte
> >> prefetch, page synced, that means the last speculative spte may be not point
> >> to the written page and the written page can be accessed via other sptes, so
> >> depends on the Accessed bit of the last speculative spte is not enough
> > 
> > Yes, a stale last_speculative_spte is possible, but is this fact a
> > noticeable problem in practice?
> > 
> > Was this detected by code inspection?
> > 
> 
> I detected this because: i noticed some shadow page is zapped by
> write-flooding but it is accessed soon, it causes the shadow page zapped
> and alloced again and again(very frequently).
> 
> Another reason is that: in current code, write-flooding is little complex
> and it stuffs code in many places, actually, write-flooding is only needed for
> shadow page/nested guest, so i want to simplify it and wrap its code up.
> 
> >> -	}
> >> +	if (spte && !(*spte & shadow_accessed_mask))
> >> +		sp->write_flooding_count++;
> >> +	else
> >> +		sp->write_flooding_count = 0;
> > 
> > This relies on the sptes being created by speculative means
> > or by pressure on the host clearing the accessed bit for the
> > shadow page to be zapped. 
> > 
> > There is no guarantee that either of these is true for a given
> > spte.
> > 
> > And if the sptes do not have accessed bit set, any nonconsecutive 3 pte
> > updates will zap the page.
> > 
> 
> Please note we clear 'sp->write_flooding_count' when it is accessed from
> shadow page cache (in kvm_mmu_get_page), it means if any spte of sp generates
> #PF, the fooding count can be reset.

OK.

> And, i think there are not problems since: if the spte without accssed bit is
> written frequently, it means the guest page table is accessed infrequently or
> during the writing, the guest page table is not accessed, in this time, zapping
> this shadow page is not bad.

Think of the following scenario:

1) page fault, spte with accessed bit is created from gpte at gfnA+indexA.
2) write to gfnA+indexA, spte has accessed bit set, write_flooding_count
is not increased.
3) repeat

So you cannot rely on the accessed bit being cleared to zap the shadow
page, because it might not be cleared in certain scenarios.

> Comparing the old way, the advantage of it is good for zapping upper shadow page,
> for example, in the old way:
> if a gfn is used as PDE for a task, later, the gfn is freed and used as PTE for
> the new task, so we have two shadow pages in the host, one sp1.level = 2 and the
> other sp2.level = 1. So, when we detect write-flooding, the vcpu->last_pte_updated
> always point to sp2.pte. As sp2 is used for the new task, we always detected both
> shadow pages are bing used, but actually, sp1 is not used by guest anymore.

Makes sense.

> > Back to the first question, what is the motivation for this heuristic
> > change? Do you have any numbers?
> > 
> 
> Yes, i have done the quick test:
> 
> before this patch:
> 2m56.561
> 2m50.651
> 2m51.220
> 2m52.199
> 2m48.066
> 
> After this patch:
> 2m51.194
> 2m55.980
> 2m50.755
> 2m47.396
> 2m46.807
> 
> It shows the new way is little better than the old way.

What test is this?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 11/11] KVM: MMU: improve write flooding detected
  2011-08-23 12:38       ` Marcelo Tosatti
@ 2011-08-23 16:32         ` Xiao Guangrong
  2011-08-23 19:09           ` Marcelo Tosatti
  0 siblings, 1 reply; 53+ messages in thread
From: Xiao Guangrong @ 2011-08-23 16:32 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, LKML, KVM

On 08/23/2011 08:38 PM, Marcelo Tosatti wrote:

>> And, i think there are not problems since: if the spte without accssed bit is
>> written frequently, it means the guest page table is accessed infrequently or
>> during the writing, the guest page table is not accessed, in this time, zapping
>> this shadow page is not bad.
> 
> Think of the following scenario:
> 
> 1) page fault, spte with accessed bit is created from gpte at gfnA+indexA.
> 2) write to gfnA+indexA, spte has accessed bit set, write_flooding_count
> is not increased.
> 3) repeat
> 

I think the result is just we hoped, we do not want to zap the shadow page
because the spte is currently used by the guest, it also will be used in the
next repetition. So do not increase 'write_flooding_count' is a good choice.

Let's consider what will happen if we increase 'write_flooding_count':
1: after three repetitions, zap the shadow page
2: in step 1, we will alloc a new shadow page for gpte at gfnA+indexA
3: in step 2, the flooding count is creased, so after 3 repetitions, the
   shadow page can be zapped again, repeat 1 to 3.

The result is the shadow page for gfnA is alloced and zapped again and again,
yes?

> So you cannot rely on the accessed bit being cleared to zap the shadow
> page, because it might not be cleared in certain scenarios.
> 
>> Comparing the old way, the advantage of it is good for zapping upper shadow page,
>> for example, in the old way:
>> if a gfn is used as PDE for a task, later, the gfn is freed and used as PTE for
>> the new task, so we have two shadow pages in the host, one sp1.level = 2 and the
>> other sp2.level = 1. So, when we detect write-flooding, the vcpu->last_pte_updated
>> always point to sp2.pte. As sp2 is used for the new task, we always detected both
>> shadow pages are bing used, but actually, sp1 is not used by guest anymore.
> 
> Makes sense.
> 
>>> Back to the first question, what is the motivation for this heuristic
>>> change? Do you have any numbers?
>>>
>>
>> Yes, i have done the quick test:
>>
>> before this patch:
>> 2m56.561
>> 2m50.651
>> 2m51.220
>> 2m52.199
>> 2m48.066
>>
>> After this patch:
>> 2m51.194
>> 2m55.980
>> 2m50.755
>> 2m47.396
>> 2m46.807
>>
>> It shows the new way is little better than the old way.
> 
> What test is this?
> 

Sorry, i forgot to mention it, the test case is kerbench. :-)
 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 11/11] KVM: MMU: improve write flooding detected
  2011-08-23 16:32         ` Xiao Guangrong
@ 2011-08-23 19:09           ` Marcelo Tosatti
  2011-08-23 20:16             ` Xiao Guangrong
  2011-08-25  7:57             ` Xiao Guangrong
  0 siblings, 2 replies; 53+ messages in thread
From: Marcelo Tosatti @ 2011-08-23 19:09 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, LKML, KVM

On Wed, Aug 24, 2011 at 12:32:32AM +0800, Xiao Guangrong wrote:
> On 08/23/2011 08:38 PM, Marcelo Tosatti wrote:
> 
> >> And, i think there are not problems since: if the spte without accssed bit is
> >> written frequently, it means the guest page table is accessed infrequently or
> >> during the writing, the guest page table is not accessed, in this time, zapping
> >> this shadow page is not bad.
> > 
> > Think of the following scenario:
> > 
> > 1) page fault, spte with accessed bit is created from gpte at gfnA+indexA.
> > 2) write to gfnA+indexA, spte has accessed bit set, write_flooding_count
> > is not increased.
> > 3) repeat
> > 
> 
> I think the result is just we hoped, we do not want to zap the shadow page
> because the spte is currently used by the guest, it also will be used in the
> next repetition. So do not increase 'write_flooding_count' is a good choice.

Its not used. Step 2) is write to write protected shadow page at
gfnA.

> Let's consider what will happen if we increase 'write_flooding_count':
> 1: after three repetitions, zap the shadow page
> 2: in step 1, we will alloc a new shadow page for gpte at gfnA+indexA
> 3: in step 2, the flooding count is creased, so after 3 repetitions, the
>    shadow page can be zapped again, repeat 1 to 3.

The shadow page will not be zapped because the spte created from
gfnA+indexA has the accessed bit set:

       if (spte && !(*spte & shadow_accessed_mask))
               sp->write_flooding_count++;
       else
               sp->write_flooding_count = 0;

> The result is the shadow page for gfnA is alloced and zapped again and again,
> yes?

The point is you cannot rely on the accessed bit of sptes that have been
instantiated with the accessed bit set to decide whether or not to zap.
Because the accessed bit will only be cleared on host memory pressure.

> > So you cannot rely on the accessed bit being cleared to zap the shadow
> > page, because it might not be cleared in certain scenarios.
> > 
> >> Comparing the old way, the advantage of it is good for zapping upper shadow page,
> >> for example, in the old way:
> >> if a gfn is used as PDE for a task, later, the gfn is freed and used as PTE for
> >> the new task, so we have two shadow pages in the host, one sp1.level = 2 and the
> >> other sp2.level = 1. So, when we detect write-flooding, the vcpu->last_pte_updated
> >> always point to sp2.pte. As sp2 is used for the new task, we always detected both
> >> shadow pages are bing used, but actually, sp1 is not used by guest anymore.
> > 
> > Makes sense.
> > 
> >>> Back to the first question, what is the motivation for this heuristic
> >>> change? Do you have any numbers?
> >>>
> >>
> >> Yes, i have done the quick test:
> >>
> >> before this patch:
> >> 2m56.561
> >> 2m50.651
> >> 2m51.220
> >> 2m52.199
> >> 2m48.066
> >>
> >> After this patch:
> >> 2m51.194
> >> 2m55.980
> >> 2m50.755
> >> 2m47.396
> >> 2m46.807
> >>
> >> It shows the new way is little better than the old way.
> > 
> > What test is this?
> > 
> 
> Sorry, i forgot to mention it, the test case is kerbench. :-)
>  

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 11/11] KVM: MMU: improve write flooding detected
  2011-08-23 19:09           ` Marcelo Tosatti
@ 2011-08-23 20:16             ` Xiao Guangrong
  2011-08-24 20:05               ` Marcelo Tosatti
  2011-08-25  7:57             ` Xiao Guangrong
  1 sibling, 1 reply; 53+ messages in thread
From: Xiao Guangrong @ 2011-08-23 20:16 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, LKML, KVM

On 08/24/2011 03:09 AM, Marcelo Tosatti wrote:
> On Wed, Aug 24, 2011 at 12:32:32AM +0800, Xiao Guangrong wrote:
>> On 08/23/2011 08:38 PM, Marcelo Tosatti wrote:
>>
>>>> And, i think there are not problems since: if the spte without accssed bit is
>>>> written frequently, it means the guest page table is accessed infrequently or
>>>> during the writing, the guest page table is not accessed, in this time, zapping
>>>> this shadow page is not bad.
>>>
>>> Think of the following scenario:
>>>
>>> 1) page fault, spte with accessed bit is created from gpte at gfnA+indexA.
>>> 2) write to gfnA+indexA, spte has accessed bit set, write_flooding_count
>>> is not increased.
>>> 3) repeat
>>>
>>
>> I think the result is just we hoped, we do not want to zap the shadow page
>> because the spte is currently used by the guest, it also will be used in the
>> next repetition. So do not increase 'write_flooding_count' is a good choice.
> 
> Its not used. Step 2) is write to write protected shadow page at
> gfnA.
> 
>> Let's consider what will happen if we increase 'write_flooding_count':
>> 1: after three repetitions, zap the shadow page
>> 2: in step 1, we will alloc a new shadow page for gpte at gfnA+indexA
>> 3: in step 2, the flooding count is creased, so after 3 repetitions, the
>>    shadow page can be zapped again, repeat 1 to 3.
> 
> The shadow page will not be zapped because the spte created from
> gfnA+indexA has the accessed bit set:
> 
>        if (spte && !(*spte & shadow_accessed_mask))
>                sp->write_flooding_count++;
>        else
>                sp->write_flooding_count = 0;
> 

Ah, i see, i thought it was "repeat"ed on the same spte, it was my wrong.

Yes, in this case, the sp is not zapped, but it is hardly to know the gfn
is not used as gpte just depends on writing, for example, the guest can
change the mapping address or the status bit, and so on...The sp can be
zapped if the guest write it again(on the same address), i think it is
acceptable, anymore, it is just the speculative way to zap the unused
shadow page...your opinion?

>> The result is the shadow page for gfnA is alloced and zapped again and again,
>> yes?
> 
> The point is you cannot rely on the accessed bit of sptes that have been
> instantiated with the accessed bit set to decide whether or not to zap.
> Because the accessed bit will only be cleared on host memory pressure.
> 

Yes, accessed bit is the cursory way to track gpte accessed, however,
at least, the accessed bit can indicate whether the gfn is accessed
for a period of time in the most case, for example, from it is
speculated to it is written, or from it is zapped to it is written,
i thinks it is not too bad.

Do you have ideas to improve this?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 11/11] KVM: MMU: improve write flooding detected
  2011-08-23 20:16             ` Xiao Guangrong
@ 2011-08-24 20:05               ` Marcelo Tosatti
  2011-08-25  2:04                 ` Marcelo Tosatti
  0 siblings, 1 reply; 53+ messages in thread
From: Marcelo Tosatti @ 2011-08-24 20:05 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, LKML, KVM

On Wed, Aug 24, 2011 at 04:16:52AM +0800, Xiao Guangrong wrote:
> On 08/24/2011 03:09 AM, Marcelo Tosatti wrote:
> > On Wed, Aug 24, 2011 at 12:32:32AM +0800, Xiao Guangrong wrote:
> >> On 08/23/2011 08:38 PM, Marcelo Tosatti wrote:
> >>
> >>>> And, i think there are not problems since: if the spte without accssed bit is
> >>>> written frequently, it means the guest page table is accessed infrequently or
> >>>> during the writing, the guest page table is not accessed, in this time, zapping
> >>>> this shadow page is not bad.
> >>>
> >>> Think of the following scenario:
> >>>
> >>> 1) page fault, spte with accessed bit is created from gpte at gfnA+indexA.
> >>> 2) write to gfnA+indexA, spte has accessed bit set, write_flooding_count
> >>> is not increased.
> >>> 3) repeat
> >>>
> >>
> >> I think the result is just we hoped, we do not want to zap the shadow page
> >> because the spte is currently used by the guest, it also will be used in the
> >> next repetition. So do not increase 'write_flooding_count' is a good choice.
> > 
> > Its not used. Step 2) is write to write protected shadow page at
> > gfnA.
> > 
> >> Let's consider what will happen if we increase 'write_flooding_count':
> >> 1: after three repetitions, zap the shadow page
> >> 2: in step 1, we will alloc a new shadow page for gpte at gfnA+indexA
> >> 3: in step 2, the flooding count is creased, so after 3 repetitions, the
> >>    shadow page can be zapped again, repeat 1 to 3.
> > 
> > The shadow page will not be zapped because the spte created from
> > gfnA+indexA has the accessed bit set:
> > 
> >        if (spte && !(*spte & shadow_accessed_mask))
> >                sp->write_flooding_count++;
> >        else
> >                sp->write_flooding_count = 0;
> > 
> 
> Ah, i see, i thought it was "repeat"ed on the same spte, it was my wrong.
> 
> Yes, in this case, the sp is not zapped, but it is hardly to know the gfn
> is not used as gpte just depends on writing, for example, the guest can
> change the mapping address or the status bit, and so on...The sp can be
> zapped if the guest write it again(on the same address), i think it is
> acceptable, anymore, it is just the speculative way to zap the unused
> shadow page...your opinion?

It could increase the flood count independently of the accessed bit of
the spte being updated, zapping after 3 attempts as it is now.

But additionally reset the flood count if the gpte appears to be valid
(points to an existant gfn if the present bit is set, or if its zeroed).

> >> The result is the shadow page for gfnA is alloced and zapped again and again,
> >> yes?
> > 
> > The point is you cannot rely on the accessed bit of sptes that have been
> > instantiated with the accessed bit set to decide whether or not to zap.
> > Because the accessed bit will only be cleared on host memory pressure.
> > 
> 
> Yes, accessed bit is the cursory way to track gpte accessed, however,
> at least, the accessed bit can indicate whether the gfn is accessed
> for a period of time in the most case, for example, from it is
> speculated to it is written, or from it is zapped to it is written,
> i thinks it is not too bad.
> 
> Do you have ideas to improve this?



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 11/11] KVM: MMU: improve write flooding detected
  2011-08-24 20:05               ` Marcelo Tosatti
@ 2011-08-25  2:04                 ` Marcelo Tosatti
  2011-08-25  4:42                   ` Avi Kivity
  2011-08-25  7:40                   ` Xiao Guangrong
  0 siblings, 2 replies; 53+ messages in thread
From: Marcelo Tosatti @ 2011-08-25  2:04 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, LKML, KVM

On Wed, Aug 24, 2011 at 05:05:40PM -0300, Marcelo Tosatti wrote:
> On Wed, Aug 24, 2011 at 04:16:52AM +0800, Xiao Guangrong wrote:
> > On 08/24/2011 03:09 AM, Marcelo Tosatti wrote:
> > > On Wed, Aug 24, 2011 at 12:32:32AM +0800, Xiao Guangrong wrote:
> > >> On 08/23/2011 08:38 PM, Marcelo Tosatti wrote:
> > >>
> > >>>> And, i think there are not problems since: if the spte without accssed bit is
> > >>>> written frequently, it means the guest page table is accessed infrequently or
> > >>>> during the writing, the guest page table is not accessed, in this time, zapping
> > >>>> this shadow page is not bad.
> > >>>
> > >>> Think of the following scenario:
> > >>>
> > >>> 1) page fault, spte with accessed bit is created from gpte at gfnA+indexA.
> > >>> 2) write to gfnA+indexA, spte has accessed bit set, write_flooding_count
> > >>> is not increased.
> > >>> 3) repeat
> > >>>
> > >>
> > >> I think the result is just we hoped, we do not want to zap the shadow page
> > >> because the spte is currently used by the guest, it also will be used in the
> > >> next repetition. So do not increase 'write_flooding_count' is a good choice.
> > > 
> > > Its not used. Step 2) is write to write protected shadow page at
> > > gfnA.
> > > 
> > >> Let's consider what will happen if we increase 'write_flooding_count':
> > >> 1: after three repetitions, zap the shadow page
> > >> 2: in step 1, we will alloc a new shadow page for gpte at gfnA+indexA
> > >> 3: in step 2, the flooding count is creased, so after 3 repetitions, the
> > >>    shadow page can be zapped again, repeat 1 to 3.
> > > 
> > > The shadow page will not be zapped because the spte created from
> > > gfnA+indexA has the accessed bit set:
> > > 
> > >        if (spte && !(*spte & shadow_accessed_mask))
> > >                sp->write_flooding_count++;
> > >        else
> > >                sp->write_flooding_count = 0;
> > > 
> > 
> > Ah, i see, i thought it was "repeat"ed on the same spte, it was my wrong.
> > 
> > Yes, in this case, the sp is not zapped, but it is hardly to know the gfn
> > is not used as gpte just depends on writing, for example, the guest can
> > change the mapping address or the status bit, and so on...The sp can be
> > zapped if the guest write it again(on the same address), i think it is
> > acceptable, anymore, it is just the speculative way to zap the unused
> > shadow page...your opinion?
> 
> It could increase the flood count independently of the accessed bit of
> the spte being updated, zapping after 3 attempts as it is now.
> 
> But additionally reset the flood count if the gpte appears to be valid
> (points to an existant gfn if the present bit is set, or if its zeroed).

Well not zero, as thats a common pattern for non ptes.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 11/11] KVM: MMU: improve write flooding detected
  2011-08-25  2:04                 ` Marcelo Tosatti
@ 2011-08-25  4:42                   ` Avi Kivity
  2011-08-25 13:21                     ` Marcelo Tosatti
  2011-08-25  7:40                   ` Xiao Guangrong
  1 sibling, 1 reply; 53+ messages in thread
From: Avi Kivity @ 2011-08-25  4:42 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Xiao Guangrong, LKML, KVM

On 08/25/2011 05:04 AM, Marcelo Tosatti wrote:
> >
> >  It could increase the flood count independently of the accessed bit of
> >  the spte being updated, zapping after 3 attempts as it is now.
> >
> >  But additionally reset the flood count if the gpte appears to be valid
> >  (points to an existant gfn if the present bit is set, or if its zeroed).
>
> Well not zero, as thats a common pattern for non ptes.
>

On 32-bit with 4GB RAM, practically anything is a valid gpte.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 11/11] KVM: MMU: improve write flooding detected
  2011-08-25  2:04                 ` Marcelo Tosatti
  2011-08-25  4:42                   ` Avi Kivity
@ 2011-08-25  7:40                   ` Xiao Guangrong
  1 sibling, 0 replies; 53+ messages in thread
From: Xiao Guangrong @ 2011-08-25  7:40 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, LKML, KVM

On 08/25/2011 10:04 AM, Marcelo Tosatti wrote:

>>> Yes, in this case, the sp is not zapped, but it is hardly to know the gfn
>>> is not used as gpte just depends on writing, for example, the guest can
>>> change the mapping address or the status bit, and so on...The sp can be
>>> zapped if the guest write it again(on the same address), i think it is
>>> acceptable, anymore, it is just the speculative way to zap the unused
>>> shadow page...your opinion?
>>
>> It could increase the flood count independently of the accessed bit of
>> the spte being updated, zapping after 3 attempts as it is now.
>>
>> But additionally reset the flood count if the gpte appears to be valid
>> (points to an existant gfn if the present bit is set, or if its zeroed).
> 
> Well not zero, as thats a common pattern for non ptes.
> 

Hi Marcelo,

Maybe it is not good i think, for some reasons:
- checking gfn valid which it is pointed by gpte is high overload,
  it needs to call gfn_to_hva to walk memslots, especially. kvm_mmu_pte_write
  is called very frequently on shadow mmu.

- MMIO gfn is not an existent gfn, but it is valid pointed by gpte

- we can check the reserved bits in the gpte to check whether it is valid a
  gpte, but for some paging modes, all bits are valid.(for example, non-PAE mode)

- it can not work if the gfn has multiple shadow pages, for example:
  if the gfn was used as PDE, later it is used as PTE, then we have two shadow
  pages: sp1.level = 2, sp2.level = 1, sp1 can not be zapped even even though it
  is not used anymore.

- sometime, we need to zap the shadow page even though the gpte is written validly:
  if the gpte is written frequently but infrequently accessed, we do better zap the
  shadow page to let it is writable(write it directly without #PF) and map it when it
  is accessed, one example is from Avi, the guest OS may update many gptes at one time
  after one page fault.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 11/11] KVM: MMU: improve write flooding detected
  2011-08-23 19:09           ` Marcelo Tosatti
  2011-08-23 20:16             ` Xiao Guangrong
@ 2011-08-25  7:57             ` Xiao Guangrong
  2011-08-25 13:47               ` Marcelo Tosatti
  1 sibling, 1 reply; 53+ messages in thread
From: Xiao Guangrong @ 2011-08-25  7:57 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, LKML, KVM

On 08/24/2011 03:09 AM, Marcelo Tosatti wrote:
> On Wed, Aug 24, 2011 at 12:32:32AM +0800, Xiao Guangrong wrote:
>> On 08/23/2011 08:38 PM, Marcelo Tosatti wrote:
>>
>>>> And, i think there are not problems since: if the spte without accssed bit is
>>>> written frequently, it means the guest page table is accessed infrequently or
>>>> during the writing, the guest page table is not accessed, in this time, zapping
>>>> this shadow page is not bad.
>>>
>>> Think of the following scenario:
>>>
>>> 1) page fault, spte with accessed bit is created from gpte at gfnA+indexA.
>>> 2) write to gfnA+indexA, spte has accessed bit set, write_flooding_count
>>> is not increased.
>>> 3) repeat
>>>
>>
>> I think the result is just we hoped, we do not want to zap the shadow page
>> because the spte is currently used by the guest, it also will be used in the
>> next repetition. So do not increase 'write_flooding_count' is a good choice.
> 
> Its not used. Step 2) is write to write protected shadow page at
> gfnA.
> 
>> Let's consider what will happen if we increase 'write_flooding_count':
>> 1: after three repetitions, zap the shadow page
>> 2: in step 1, we will alloc a new shadow page for gpte at gfnA+indexA
>> 3: in step 2, the flooding count is creased, so after 3 repetitions, the
>>    shadow page can be zapped again, repeat 1 to 3.
> 
> The shadow page will not be zapped because the spte created from
> gfnA+indexA has the accessed bit set:
> 
>        if (spte && !(*spte & shadow_accessed_mask))
>                sp->write_flooding_count++;
>        else
>                sp->write_flooding_count = 0;
> 

Marcelo, i am still confused with your example, in step 3), what is repeated?
it repeats step 2) or it repeats step 1) and 2)?

Only step 2) is repeated i guess, right? if it is yes, it works well:
when the guest writes gpte, the spte of corresponding shadow page is zapped
(level > 1) or it is speculatively fetched(level == 1), the accessed bit is
cleared in both case.

the later write can detect that the accessed bit is not set, and write_flooding_count
is increased. finally, the shadow page is zapped, the gpte is written directly.

>> The result is the shadow page for gfnA is alloced and zapped again and again,
>> yes?
> 
> The point is you cannot rely on the accessed bit of sptes that have been
> instantiated with the accessed bit set to decide whether or not to zap.
> Because the accessed bit will only be cleared on host memory pressure.
> 

But the accessed bit is also cleared after spte is written.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 11/11] KVM: MMU: improve write flooding detected
  2011-08-25  4:42                   ` Avi Kivity
@ 2011-08-25 13:21                     ` Marcelo Tosatti
  2011-08-25 14:06                       ` Avi Kivity
  0 siblings, 1 reply; 53+ messages in thread
From: Marcelo Tosatti @ 2011-08-25 13:21 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Xiao Guangrong, LKML, KVM

On Thu, Aug 25, 2011 at 07:42:10AM +0300, Avi Kivity wrote:
> On 08/25/2011 05:04 AM, Marcelo Tosatti wrote:
> >>
> >>  It could increase the flood count independently of the accessed bit of
> >>  the spte being updated, zapping after 3 attempts as it is now.
> >>
> >>  But additionally reset the flood count if the gpte appears to be valid
> >>  (points to an existant gfn if the present bit is set, or if its zeroed).
> >
> >Well not zero, as thats a common pattern for non ptes.
> >
> 
> On 32-bit with 4GB RAM, practically anything is a valid gpte.

The following could be required to consider a valid gpte, for write
flood detection purposes:

- Must be present.
- PageCacheDisable must be unset.
- PageWriteThrough must be unset.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 11/11] KVM: MMU: improve write flooding detected
  2011-08-25  7:57             ` Xiao Guangrong
@ 2011-08-25 13:47               ` Marcelo Tosatti
  2011-08-26  3:18                 ` Xiao Guangrong
  0 siblings, 1 reply; 53+ messages in thread
From: Marcelo Tosatti @ 2011-08-25 13:47 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, LKML, KVM

On Thu, Aug 25, 2011 at 03:57:22PM +0800, Xiao Guangrong wrote:
> On 08/24/2011 03:09 AM, Marcelo Tosatti wrote:
> > On Wed, Aug 24, 2011 at 12:32:32AM +0800, Xiao Guangrong wrote:
> >> On 08/23/2011 08:38 PM, Marcelo Tosatti wrote:
> >>
> >>>> And, i think there are not problems since: if the spte without accssed bit is
> >>>> written frequently, it means the guest page table is accessed infrequently or
> >>>> during the writing, the guest page table is not accessed, in this time, zapping
> >>>> this shadow page is not bad.
> >>>
> >>> Think of the following scenario:
> >>>
> >>> 1) page fault, spte with accessed bit is created from gpte at gfnA+indexA.
> >>> 2) write to gfnA+indexA, spte has accessed bit set, write_flooding_count
> >>> is not increased.
> >>> 3) repeat
> >>>
> >>
> >> I think the result is just we hoped, we do not want to zap the shadow page
> >> because the spte is currently used by the guest, it also will be used in the
> >> next repetition. So do not increase 'write_flooding_count' is a good choice.
> > 
> > Its not used. Step 2) is write to write protected shadow page at
> > gfnA.
> > 
> >> Let's consider what will happen if we increase 'write_flooding_count':
> >> 1: after three repetitions, zap the shadow page
> >> 2: in step 1, we will alloc a new shadow page for gpte at gfnA+indexA
> >> 3: in step 2, the flooding count is creased, so after 3 repetitions, the
> >>    shadow page can be zapped again, repeat 1 to 3.
> > 
> > The shadow page will not be zapped because the spte created from
> > gfnA+indexA has the accessed bit set:
> > 
> >        if (spte && !(*spte & shadow_accessed_mask))
> >                sp->write_flooding_count++;
> >        else
> >                sp->write_flooding_count = 0;
> > 
> 
> Marcelo, i am still confused with your example, in step 3), what is repeated?
> it repeats step 2) or it repeats step 1) and 2)?
> 
> Only step 2) is repeated i guess, right? if it is yes, it works well:
> when the guest writes gpte, the spte of corresponding shadow page is zapped
> (level > 1) or it is speculatively fetched(level == 1), the accessed bit is
> cleared in both case.

Right.

> the later write can detect that the accessed bit is not set, and write_flooding_count
> is increased. finally, the shadow page is zapped, the gpte is written directly.
> 
> >> The result is the shadow page for gfnA is alloced and zapped again and again,
> >> yes?
> > 
> > The point is you cannot rely on the accessed bit of sptes that have been
> > instantiated with the accessed bit set to decide whether or not to zap.
> > Because the accessed bit will only be cleared on host memory pressure.
> > 
> 
> But the accessed bit is also cleared after spte is written.

Right. But only one of the 512 sptes. Worst case, a shadow that has 1
spte with accessed bit at every 3 spte entries would not be zapped for a
linear write of the entire guest pagetable. The current heuristic does 
not suffer from this issue.

I guess it is OK to be more trigger happy with zapping by ignoring
the accessed bit, clearing the flood counter on page fault.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 11/11] KVM: MMU: improve write flooding detected
  2011-08-25 13:21                     ` Marcelo Tosatti
@ 2011-08-25 14:06                       ` Avi Kivity
  2011-08-25 14:07                         ` Avi Kivity
  0 siblings, 1 reply; 53+ messages in thread
From: Avi Kivity @ 2011-08-25 14:06 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Xiao Guangrong, LKML, KVM

On 08/25/2011 04:21 PM, Marcelo Tosatti wrote:
> On Thu, Aug 25, 2011 at 07:42:10AM +0300, Avi Kivity wrote:
> >  On 08/25/2011 05:04 AM, Marcelo Tosatti wrote:
> >  >>
> >  >>   It could increase the flood count independently of the accessed bit of
> >  >>   the spte being updated, zapping after 3 attempts as it is now.
> >  >>
> >  >>   But additionally reset the flood count if the gpte appears to be valid
> >  >>   (points to an existant gfn if the present bit is set, or if its zeroed).
> >  >
> >  >Well not zero, as thats a common pattern for non ptes.
> >  >
> >
> >  On 32-bit with 4GB RAM, practically anything is a valid gpte.
>
> The following could be required to consider a valid gpte, for write
> flood detection purposes:
>
> - Must be present.
> - PageCacheDisable must be unset.
> - PageWriteThrough must be unset.
>

Unless the guest is using PAT.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 11/11] KVM: MMU: improve write flooding detected
  2011-08-25 14:06                       ` Avi Kivity
@ 2011-08-25 14:07                         ` Avi Kivity
  0 siblings, 0 replies; 53+ messages in thread
From: Avi Kivity @ 2011-08-25 14:07 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Xiao Guangrong, LKML, KVM

On 08/25/2011 05:06 PM, Avi Kivity wrote:
> On 08/25/2011 04:21 PM, Marcelo Tosatti wrote:
>> On Thu, Aug 25, 2011 at 07:42:10AM +0300, Avi Kivity wrote:
>> >  On 08/25/2011 05:04 AM, Marcelo Tosatti wrote:
>> > >>
>> > >>   It could increase the flood count independently of the 
>> accessed bit of
>> > >>   the spte being updated, zapping after 3 attempts as it is now.
>> > >>
>> > >>   But additionally reset the flood count if the gpte appears to 
>> be valid
>> > >>   (points to an existant gfn if the present bit is set, or if 
>> its zeroed).
>> > >
>> > >Well not zero, as thats a common pattern for non ptes.
>> > >
>> >
>> >  On 32-bit with 4GB RAM, practically anything is a valid gpte.
>>
>> The following could be required to consider a valid gpte, for write
>> flood detection purposes:
>>
>> - Must be present.
>> - PageCacheDisable must be unset.
>> - PageWriteThrough must be unset.
>>
>
> Unless the guest is using PAT.
>

And not swapping.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 11/11] KVM: MMU: improve write flooding detected
  2011-08-25 13:47               ` Marcelo Tosatti
@ 2011-08-26  3:18                 ` Xiao Guangrong
  2011-08-26 10:53                   ` Marcelo Tosatti
  0 siblings, 1 reply; 53+ messages in thread
From: Xiao Guangrong @ 2011-08-26  3:18 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, LKML, KVM

On 08/25/2011 09:47 PM, Marcelo Tosatti wrote:

> I guess it is OK to be more trigger happy with zapping by ignoring
> the accessed bit, clearing the flood counter on page fault.
> 

Yeah, i like this way, is this patch good for you?

================================
Subject: [PATCH 11/11] KVM: MMU: improve write flooding detected

Detecting write-flooding does not work well, when we handle page written, if
the last speculative spte is not accessed, we treat the page is
write-flooding, however, we can speculative spte on many path, such as pte
prefetch, page synced, that means the last speculative spte may be not point
to the written page and the written page can be accessed via other sptes, so
depends on the Accessed bit of the last speculative spte is not enough

Instead of detected page accessed, we can detect whether the spte is accessed
after it is written, if the spte is not accessed but it is written frequently,
we treat is not a page table or it not used for a long time

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/include/asm/kvm_host.h |    6 +---
 arch/x86/kvm/mmu.c              |   57 +++++++++++++--------------------------
 arch/x86/kvm/paging_tmpl.h      |   12 +++-----
 3 files changed, 26 insertions(+), 49 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 927ba73..9d17238 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -239,6 +239,8 @@ struct kvm_mmu_page {
 	int clear_spte_count;
 #endif
 
+	int write_flooding_count;
+
 	struct rcu_head rcu;
 };
 
@@ -353,10 +355,6 @@ struct kvm_vcpu_arch {
 	struct kvm_mmu_memory_cache mmu_page_cache;
 	struct kvm_mmu_memory_cache mmu_page_header_cache;
 
-	gfn_t last_pt_write_gfn;
-	int   last_pt_write_count;
-	u64  *last_pte_updated;
-
 	struct fpu guest_fpu;
 	u64 xcr0;
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index adaa160..fd5b389 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1652,6 +1652,18 @@ static void init_shadow_page_table(struct kvm_mmu_page *sp)
 		sp->spt[i] = 0ull;
 }
 
+static void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp)
+{
+	sp->write_flooding_count = 0;
+}
+
+static void clear_sp_write_flooding_count(u64 *spte)
+{
+	struct kvm_mmu_page *sp =  page_header(__pa(spte));
+
+	__clear_sp_write_flooding_count(sp);
+}
+
 static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 					     gfn_t gfn,
 					     gva_t gaddr,
@@ -1695,6 +1707,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 		} else if (sp->unsync)
 			kvm_mmu_mark_parents_unsync(sp);
 
+		__clear_sp_write_flooding_count(sp);
 		trace_kvm_mmu_get_page(sp, false);
 		return sp;
 	}
@@ -1847,15 +1860,6 @@ static void kvm_mmu_put_page(struct kvm_mmu_page *sp, u64 *parent_pte)
 	mmu_page_remove_parent_pte(sp, parent_pte);
 }
 
-static void kvm_mmu_reset_last_pte_updated(struct kvm *kvm)
-{
-	int i;
-	struct kvm_vcpu *vcpu;
-
-	kvm_for_each_vcpu(i, vcpu, kvm)
-		vcpu->arch.last_pte_updated = NULL;
-}
-
 static void kvm_mmu_unlink_parents(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
 	u64 *parent_pte;
@@ -1915,7 +1919,6 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
 	}
 
 	sp->role.invalid = 1;
-	kvm_mmu_reset_last_pte_updated(kvm);
 	return ret;
 }
 
@@ -2360,8 +2363,6 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 		}
 	}
 	kvm_release_pfn_clean(pfn);
-	if (speculative)
-		vcpu->arch.last_pte_updated = sptep;
 }
 
 static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
@@ -3522,13 +3523,6 @@ static void mmu_pte_write_flush_tlb(struct kvm_vcpu *vcpu, bool zap_page,
 		kvm_mmu_flush_tlb(vcpu);
 }
 
-static bool last_updated_pte_accessed(struct kvm_vcpu *vcpu)
-{
-	u64 *spte = vcpu->arch.last_pte_updated;
-
-	return !!(spte && (*spte & shadow_accessed_mask));
-}
-
 static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu *vcpu, gpa_t *gpa,
 				    const u8 *new, int *bytes)
 {
@@ -3569,22 +3563,9 @@ static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu *vcpu, gpa_t *gpa,
  * If we're seeing too many writes to a page, it may no longer be a page table,
  * or we may be forking, in which case it is better to unmap the page.
  */
-static bool detect_write_flooding(struct kvm_vcpu *vcpu, gfn_t gfn)
+static bool detect_write_flooding(struct kvm_mmu_page *sp, u64 *spte)
 {
-	bool flooded = false;
-
-	if (gfn == vcpu->arch.last_pt_write_gfn
-	    && !last_updated_pte_accessed(vcpu)) {
-		++vcpu->arch.last_pt_write_count;
-		if (vcpu->arch.last_pt_write_count >= 3)
-			flooded = true;
-	} else {
-		vcpu->arch.last_pt_write_gfn = gfn;
-		vcpu->arch.last_pt_write_count = 1;
-		vcpu->arch.last_pte_updated = NULL;
-	}
-
-	return flooded;
+	return ++sp->write_flooding_count >= 3;
 }
 
 /*
@@ -3656,7 +3637,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 	LIST_HEAD(invalid_list);
 	u64 entry, gentry, *spte;
 	int npte;
-	bool remote_flush, local_flush, zap_page, flooded, misaligned;
+	bool remote_flush, local_flush, zap_page;
 
 	/*
 	 * If we don't have indirect shadow pages, it means no page is
@@ -3675,12 +3656,12 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 	++vcpu->kvm->stat.mmu_pte_write;
 	trace_kvm_mmu_audit(vcpu, AUDIT_PRE_PTE_WRITE);
 
-	flooded = detect_write_flooding(vcpu, gfn);
 	mask.cr0_wp = mask.cr4_pae = mask.nxe = 1;
 	for_each_gfn_indirect_valid_sp(vcpu->kvm, sp, gfn, node) {
-		misaligned = detect_write_misaligned(sp, gpa, bytes);
+		spte = get_written_sptes(sp, gpa, &npte);
 
-		if (misaligned || flooded) {
+		if (detect_write_misaligned(sp, gpa, bytes) ||
+		      detect_write_flooding(sp, spte)) {
 			zap_page |= !!kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
 						     &invalid_list);
 			++vcpu->kvm->stat.mmu_flooded;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 3395ab2..379a795 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -497,6 +497,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 	     shadow_walk_next(&it)) {
 		gfn_t table_gfn;
 
+		clear_sp_write_flooding_count(it.sptep);
 		drop_large_spte(vcpu, it.sptep);
 
 		sp = NULL;
@@ -522,6 +523,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 	     shadow_walk_next(&it)) {
 		gfn_t direct_gfn;
 
+		clear_sp_write_flooding_count(it.sptep);
 		validate_direct_spte(vcpu, it.sptep, direct_access);
 
 		drop_large_spte(vcpu, it.sptep);
@@ -536,6 +538,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 		link_shadow_page(it.sptep, sp);
 	}
 
+	clear_sp_write_flooding_count(it.sptep);
 	mmu_set_spte(vcpu, it.sptep, access, gw->pte_access,
 		     user_fault, write_fault, emulate, it.level,
 		     gw->gfn, pfn, prefault, map_writable);
@@ -599,11 +602,9 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
 	 */
 	if (!r) {
 		pgprintk("%s: guest page fault\n", __func__);
-		if (!prefault) {
+		if (!prefault)
 			inject_page_fault(vcpu, &walker.fault);
-			/* reset fork detector */
-			vcpu->arch.last_pt_write_count = 0;
-		}
+
 		return 0;
 	}
 
@@ -641,9 +642,6 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
 	pgprintk("%s: shadow pte %p %llx emulate %d\n", __func__,
 		 sptep, *sptep, emulate);
 
-	if (!emulate)
-		vcpu->arch.last_pt_write_count = 0; /* reset fork detector */
-
 	++vcpu->stat.pf_fixed;
 	trace_kvm_mmu_audit(vcpu, AUDIT_POST_PAGE_FAULT);
 	spin_unlock(&vcpu->kvm->mmu_lock);
-- 
1.7.5.4



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 11/11] KVM: MMU: improve write flooding detected
  2011-08-26  3:18                 ` Xiao Guangrong
@ 2011-08-26 10:53                   ` Marcelo Tosatti
  2011-08-26 14:24                     ` Xiao Guangrong
  0 siblings, 1 reply; 53+ messages in thread
From: Marcelo Tosatti @ 2011-08-26 10:53 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, LKML, KVM

On Fri, Aug 26, 2011 at 11:18:01AM +0800, Xiao Guangrong wrote:
> On 08/25/2011 09:47 PM, Marcelo Tosatti wrote:
> 
> > I guess it is OK to be more trigger happy with zapping by ignoring
> > the accessed bit, clearing the flood counter on page fault.
> > 
> 
> Yeah, i like this way, is this patch good for you?

Looks fine, can you rerun kernbench?

> ================================
> Subject: [PATCH 11/11] KVM: MMU: improve write flooding detected
> 
> Detecting write-flooding does not work well, when we handle page written, if
> the last speculative spte is not accessed, we treat the page is
> write-flooding, however, we can speculative spte on many path, such as pte
> prefetch, page synced, that means the last speculative spte may be not point
> to the written page and the written page can be accessed via other sptes, so
> depends on the Accessed bit of the last speculative spte is not enough
> 
> Instead of detected page accessed, we can detect whether the spte is accessed
> after it is written, if the spte is not accessed but it is written frequently,
> we treat is not a page table or it not used for a long time
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 11/11] KVM: MMU: improve write flooding detected
  2011-08-26 10:53                   ` Marcelo Tosatti
@ 2011-08-26 14:24                     ` Xiao Guangrong
  0 siblings, 0 replies; 53+ messages in thread
From: Xiao Guangrong @ 2011-08-26 14:24 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, LKML, KVM

On 08/26/2011 06:53 PM, Marcelo Tosatti wrote:
> On Fri, Aug 26, 2011 at 11:18:01AM +0800, Xiao Guangrong wrote:
>> On 08/25/2011 09:47 PM, Marcelo Tosatti wrote:
>>
>>> I guess it is OK to be more trigger happy with zapping by ignoring
>>> the accessed bit, clearing the flood counter on page fault.
>>>
>>
>> Yeah, i like this way, is this patch good for you?
> 
> Looks fine, can you rerun kernbench?
> 

Sure, i tested the performance of this way, there is the result:

The origin way:
2m56.561
2m50.651
2m51.220
2m52.199
2m48.066

The way of using accessed bit:
2m51.194
2m55.980
2m50.755
2m47.396
2m46.807

The way of ignoring accessed bit:
2m45.547
2m44.551
2m55.840
2m56.333
2m45.534

I think the result is not bad.
 

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2011-08-26 14:22 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-07-26 11:24 [PATCH 0/11] KVM: x86: optimize for guest page written Xiao Guangrong
2011-07-26 11:25 ` [PATCH 01/11] KVM: MMU: avoid pte_list_desc run out in kvm_mmu_pte_write Xiao Guangrong
2011-07-27  9:00   ` Avi Kivity
2011-07-27  9:37     ` Xiao Guangrong
2011-07-26 11:25 ` [PATCH 02/11] KVM: x86: cleanup pio/pout emulated Xiao Guangrong
2011-07-26 11:26 ` [PATCH 03/11] KVM: x86: fast emulate repeat string write instructions Xiao Guangrong
2011-07-26 12:27   ` Gleb Natapov
2011-07-26 13:53     ` Avi Kivity
2011-07-27  1:47     ` Xiao Guangrong
2011-07-27  4:26       ` Gleb Natapov
2011-07-27  6:32         ` Xiao Guangrong
2011-07-27  7:51           ` Gleb Natapov
2011-07-27  9:36             ` Xiao Guangrong
2011-07-27  9:04   ` Avi Kivity
2011-07-27  9:37     ` Xiao Guangrong
2011-07-26 11:28 ` [PATCH 04/11] KVM: MMU: do not mark access bit on pte write path Xiao Guangrong
2011-07-27  9:08   ` Avi Kivity
2011-07-27 10:04     ` Xiao Guangrong
2011-07-26 11:28 ` [PATCH 05/11] KVM: MMU: cleanup FNAME(invlpg) Xiao Guangrong
2011-07-26 11:29 ` [PATCH 06/11] KVM: MMU: fast prefetch spte on invlpg path Xiao Guangrong
2011-07-26 11:29 ` [PATCH 07/11] KVM: MMU: remove unnecessary kvm_mmu_free_some_pages Xiao Guangrong
2011-07-26 11:30 ` [PATCH 08/11] KVM: MMU: split kvm_mmu_pte_write function Xiao Guangrong
2011-07-26 11:31 ` [PATCH 09/11] KVM: MMU: remove the mismatch shadow page Xiao Guangrong
2011-07-27  9:11   ` Avi Kivity
2011-07-27  9:13     ` Avi Kivity
2011-07-27 10:05       ` Xiao Guangrong
2011-07-26 11:31 ` [PATCH 10/11] KVM: MMU: fix detecting misaligned accessed Xiao Guangrong
2011-07-27  9:15   ` Avi Kivity
2011-07-27 10:10     ` Xiao Guangrong
2011-07-26 11:32 ` [PATCH 11/11] KVM: MMU: improve write flooding detected Xiao Guangrong
2011-07-27  9:23   ` Avi Kivity
2011-07-27 10:20     ` Xiao Guangrong
2011-07-27 11:08       ` Avi Kivity
2011-07-28  2:43         ` Xiao Guangrong
  -- strict thread matches above, loose matches on Subject: below --
2011-08-16  6:40 [PATCH 01/11] KVM: MMU: avoid pte_list_desc running out in kvm_mmu_pte_write Xiao Guangrong
2011-08-16  6:46 ` [PATCH 11/11] KVM: MMU: improve write flooding detected Xiao Guangrong
2011-08-23  8:00   ` Marcelo Tosatti
2011-08-23 10:55     ` Xiao Guangrong
2011-08-23 12:38       ` Marcelo Tosatti
2011-08-23 16:32         ` Xiao Guangrong
2011-08-23 19:09           ` Marcelo Tosatti
2011-08-23 20:16             ` Xiao Guangrong
2011-08-24 20:05               ` Marcelo Tosatti
2011-08-25  2:04                 ` Marcelo Tosatti
2011-08-25  4:42                   ` Avi Kivity
2011-08-25 13:21                     ` Marcelo Tosatti
2011-08-25 14:06                       ` Avi Kivity
2011-08-25 14:07                         ` Avi Kivity
2011-08-25  7:40                   ` Xiao Guangrong
2011-08-25  7:57             ` Xiao Guangrong
2011-08-25 13:47               ` Marcelo Tosatti
2011-08-26  3:18                 ` Xiao Guangrong
2011-08-26 10:53                   ` Marcelo Tosatti
2011-08-26 14:24                     ` Xiao Guangrong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox