[PATCH v2 0/5] KVM: x86: improve reexecute

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/5] KVM: x86: improve reexecute_instruction
@ 2012-12-10  9:11 Xiao Guangrong
  2012-12-10  9:12 ` [PATCH v2 1/5] KVM: MMU: move adjusting pte access for softmmu to FNAME(page_fault) Xiao Guangrong
                   ` (5 more replies)
  0 siblings, 6 replies; 21+ messages in thread
From: Xiao Guangrong @ 2012-12-10  9:11 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Gleb Natapov, LKML, KVM

Changelog:
There are some changes from Marcelo and Gleb's review, thank you all!
- access indirect_shadow_pages in the protection of mmu-lock
- fix the issue when unhandleable instruction access on large page
- add a new test case for large page

The current reexecute_instruction can not well detect the failed instruction
emulation. It allows guest to retry all the instructions except it accesses
on error pfn.

For example, these cases can not be detected:
- for tdp used
  currently, it refused to retry all instructions. If nested npt is used, the
  emulation may be caused by shadow page, it can be fixed by unshadow the
  shadow page.

- for shadow mmu
  some cases are nested-write-protect, for example, if the page we want to
  write is used as PDE but it chains to itself. Under this case, we should
  stop the emulation and report the case to userspace.

There are two test cases based on kvm-unit-test can trigger a infinite loop on
current code (ept = 0), after this patchset, it can report the error to Qemu.

Subject: [PATCH] access test: test unhandleable instruction

Test the instruction which can not be handled by kvm

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 x86/access.c |   54 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 53 insertions(+), 1 deletions(-)

diff --git a/x86/access.c b/x86/access.c
index 23a5995..9141c70 100644
--- a/x86/access.c
+++ b/x86/access.c
@@ -2,6 +2,7 @@
 #include "libcflat.h"
 #include "desc.h"
 #include "processor.h"
+#include "vm.h"

 #define smp_id() 0

@@ -739,6 +740,54 @@ err:
 	return 0;
 }

+static int check_retry_unhandleable_ins(ac_pool_t *pool)
+{
+	unsigned long mem = 30 * 1024 * 1024;
+	unsigned long esp;
+	ac_test_t at;
+
+	ac_test_init(&at, (void *)(0x123406003000));
+	at.flags[AC_PDE_PRESENT] = at.flags[AC_PDE_WRITABLE] = 1;
+	at.flags[AC_PTE_PRESENT] = at.flags[AC_PTE_WRITABLE] = 1;
+	at.flags[AC_CPU_CR0_WP] = 1;
+
+	at.phys = mem;
+	ac_setup_specific_pages(&at, pool, mem, 0);
+
+	asm volatile("mov %%rsp, %%rax  \n\t" : "=a"(esp));
+	asm volatile("mov %%rax, %%rsp  \n\t" : : "a"(0x123406003000 + 0xf0));
+	asm volatile ("int $0x3 \n\t");
+	asm volatile("mov %%rax, %%rsp  \n\t" : : "a"(esp));
+
+	return 1;
+}
+
+static int check_large_mapping_write_page_table(ac_pool_t *pool)
+{
+	unsigned long mem = 0x1000000;
+	unsigned long esp;
+	ac_test_t at;
+	ulong cr3;
+
+	ac_test_init(&at, (void *)(0x123400000000));
+	at.flags[AC_PDE_PRESENT] = at.flags[AC_PDE_WRITABLE] = 1;
+	at.flags[AC_PDE_PSE] = 1;
+	at.flags[AC_CPU_CR0_WP] = 1;
+
+	at.phys = mem;
+	ac_setup_specific_pages(&at, pool, mem, 0);
+
+	cr3 = read_cr3();
+	write_cr3(cr3);
+
+	asm volatile("mov %%rsp, %%rax  \n\t" : "=a"(esp));
+	asm volatile("mov %%rax, %%rsp  \n\t" : : "a"(0x123400000000 + 0x6f0));
+	asm volatile ("int $0x3 \n\t");
+	asm volatile("mov %%rax, %%rsp  \n\t" : : "a"(esp));
+
+	return 1;
+}
+
 int ac_test_exec(ac_test_t *at, ac_pool_t *pool)
 {
     int r;
@@ -756,7 +805,9 @@ const ac_test_fn ac_test_cases[] =
 {
 	corrupt_hugepage_triger,
 	check_pfec_on_prefetch_pte,
-	check_smep_andnot_wp
+	check_smep_andnot_wp,
+	check_retry_unhandleable_ins,
+	check_large_mapping_write_page_table
 };

 int ac_test_run(void)
@@ -770,6 +821,7 @@ int ac_test_run(void)
     tests = successes = 0;
     ac_env_int(&pool);
     ac_test_init(&at, (void *)(0x123400000000 + 16 * smp_id()));
+
     do {
 	if (at.flags[AC_CPU_CR4_SMEP] && (ptl2[2] & 0x4))
 		ptl2[2] -= 0x4;
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 1/5] KVM: MMU: move adjusting pte access for softmmu to FNAME(page_fault)
  2012-12-10  9:11 [PATCH v2 0/5] KVM: x86: improve reexecute_instruction Xiao Guangrong
@ 2012-12-10  9:12 ` Xiao Guangrong
  2012-12-11 23:47   ` Marcelo Tosatti
  2012-12-10  9:13 ` [PATCH v2 2/5] KVM: MMU: adjust page size early if gfn used as page table Xiao Guangrong
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: Xiao Guangrong @ 2012-12-10  9:12 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, Gleb Natapov, LKML, KVM

Then, no mmu specified code exists in the common function and drop two
parameters in set_spte

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c         |   47 ++++++++++++-------------------------------
 arch/x86/kvm/paging_tmpl.h |   25 ++++++++++++++++++----
 2 files changed, 33 insertions(+), 39 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 01d7c2a..2a3c890 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2342,8 +2342,7 @@ static int mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
 }

 static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
-		    unsigned pte_access, int user_fault,
-		    int write_fault, int level,
+		    unsigned pte_access, int level,
 		    gfn_t gfn, pfn_t pfn, bool speculative,
 		    bool can_unsync, bool host_writable)
 {
@@ -2378,9 +2377,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,

 	spte |= (u64)pfn << PAGE_SHIFT;

-	if ((pte_access & ACC_WRITE_MASK)
-	    || (!vcpu->arch.mmu.direct_map && write_fault
-		&& !is_write_protection(vcpu) && !user_fault)) {
+	if (pte_access & ACC_WRITE_MASK) {

 		/*
 		 * There are two cases:
@@ -2399,19 +2396,6 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,

 		spte |= PT_WRITABLE_MASK | SPTE_MMU_WRITEABLE;

-		if (!vcpu->arch.mmu.direct_map
-		    && !(pte_access & ACC_WRITE_MASK)) {
-			spte &= ~PT_USER_MASK;
-			/*
-			 * If we converted a user page to a kernel page,
-			 * so that the kernel can write to it when cr0.wp=0,
-			 * then we should prevent the kernel from executing it
-			 * if SMEP is enabled.
-			 */
-			if (kvm_read_cr4_bits(vcpu, X86_CR4_SMEP))
-				spte |= PT64_NX_MASK;
-		}
-
 		/*
 		 * Optimization: for pte sync, if spte was writable the hash
 		 * lookup is unnecessary (and expensive). Write protection
@@ -2442,18 +2426,15 @@ done:

 static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 			 unsigned pt_access, unsigned pte_access,
-			 int user_fault, int write_fault,
-			 int *emulate, int level, gfn_t gfn,
-			 pfn_t pfn, bool speculative,
-			 bool host_writable)
+			 int write_fault, int *emulate, int level, gfn_t gfn,
+			 pfn_t pfn, bool speculative, bool host_writable)
 {
 	int was_rmapped = 0;
 	int rmap_count;

-	pgprintk("%s: spte %llx access %x write_fault %d"
-		 " user_fault %d gfn %llx\n",
+	pgprintk("%s: spte %llx access %x write_fault %d gfn %llx\n",
 		 __func__, *sptep, pt_access,
-		 write_fault, user_fault, gfn);
+		 write_fault, gfn);

 	if (is_rmap_spte(*sptep)) {
 		/*
@@ -2477,9 +2458,8 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 			was_rmapped = 1;
 	}

-	if (set_spte(vcpu, sptep, pte_access, user_fault, write_fault,
-		      level, gfn, pfn, speculative, true,
-		      host_writable)) {
+	if (set_spte(vcpu, sptep, pte_access, level, gfn, pfn, speculative,
+	      true, host_writable)) {
 		if (write_fault)
 			*emulate = 1;
 		kvm_mmu_flush_tlb(vcpu);
@@ -2571,10 +2551,9 @@ static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
 		return -1;

 	for (i = 0; i < ret; i++, gfn++, start++)
-		mmu_set_spte(vcpu, start, ACC_ALL,
-			     access, 0, 0, NULL,
-			     sp->role.level, gfn,
-			     page_to_pfn(pages[i]), true, true);
+		mmu_set_spte(vcpu, start, ACC_ALL, access, 0, NULL,
+			     sp->role.level, gfn, page_to_pfn(pages[i]),
+			     true, true);

 	return 0;
 }
@@ -2636,8 +2615,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
 			unsigned pte_access = ACC_ALL;

 			mmu_set_spte(vcpu, iterator.sptep, ACC_ALL, pte_access,
-				     0, write, &emulate,
-				     level, gfn, pfn, prefault, map_writable);
+				     write, &emulate, level, gfn, pfn,
+				     prefault, map_writable);
 			direct_pte_prefetch(vcpu, iterator.sptep);
 			++vcpu->stat.pf_fixed;
 			break;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 891eb6d..ec481e9 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -330,7 +330,7 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	 * we call mmu_set_spte() with host_writable = true because
 	 * pte_prefetch_gfn_to_pfn always gets a writable pfn.
 	 */
-	mmu_set_spte(vcpu, spte, sp->role.access, pte_access, 0, 0,
+	mmu_set_spte(vcpu, spte, sp->role.access, pte_access, 0,
 		     NULL, PT_PAGE_TABLE_LEVEL, gfn, pfn, true, true);

 	return true;
@@ -405,7 +405,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw,
  */
 static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 			 struct guest_walker *gw,
-			 int user_fault, int write_fault, int hlevel,
+			 int write_fault, int hlevel,
 			 pfn_t pfn, bool map_writable, bool prefault)
 {
 	struct kvm_mmu_page *sp = NULL;
@@ -478,7 +478,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,

 	clear_sp_write_flooding_count(it.sptep);
 	mmu_set_spte(vcpu, it.sptep, access, gw->pte_access,
-		     user_fault, write_fault, &emulate, it.level,
+		     write_fault, &emulate, it.level,
 		     gw->gfn, pfn, prefault, map_writable);
 	FNAME(pte_prefetch)(vcpu, gw, it.sptep);

@@ -544,6 +544,21 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
 		return 0;
 	}

+	if (write_fault && !(walker.pte_access & ACC_WRITE_MASK) &&
+		  !is_write_protection(vcpu) && !user_fault) {
+		walker.pte_access |= ACC_WRITE_MASK;
+		walker.pte_access &= ~ACC_USER_MASK;
+
+		/*
+		 * If we converted a user page to a kernel page,
+		 * so that the kernel can write to it when cr0.wp=0,
+		 * then we should prevent the kernel from executing it
+		 * if SMEP is enabled.
+		 */
+		if (kvm_read_cr4_bits(vcpu, X86_CR4_SMEP))
+			walker.pte_access &= ~ACC_EXEC_MASK;
+	}
+
 	if (walker.level >= PT_DIRECTORY_LEVEL)
 		force_pt_level = mapping_level_dirty_bitmap(vcpu, walker.gfn);
 	else
@@ -572,7 +587,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
 	kvm_mmu_free_some_pages(vcpu);
 	if (!force_pt_level)
 		transparent_hugepage_adjust(vcpu, &walker.gfn, &pfn, &level);
-	r = FNAME(fetch)(vcpu, addr, &walker, user_fault, write_fault,
+	r = FNAME(fetch)(vcpu, addr, &walker, write_fault,
 			 level, pfn, map_writable, prefault);
 	++vcpu->stat.pf_fixed;
 	kvm_mmu_audit(vcpu, AUDIT_POST_PAGE_FAULT);
@@ -747,7 +762,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)

 		host_writable = sp->spt[i] & SPTE_HOST_WRITEABLE;

-		set_spte(vcpu, &sp->spt[i], pte_access, 0, 0,
+		set_spte(vcpu, &sp->spt[i], pte_access,
 			 PT_PAGE_TABLE_LEVEL, gfn,
 			 spte_to_pfn(sp->spt[i]), true, false,
 			 host_writable);
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 2/5] KVM: MMU: adjust page size early if gfn used as page table
  2012-12-10  9:11 [PATCH v2 0/5] KVM: x86: improve reexecute_instruction Xiao Guangrong
  2012-12-10  9:12 ` [PATCH v2 1/5] KVM: MMU: move adjusting pte access for softmmu to FNAME(page_fault) Xiao Guangrong
@ 2012-12-10  9:13 ` Xiao Guangrong
  2012-12-12  0:57   ` Marcelo Tosatti
  2012-12-10  9:13 ` [PATCH v2 3/5] KVM: x86: clean up reexecute_instruction Xiao Guangrong
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: Xiao Guangrong @ 2012-12-10  9:13 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, Gleb Natapov, LKML, KVM

We have two issues in current code:
- if target gfn is used as its page table, guest will refault then kvm will use
  small page size to map it. We need two #PF to fix its shadow page table

- sometimes, say a exception is triggered during vm-exit caused by #PF
  (see handle_exception() in vmx.c), we remove all the shadow pages shadowed
  by the target gfn before go into page fault path, it will cause infinite
  loop:
  delete shadow pages shadowed by the gfn -> try to use large page size to map
  the gfn -> retry the access ->...

To fix these, We can adjust page size early if the target gfn is used as page
table

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c         |   13 ++++---------
 arch/x86/kvm/paging_tmpl.h |   33 ++++++++++++++++++++++++++++++++-
 2 files changed, 36 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 2a3c890..54fc61e 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2380,15 +2380,10 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 	if (pte_access & ACC_WRITE_MASK) {

 		/*
-		 * There are two cases:
-		 * - the one is other vcpu creates new sp in the window
-		 *   between mapping_level() and acquiring mmu-lock.
-		 * - the another case is the new sp is created by itself
-		 *   (page-fault path) when guest uses the target gfn as
-		 *   its page table.
-		 * Both of these cases can be fixed by allowing guest to
-		 * retry the access, it will refault, then we can establish
-		 * the mapping by using small page.
+		 * Other vcpu creates new sp in the window between
+		 * mapping_level() and acquiring mmu-lock. We can
+		 * allow guest to retry the access, the mapping can
+		 * be fixed if guest refault.
 		 */
 		if (level > PT_PAGE_TABLE_LEVEL &&
 		    has_wrprotected_page(vcpu->kvm, gfn, level))
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index ec481e9..32d77ff 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -491,6 +491,36 @@ out_gpte_changed:
 	return 0;
 }

+ /*
+ * To see whether the mapped gfn can write its page table in the current
+ * mapping.
+ *
+ * It is the helper function of FNAME(page_fault). When guest uses large page
+ * size to map the writable gfn which is used as current page table, we should
+ * force kvm to use small page size to map it because new shadow page will be
+ * created when kvm establishes shadow page table that stop kvm using large
+ * page size. Do it early can avoid unnecessary #PF and emulation.
+ *
+ * Note: the PDPT page table is not checked for PAE-32 bit guest. It is ok
+ * since the PDPT is always shadowed, that means, we can not use large page
+ * size to map the gfn which is used as PDPT.
+ */
+static bool
+FNAME(mapped_gfn_can_write_current_pagetable)(struct guest_walker *walker)
+{
+	int level;
+	gfn_t mask = ~(KVM_PAGES_PER_HPAGE(walker->level) - 1);
+
+	if (!(walker->pte_access & ACC_WRITE_MASK))
+		return false;
+
+	for (level = walker->level; level <= walker->max_level; level++)
+		if (!((walker->gfn ^ walker->table_gfn[level - 1]) & mask))
+			return true;
+
+	return false;
+}
+
 /*
  * Page fault handler.  There are several causes for a page fault:
  *   - there is no shadow pte for the guest pte
@@ -560,7 +590,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
 	}

 	if (walker.level >= PT_DIRECTORY_LEVEL)
-		force_pt_level = mapping_level_dirty_bitmap(vcpu, walker.gfn);
+		force_pt_level = mapping_level_dirty_bitmap(vcpu, walker.gfn)
+		      || FNAME(mapped_gfn_can_write_current_pagetable)(&walker);
 	else
 		force_pt_level = 1;
 	if (!force_pt_level) {
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 3/5] KVM: x86: clean up reexecute_instruction
  2012-12-10  9:11 [PATCH v2 0/5] KVM: x86: improve reexecute_instruction Xiao Guangrong
  2012-12-10  9:12 ` [PATCH v2 1/5] KVM: MMU: move adjusting pte access for softmmu to FNAME(page_fault) Xiao Guangrong
  2012-12-10  9:13 ` [PATCH v2 2/5] KVM: MMU: adjust page size early if gfn used as page table Xiao Guangrong
@ 2012-12-10  9:13 ` Xiao Guangrong
  2012-12-10  9:14 ` [PATCH v2 4/5] KVM: x86: let reexecute_instruction work for tdp Xiao Guangrong
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 21+ messages in thread
From: Xiao Guangrong @ 2012-12-10  9:13 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, Gleb Natapov, LKML, KVM

Little cleanup for reexecute_instruction, also use gpa_to_gfn in
retry_instruction

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/x86.c |   13 ++++++-------
 1 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e87be93b..1c67873 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4761,19 +4761,18 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gva_t gva)
 	if (tdp_enabled)
 		return false;

+	gpa = kvm_mmu_gva_to_gpa_read(vcpu, gva, NULL);
+	if (gpa == UNMAPPED_GVA)
+		return true; /* let cpu generate fault */
+
 	/*
 	 * if emulation was due to access to shadowed page table
 	 * and it failed try to unshadow page and re-enter the
 	 * guest to let CPU execute the instruction.
 	 */
-	if (kvm_mmu_unprotect_page_virt(vcpu, gva))
+	if (kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa)))
 		return true;

-	gpa = kvm_mmu_gva_to_gpa_system(vcpu, gva, NULL);
-
-	if (gpa == UNMAPPED_GVA)
-		return true; /* let cpu generate fault */
-
 	/*
 	 * Do not retry the unhandleable instruction if it faults on the
 	 * readonly host memory, otherwise it will goto a infinite loop:
@@ -4828,7 +4827,7 @@ static bool retry_instruction(struct x86_emulate_ctxt *ctxt,
 	if (!vcpu->arch.mmu.direct_map)
 		gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2, NULL);

-	kvm_mmu_unprotect_page(vcpu->kvm, gpa >> PAGE_SHIFT);
+	kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));

 	return true;
 }
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 4/5] KVM: x86: let reexecute_instruction work for tdp
  2012-12-10  9:11 [PATCH v2 0/5] KVM: x86: improve reexecute_instruction Xiao Guangrong
                   ` (2 preceding siblings ...)
  2012-12-10  9:13 ` [PATCH v2 3/5] KVM: x86: clean up reexecute_instruction Xiao Guangrong
@ 2012-12-10  9:14 ` Xiao Guangrong
  2012-12-10  9:14 ` [PATCH v2 5/5] KVM: x86: improve reexecute_instruction Xiao Guangrong
  2012-12-11 23:36 ` [PATCH v2 0/5] " Marcelo Tosatti
  5 siblings, 0 replies; 21+ messages in thread
From: Xiao Guangrong @ 2012-12-10  9:14 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, Gleb Natapov, LKML, KVM

Currently, reexecute_instruction refused to retry all instructions. If
nested npt is used, the emulation may be caused by shadow page, it can
be fixed by dropping the shadow page

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/x86.c |   19 +++++++++++++------
 1 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1c67873..3796f8c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4753,17 +4753,24 @@ static int handle_emulation_failure(struct kvm_vcpu *vcpu)
 	return r;
 }

-static bool reexecute_instruction(struct kvm_vcpu *vcpu, gva_t gva)
+static bool reexecute_instruction(struct kvm_vcpu *vcpu, unsigned long cr2)
 {
-	gpa_t gpa;
+	gpa_t gpa = cr2;
 	pfn_t pfn;
+	unsigned int indirect_shadow_pages;
+
+	spin_lock(&vcpu->kvm->mmu_lock);
+	indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages;
+	spin_unlock(&vcpu->kvm->mmu_lock);

-	if (tdp_enabled)
+	if (!indirect_shadow_pages)
 		return false;

-	gpa = kvm_mmu_gva_to_gpa_read(vcpu, gva, NULL);
-	if (gpa == UNMAPPED_GVA)
-		return true; /* let cpu generate fault */
+	if (!vcpu->arch.mmu.direct_map) {
+		gpa = kvm_mmu_gva_to_gpa_read(vcpu, cr2, NULL);
+		if (gpa == UNMAPPED_GVA)
+			return true; /* let cpu generate fault */
+	}

 	/*
 	 * if emulation was due to access to shadowed page table
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 5/5] KVM: x86: improve reexecute_instruction
  2012-12-10  9:11 [PATCH v2 0/5] KVM: x86: improve reexecute_instruction Xiao Guangrong
                   ` (3 preceding siblings ...)
  2012-12-10  9:14 ` [PATCH v2 4/5] KVM: x86: let reexecute_instruction work for tdp Xiao Guangrong
@ 2012-12-10  9:14 ` Xiao Guangrong
  2012-12-12  1:09   ` Marcelo Tosatti
  2012-12-11 23:36 ` [PATCH v2 0/5] " Marcelo Tosatti
  5 siblings, 1 reply; 21+ messages in thread
From: Xiao Guangrong @ 2012-12-10  9:14 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, Gleb Natapov, LKML, KVM

The current reexecute_instruction can not well detect the failed instruction
emulation. It allows guest to retry all the instructions except it accesses
on error pfn

For example, some cases are nested-write-protect - if the page we want to
write is used as PDE but it chains to itself. Under this case, we should
stop the emulation and report the case to userspace

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/include/asm/kvm_host.h |    2 +
 arch/x86/kvm/paging_tmpl.h      |    2 +
 arch/x86/kvm/x86.c              |   82 +++++++++++++++++++++++++++++----------
 3 files changed, 65 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index dc87b65..8d01c02 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -575,6 +575,8 @@ struct kvm_arch {
 	u64 hv_guest_os_id;
 	u64 hv_hypercall;

+	/* synchronizing reexecute_instruction and page fault path. */
+	u64 page_fault_count;
 	#ifdef CONFIG_KVM_MMU_AUDIT
 	int audit_point;
 	#endif
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 32d77ff..85b8e0e 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -614,6 +614,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
 	if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
 		goto out_unlock;

+	vcpu->kvm->arch.page_fault_count++;
+
 	kvm_mmu_audit(vcpu, AUDIT_PRE_PAGE_FAULT);
 	kvm_mmu_free_some_pages(vcpu);
 	if (!force_pt_level)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3796f8c..5677869 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4756,29 +4756,27 @@ static int handle_emulation_failure(struct kvm_vcpu *vcpu)
 static bool reexecute_instruction(struct kvm_vcpu *vcpu, unsigned long cr2)
 {
 	gpa_t gpa = cr2;
+	gfn_t gfn;
 	pfn_t pfn;
-	unsigned int indirect_shadow_pages;
-
-	spin_lock(&vcpu->kvm->mmu_lock);
-	indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages;
-	spin_unlock(&vcpu->kvm->mmu_lock);
-
-	if (!indirect_shadow_pages)
-		return false;
+	u64 page_fault_count;
+	int emulate;

 	if (!vcpu->arch.mmu.direct_map) {
-		gpa = kvm_mmu_gva_to_gpa_read(vcpu, cr2, NULL);
+		/*
+		 * Write permission should be allowed since only
+		 * write access need to be emulated.
+		 */
+		gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2, NULL);
+
+		/*
+		 * If the mapping is invalid in guest, let cpu retry
+		 * it to generate fault.
+		 */
 		if (gpa == UNMAPPED_GVA)
-			return true; /* let cpu generate fault */
+			return true;
 	}

-	/*
-	 * if emulation was due to access to shadowed page table
-	 * and it failed try to unshadow page and re-enter the
-	 * guest to let CPU execute the instruction.
-	 */
-	if (kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa)))
-		return true;
+	gfn = gpa_to_gfn(gpa);

 	/*
 	 * Do not retry the unhandleable instruction if it faults on the
@@ -4786,13 +4784,55 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, unsigned long cr2)
 	 * retry instruction -> write #PF -> emulation fail -> retry
 	 * instruction -> ...
 	 */
-	pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
-	if (!is_error_noslot_pfn(pfn)) {
-		kvm_release_pfn_clean(pfn);
+	pfn = gfn_to_pfn(vcpu->kvm, gfn);
+
+	/*
+	 * If the instruction failed on the error pfn, it can not be fixed,
+	 * report the error to userspace.
+	 */
+	if (is_error_noslot_pfn(pfn))
+		return false;
+
+	kvm_release_pfn_clean(pfn);
+
+	/* The instructions are well-emulated on direct mmu. */
+	if (vcpu->arch.mmu.direct_map) {
+		unsigned int indirect_shadow_pages;
+
+		spin_lock(&vcpu->kvm->mmu_lock);
+		indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages;
+		spin_unlock(&vcpu->kvm->mmu_lock);
+
+		if (indirect_shadow_pages)
+			kvm_mmu_unprotect_page(vcpu->kvm, gfn);
+
 		return true;
 	}

-	return false;
+again:
+	page_fault_count = ACCESS_ONCE(vcpu->kvm->arch.page_fault_count);
+
+	/*
+	 * The instruction emulation is caused by fault access on cr2.
+	 * After unprotect the target page, we call
+	 * vcpu->arch.mmu.page_fault to fix the mapping of cr2. If it
+	 * return 1, mmu can not fix the mapping, we should report the
+	 * error, otherwise it is good to return to guest and let it
+	 * re-execute the instruction again.
+	 *
+	 * page_fault_count is used to avoid the race on other vcpus,
+	 * since after we unprotect the target page, other cpu can enter
+	 * page fault path and let the page be write-protected again.
+	 */
+	kvm_mmu_unprotect_page(vcpu->kvm, gfn);
+	emulate = vcpu->arch.mmu.page_fault(vcpu, cr2, PFERR_WRITE_MASK, false);
+
+	/* The page fault path called above can increase the count. */
+	if (page_fault_count + 1 !=
+		  ACCESS_ONCE(vcpu->kvm->arch.page_fault_count))
+		goto again;
+
+	return !emulate;
 }

 static bool retry_instruction(struct x86_emulate_ctxt *ctxt,
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 0/5] KVM: x86: improve reexecute_instruction
  2012-12-10  9:11 [PATCH v2 0/5] KVM: x86: improve reexecute_instruction Xiao Guangrong
                   ` (4 preceding siblings ...)
  2012-12-10  9:14 ` [PATCH v2 5/5] KVM: x86: improve reexecute_instruction Xiao Guangrong
@ 2012-12-11 23:36 ` Marcelo Tosatti
  2012-12-12 20:05   ` Xiao Guangrong
  5 siblings, 1 reply; 21+ messages in thread
From: Marcelo Tosatti @ 2012-12-11 23:36 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Gleb Natapov, LKML, KVM

On Mon, Dec 10, 2012 at 05:11:35PM +0800, Xiao Guangrong wrote:
> Changelog:
> There are some changes from Marcelo and Gleb's review, thank you all!
> - access indirect_shadow_pages in the protection of mmu-lock
> - fix the issue when unhandleable instruction access on large page
> - add a new test case for large page
> 
> The current reexecute_instruction can not well detect the failed instruction
> emulation. It allows guest to retry all the instructions except it accesses
> on error pfn.
> 
> For example, these cases can not be detected:
> - for tdp used
>   currently, it refused to retry all instructions. If nested npt is used, the
>   emulation may be caused by shadow page, it can be fixed by unshadow the
>   shadow page.
> 
> - for shadow mmu
>   some cases are nested-write-protect, for example, if the page we want to
>   write is used as PDE but it chains to itself. Under this case, we should
>   stop the emulation and report the case to userspace.
> 
> There are two test cases based on kvm-unit-test can trigger a infinite loop on
> current code (ept = 0), after this patchset, it can report the error to Qemu.
> 
> Subject: [PATCH] access test: test unhandleable instruction
> 
> Test the instruction which can not be handled by kvm
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>

Please submit the test for inclusion. There should be some way to make
it fail.. program a timer interrupt and #GP? 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/5] KVM: MMU: move adjusting pte access for softmmu to FNAME(page_fault)
  2012-12-10  9:12 ` [PATCH v2 1/5] KVM: MMU: move adjusting pte access for softmmu to FNAME(page_fault) Xiao Guangrong
@ 2012-12-11 23:47   ` Marcelo Tosatti
  2012-12-12 18:53     ` Xiao Guangrong
  0 siblings, 1 reply; 21+ messages in thread
From: Marcelo Tosatti @ 2012-12-11 23:47 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Gleb Natapov, LKML, KVM

On Mon, Dec 10, 2012 at 05:12:20PM +0800, Xiao Guangrong wrote:
> Then, no mmu specified code exists in the common function and drop two
> parameters in set_spte
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> ---
>  arch/x86/kvm/mmu.c         |   47 ++++++++++++-------------------------------
>  arch/x86/kvm/paging_tmpl.h |   25 ++++++++++++++++++----
>  2 files changed, 33 insertions(+), 39 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 01d7c2a..2a3c890 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2342,8 +2342,7 @@ static int mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
>  }
> 
>  static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
> -		    unsigned pte_access, int user_fault,
> -		    int write_fault, int level,
> +		    unsigned pte_access, int level,
>  		    gfn_t gfn, pfn_t pfn, bool speculative,
>  		    bool can_unsync, bool host_writable)
>  {
> @@ -2378,9 +2377,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
> 
>  	spte |= (u64)pfn << PAGE_SHIFT;
> 
> -	if ((pte_access & ACC_WRITE_MASK)
> -	    || (!vcpu->arch.mmu.direct_map && write_fault
> -		&& !is_write_protection(vcpu) && !user_fault)) {
> +	if (pte_access & ACC_WRITE_MASK) {
> 
>  		/*
>  		 * There are two cases:
> @@ -2399,19 +2396,6 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
> 
>  		spte |= PT_WRITABLE_MASK | SPTE_MMU_WRITEABLE;
> 
> -		if (!vcpu->arch.mmu.direct_map
> -		    && !(pte_access & ACC_WRITE_MASK)) {
> -			spte &= ~PT_USER_MASK;
> -			/*
> -			 * If we converted a user page to a kernel page,
> -			 * so that the kernel can write to it when cr0.wp=0,
> -			 * then we should prevent the kernel from executing it
> -			 * if SMEP is enabled.
> -			 */
> -			if (kvm_read_cr4_bits(vcpu, X86_CR4_SMEP))
> -				spte |= PT64_NX_MASK;
> -		}
> -
>  		/*
>  		 * Optimization: for pte sync, if spte was writable the hash
>  		 * lookup is unnecessary (and expensive). Write protection
> @@ -2442,18 +2426,15 @@ done:
> 
>  static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
>  			 unsigned pt_access, unsigned pte_access,
> -			 int user_fault, int write_fault,
> -			 int *emulate, int level, gfn_t gfn,
> -			 pfn_t pfn, bool speculative,
> -			 bool host_writable)
> +			 int write_fault, int *emulate, int level, gfn_t gfn,
> +			 pfn_t pfn, bool speculative, bool host_writable)
>  {
>  	int was_rmapped = 0;
>  	int rmap_count;
> 
> -	pgprintk("%s: spte %llx access %x write_fault %d"
> -		 " user_fault %d gfn %llx\n",
> +	pgprintk("%s: spte %llx access %x write_fault %d gfn %llx\n",
>  		 __func__, *sptep, pt_access,
> -		 write_fault, user_fault, gfn);
> +		 write_fault, gfn);
> 
>  	if (is_rmap_spte(*sptep)) {
>  		/*
> @@ -2477,9 +2458,8 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
>  			was_rmapped = 1;
>  	}
> 
> -	if (set_spte(vcpu, sptep, pte_access, user_fault, write_fault,
> -		      level, gfn, pfn, speculative, true,
> -		      host_writable)) {
> +	if (set_spte(vcpu, sptep, pte_access, level, gfn, pfn, speculative,
> +	      true, host_writable)) {
>  		if (write_fault)
>  			*emulate = 1;
>  		kvm_mmu_flush_tlb(vcpu);
> @@ -2571,10 +2551,9 @@ static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
>  		return -1;
> 
>  	for (i = 0; i < ret; i++, gfn++, start++)
> -		mmu_set_spte(vcpu, start, ACC_ALL,
> -			     access, 0, 0, NULL,
> -			     sp->role.level, gfn,
> -			     page_to_pfn(pages[i]), true, true);
> +		mmu_set_spte(vcpu, start, ACC_ALL, access, 0, NULL,
> +			     sp->role.level, gfn, page_to_pfn(pages[i]),
> +			     true, true);
> 
>  	return 0;
>  }
> @@ -2636,8 +2615,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
>  			unsigned pte_access = ACC_ALL;
> 
>  			mmu_set_spte(vcpu, iterator.sptep, ACC_ALL, pte_access,
> -				     0, write, &emulate,
> -				     level, gfn, pfn, prefault, map_writable);
> +				     write, &emulate, level, gfn, pfn,
> +				     prefault, map_writable);
>  			direct_pte_prefetch(vcpu, iterator.sptep);
>  			++vcpu->stat.pf_fixed;
>  			break;
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 891eb6d..ec481e9 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -330,7 +330,7 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  	 * we call mmu_set_spte() with host_writable = true because
>  	 * pte_prefetch_gfn_to_pfn always gets a writable pfn.
>  	 */
> -	mmu_set_spte(vcpu, spte, sp->role.access, pte_access, 0, 0,
> +	mmu_set_spte(vcpu, spte, sp->role.access, pte_access, 0,
>  		     NULL, PT_PAGE_TABLE_LEVEL, gfn, pfn, true, true);
> 
>  	return true;
> @@ -405,7 +405,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw,
>   */
>  static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
>  			 struct guest_walker *gw,
> -			 int user_fault, int write_fault, int hlevel,
> +			 int write_fault, int hlevel,
>  			 pfn_t pfn, bool map_writable, bool prefault)
>  {
>  	struct kvm_mmu_page *sp = NULL;
> @@ -478,7 +478,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
> 
>  	clear_sp_write_flooding_count(it.sptep);
>  	mmu_set_spte(vcpu, it.sptep, access, gw->pte_access,
> -		     user_fault, write_fault, &emulate, it.level,
> +		     write_fault, &emulate, it.level,
>  		     gw->gfn, pfn, prefault, map_writable);
>  	FNAME(pte_prefetch)(vcpu, gw, it.sptep);
> 
> @@ -544,6 +544,21 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
>  		return 0;
>  	}
> 
> +	if (write_fault && !(walker.pte_access & ACC_WRITE_MASK) &&
> +		  !is_write_protection(vcpu) && !user_fault) {
> +		walker.pte_access |= ACC_WRITE_MASK;
> +		walker.pte_access &= ~ACC_USER_MASK;
> +
> +		/*
> +		 * If we converted a user page to a kernel page,
> +		 * so that the kernel can write to it when cr0.wp=0,
> +		 * then we should prevent the kernel from executing it
> +		 * if SMEP is enabled.
> +		 */
> +		if (kvm_read_cr4_bits(vcpu, X86_CR4_SMEP))
> +			walker.pte_access &= ~ACC_EXEC_MASK;
> +	}

Don't think you should modify walker.pte_access here, since it can be
used afterwards (eg for handle_abnormal_pfn). 

BTW, your patch is fixing a bug: 

host_writable is ignored for CR0.WP emulation:

        if (host_writable)
                spte |= SPTE_HOST_WRITEABLE;
        else
                pte_access &= ~ACC_WRITE_MASK;

        spte |= (u64)pfn << PAGE_SHIFT;

        if ((pte_access & ACC_WRITE_MASK)
            || (!vcpu->arch.mmu.direct_map && write_fault
                && !is_write_protection(vcpu) && !user_fault)) {


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/5] KVM: MMU: adjust page size early if gfn used as page table
  2012-12-10  9:13 ` [PATCH v2 2/5] KVM: MMU: adjust page size early if gfn used as page table Xiao Guangrong
@ 2012-12-12  0:57   ` Marcelo Tosatti
  2012-12-12 19:23     ` Xiao Guangrong
  0 siblings, 1 reply; 21+ messages in thread
From: Marcelo Tosatti @ 2012-12-12  0:57 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Gleb Natapov, LKML, KVM

On Mon, Dec 10, 2012 at 05:13:03PM +0800, Xiao Guangrong wrote:
> We have two issues in current code:
> - if target gfn is used as its page table, guest will refault then kvm will use
>   small page size to map it. We need two #PF to fix its shadow page table
> 
> - sometimes, say a exception is triggered during vm-exit caused by #PF
>   (see handle_exception() in vmx.c), we remove all the shadow pages shadowed
>   by the target gfn before go into page fault path, it will cause infinite
>   loop:
>   delete shadow pages shadowed by the gfn -> try to use large page size to map
>   the gfn -> retry the access ->...
> 
> To fix these, We can adjust page size early if the target gfn is used as page
> table
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> ---
>  arch/x86/kvm/mmu.c         |   13 ++++---------
>  arch/x86/kvm/paging_tmpl.h |   33 ++++++++++++++++++++++++++++++++-
>  2 files changed, 36 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 2a3c890..54fc61e 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2380,15 +2380,10 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
>  	if (pte_access & ACC_WRITE_MASK) {
> 
>  		/*
> -		 * There are two cases:
> -		 * - the one is other vcpu creates new sp in the window
> -		 *   between mapping_level() and acquiring mmu-lock.
> -		 * - the another case is the new sp is created by itself
> -		 *   (page-fault path) when guest uses the target gfn as
> -		 *   its page table.
> -		 * Both of these cases can be fixed by allowing guest to
> -		 * retry the access, it will refault, then we can establish
> -		 * the mapping by using small page.
> +		 * Other vcpu creates new sp in the window between
> +		 * mapping_level() and acquiring mmu-lock. We can
> +		 * allow guest to retry the access, the mapping can
> +		 * be fixed if guest refault.
>  		 */
>  		if (level > PT_PAGE_TABLE_LEVEL &&
>  		    has_wrprotected_page(vcpu->kvm, gfn, level))
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index ec481e9..32d77ff 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -491,6 +491,36 @@ out_gpte_changed:
>  	return 0;
>  }
> 
> + /*
> + * To see whether the mapped gfn can write its page table in the current
> + * mapping.
> + *
> + * It is the helper function of FNAME(page_fault). When guest uses large page
> + * size to map the writable gfn which is used as current page table, we should
> + * force kvm to use small page size to map it because new shadow page will be
> + * created when kvm establishes shadow page table that stop kvm using large
> + * page size. Do it early can avoid unnecessary #PF and emulation.
> + *
> + * Note: the PDPT page table is not checked for PAE-32 bit guest. It is ok
> + * since the PDPT is always shadowed, that means, we can not use large page
> + * size to map the gfn which is used as PDPT.
> + */
> +static bool
> +FNAME(mapped_gfn_can_write_current_pagetable)(struct guest_walker *walker)
> +{
> +	int level;
> +	gfn_t mask = ~(KVM_PAGES_PER_HPAGE(walker->level) - 1);
> +
> +	if (!(walker->pte_access & ACC_WRITE_MASK))
> +		return false;
> +
> +	for (level = walker->level; level <= walker->max_level; level++)
> +		if (!((walker->gfn ^ walker->table_gfn[level - 1]) & mask))
> +			return true;

XOR won't work. Just check with sums and integer comparison, ie.
walker->gfn + KVM_PAGES_PER_HPAGE(walker->level).

Moreover, its confusing to have it checked at this level. What about
doing at reexecute_instruction?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 5/5] KVM: x86: improve reexecute_instruction
  2012-12-10  9:14 ` [PATCH v2 5/5] KVM: x86: improve reexecute_instruction Xiao Guangrong
@ 2012-12-12  1:09   ` Marcelo Tosatti
  2012-12-12 19:29     ` Xiao Guangrong
  0 siblings, 1 reply; 21+ messages in thread
From: Marcelo Tosatti @ 2012-12-12  1:09 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Gleb Natapov, LKML, KVM

On Mon, Dec 10, 2012 at 05:14:47PM +0800, Xiao Guangrong wrote:
> The current reexecute_instruction can not well detect the failed instruction
> emulation. It allows guest to retry all the instructions except it accesses
> on error pfn
> 
> For example, some cases are nested-write-protect - if the page we want to
> write is used as PDE but it chains to itself. Under this case, we should
> stop the emulation and report the case to userspace
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> ---
>  arch/x86/include/asm/kvm_host.h |    2 +
>  arch/x86/kvm/paging_tmpl.h      |    2 +
>  arch/x86/kvm/x86.c              |   82 +++++++++++++++++++++++++++++----------
>  3 files changed, 65 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index dc87b65..8d01c02 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -575,6 +575,8 @@ struct kvm_arch {
>  	u64 hv_guest_os_id;
>  	u64 hv_hypercall;
> 
> +	/* synchronizing reexecute_instruction and page fault path. */
> +	u64 page_fault_count;
>  	#ifdef CONFIG_KVM_MMU_AUDIT
>  	int audit_point;
>  	#endif
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 32d77ff..85b8e0e 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -614,6 +614,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
>  	if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
>  		goto out_unlock;
> 
> +	vcpu->kvm->arch.page_fault_count++;
> +
>  	kvm_mmu_audit(vcpu, AUDIT_PRE_PAGE_FAULT);
>  	kvm_mmu_free_some_pages(vcpu);
>  	if (!force_pt_level)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 3796f8c..5677869 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4756,29 +4756,27 @@ static int handle_emulation_failure(struct kvm_vcpu *vcpu)
>  static bool reexecute_instruction(struct kvm_vcpu *vcpu, unsigned long cr2)
>  {
>  	gpa_t gpa = cr2;
> +	gfn_t gfn;
>  	pfn_t pfn;
> -	unsigned int indirect_shadow_pages;
> -
> -	spin_lock(&vcpu->kvm->mmu_lock);
> -	indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages;
> -	spin_unlock(&vcpu->kvm->mmu_lock);
> -
> -	if (!indirect_shadow_pages)
> -		return false;
> +	u64 page_fault_count;
> +	int emulate;
> 
>  	if (!vcpu->arch.mmu.direct_map) {
> -		gpa = kvm_mmu_gva_to_gpa_read(vcpu, cr2, NULL);
> +		/*
> +		 * Write permission should be allowed since only
> +		 * write access need to be emulated.
> +		 */
> +		gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2, NULL);
> +
> +		/*
> +		 * If the mapping is invalid in guest, let cpu retry
> +		 * it to generate fault.
> +		 */
>  		if (gpa == UNMAPPED_GVA)
> -			return true; /* let cpu generate fault */
> +			return true;
>  	}
> 
> -	/*
> -	 * if emulation was due to access to shadowed page table
> -	 * and it failed try to unshadow page and re-enter the
> -	 * guest to let CPU execute the instruction.
> -	 */
> -	if (kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa)))
> -		return true;
> +	gfn = gpa_to_gfn(gpa);
> 
>  	/*
>  	 * Do not retry the unhandleable instruction if it faults on the
> @@ -4786,13 +4784,55 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, unsigned long cr2)
>  	 * retry instruction -> write #PF -> emulation fail -> retry
>  	 * instruction -> ...
>  	 */
> -	pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
> -	if (!is_error_noslot_pfn(pfn)) {
> -		kvm_release_pfn_clean(pfn);
> +	pfn = gfn_to_pfn(vcpu->kvm, gfn);
> +
> +	/*
> +	 * If the instruction failed on the error pfn, it can not be fixed,
> +	 * report the error to userspace.
> +	 */
> +	if (is_error_noslot_pfn(pfn))
> +		return false;
> +
> +	kvm_release_pfn_clean(pfn);
> +
> +	/* The instructions are well-emulated on direct mmu. */
> +	if (vcpu->arch.mmu.direct_map) {
> +		unsigned int indirect_shadow_pages;
> +
> +		spin_lock(&vcpu->kvm->mmu_lock);
> +		indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages;
> +		spin_unlock(&vcpu->kvm->mmu_lock);
> +
> +		if (indirect_shadow_pages)
> +			kvm_mmu_unprotect_page(vcpu->kvm, gfn);
> +
>  		return true;
>  	}
> 
> -	return false;
> +again:
> +	page_fault_count = ACCESS_ONCE(vcpu->kvm->arch.page_fault_count);
> +
> +	/*
> +	 * The instruction emulation is caused by fault access on cr2.
> +	 * After unprotect the target page, we call
> +	 * vcpu->arch.mmu.page_fault to fix the mapping of cr2. If it
> +	 * return 1, mmu can not fix the mapping, we should report the
> +	 * error, otherwise it is good to return to guest and let it
> +	 * re-execute the instruction again.
> +	 *
> +	 * page_fault_count is used to avoid the race on other vcpus,
> +	 * since after we unprotect the target page, other cpu can enter
> +	 * page fault path and let the page be write-protected again.
> +	 */
> +	kvm_mmu_unprotect_page(vcpu->kvm, gfn);
> +	emulate = vcpu->arch.mmu.page_fault(vcpu, cr2, PFERR_WRITE_MASK, false);
> +
> +	/* The page fault path called above can increase the count. */
> +	if (page_fault_count + 1 !=
> +		  ACCESS_ONCE(vcpu->kvm->arch.page_fault_count))
> +		goto again;
> +
> +	return !emulate;
>  }
> 
>  static bool retry_instruction(struct x86_emulate_ctxt *ctxt,\

Same comment as before: the only case where it should not attempt to 
emulate is when there is a condition which makes it impossible to fix
(the information is available to detect that condition).

The earlier suggestion

	"How about recording the gfn number for shadow pages that have been
shadowed in the current pagefault run?"

Was about that.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/5] KVM: MMU: move adjusting pte access for softmmu to FNAME(page_fault)
  2012-12-11 23:47   ` Marcelo Tosatti
@ 2012-12-12 18:53     ` Xiao Guangrong
  0 siblings, 0 replies; 21+ messages in thread
From: Xiao Guangrong @ 2012-12-12 18:53 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Gleb Natapov, LKML, KVM

On 12/12/2012 07:47 AM, Marcelo Tosatti wrote:

>> +	if (write_fault && !(walker.pte_access & ACC_WRITE_MASK) &&
>> +		  !is_write_protection(vcpu) && !user_fault) {
>> +		walker.pte_access |= ACC_WRITE_MASK;
>> +		walker.pte_access &= ~ACC_USER_MASK;
>> +
>> +		/*
>> +		 * If we converted a user page to a kernel page,
>> +		 * so that the kernel can write to it when cr0.wp=0,
>> +		 * then we should prevent the kernel from executing it
>> +		 * if SMEP is enabled.
>> +		 */
>> +		if (kvm_read_cr4_bits(vcpu, X86_CR4_SMEP))
>> +			walker.pte_access &= ~ACC_EXEC_MASK;
>> +	}
> 
> Don't think you should modify walker.pte_access here, since it can be
> used afterwards (eg for handle_abnormal_pfn). 

Yes, you're right. It will cache !U+W instead of U+!W into mmio spte.
It causes the mmio access from userspace always fail. I will recheck it
carefully.

Hmm, the current code seems buggy if CR0.WP = 0. Say if two mappings
map to the same gfn, both of them use large page and small page size
is used on kvm. If guest write fault on the first mapping, kvm will
create a writable spte(!U + W) in the readonly sp
(sp.role.access = readonly). Then, read fault on the second mapping,
it will establish shadow page table by using the readonly sp which is
created by first mapping, so the second mapping has writable spte
even if Dirty bit in the second mapping is not set.

> 
> BTW, your patch is fixing a bug: 
> 
> host_writable is ignored for CR0.WP emulation:
> 
>         if (host_writable)
>                 spte |= SPTE_HOST_WRITEABLE;
>         else
>                 pte_access &= ~ACC_WRITE_MASK;
> 
>         spte |= (u64)pfn << PAGE_SHIFT;
> 
>         if ((pte_access & ACC_WRITE_MASK)
>             || (!vcpu->arch.mmu.direct_map && write_fault
>                 && !is_write_protection(vcpu) && !user_fault)) {

I noticed it too but it is not a bug, because the access is adjusted only
if it is a write fault. For the write #PF, the pfn is always writeable on
host.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/5] KVM: MMU: adjust page size early if gfn used as page table
  2012-12-12  0:57   ` Marcelo Tosatti
@ 2012-12-12 19:23     ` Xiao Guangrong
  2012-12-13 22:37       ` Marcelo Tosatti
  0 siblings, 1 reply; 21+ messages in thread
From: Xiao Guangrong @ 2012-12-12 19:23 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Gleb Natapov, LKML, KVM

On 12/12/2012 08:57 AM, Marcelo Tosatti wrote:
> On Mon, Dec 10, 2012 at 05:13:03PM +0800, Xiao Guangrong wrote:
>> We have two issues in current code:
>> - if target gfn is used as its page table, guest will refault then kvm will use
>>   small page size to map it. We need two #PF to fix its shadow page table
>>
>> - sometimes, say a exception is triggered during vm-exit caused by #PF
>>   (see handle_exception() in vmx.c), we remove all the shadow pages shadowed
>>   by the target gfn before go into page fault path, it will cause infinite
>>   loop:
>>   delete shadow pages shadowed by the gfn -> try to use large page size to map
>>   the gfn -> retry the access ->...
>>
>> To fix these, We can adjust page size early if the target gfn is used as page
>> table
>>
>> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
>> ---
>>  arch/x86/kvm/mmu.c         |   13 ++++---------
>>  arch/x86/kvm/paging_tmpl.h |   33 ++++++++++++++++++++++++++++++++-
>>  2 files changed, 36 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 2a3c890..54fc61e 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -2380,15 +2380,10 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
>>  	if (pte_access & ACC_WRITE_MASK) {
>>
>>  		/*
>> -		 * There are two cases:
>> -		 * - the one is other vcpu creates new sp in the window
>> -		 *   between mapping_level() and acquiring mmu-lock.
>> -		 * - the another case is the new sp is created by itself
>> -		 *   (page-fault path) when guest uses the target gfn as
>> -		 *   its page table.
>> -		 * Both of these cases can be fixed by allowing guest to
>> -		 * retry the access, it will refault, then we can establish
>> -		 * the mapping by using small page.
>> +		 * Other vcpu creates new sp in the window between
>> +		 * mapping_level() and acquiring mmu-lock. We can
>> +		 * allow guest to retry the access, the mapping can
>> +		 * be fixed if guest refault.
>>  		 */
>>  		if (level > PT_PAGE_TABLE_LEVEL &&
>>  		    has_wrprotected_page(vcpu->kvm, gfn, level))
>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
>> index ec481e9..32d77ff 100644
>> --- a/arch/x86/kvm/paging_tmpl.h
>> +++ b/arch/x86/kvm/paging_tmpl.h
>> @@ -491,6 +491,36 @@ out_gpte_changed:
>>  	return 0;
>>  }
>>
>> + /*
>> + * To see whether the mapped gfn can write its page table in the current
>> + * mapping.
>> + *
>> + * It is the helper function of FNAME(page_fault). When guest uses large page
>> + * size to map the writable gfn which is used as current page table, we should
>> + * force kvm to use small page size to map it because new shadow page will be
>> + * created when kvm establishes shadow page table that stop kvm using large
>> + * page size. Do it early can avoid unnecessary #PF and emulation.
>> + *
>> + * Note: the PDPT page table is not checked for PAE-32 bit guest. It is ok
>> + * since the PDPT is always shadowed, that means, we can not use large page
>> + * size to map the gfn which is used as PDPT.
>> + */
>> +static bool
>> +FNAME(mapped_gfn_can_write_current_pagetable)(struct guest_walker *walker)
>> +{
>> +	int level;
>> +	gfn_t mask = ~(KVM_PAGES_PER_HPAGE(walker->level) - 1);
>> +
>> +	if (!(walker->pte_access & ACC_WRITE_MASK))
>> +		return false;
>> +
>> +	for (level = walker->level; level <= walker->max_level; level++)
>> +		if (!((walker->gfn ^ walker->table_gfn[level - 1]) & mask))
>> +			return true;
> 
> XOR won't work. Just check with sums and integer comparison, ie.
> walker->gfn + KVM_PAGES_PER_HPAGE(walker->level).

It can not work since walker->gfn is not large-page-size aligned. For example,
guest uses large page size to map 0x123000000 to physical address 0-2M, if
guest faults on 0x123001000, walker->gfn = 0x1000.

The code "if (!((walker->gfn ^ walker->table_gfn[level - 1]) & mask))" is the
same as "if (walker->gfn & mask == walker->table_gfn[level - 1] & mask)" - if
any page in the large page area used as page table, we should use 4K page size
to fix it.

In above example, if table_gfn is in the area [0, 2M), kvm is forced to use
4k page size.

> 
> Moreover, its confusing to have it checked at this level. What about
> doing at reexecute_instruction?

Hmm, this patch is trying to fix a bug described in the changelog:

======
 - sometimes, say a exception is triggered during vm-exit caused by #PF
   (see handle_exception() in vmx.c), we remove all the shadow pages shadowed
   by the target gfn before go into page fault path, it will cause infinite
   loop:
   delete shadow pages shadowed by the gfn -> try to use large page size to map
   the gfn -> retry the access ->...
======

Which is caused by this code:

	if (is_page_fault(intr_info)) {
		/* EPT won't cause page fault directly */
		BUG_ON(enable_ept);
		cr2 = vmcs_readl(EXIT_QUALIFICATION);
		trace_kvm_page_fault(cr2, error_code);

		if (kvm_event_needs_reinjection(vcpu))
			kvm_mmu_unprotect_page_virt(vcpu, cr2);
		return kvm_mmu_page_fault(vcpu, cr2, error_code, NULL, 0);
	}

This bug is introduced in commit c219346325.

Another way to fix it is doing this change:
@@ -2395,7 +2395,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
                 */
                if (level > PT_PAGE_TABLE_LEVEL &&
                    has_wrprotected_page(vcpu->kvm, gfn, level))
-                       goto done;
+                       return 1;

The disadvantage of this way is, it causes unnecessary emulation. For example,
if 0-2M is mapped in guest and only page 0 used as page table, any write to
[4K, 2M) will need be emulated.

Your idea?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 5/5] KVM: x86: improve reexecute_instruction
  2012-12-12  1:09   ` Marcelo Tosatti
@ 2012-12-12 19:29     ` Xiao Guangrong
  2012-12-13 23:02       ` Marcelo Tosatti
  0 siblings, 1 reply; 21+ messages in thread
From: Xiao Guangrong @ 2012-12-12 19:29 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Gleb Natapov, LKML, KVM

On 12/12/2012 09:09 AM, Marcelo Tosatti wrote:
> On Mon, Dec 10, 2012 at 05:14:47PM +0800, Xiao Guangrong wrote:
>> The current reexecute_instruction can not well detect the failed instruction
>> emulation. It allows guest to retry all the instructions except it accesses
>> on error pfn
>>
>> For example, some cases are nested-write-protect - if the page we want to
>> write is used as PDE but it chains to itself. Under this case, we should
>> stop the emulation and report the case to userspace
>>
>> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
>> ---
>>  arch/x86/include/asm/kvm_host.h |    2 +
>>  arch/x86/kvm/paging_tmpl.h      |    2 +
>>  arch/x86/kvm/x86.c              |   82 +++++++++++++++++++++++++++++----------
>>  3 files changed, 65 insertions(+), 21 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index dc87b65..8d01c02 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -575,6 +575,8 @@ struct kvm_arch {
>>  	u64 hv_guest_os_id;
>>  	u64 hv_hypercall;
>>
>> +	/* synchronizing reexecute_instruction and page fault path. */
>> +	u64 page_fault_count;
>>  	#ifdef CONFIG_KVM_MMU_AUDIT
>>  	int audit_point;
>>  	#endif
>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
>> index 32d77ff..85b8e0e 100644
>> --- a/arch/x86/kvm/paging_tmpl.h
>> +++ b/arch/x86/kvm/paging_tmpl.h
>> @@ -614,6 +614,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
>>  	if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
>>  		goto out_unlock;
>>
>> +	vcpu->kvm->arch.page_fault_count++;
>> +
>>  	kvm_mmu_audit(vcpu, AUDIT_PRE_PAGE_FAULT);
>>  	kvm_mmu_free_some_pages(vcpu);
>>  	if (!force_pt_level)
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 3796f8c..5677869 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -4756,29 +4756,27 @@ static int handle_emulation_failure(struct kvm_vcpu *vcpu)
>>  static bool reexecute_instruction(struct kvm_vcpu *vcpu, unsigned long cr2)
>>  {
>>  	gpa_t gpa = cr2;
>> +	gfn_t gfn;
>>  	pfn_t pfn;
>> -	unsigned int indirect_shadow_pages;
>> -
>> -	spin_lock(&vcpu->kvm->mmu_lock);
>> -	indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages;
>> -	spin_unlock(&vcpu->kvm->mmu_lock);
>> -
>> -	if (!indirect_shadow_pages)
>> -		return false;
>> +	u64 page_fault_count;
>> +	int emulate;
>>
>>  	if (!vcpu->arch.mmu.direct_map) {
>> -		gpa = kvm_mmu_gva_to_gpa_read(vcpu, cr2, NULL);
>> +		/*
>> +		 * Write permission should be allowed since only
>> +		 * write access need to be emulated.
>> +		 */
>> +		gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2, NULL);
>> +
>> +		/*
>> +		 * If the mapping is invalid in guest, let cpu retry
>> +		 * it to generate fault.
>> +		 */
>>  		if (gpa == UNMAPPED_GVA)
>> -			return true; /* let cpu generate fault */
>> +			return true;
>>  	}
>>
>> -	/*
>> -	 * if emulation was due to access to shadowed page table
>> -	 * and it failed try to unshadow page and re-enter the
>> -	 * guest to let CPU execute the instruction.
>> -	 */
>> -	if (kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa)))
>> -		return true;
>> +	gfn = gpa_to_gfn(gpa);
>>
>>  	/*
>>  	 * Do not retry the unhandleable instruction if it faults on the
>> @@ -4786,13 +4784,55 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, unsigned long cr2)
>>  	 * retry instruction -> write #PF -> emulation fail -> retry
>>  	 * instruction -> ...
>>  	 */
>> -	pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
>> -	if (!is_error_noslot_pfn(pfn)) {
>> -		kvm_release_pfn_clean(pfn);
>> +	pfn = gfn_to_pfn(vcpu->kvm, gfn);
>> +
>> +	/*
>> +	 * If the instruction failed on the error pfn, it can not be fixed,
>> +	 * report the error to userspace.
>> +	 */
>> +	if (is_error_noslot_pfn(pfn))
>> +		return false;
>> +
>> +	kvm_release_pfn_clean(pfn);
>> +
>> +	/* The instructions are well-emulated on direct mmu. */
>> +	if (vcpu->arch.mmu.direct_map) {
>> +		unsigned int indirect_shadow_pages;
>> +
>> +		spin_lock(&vcpu->kvm->mmu_lock);
>> +		indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages;
>> +		spin_unlock(&vcpu->kvm->mmu_lock);
>> +
>> +		if (indirect_shadow_pages)
>> +			kvm_mmu_unprotect_page(vcpu->kvm, gfn);
>> +
>>  		return true;
>>  	}
>>
>> -	return false;
>> +again:
>> +	page_fault_count = ACCESS_ONCE(vcpu->kvm->arch.page_fault_count);
>> +
>> +	/*
>> +	 * The instruction emulation is caused by fault access on cr2.
>> +	 * After unprotect the target page, we call
>> +	 * vcpu->arch.mmu.page_fault to fix the mapping of cr2. If it
>> +	 * return 1, mmu can not fix the mapping, we should report the
>> +	 * error, otherwise it is good to return to guest and let it
>> +	 * re-execute the instruction again.
>> +	 *
>> +	 * page_fault_count is used to avoid the race on other vcpus,
>> +	 * since after we unprotect the target page, other cpu can enter
>> +	 * page fault path and let the page be write-protected again.
>> +	 */
>> +	kvm_mmu_unprotect_page(vcpu->kvm, gfn);
>> +	emulate = vcpu->arch.mmu.page_fault(vcpu, cr2, PFERR_WRITE_MASK, false);
>> +
>> +	/* The page fault path called above can increase the count. */
>> +	if (page_fault_count + 1 !=
>> +		  ACCESS_ONCE(vcpu->kvm->arch.page_fault_count))
>> +		goto again;
>> +
>> +	return !emulate;
>>  }
>>
>>  static bool retry_instruction(struct x86_emulate_ctxt *ctxt,\
> 
> Same comment as before: the only case where it should not attempt to 
> emulate is when there is a condition which makes it impossible to fix
> (the information is available to detect that condition).
> 
> The earlier suggestion
> 
> 	"How about recording the gfn number for shadow pages that have been
> shadowed in the current pagefault run?"
> 
> Was about that.

I think we can have a try. Is this change good to you, Marcelo?

[eric@localhost kvm]$ git diff
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 01d7c2a..e3d0001 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4359,24 +4359,34 @@ unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm)
        return nr_mmu_pages;
 }

-int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4])
+void kvm_mmu_get_sp_hierarchy(struct kvm_vcpu *vcpu, u64 addr,
+                             struct kvm_mmu_sp_hierarchy *hierarchy)
 {
        struct kvm_shadow_walk_iterator iterator;
        u64 spte;
-       int nr_sptes = 0;
+
+       hierarchy->max_level = hierarchy->nr_levels = 0;

        walk_shadow_page_lockless_begin(vcpu);
        for_each_shadow_entry_lockless(vcpu, addr, iterator, spte) {
-               sptes[iterator.level-1] = spte;
-               nr_sptes++;
+               struct kvm_mmu_page *sp =  page_header(__pa(iterator.sptep));
+
+               if (hierarchy->indirect_only && sp->role.direct)
+                       break;
+
+               if (!hierarchy->max_level)
+                       hierarchy->max_level = iterator.level;
+
+               hierarchy->shadow_gfns[iterator.level-1] = sp->gfn;
+               hierarchy->sptes[iterator.level-1] = spte;
+               hierarchy->nr_levels++;
+
                if (!is_shadow_present_pte(spte))
                        break;
        }
        walk_shadow_page_lockless_end(vcpu);
-
-       return nr_sptes;
 }
-EXPORT_SYMBOL_GPL(kvm_mmu_get_spte_hierarchy);
+EXPORT_SYMBOL_GPL(kvm_mmu_get_sp_hierarchy);

 void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 6987108..d7ba397 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -50,7 +50,17 @@
 #define PFERR_RSVD_MASK (1U << 3)
 #define PFERR_FETCH_MASK (1U << 4)

-int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4]);
+struct kvm_mmu_sp_hierarchy {
+       int max_level;
+       int nr_levels;
+       bool indirect_only;
+
+       u64 sptes[PT64_ROOT_LEVEL];
+       gfn_t shadow_gfns[PT64_ROOT_LEVEL];
+};
+
+void kvm_mmu_get_sp_hierarchy(struct kvm_vcpu *vcpu, u64 addr,
+                             struct kvm_mmu_sp_hierarchy *hierarchy);
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask);
 int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct);
 int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index d14bb12..9c60f8c 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4984,8 +4984,8 @@ static void ept_misconfig_inspect_spte(struct kvm_vcpu *vcpu, u64 spte,

 static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
 {
-       u64 sptes[4];
-       int nr_sptes, i, ret;
+       struct kvm_mmu_sp_hierarchy hierarchy;
+       int i, ret;
        gpa_t gpa;

        gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
@@ -5001,10 +5001,10 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
        printk(KERN_ERR "EPT: Misconfiguration.\n");
        printk(KERN_ERR "EPT: GPA: 0x%llx\n", gpa);

-       nr_sptes = kvm_mmu_get_spte_hierarchy(vcpu, gpa, sptes);
-
-       for (i = PT64_ROOT_LEVEL; i > PT64_ROOT_LEVEL - nr_sptes; --i)
-               ept_misconfig_inspect_spte(vcpu, sptes[i-1], i);
+       kvm_mmu_get_sp_hierarchy(vcpu, gpa, &hierarchy);
+
+       for (i = hierarchy.max_level; hierarchy.nr_levels; hierarchy.nr_levels--, i--)
+                       ept_misconfig_inspect_spte(vcpu, hierarchy.sptes[i-1], i);

        vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
        vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_MISCONFIG;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1056106..cee6242 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4755,11 +4755,11 @@ static int handle_emulation_failure(struct kvm_vcpu *vcpu)

 static bool reexecute_instruction(struct kvm_vcpu *vcpu, unsigned long cr2)
 {
+       struct kvm_mmu_sp_hierarchy hierarchy;
        gpa_t gpa = cr2;
        gfn_t gfn;
        pfn_t pfn;
-       u64 page_fault_count;
-       int emulate;
+       int level;

        if (!vcpu->arch.mmu.direct_map) {
                /*
@@ -4809,30 +4809,14 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, unsigned long cr2)
                return true;
        }

-again:
-       page_fault_count = ACCESS_ONCE(vcpu->kvm->arch.page_fault_count);
+       hierarchy.indirect_only = true;
+       kvm_mmu_get_sp_hierarchy(vcpu, cr2, &hierarchy);

-       /*
-        * The instruction emulation is caused by fault access on cr2.
-        * After unprotect the target page, we call
-        * vcpu->arch.mmu.page_fault to fix the mapping of cr2. If it
-        * return 1, mmu can not fix the mapping, we should report the
-        * error, otherwise it is good to return to guest and let it
-        * re-execute the instruction again.
-        *
-        * page_fault_count is used to avoid the race on other vcpus,
-        * since after we unprotect the target page, other cpu can enter
-        * page fault path and let the page be write-protected again.
-        */
-       kvm_mmu_unprotect_page(vcpu->kvm, gfn);
-       emulate = vcpu->arch.mmu.page_fault(vcpu, cr2, PFERR_WRITE_MASK, false);
-
-       /* The page fault path called above can increase the count. */
-       if (page_fault_count + 1 !=
-                 ACCESS_ONCE(vcpu->kvm->arch.page_fault_count))
-               goto again;
+       for (level = hierarchy.max_level; hierarchy.nr_levels > 0; level--, hierarchy.nr_levels--)
+               if (hierarchy.shadow_gfns[level - 1] == gfn)
+                       return false;

-       return !emulate;
+       return true;
 }



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 0/5] KVM: x86: improve reexecute_instruction
  2012-12-11 23:36 ` [PATCH v2 0/5] " Marcelo Tosatti
@ 2012-12-12 20:05   ` Xiao Guangrong
  2012-12-13 22:54     ` Marcelo Tosatti
  0 siblings, 1 reply; 21+ messages in thread
From: Xiao Guangrong @ 2012-12-12 20:05 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Gleb Natapov, LKML, KVM

On 12/12/2012 07:36 AM, Marcelo Tosatti wrote:
> On Mon, Dec 10, 2012 at 05:11:35PM +0800, Xiao Guangrong wrote:
>> Changelog:
>> There are some changes from Marcelo and Gleb's review, thank you all!
>> - access indirect_shadow_pages in the protection of mmu-lock
>> - fix the issue when unhandleable instruction access on large page
>> - add a new test case for large page
>>
>> The current reexecute_instruction can not well detect the failed instruction
>> emulation. It allows guest to retry all the instructions except it accesses
>> on error pfn.
>>
>> For example, these cases can not be detected:
>> - for tdp used
>>   currently, it refused to retry all instructions. If nested npt is used, the
>>   emulation may be caused by shadow page, it can be fixed by unshadow the
>>   shadow page.
>>
>> - for shadow mmu
>>   some cases are nested-write-protect, for example, if the page we want to
>>   write is used as PDE but it chains to itself. Under this case, we should
>>   stop the emulation and report the case to userspace.
>>
>> There are two test cases based on kvm-unit-test can trigger a infinite loop on
>> current code (ept = 0), after this patchset, it can report the error to Qemu.
>>
>> Subject: [PATCH] access test: test unhandleable instruction
>>
>> Test the instruction which can not be handled by kvm
>>
>> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> 
> Please submit the test for inclusion. There should be some way to make
> it fail.. 

Yes.

But it is not easy. If the test cases run normally, kvm will report a error to Qemu
then Qemu will exit the vcpu thread after dumping the vcpu state.

We need to do something to let guest can be aware that the error report is triggered.
I guess we can add a option in Qemu, say '-notify-guest' and allow Qemu to inject #GP
to guest with a special ERROR_CODE if error is reported.

> program a timer interrupt and #GP? 

Could you please explain the detail?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/5] KVM: MMU: adjust page size early if gfn used as page table
  2012-12-12 19:23     ` Xiao Guangrong
@ 2012-12-13 22:37       ` Marcelo Tosatti
  0 siblings, 0 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2012-12-13 22:37 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Gleb Natapov, LKML, KVM

On Thu, Dec 13, 2012 at 03:23:26AM +0800, Xiao Guangrong wrote:
> On 12/12/2012 08:57 AM, Marcelo Tosatti wrote:
> > On Mon, Dec 10, 2012 at 05:13:03PM +0800, Xiao Guangrong wrote:
> >> We have two issues in current code:
> >> - if target gfn is used as its page table, guest will refault then kvm will use
> >>   small page size to map it. We need two #PF to fix its shadow page table
> >>
> >> - sometimes, say a exception is triggered during vm-exit caused by #PF
> >>   (see handle_exception() in vmx.c), we remove all the shadow pages shadowed
> >>   by the target gfn before go into page fault path, it will cause infinite
> >>   loop:
> >>   delete shadow pages shadowed by the gfn -> try to use large page size to map
> >>   the gfn -> retry the access ->...
> >>
> >> To fix these, We can adjust page size early if the target gfn is used as page
> >> table
> >>
> >> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> >> ---
> >>  arch/x86/kvm/mmu.c         |   13 ++++---------
> >>  arch/x86/kvm/paging_tmpl.h |   33 ++++++++++++++++++++++++++++++++-
> >>  2 files changed, 36 insertions(+), 10 deletions(-)
> >>
> >> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >> index 2a3c890..54fc61e 100644
> >> --- a/arch/x86/kvm/mmu.c
> >> +++ b/arch/x86/kvm/mmu.c
> >> @@ -2380,15 +2380,10 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
> >>  	if (pte_access & ACC_WRITE_MASK) {
> >>
> >>  		/*
> >> -		 * There are two cases:
> >> -		 * - the one is other vcpu creates new sp in the window
> >> -		 *   between mapping_level() and acquiring mmu-lock.
> >> -		 * - the another case is the new sp is created by itself
> >> -		 *   (page-fault path) when guest uses the target gfn as
> >> -		 *   its page table.
> >> -		 * Both of these cases can be fixed by allowing guest to
> >> -		 * retry the access, it will refault, then we can establish
> >> -		 * the mapping by using small page.
> >> +		 * Other vcpu creates new sp in the window between
> >> +		 * mapping_level() and acquiring mmu-lock. We can
> >> +		 * allow guest to retry the access, the mapping can
> >> +		 * be fixed if guest refault.
> >>  		 */
> >>  		if (level > PT_PAGE_TABLE_LEVEL &&
> >>  		    has_wrprotected_page(vcpu->kvm, gfn, level))
> >> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> >> index ec481e9..32d77ff 100644
> >> --- a/arch/x86/kvm/paging_tmpl.h
> >> +++ b/arch/x86/kvm/paging_tmpl.h
> >> @@ -491,6 +491,36 @@ out_gpte_changed:
> >>  	return 0;
> >>  }
> >>
> >> + /*
> >> + * To see whether the mapped gfn can write its page table in the current
> >> + * mapping.
> >> + *
> >> + * It is the helper function of FNAME(page_fault). When guest uses large page
> >> + * size to map the writable gfn which is used as current page table, we should
> >> + * force kvm to use small page size to map it because new shadow page will be
> >> + * created when kvm establishes shadow page table that stop kvm using large
> >> + * page size. Do it early can avoid unnecessary #PF and emulation.
> >> + *
> >> + * Note: the PDPT page table is not checked for PAE-32 bit guest. It is ok
> >> + * since the PDPT is always shadowed, that means, we can not use large page
> >> + * size to map the gfn which is used as PDPT.
> >> + */
> >> +static bool
> >> +FNAME(mapped_gfn_can_write_current_pagetable)(struct guest_walker *walker)
> >> +{
> >> +	int level;
> >> +	gfn_t mask = ~(KVM_PAGES_PER_HPAGE(walker->level) - 1);
> >> +
> >> +	if (!(walker->pte_access & ACC_WRITE_MASK))
> >> +		return false;
> >> +
> >> +	for (level = walker->level; level <= walker->max_level; level++)
> >> +		if (!((walker->gfn ^ walker->table_gfn[level - 1]) & mask))
> >> +			return true;
> > 
> > XOR won't work. Just check with sums and integer comparison, ie.
> > walker->gfn + KVM_PAGES_PER_HPAGE(walker->level).
> 
> It can not work since walker->gfn is not large-page-size aligned. For example,
> guest uses large page size to map 0x123000000 to physical address 0-2M, if
> guest faults on 0x123001000, walker->gfn = 0x1000.
> 
> The code "if (!((walker->gfn ^ walker->table_gfn[level - 1]) & mask))" is the
> same as "if (walker->gfn & mask == walker->table_gfn[level - 1] & mask)" - if
> any page in the large page area used as page table, we should use 4K page size
> to fix it.
> 
> In above example, if table_gfn is in the area [0, 2M), kvm is forced to use
> 4k page size.

Right, i misread it. 

> > Moreover, its confusing to have it checked at this level. What about
> > doing at reexecute_instruction?
> 
> Hmm, this patch is trying to fix a bug described in the changelog:
> 
> ======
>  - sometimes, say a exception is triggered during vm-exit caused by #PF
>    (see handle_exception() in vmx.c), we remove all the shadow pages shadowed
>    by the target gfn before go into page fault path, it will cause infinite
>    loop:
>    delete shadow pages shadowed by the gfn -> try to use large page size to map
>    the gfn -> retry the access ->...
> ======
> 
> Which is caused by this code:
> 
> 	if (is_page_fault(intr_info)) {
> 		/* EPT won't cause page fault directly */
> 		BUG_ON(enable_ept);
> 		cr2 = vmcs_readl(EXIT_QUALIFICATION);
> 		trace_kvm_page_fault(cr2, error_code);
> 
> 		if (kvm_event_needs_reinjection(vcpu))
> 			kvm_mmu_unprotect_page_virt(vcpu, cr2);
> 		return kvm_mmu_page_fault(vcpu, cr2, error_code, NULL, 0);
> 	}
> 
> This bug is introduced in commit c219346325.
> 
> Another way to fix it is doing this change:
> @@ -2395,7 +2395,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
>                  */
>                 if (level > PT_PAGE_TABLE_LEVEL &&
>                     has_wrprotected_page(vcpu->kvm, gfn, level))
> -                       goto done;
> +                       return 1;
> 
> The disadvantage of this way is, it causes unnecessary emulation. For example,
> if 0-2M is mapped in guest and only page 0 used as page table, any write to
> [4K, 2M) will need be emulated.
> 
> Your idea?

OK, i understand now.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 0/5] KVM: x86: improve reexecute_instruction
  2012-12-12 20:05   ` Xiao Guangrong
@ 2012-12-13 22:54     ` Marcelo Tosatti
  2012-12-14  4:50       ` Xiao Guangrong
  0 siblings, 1 reply; 21+ messages in thread
From: Marcelo Tosatti @ 2012-12-13 22:54 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Gleb Natapov, LKML, KVM

On Thu, Dec 13, 2012 at 04:05:55AM +0800, Xiao Guangrong wrote:
> On 12/12/2012 07:36 AM, Marcelo Tosatti wrote:
> > On Mon, Dec 10, 2012 at 05:11:35PM +0800, Xiao Guangrong wrote:
> >> Changelog:
> >> There are some changes from Marcelo and Gleb's review, thank you all!
> >> - access indirect_shadow_pages in the protection of mmu-lock
> >> - fix the issue when unhandleable instruction access on large page
> >> - add a new test case for large page
> >>
> >> The current reexecute_instruction can not well detect the failed instruction
> >> emulation. It allows guest to retry all the instructions except it accesses
> >> on error pfn.
> >>
> >> For example, these cases can not be detected:
> >> - for tdp used
> >>   currently, it refused to retry all instructions. If nested npt is used, the
> >>   emulation may be caused by shadow page, it can be fixed by unshadow the
> >>   shadow page.
> >>
> >> - for shadow mmu
> >>   some cases are nested-write-protect, for example, if the page we want to
> >>   write is used as PDE but it chains to itself. Under this case, we should
> >>   stop the emulation and report the case to userspace.
> >>
> >> There are two test cases based on kvm-unit-test can trigger a infinite loop on
> >> current code (ept = 0), after this patchset, it can report the error to Qemu.
> >>
> >> Subject: [PATCH] access test: test unhandleable instruction
> >>
> >> Test the instruction which can not be handled by kvm
> >>
> >> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> > 
> > Please submit the test for inclusion. There should be some way to make
> > it fail.. 
> 
> Yes.
> 
> But it is not easy. If the test cases run normally, kvm will report a error to Qemu
> then Qemu will exit the vcpu thread after dumping the vcpu state.
> 
> We need to do something to let guest can be aware that the error report is triggered.
> I guess we can add a option in Qemu, say '-notify-guest' and allow Qemu to inject #GP
> to guest with a special ERROR_CODE if error is reported.
> 
> > program a timer interrupt and #GP? 
> 
> Could you please explain the detail?

Before the instruction which writes continuously to the pagetable, arm
say lapic timer. #GP on the interrupt handler and test with failure.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 5/5] KVM: x86: improve reexecute_instruction
  2012-12-12 19:29     ` Xiao Guangrong
@ 2012-12-13 23:02       ` Marcelo Tosatti
  2012-12-14  3:40         ` Xiao Guangrong
  0 siblings, 1 reply; 21+ messages in thread
From: Marcelo Tosatti @ 2012-12-13 23:02 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Gleb Natapov, LKML, KVM

On Thu, Dec 13, 2012 at 03:29:21AM +0800, Xiao Guangrong wrote:
> On 12/12/2012 09:09 AM, Marcelo Tosatti wrote:
> > On Mon, Dec 10, 2012 at 05:14:47PM +0800, Xiao Guangrong wrote:
> >> The current reexecute_instruction can not well detect the failed instruction
> >> emulation. It allows guest to retry all the instructions except it accesses
> >> on error pfn
> >>
> >> For example, some cases are nested-write-protect - if the page we want to
> >> write is used as PDE but it chains to itself. Under this case, we should
> >> stop the emulation and report the case to userspace
> >>
> >> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> >> ---
> >>  arch/x86/include/asm/kvm_host.h |    2 +
> >>  arch/x86/kvm/paging_tmpl.h      |    2 +
> >>  arch/x86/kvm/x86.c              |   82 +++++++++++++++++++++++++++++----------
> >>  3 files changed, 65 insertions(+), 21 deletions(-)
> >>
> >> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> >> index dc87b65..8d01c02 100644
> >> --- a/arch/x86/include/asm/kvm_host.h
> >> +++ b/arch/x86/include/asm/kvm_host.h
> >> @@ -575,6 +575,8 @@ struct kvm_arch {
> >>  	u64 hv_guest_os_id;
> >>  	u64 hv_hypercall;
> >>
> >> +	/* synchronizing reexecute_instruction and page fault path. */
> >> +	u64 page_fault_count;
> >>  	#ifdef CONFIG_KVM_MMU_AUDIT
> >>  	int audit_point;
> >>  	#endif
> >> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> >> index 32d77ff..85b8e0e 100644
> >> --- a/arch/x86/kvm/paging_tmpl.h
> >> +++ b/arch/x86/kvm/paging_tmpl.h
> >> @@ -614,6 +614,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
> >>  	if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
> >>  		goto out_unlock;
> >>
> >> +	vcpu->kvm->arch.page_fault_count++;
> >> +
> >>  	kvm_mmu_audit(vcpu, AUDIT_PRE_PAGE_FAULT);
> >>  	kvm_mmu_free_some_pages(vcpu);
> >>  	if (!force_pt_level)
> >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >> index 3796f8c..5677869 100644
> >> --- a/arch/x86/kvm/x86.c
> >> +++ b/arch/x86/kvm/x86.c
> >> @@ -4756,29 +4756,27 @@ static int handle_emulation_failure(struct kvm_vcpu *vcpu)
> >>  static bool reexecute_instruction(struct kvm_vcpu *vcpu, unsigned long cr2)
> >>  {
> >>  	gpa_t gpa = cr2;
> >> +	gfn_t gfn;
> >>  	pfn_t pfn;
> >> -	unsigned int indirect_shadow_pages;
> >> -
> >> -	spin_lock(&vcpu->kvm->mmu_lock);
> >> -	indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages;
> >> -	spin_unlock(&vcpu->kvm->mmu_lock);
> >> -
> >> -	if (!indirect_shadow_pages)
> >> -		return false;
> >> +	u64 page_fault_count;
> >> +	int emulate;
> >>
> >>  	if (!vcpu->arch.mmu.direct_map) {
> >> -		gpa = kvm_mmu_gva_to_gpa_read(vcpu, cr2, NULL);
> >> +		/*
> >> +		 * Write permission should be allowed since only
> >> +		 * write access need to be emulated.
> >> +		 */
> >> +		gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2, NULL);
> >> +
> >> +		/*
> >> +		 * If the mapping is invalid in guest, let cpu retry
> >> +		 * it to generate fault.
> >> +		 */
> >>  		if (gpa == UNMAPPED_GVA)
> >> -			return true; /* let cpu generate fault */
> >> +			return true;
> >>  	}
> >>
> >> -	/*
> >> -	 * if emulation was due to access to shadowed page table
> >> -	 * and it failed try to unshadow page and re-enter the
> >> -	 * guest to let CPU execute the instruction.
> >> -	 */
> >> -	if (kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa)))
> >> -		return true;
> >> +	gfn = gpa_to_gfn(gpa);
> >>
> >>  	/*
> >>  	 * Do not retry the unhandleable instruction if it faults on the
> >> @@ -4786,13 +4784,55 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, unsigned long cr2)
> >>  	 * retry instruction -> write #PF -> emulation fail -> retry
> >>  	 * instruction -> ...
> >>  	 */
> >> -	pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
> >> -	if (!is_error_noslot_pfn(pfn)) {
> >> -		kvm_release_pfn_clean(pfn);
> >> +	pfn = gfn_to_pfn(vcpu->kvm, gfn);
> >> +
> >> +	/*
> >> +	 * If the instruction failed on the error pfn, it can not be fixed,
> >> +	 * report the error to userspace.
> >> +	 */
> >> +	if (is_error_noslot_pfn(pfn))
> >> +		return false;
> >> +
> >> +	kvm_release_pfn_clean(pfn);
> >> +
> >> +	/* The instructions are well-emulated on direct mmu. */
> >> +	if (vcpu->arch.mmu.direct_map) {
> >> +		unsigned int indirect_shadow_pages;
> >> +
> >> +		spin_lock(&vcpu->kvm->mmu_lock);
> >> +		indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages;
> >> +		spin_unlock(&vcpu->kvm->mmu_lock);
> >> +
> >> +		if (indirect_shadow_pages)
> >> +			kvm_mmu_unprotect_page(vcpu->kvm, gfn);
> >> +
> >>  		return true;
> >>  	}
> >>
> >> -	return false;
> >> +again:
> >> +	page_fault_count = ACCESS_ONCE(vcpu->kvm->arch.page_fault_count);
> >> +
> >> +	/*
> >> +	 * The instruction emulation is caused by fault access on cr2.
> >> +	 * After unprotect the target page, we call
> >> +	 * vcpu->arch.mmu.page_fault to fix the mapping of cr2. If it
> >> +	 * return 1, mmu can not fix the mapping, we should report the
> >> +	 * error, otherwise it is good to return to guest and let it
> >> +	 * re-execute the instruction again.
> >> +	 *
> >> +	 * page_fault_count is used to avoid the race on other vcpus,
> >> +	 * since after we unprotect the target page, other cpu can enter
> >> +	 * page fault path and let the page be write-protected again.
> >> +	 */
> >> +	kvm_mmu_unprotect_page(vcpu->kvm, gfn);
> >> +	emulate = vcpu->arch.mmu.page_fault(vcpu, cr2, PFERR_WRITE_MASK, false);
> >> +
> >> +	/* The page fault path called above can increase the count. */
> >> +	if (page_fault_count + 1 !=
> >> +		  ACCESS_ONCE(vcpu->kvm->arch.page_fault_count))
> >> +		goto again;
> >> +
> >> +	return !emulate;
> >>  }
> >>
> >>  static bool retry_instruction(struct x86_emulate_ctxt *ctxt,\
> > 
> > Same comment as before: the only case where it should not attempt to 
> > emulate is when there is a condition which makes it impossible to fix
> > (the information is available to detect that condition).
> > 
> > The earlier suggestion
> > 
> > 	"How about recording the gfn number for shadow pages that have been
> > shadowed in the current pagefault run?"
> > 
> > Was about that.
> 
> I think we can have a try. Is this change good to you, Marcelo?
> 
> [eric@localhost kvm]$ git diff
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 01d7c2a..e3d0001 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -4359,24 +4359,34 @@ unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm)
>         return nr_mmu_pages;
>  }
> 
> -int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4])
> +void kvm_mmu_get_sp_hierarchy(struct kvm_vcpu *vcpu, u64 addr,
> +                             struct kvm_mmu_sp_hierarchy *hierarchy)
>  {
>         struct kvm_shadow_walk_iterator iterator;
>         u64 spte;
> -       int nr_sptes = 0;
> +
> +       hierarchy->max_level = hierarchy->nr_levels = 0;
> 
>         walk_shadow_page_lockless_begin(vcpu);
>         for_each_shadow_entry_lockless(vcpu, addr, iterator, spte) {
> -               sptes[iterator.level-1] = spte;
> -               nr_sptes++;
> +               struct kvm_mmu_page *sp =  page_header(__pa(iterator.sptep));
> +
> +               if (hierarchy->indirect_only && sp->role.direct)
> +                       break;
> +
> +               if (!hierarchy->max_level)
> +                       hierarchy->max_level = iterator.level;
> +
> +               hierarchy->shadow_gfns[iterator.level-1] = sp->gfn;
> +               hierarchy->sptes[iterator.level-1] = spte;
> +               hierarchy->nr_levels++;
> +
>                 if (!is_shadow_present_pte(spte))
>                         break;
>         }
>         walk_shadow_page_lockless_end(vcpu);
> -
> -       return nr_sptes;
>  }

Record gfns while shadowing in the vcpu struct, in a struct, along with cr2.
Then validate 
That way its guaranteed its not some other vcpu.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 5/5] KVM: x86: improve reexecute_instruction
  2012-12-13 23:02       ` Marcelo Tosatti
@ 2012-12-14  3:40         ` Xiao Guangrong
  0 siblings, 0 replies; 21+ messages in thread
From: Xiao Guangrong @ 2012-12-14  3:40 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Gleb Natapov, LKML, KVM

On 12/14/2012 07:02 AM, Marcelo Tosatti wrote:

>>> Same comment as before: the only case where it should not attempt to 
>>> emulate is when there is a condition which makes it impossible to fix
>>> (the information is available to detect that condition).
>>>
>>> The earlier suggestion
>>>
>>> 	"How about recording the gfn number for shadow pages that have been
>>> shadowed in the current pagefault run?"
>>>
>>> Was about that.
>>
>> I think we can have a try. Is this change good to you, Marcelo?
>>
>> [eric@localhost kvm]$ git diff
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 01d7c2a..e3d0001 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -4359,24 +4359,34 @@ unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm)
>>         return nr_mmu_pages;
>>  }
>>
>> -int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4])
>> +void kvm_mmu_get_sp_hierarchy(struct kvm_vcpu *vcpu, u64 addr,
>> +                             struct kvm_mmu_sp_hierarchy *hierarchy)
>>  {
>>         struct kvm_shadow_walk_iterator iterator;
>>         u64 spte;
>> -       int nr_sptes = 0;
>> +
>> +       hierarchy->max_level = hierarchy->nr_levels = 0;
>>
>>         walk_shadow_page_lockless_begin(vcpu);
>>         for_each_shadow_entry_lockless(vcpu, addr, iterator, spte) {
>> -               sptes[iterator.level-1] = spte;
>> -               nr_sptes++;
>> +               struct kvm_mmu_page *sp =  page_header(__pa(iterator.sptep));
>> +
>> +               if (hierarchy->indirect_only && sp->role.direct)
>> +                       break;
>> +
>> +               if (!hierarchy->max_level)
>> +                       hierarchy->max_level = iterator.level;
>> +
>> +               hierarchy->shadow_gfns[iterator.level-1] = sp->gfn;
>> +               hierarchy->sptes[iterator.level-1] = spte;
>> +               hierarchy->nr_levels++;
>> +
>>                 if (!is_shadow_present_pte(spte))
>>                         break;
>>         }
>>         walk_shadow_page_lockless_end(vcpu);
>> -
>> -       return nr_sptes;
>>  }
> 
> Record gfns while shadowing in the vcpu struct, in a struct, along with cr2.
> Then validate 
> That way its guaranteed its not some other vcpu.

Okay, i will try this way. :)



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 0/5] KVM: x86: improve reexecute_instruction
  2012-12-13 22:54     ` Marcelo Tosatti
@ 2012-12-14  4:50       ` Xiao Guangrong
  2012-12-15  1:05         ` Marcelo Tosatti
  0 siblings, 1 reply; 21+ messages in thread
From: Xiao Guangrong @ 2012-12-14  4:50 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Gleb Natapov, LKML, KVM

On 12/14/2012 06:54 AM, Marcelo Tosatti wrote:
> On Thu, Dec 13, 2012 at 04:05:55AM +0800, Xiao Guangrong wrote:
>> On 12/12/2012 07:36 AM, Marcelo Tosatti wrote:
>>> On Mon, Dec 10, 2012 at 05:11:35PM +0800, Xiao Guangrong wrote:
>>>> Changelog:
>>>> There are some changes from Marcelo and Gleb's review, thank you all!
>>>> - access indirect_shadow_pages in the protection of mmu-lock
>>>> - fix the issue when unhandleable instruction access on large page
>>>> - add a new test case for large page
>>>>
>>>> The current reexecute_instruction can not well detect the failed instruction
>>>> emulation. It allows guest to retry all the instructions except it accesses
>>>> on error pfn.
>>>>
>>>> For example, these cases can not be detected:
>>>> - for tdp used
>>>>   currently, it refused to retry all instructions. If nested npt is used, the
>>>>   emulation may be caused by shadow page, it can be fixed by unshadow the
>>>>   shadow page.
>>>>
>>>> - for shadow mmu
>>>>   some cases are nested-write-protect, for example, if the page we want to
>>>>   write is used as PDE but it chains to itself. Under this case, we should
>>>>   stop the emulation and report the case to userspace.
>>>>
>>>> There are two test cases based on kvm-unit-test can trigger a infinite loop on
>>>> current code (ept = 0), after this patchset, it can report the error to Qemu.
>>>>
>>>> Subject: [PATCH] access test: test unhandleable instruction
>>>>
>>>> Test the instruction which can not be handled by kvm
>>>>
>>>> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
>>>
>>> Please submit the test for inclusion. There should be some way to make
>>> it fail.. 
>>
>> Yes.
>>
>> But it is not easy. If the test cases run normally, kvm will report a error to Qemu
>> then Qemu will exit the vcpu thread after dumping the vcpu state.
>>
>> We need to do something to let guest can be aware that the error report is triggered.
>> I guess we can add a option in Qemu, say '-notify-guest' and allow Qemu to inject #GP
>> to guest with a special ERROR_CODE if error is reported.
>>
>>> program a timer interrupt and #GP? 
>>
>> Could you please explain the detail?
> 
> Before the instruction which writes continuously to the pagetable, arm
> say lapic timer. #GP on the interrupt handler and test with failure.

Sorry, I am confused about this. After Qemu exits due to KVM_EXIT_INTERNAL_ERROR,
the vm is stopped then interrupt can not be injected to guest. Or i missed something?




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 0/5] KVM: x86: improve reexecute_instruction
  2012-12-14  4:50       ` Xiao Guangrong
@ 2012-12-15  1:05         ` Marcelo Tosatti
  2012-12-23 11:46           ` Gleb Natapov
  0 siblings, 1 reply; 21+ messages in thread
From: Marcelo Tosatti @ 2012-12-15  1:05 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Gleb Natapov, LKML, KVM

On Fri, Dec 14, 2012 at 12:50:09PM +0800, Xiao Guangrong wrote:
> >>> program a timer interrupt and #GP? 
> >>
> >> Could you please explain the detail?
> > 
> > Before the instruction which writes continuously to the pagetable, arm
> > say lapic timer. #GP on the interrupt handler and test with failure.
> 
> Sorry, I am confused about this. After Qemu exits due to KVM_EXIT_INTERNAL_ERROR,
> the vm is stopped then interrupt can not be injected to guest. Or i missed something?

Yes, but without fixed kernel kvm-unit test executable loops continuously.
Perhaps its more appropriate to fix generically, nevermind.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 0/5] KVM: x86: improve reexecute_instruction
  2012-12-15  1:05         ` Marcelo Tosatti
@ 2012-12-23 11:46           ` Gleb Natapov
  0 siblings, 0 replies; 21+ messages in thread
From: Gleb Natapov @ 2012-12-23 11:46 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Xiao Guangrong, LKML, KVM

On Fri, Dec 14, 2012 at 11:05:46PM -0200, Marcelo Tosatti wrote:
> On Fri, Dec 14, 2012 at 12:50:09PM +0800, Xiao Guangrong wrote:
> > >>> program a timer interrupt and #GP? 
> > >>
> > >> Could you please explain the detail?
> > > 
> > > Before the instruction which writes continuously to the pagetable, arm
> > > say lapic timer. #GP on the interrupt handler and test with failure.
> > 
> > Sorry, I am confused about this. After Qemu exits due to KVM_EXIT_INTERNAL_ERROR,
> > the vm is stopped then interrupt can not be injected to guest. Or i missed something?
> 
> Yes, but without fixed kernel kvm-unit test executable loops continuously.
> Perhaps its more appropriate to fix generically, nevermind.
This will not be the first test that makes kvm-unit-test hang on non
fixed kernels.

--
			Gleb.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2012-12-23 11:46 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-10  9:11 [PATCH v2 0/5] KVM: x86: improve reexecute_instruction Xiao Guangrong
2012-12-10  9:12 ` [PATCH v2 1/5] KVM: MMU: move adjusting pte access for softmmu to FNAME(page_fault) Xiao Guangrong
2012-12-11 23:47   ` Marcelo Tosatti
2012-12-12 18:53     ` Xiao Guangrong
2012-12-10  9:13 ` [PATCH v2 2/5] KVM: MMU: adjust page size early if gfn used as page table Xiao Guangrong
2012-12-12  0:57   ` Marcelo Tosatti
2012-12-12 19:23     ` Xiao Guangrong
2012-12-13 22:37       ` Marcelo Tosatti
2012-12-10  9:13 ` [PATCH v2 3/5] KVM: x86: clean up reexecute_instruction Xiao Guangrong
2012-12-10  9:14 ` [PATCH v2 4/5] KVM: x86: let reexecute_instruction work for tdp Xiao Guangrong
2012-12-10  9:14 ` [PATCH v2 5/5] KVM: x86: improve reexecute_instruction Xiao Guangrong
2012-12-12  1:09   ` Marcelo Tosatti
2012-12-12 19:29     ` Xiao Guangrong
2012-12-13 23:02       ` Marcelo Tosatti
2012-12-14  3:40         ` Xiao Guangrong
2012-12-11 23:36 ` [PATCH v2 0/5] " Marcelo Tosatti
2012-12-12 20:05   ` Xiao Guangrong
2012-12-13 22:54     ` Marcelo Tosatti
2012-12-14  4:50       ` Xiao Guangrong
2012-12-15  1:05         ` Marcelo Tosatti
2012-12-23 11:46           ` Gleb Natapov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).