large page support for kvm

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* large page support for kvm
@ 2008-01-29 17:20 Avi Kivity
       [not found] ` <479F604C.20107-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 14+ messages in thread
From: Avi Kivity @ 2008-01-29 17:20 UTC (permalink / raw)
  To: kvm-devel

The npt patches started me thinking on large page support (2MB/4MB 
pages), and I think we can implement them even when npt/ept are not 
available.

Here's a rough sketch of my proposal:

- For every memory slot, allocate an array containing one int for every 
potential large page included within that memory slot.  Each entry in 
the array contains the number of write-protected 4KB pages within the 
large page frame corresponding to that entry.

For example, if we have a memory slot for gpas 1MB-1GB, we'd have an 
array of size 511, corresponding to the 511 2MB pages from 2MB upwards.  
If we shadow a pagetable at address 4MB+8KB, we'd increment the entry 
corresponding to the large page at 4MB.  When we unshadow that page, 
decrement the entry.

- If we attempt to shadow a large page (either a guest pse pte, or a 
real-mode pseudo pte), we check if the host page is a large page.  If 
so, we also check the write-protect count array.  If the result is zero, 
we create a shadow pse pte.

- Whenever we write-protect a page, also zap any large-page mappings for 
that page.  This means rmap will need some extension to handle pde rmaps 
in addition to pte rmaps.

- qemu is extended to have a command-line option to use large pages to 
back guest memory.

Large pages should improve performance significantly, both with 
traditional shadow and npt/ept.  Hopefully we will have transparent 
Linux support for them one day.

-- 
error compiling committee.c: too many arguments to function

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 14+ messages in thread

[parent not found: <479F604C.20107-atKUWr5tajBWk0Htik3J/w@public.gmane.org>]

* Re: large page support for kvm
       [not found] ` <479F604C.20107-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2008-01-30 18:40   ` Joerg Roedel
       [not found]     ` <20080130184035.GS6960-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 14+ messages in thread
From: Joerg Roedel @ 2008-01-30 18:40 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel

On Tue, Jan 29, 2008 at 07:20:12PM +0200, Avi Kivity wrote:

> Here's a rough sketch of my proposal:
> 
> - For every memory slot, allocate an array containing one int for every 
> potential large page included within that memory slot.  Each entry in 
> the array contains the number of write-protected 4KB pages within the 
> large page frame corresponding to that entry.
> 
> For example, if we have a memory slot for gpas 1MB-1GB, we'd have an 
> array of size 511, corresponding to the 511 2MB pages from 2MB upwards.  
> If we shadow a pagetable at address 4MB+8KB, we'd increment the entry 
> corresponding to the large page at 4MB.  When we unshadow that page, 
> decrement the entry.

You need to take care the the 2MB gpa is aligned 2 MB host physical to
be able to map it correctly with a large pte. So maybe we need two
memslots for 1MB-1GB. One for 1MB-2MB using normal 4kb pages and one
from 2MB-1GB which can be allocated using HugeTLBfs.

> - If we attempt to shadow a large page (either a guest pse pte, or a 
> real-mode pseudo pte), we check if the host page is a large page.  If 
> so, we also check the write-protect count array.  If the result is zero, 
> we create a shadow pse pte.
> 
> - Whenever we write-protect a page, also zap any large-page mappings for 
> that page.  This means rmap will need some extension to handle pde rmaps 
> in addition to pte rmaps.

This sounds straight forward to me. All you need is a short value for
every potential large page and initialize it with -1 if the host page is
a large page and with 0 otherwise. Every time this value reaches -1 we
can map the page with a large pte (and the guest maps with large pte).

> - qemu is extended to have a command-line option to use large pages to 
> back guest memory.
> 
> Large pages should improve performance significantly, both with 
> traditional shadow and npt/ept.

Yes, I think that too. But with shadow paging it really depends on the
guest if the performance increasement is long-term. In a Linux guest,
for example, the direct mapped memory will become fragmented over
time (together with the location of the page tables). So the
number of potential large page mappings will likely decrease over
time.

But the situation is different when the Linux guest uses HugeTLBfs in
its userspace. We will always be able to map these pages using large
ptes if the guest physical memory is correctly aligned.

With Nested Paging (and EPT also) we will always have the benefit
because we don't need to write protect anything.

I really look forward to large page support in KVM. Maybe we reach the
95% VCPU performance mark compared to native performance with it :-)

> Hopefully we will have transparent Linux support for them one day.

Unlikely. As far as I know Linus doesn't like the idea...

Joerg

-- 
           |           AMD Saxony Limited Liability Company & Co. KG
 Operating |         Wilschdorfer Landstr. 101, 01109 Dresden, Germany
 System    |                  Register Court Dresden: HRA 4896
 Research  |              General Partner authorized to represent:
 Center    |             AMD Saxony LLC (Wilmington, Delaware, US)
           | General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 14+ messages in thread

[parent not found: <20080130184035.GS6960-5C7GfCeVMHo@public.gmane.org>]

* Re: large page support for kvm
       [not found]     ` <20080130184035.GS6960-5C7GfCeVMHo@public.gmane.org>
@ 2008-01-31  5:44       ` Avi Kivity
  2008-02-11 15:49         ` Marcelo Tosatti
  0 siblings, 1 reply; 14+ messages in thread
From: Avi Kivity @ 2008-01-31  5:44 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: kvm-devel

Joerg Roedel wrote:
> On Tue, Jan 29, 2008 at 07:20:12PM +0200, Avi Kivity wrote:
>
>   
>> Here's a rough sketch of my proposal:
>>
>> - For every memory slot, allocate an array containing one int for every 
>> potential large page included within that memory slot.  Each entry in 
>> the array contains the number of write-protected 4KB pages within the 
>> large page frame corresponding to that entry.
>>
>> For example, if we have a memory slot for gpas 1MB-1GB, we'd have an 
>> array of size 511, corresponding to the 511 2MB pages from 2MB upwards.  
>> If we shadow a pagetable at address 4MB+8KB, we'd increment the entry 
>> corresponding to the large page at 4MB.  When we unshadow that page, 
>> decrement the entry.
>>     
>
> You need to take care the the 2MB gpa is aligned 2 MB host physical to
> be able to map it correctly with a large pte. So maybe we need two
> memslots for 1MB-1GB. One for 1MB-2MB using normal 4kb pages and one
> from 2MB-1GB which can be allocated using HugeTLBfs.
>
>   

Another option is to allocate all memory starting from address zero 
using hugetlbfs, and pass 0-640K as one memslot and 1MB+ as another. In 
case the kernel needs to support both methods (e.g. it must handle a 
memslot that starts in the middle of a large page).

>> - If we attempt to shadow a large page (either a guest pse pte, or a 
>> real-mode pseudo pte), we check if the host page is a large page.  If 
>> so, we also check the write-protect count array.  If the result is zero, 
>> we create a shadow pse pte.
>>
>> - Whenever we write-protect a page, also zap any large-page mappings for 
>> that page.  This means rmap will need some extension to handle pde rmaps 
>> in addition to pte rmaps.
>>     
>
> This sounds straight forward to me. All you need is a short value for
> every potential large page and initialize it with -1 if the host page is
> a large page and with 0 otherwise. Every time this value reaches -1 we
> can map the page with a large pte (and the guest maps with large pte).
>
>   

You don't know whether the host page is a large page in advance. It 
needs to be checked during pagefault time.

>> - qemu is extended to have a command-line option to use large pages to 
>> back guest memory.
>>
>> Large pages should improve performance significantly, both with 
>> traditional shadow and npt/ept.
>>     
>
> Yes, I think that too. But with shadow paging it really depends on the
> guest if the performance increasement is long-term. In a Linux guest,
> for example, the direct mapped memory will become fragmented over
> time (together with the location of the page tables). So the
> number of potential large page mappings will likely decrease over
> time.
>
>   

Yes, that's why it is important to be able to fail fast when checking 
whether we can use a large spte.


-- 
Any sufficiently difficult bug is indistinguishable from a feature.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: large page support for kvm
  2008-01-31  5:44       ` Avi Kivity
@ 2008-02-11 15:49         ` Marcelo Tosatti
  2008-02-12 11:55           ` Avi Kivity
  0 siblings, 1 reply; 14+ messages in thread
From: Marcelo Tosatti @ 2008-02-11 15:49 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel

[-- Attachment #1: Type: text/plain, Size: 21703 bytes --]

On Thu, Jan 31, 2008 at 07:44:52AM +0200, Avi Kivity wrote:
> Joerg Roedel wrote:
> > On Tue, Jan 29, 2008 at 07:20:12PM +0200, Avi Kivity wrote:
> >
> >   
> >> Here's a rough sketch of my proposal:
> >>
> >> - For every memory slot, allocate an array containing one int for every 
> >> potential large page included within that memory slot.  Each entry in 
> >> the array contains the number of write-protected 4KB pages within the 
> >> large page frame corresponding to that entry.
> >>
> >> For example, if we have a memory slot for gpas 1MB-1GB, we'd have an 
> >> array of size 511, corresponding to the 511 2MB pages from 2MB upwards.  
> >> If we shadow a pagetable at address 4MB+8KB, we'd increment the entry 
> >> corresponding to the large page at 4MB.  When we unshadow that page, 
> >> decrement the entry.
> >>     
> >
> > You need to take care the the 2MB gpa is aligned 2 MB host physical to
> > be able to map it correctly with a large pte. So maybe we need two
> > memslots for 1MB-1GB. One for 1MB-2MB using normal 4kb pages and one
> > from 2MB-1GB which can be allocated using HugeTLBfs.
> >
> >   
> 
> Another option is to allocate all memory starting from address zero 
> using hugetlbfs, and pass 0-640K as one memslot and 1MB+ as another. In 
> case the kernel needs to support both methods (e.g. it must handle a 
> memslot that starts in the middle of a large page).
> 
> >> - If we attempt to shadow a large page (either a guest pse pte, or a 
> >> real-mode pseudo pte), we check if the host page is a large page.  If 
> >> so, we also check the write-protect count array.  If the result is zero, 
> >> we create a shadow pse pte.
> >>
> >> - Whenever we write-protect a page, also zap any large-page mappings for 
> >> that page.  This means rmap will need some extension to handle pde rmaps 
> >> in addition to pte rmaps.
> >>     
> >
> > This sounds straight forward to me. All you need is a short value for
> > every potential large page and initialize it with -1 if the host page is
> > a large page and with 0 otherwise. Every time this value reaches -1 we
> > can map the page with a large pte (and the guest maps with large pte).
> >
> >   
> 
> You don't know whether the host page is a large page in advance. It 
> needs to be checked during pagefault time.
> 
> >> - qemu is extended to have a command-line option to use large pages to 
> >> back guest memory.
> >>
> >> Large pages should improve performance significantly, both with 
> >> traditional shadow and npt/ept.
> >>     
> >
> > Yes, I think that too. But with shadow paging it really depends on the
> > guest if the performance increasement is long-term. In a Linux guest,
> > for example, the direct mapped memory will become fragmented over
> > time (together with the location of the page tables). So the
> > number of potential large page mappings will likely decrease over
> > time.
> >
> >   
> 
> Yes, that's why it is important to be able to fail fast when checking 
> whether we can use a large spte.

Ok, how does the following look. Still need to plug in large page
creation in the nonpaging case, but this should be enough for comments.

One drawback is that the hugepage follow_page() path uses a single
mm->page_table_lock spinlock, whereas 4k pages are split one per PTE
page. This is noticeable, SMP guests are slower due to it, but its a
separate problem.

Gives a 2% improvement in kernel compilations and large memory copies.

Attached is the qemu part, but this is obviously just a hack in case 
someone is interested in testing...

Index: linux-2.6-x86-kvm/arch/x86/kvm/mmu.c
===================================================================
--- linux-2.6-x86-kvm.orig/arch/x86/kvm/mmu.c
+++ linux-2.6-x86-kvm/arch/x86/kvm/mmu.c
@@ -27,6 +27,7 @@
 #include <linux/highmem.h>
 #include <linux/module.h>
 #include <linux/swap.h>
+#include <linux/hugetlb.h>
 
 #include <asm/page.h>
 #include <asm/cmpxchg.h>
@@ -205,6 +206,11 @@ static int is_shadow_present_pte(u64 pte
 		&& pte != shadow_notrap_nonpresent_pte;
 }
 
+static int is_large_pte(u64 pte)
+{
+	return pte & PT_PAGE_SIZE_MASK;
+}
+
 static int is_writeble_pte(unsigned long pte)
 {
 	return pte & PT_WRITABLE_MASK;
@@ -349,6 +355,107 @@ static void mmu_free_rmap_desc(struct kv
 	kfree(rd);
 }
 
+#define HPAGE_ALIGN_OFFSET(x) ((((x)+HPAGE_SIZE-1)&HPAGE_MASK) - (x))
+/*
+ * Return the offset inside the memslot largepage integer map for a given
+ * gfn, handling slots that are not large page aligned.
+ */
+int *slot_largepage_idx(gfn_t gfn, struct kvm_memory_slot *slot)
+{
+	unsigned long long idx;
+	unsigned long memslot_align;
+
+	memslot_align = HPAGE_ALIGN_OFFSET(slot->base_gfn << PAGE_SHIFT);
+	idx = ((gfn - slot->base_gfn) << PAGE_SHIFT) + memslot_align;
+	idx /= HPAGE_SIZE;
+	return &slot->largepage[idx];
+}
+
+static void account_shadowed(struct kvm *kvm, gfn_t gfn)
+{
+	int *largepage_idx;
+
+	largepage_idx = slot_largepage_idx(gfn, gfn_to_memslot(kvm, gfn));
+	*largepage_idx += 1;
+	WARN_ON(*largepage_idx > (HPAGE_SIZE/PAGE_SIZE));
+}
+
+static void unaccount_shadowed(struct kvm *kvm, gfn_t gfn)
+{
+	int *largepage_idx;
+
+	largepage_idx = slot_largepage_idx(gfn, gfn_to_memslot(kvm, gfn));
+	*largepage_idx -= 1;
+	WARN_ON(*largepage_idx < 0);
+}
+
+static int has_wrprotected_page(struct kvm *kvm, gfn_t gfn)
+{
+	struct kvm_memory_slot *slot;
+
+	slot = gfn_to_memslot(kvm, gfn);
+	if (slot) {
+		int *largepage_idx;
+		int end_gfn;
+
+		largepage_idx = slot_largepage_idx(gfn, slot);
+		/* check if the largepage crosses a memslot */
+		end_gfn = slot->base_gfn + slot->npages;
+		if (gfn + (HPAGE_SIZE/PAGE_SIZE) >= end_gfn)
+			return 1;
+		else
+			return *largepage_idx;
+	}
+
+	return 1;
+}
+
+extern unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn);
+static int host_largepage_backed(struct kvm *kvm, gfn_t gfn)
+{
+	struct vm_area_struct *vma;
+	unsigned long addr;
+
+	addr = gfn_to_hva(kvm, gfn);
+	if (kvm_is_error_hva(addr))
+		return 0;
+
+	vma = find_vma(current->mm, addr);
+	if (vma && is_vm_hugetlb_page(vma))
+		return 1;
+
+	return 0;
+}
+
+static int is_largepage_backed(struct kvm_vcpu *vcpu, gfn_t large_gfn)
+{
+	if (has_wrprotected_page(vcpu->kvm, large_gfn))
+		return 0;
+
+	if (!host_largepage_backed(vcpu->kvm, large_gfn))
+		return 0;
+
+	if ((large_gfn << PAGE_SHIFT) & (HPAGE_SIZE-1))
+		return 0;
+
+	/* guest has 4M pages */
+	if (!is_pae(vcpu))
+		return 0;
+
+	return 1;
+}
+
+static int is_physical_memory(struct kvm *kvm, gfn_t gfn)
+{
+	unsigned long addr;
+
+	addr = gfn_to_hva(kvm, gfn);
+	if (kvm_is_error_hva(addr))
+		return 0;
+
+	return 1;
+}
+
 /*
  * Take gfn and return the reverse mapping to it.
  * Note: gfn must be unaliased before this function get called
@@ -362,6 +469,20 @@ static unsigned long *gfn_to_rmap(struct
 	return &slot->rmap[gfn - slot->base_gfn];
 }
 
+static unsigned long *gfn_to_rmap_pde(struct kvm *kvm, gfn_t gfn)
+{
+	struct kvm_memory_slot *slot;
+	unsigned long memslot_align;
+	unsigned long long idx;
+
+	slot = gfn_to_memslot(kvm, gfn);
+	memslot_align = HPAGE_ALIGN_OFFSET(slot->base_gfn << PAGE_SHIFT);
+
+	idx = ((gfn - slot->base_gfn) << PAGE_SHIFT) + memslot_align;
+	idx /= HPAGE_SIZE;
+	return &slot->rmap_pde[idx];
+}
+
 /*
  * Reverse mapping data structures:
  *
@@ -371,7 +492,7 @@ static unsigned long *gfn_to_rmap(struct
  * If rmapp bit zero is one, (then rmap & ~1) points to a struct kvm_rmap_desc
  * containing more mappings.
  */
-static void rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
+static void rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn, int hpage)
 {
 	struct kvm_mmu_page *sp;
 	struct kvm_rmap_desc *desc;
@@ -383,7 +504,10 @@ static void rmap_add(struct kvm_vcpu *vc
 	gfn = unalias_gfn(vcpu->kvm, gfn);
 	sp = page_header(__pa(spte));
 	sp->gfns[spte - sp->spt] = gfn;
-	rmapp = gfn_to_rmap(vcpu->kvm, gfn);
+	if (!hpage)
+		rmapp = gfn_to_rmap(vcpu->kvm, gfn);
+	else
+		rmapp = gfn_to_rmap_pde(vcpu->kvm, gfn);
 	if (!*rmapp) {
 		rmap_printk("rmap_add: %p %llx 0->1\n", spte, *spte);
 		*rmapp = (unsigned long)spte;
@@ -449,7 +573,10 @@ static void rmap_remove(struct kvm *kvm,
 		kvm_release_page_dirty(page);
 	else
 		kvm_release_page_clean(page);
-	rmapp = gfn_to_rmap(kvm, sp->gfns[spte - sp->spt]);
+	if (is_large_pte(*spte))
+		rmapp = gfn_to_rmap_pde(kvm, sp->gfns[spte - sp->spt]);
+	else
+		rmapp = gfn_to_rmap(kvm, sp->gfns[spte - sp->spt]);
 	if (!*rmapp) {
 		printk(KERN_ERR "rmap_remove: %p %llx 0->BUG\n", spte, *spte);
 		BUG();
@@ -528,8 +655,27 @@ static void rmap_write_protect(struct kv
 		}
 		spte = rmap_next(kvm, rmapp, spte);
 	}
+	/* check for huge page mappings */
+	rmapp = gfn_to_rmap_pde(kvm, gfn);
+	spte = rmap_next(kvm, rmapp, NULL);
+	while (spte) {
+		BUG_ON(!spte);
+		BUG_ON(!(*spte & PT_PRESENT_MASK));
+		BUG_ON((*spte & (PT_PAGE_SIZE_MASK|PT_PRESENT_MASK)) != (PT_PAGE_SIZE_MASK|PT_PRESENT_MASK));
+		pgprintk("rmap_write_protect(large): spte %p %llx %d\n", spte, *spte, gfn);
+		if (is_writeble_pte(*spte)) {
+			rmap_remove(kvm, spte);
+			--kvm->stat.lpages;
+			set_shadow_pte(spte, shadow_trap_nonpresent_pte);
+			write_protected = 1;
+		}
+		spte = rmap_next(kvm, rmapp, spte);
+	}
+
 	if (write_protected)
 		kvm_flush_remote_tlbs(kvm);
+
+	account_shadowed(kvm, gfn);
 }
 
 #ifdef MMU_DEBUG
@@ -749,11 +895,19 @@ static void kvm_mmu_page_unlink_children
 	for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
 		ent = pt[i];
 
+		if (is_shadow_present_pte(pt[i]) && is_large_pte(pt[i])) {
+			if (is_writeble_pte(pt[i]))
+				--kvm->stat.lpages;
+			rmap_remove(kvm, &pt[i]);
+		}
+
 		pt[i] = shadow_trap_nonpresent_pte;
 		if (!is_shadow_present_pte(ent))
 			continue;
-		ent &= PT64_BASE_ADDR_MASK;
-		mmu_page_remove_parent_pte(page_header(ent), &pt[i]);
+		if (!is_large_pte(ent)) {
+			ent &= PT64_BASE_ADDR_MASK;
+			mmu_page_remove_parent_pte(page_header(ent), &pt[i]);
+		}
 	}
 	kvm_flush_remote_tlbs(kvm);
 }
@@ -793,6 +947,8 @@ static void kvm_mmu_zap_page(struct kvm 
 	}
 	kvm_mmu_page_unlink_children(kvm, sp);
 	if (!sp->root_count) {
+		if (!sp->role.metaphysical)
+			unaccount_shadowed(kvm, sp->gfn);
 		hlist_del(&sp->hash_link);
 		kvm_mmu_free_page(kvm, sp);
 	} else
@@ -886,12 +1042,21 @@ struct page *gva_to_page(struct kvm_vcpu
 static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *shadow_pte,
 			 unsigned pt_access, unsigned pte_access,
 			 int user_fault, int write_fault, int dirty,
-			 int *ptwrite, gfn_t gfn, struct page *page)
+			 int *ptwrite, int largepage, gfn_t gfn,
+			 struct page *page)
 {
 	u64 spte;
 	int was_rmapped = is_rmap_pte(*shadow_pte);
 	int was_writeble = is_writeble_pte(*shadow_pte);
 
+	/*
+ 	 * If its a largepage mapping, there could be a previous
+ 	 * pointer to a PTE page hanging there, which will falsely
+ 	 * set was_rmapped.
+ 	 */
+	if (largepage)
+		was_rmapped = is_large_pte(*shadow_pte);
+
 	pgprintk("%s: spte %llx access %x write_fault %d"
 		 " user_fault %d gfn %lx\n",
 		 __FUNCTION__, *shadow_pte, pt_access,
@@ -911,6 +1076,8 @@ static void mmu_set_spte(struct kvm_vcpu
 	spte |= PT_PRESENT_MASK;
 	if (pte_access & ACC_USER_MASK)
 		spte |= PT_USER_MASK;
+	if (largepage)
+		spte |= PT_PAGE_SIZE_MASK;
 
 	if (is_error_page(page)) {
 		set_shadow_pte(shadow_pte,
@@ -932,7 +1099,8 @@ static void mmu_set_spte(struct kvm_vcpu
 		}
 
 		shadow = kvm_mmu_lookup_page(vcpu->kvm, gfn);
-		if (shadow) {
+		if (shadow ||
+		   (largepage && has_wrprotected_page(vcpu->kvm, gfn))) {
 			pgprintk("%s: found shadow page for %lx, marking ro\n",
 				 __FUNCTION__, gfn);
 			pte_access &= ~ACC_WRITE_MASK;
@@ -951,10 +1119,17 @@ unshadowed:
 		mark_page_dirty(vcpu->kvm, gfn);
 
 	pgprintk("%s: setting spte %llx\n", __FUNCTION__, spte);
+	pgprintk("instantiating %s PTE (%s) at %d (%llx)\n",
+		 (spte&PT_PAGE_SIZE_MASK)? "2MB" : "4kB",
+		 (spte&PT_WRITABLE_MASK)?"RW":"R", gfn, spte);
 	set_shadow_pte(shadow_pte, spte);
+	if (!was_rmapped && (spte & (PT_PAGE_SIZE_MASK|PT_WRITABLE_MASK))
+		== (PT_PAGE_SIZE_MASK|PT_WRITABLE_MASK))
+		++vcpu->kvm->stat.lpages;
+
 	page_header_update_slot(vcpu->kvm, shadow_pte, gfn);
 	if (!was_rmapped) {
-		rmap_add(vcpu, shadow_pte, gfn);
+		rmap_add(vcpu, shadow_pte, gfn, largepage);
 		if (!is_rmap_pte(*shadow_pte))
 			kvm_release_page_clean(page);
 	} else {
@@ -987,7 +1162,7 @@ static int __nonpaging_map(struct kvm_vc
 
 		if (level == 1) {
 			mmu_set_spte(vcpu, &table[index], ACC_ALL, ACC_ALL,
-				     0, write, 1, &pt_write, gfn, page);
+				     0, write, 1, &pt_write, 0, gfn, page);
 			return pt_write || is_io_pte(table[index]);
 		}
 
@@ -1300,7 +1475,8 @@ static void mmu_pte_write_zap_pte(struct
 
 	pte = *spte;
 	if (is_shadow_present_pte(pte)) {
-		if (sp->role.level == PT_PAGE_TABLE_LEVEL)
+		if (sp->role.level == PT_PAGE_TABLE_LEVEL ||
+		    is_large_pte(pte))
 			rmap_remove(vcpu->kvm, spte);
 		else {
 			child = page_header(pte & PT64_BASE_ADDR_MASK);
@@ -1308,14 +1484,18 @@ static void mmu_pte_write_zap_pte(struct
 		}
 	}
 	set_shadow_pte(spte, shadow_trap_nonpresent_pte);
+	if (is_large_pte(pte) && is_writeble_pte(pte))
+		--vcpu->kvm->stat.lpages;
 }
 
 static void mmu_pte_write_new_pte(struct kvm_vcpu *vcpu,
 				  struct kvm_mmu_page *sp,
 				  u64 *spte,
-				  const void *new)
+				  const void *new,
+				  u64 old)
 {
-	if (sp->role.level != PT_PAGE_TABLE_LEVEL) {
+	if ((sp->role.level != PT_PAGE_TABLE_LEVEL)
+	    && !vcpu->arch.update_pte.largepage) {
 		++vcpu->kvm->stat.mmu_pde_zapped;
 		return;
 	}
@@ -1390,6 +1570,10 @@ static void mmu_guess_page_from_pte_writ
 	gfn = (gpte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT;
 	vcpu->arch.update_pte.gfn = gfn;
 	vcpu->arch.update_pte.page = gfn_to_page(vcpu->kvm, gfn);
+	if (is_large_pte(gpte) && is_largepage_backed(vcpu, gfn))
+		vcpu->arch.update_pte.largepage = 1;
+	else
+		vcpu->arch.update_pte.largepage = 0;
 }
 
 void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
@@ -1487,7 +1671,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *
 			entry = *spte;
 			mmu_pte_write_zap_pte(vcpu, sp, spte);
 			if (new)
-				mmu_pte_write_new_pte(vcpu, sp, spte, new);
+				mmu_pte_write_new_pte(vcpu, sp, spte, new, entry);
 			mmu_pte_write_flush_tlb(vcpu, entry, *spte);
 			++spte;
 		}
Index: linux-2.6-x86-kvm/arch/x86/kvm/paging_tmpl.h
===================================================================
--- linux-2.6-x86-kvm.orig/arch/x86/kvm/paging_tmpl.h
+++ linux-2.6-x86-kvm/arch/x86/kvm/paging_tmpl.h
@@ -71,6 +71,7 @@ struct guest_walker {
 	unsigned pte_access;
 	gfn_t gfn;
 	u32 error_code;
+	int largepage_backed;
 };
 
 static gfn_t gpte_to_gfn(pt_element_t gpte)
@@ -120,7 +121,8 @@ static unsigned FNAME(gpte_access)(struc
  */
 static int FNAME(walk_addr)(struct guest_walker *walker,
 			    struct kvm_vcpu *vcpu, gva_t addr,
-			    int write_fault, int user_fault, int fetch_fault)
+			    int write_fault, int user_fault, int fetch_fault,
+			    int faulting)
 {
 	pt_element_t pte;
 	gfn_t table_gfn;
@@ -130,6 +132,7 @@ static int FNAME(walk_addr)(struct guest
 	pgprintk("%s: addr %lx\n", __FUNCTION__, addr);
 walk:
 	walker->level = vcpu->arch.mmu.root_level;
+	walker->largepage_backed = 0;
 	pte = vcpu->arch.cr3;
 #if PTTYPE == 64
 	if (!is_long_mode(vcpu)) {
@@ -192,10 +195,19 @@ walk:
 		if (walker->level == PT_DIRECTORY_LEVEL
 		    && (pte & PT_PAGE_SIZE_MASK)
 		    && (PTTYPE == 64 || is_pse(vcpu))) {
-			walker->gfn = gpte_to_gfn_pde(pte);
+			gfn_t gfn = gpte_to_gfn_pde(pte);
+			walker->gfn = gfn;
+
 			walker->gfn += PT_INDEX(addr, PT_PAGE_TABLE_LEVEL);
 			if (PTTYPE == 32 && is_cpuid_PSE36())
 				walker->gfn += pse36_gfn_delta(pte);
+
+			if (faulting
+			    && is_largepage_backed(vcpu, gfn)
+			    && is_physical_memory(vcpu->kvm, walker->gfn)) {
+				walker->largepage_backed = 1;
+				walker->gfn = gfn;
+			}
 			break;
 		}
 
@@ -245,6 +257,7 @@ static void FNAME(update_pte)(struct kvm
 	pt_element_t gpte;
 	unsigned pte_access;
 	struct page *npage;
+	int largepage = vcpu->arch.update_pte.largepage;
 
 	gpte = *(const pt_element_t *)pte;
 	if (~gpte & (PT_PRESENT_MASK | PT_ACCESSED_MASK)) {
@@ -261,7 +274,8 @@ static void FNAME(update_pte)(struct kvm
 		return;
 	get_page(npage);
 	mmu_set_spte(vcpu, spte, page->role.access, pte_access, 0, 0,
-		     gpte & PT_DIRTY_MASK, NULL, gpte_to_gfn(gpte), npage);
+		     gpte & PT_DIRTY_MASK, NULL, largepage, gpte_to_gfn(gpte),
+		     npage);
 }
 
 /*
@@ -299,6 +313,9 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
 		shadow_ent = ((u64 *)__va(shadow_addr)) + index;
 		if (level == PT_PAGE_TABLE_LEVEL)
 			break;
+		if (level == PT_DIRECTORY_LEVEL && walker->largepage_backed)
+			break;
+
 		if (is_shadow_present_pte(*shadow_ent)) {
 			shadow_addr = *shadow_ent & PT64_BASE_ADDR_MASK;
 			continue;
@@ -337,7 +354,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
 	mmu_set_spte(vcpu, shadow_ent, access, walker->pte_access & access,
 		     user_fault, write_fault,
 		     walker->ptes[walker->level-1] & PT_DIRTY_MASK,
-		     ptwrite, walker->gfn, page);
+		     ptwrite, walker->largepage_backed, walker->gfn, page);
 
 	return shadow_ent;
 }
@@ -380,7 +397,7 @@ static int FNAME(page_fault)(struct kvm_
 	 * Look up the shadow pte for the faulting address.
 	 */
 	r = FNAME(walk_addr)(&walker, vcpu, addr, write_fault, user_fault,
-			     fetch_fault);
+			     fetch_fault, 1);
 
 	/*
 	 * The page is not mapped by the guest.  Let the guest handle it.
@@ -395,6 +412,13 @@ static int FNAME(page_fault)(struct kvm_
 
 	page = gfn_to_page(vcpu->kvm, walker.gfn);
 
+	/* shortcut non-RAM accesses to avoid walking over a 2MB PMD entry */
+	if (is_error_page(page)) {
+		kvm_release_page_clean(page);
+		up_read(&current->mm->mmap_sem);
+		return 1;
+	}
+
 	spin_lock(&vcpu->kvm->mmu_lock);
 	kvm_mmu_free_some_pages(vcpu);
 	shadow_pte = FNAME(fetch)(vcpu, addr, &walker, user_fault, write_fault,
@@ -428,7 +452,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kv
 	gpa_t gpa = UNMAPPED_GVA;
 	int r;
 
-	r = FNAME(walk_addr)(&walker, vcpu, vaddr, 0, 0, 0);
+	r = FNAME(walk_addr)(&walker, vcpu, vaddr, 0, 0, 0, 0);
 
 	if (r) {
 		gpa = gfn_to_gpa(walker.gfn);
Index: linux-2.6-x86-kvm/include/asm-x86/kvm_host.h
===================================================================
--- linux-2.6-x86-kvm.orig/include/asm-x86/kvm_host.h
+++ linux-2.6-x86-kvm/include/asm-x86/kvm_host.h
@@ -228,6 +228,7 @@ struct kvm_vcpu_arch {
 	struct {
 		gfn_t gfn;          /* presumed gfn during guest pte update */
 		struct page *page;  /* page corresponding to that gfn */
+		int largepage;
 	} update_pte;
 
 	struct i387_fxsave_struct host_fx_image;
@@ -298,6 +299,7 @@ struct kvm_vm_stat {
 	u32 mmu_recycled;
 	u32 mmu_cache_miss;
 	u32 remote_tlb_flush;
+	u32 lpages;
 };
 
 struct kvm_vcpu_stat {
Index: linux-2.6-x86-kvm/include/linux/kvm_host.h
===================================================================
--- linux-2.6-x86-kvm.orig/include/linux/kvm_host.h
+++ linux-2.6-x86-kvm/include/linux/kvm_host.h
@@ -99,7 +99,9 @@ struct kvm_memory_slot {
 	unsigned long npages;
 	unsigned long flags;
 	unsigned long *rmap;
+	unsigned long *rmap_pde;
 	unsigned long *dirty_bitmap;
+	int *largepage;
 	unsigned long userspace_addr;
 	int user_alloc;
 };
Index: linux-2.6-x86-kvm/virt/kvm/kvm_main.c
===================================================================
--- linux-2.6-x86-kvm.orig/virt/kvm/kvm_main.c
+++ linux-2.6-x86-kvm/virt/kvm/kvm_main.c
@@ -188,9 +188,17 @@ static void kvm_free_physmem_slot(struct
 	if (!dont || free->dirty_bitmap != dont->dirty_bitmap)
 		vfree(free->dirty_bitmap);
 
+	if (!dont || free->rmap_pde != dont->rmap_pde)
+		vfree(free->rmap_pde);
+
+	if (!dont || free->largepage != dont->largepage)
+		kfree(free->largepage);
+
 	free->npages = 0;
 	free->dirty_bitmap = NULL;
 	free->rmap = NULL;
+	free->rmap_pde = NULL;
+	free->largepage = NULL;
 }
 
 void kvm_free_physmem(struct kvm *kvm)
@@ -300,6 +308,28 @@ int __kvm_set_memory_region(struct kvm *
 		new.user_alloc = user_alloc;
 		new.userspace_addr = mem->userspace_addr;
 	}
+	if (npages && !new.rmap_pde) {
+		int largepages = npages / (HPAGE_SIZE/PAGE_SIZE);
+		if (npages % (HPAGE_SIZE/PAGE_SIZE))
+			largepages++;
+		new.rmap_pde = vmalloc(largepages * sizeof(struct page *));
+
+		if (!new.rmap_pde)
+			goto out_free;
+
+		memset(new.rmap_pde, 0, largepages * sizeof(struct page *));
+	}
+	if (npages && !new.largepage) {
+		int largepages = npages / (HPAGE_SIZE/PAGE_SIZE);
+		if (npages % (HPAGE_SIZE/PAGE_SIZE))
+			largepages++;
+		new.largepage = kmalloc(largepages * sizeof(int), GFP_KERNEL);
+
+		if (!new.largepage)
+			goto out;
+
+		memset(new.largepage, 0, largepages * sizeof(int));
+	}
 
 	/* Allocate page dirty bitmap if needed */
 	if ((new.flags & KVM_MEM_LOG_DIRTY_PAGES) && !new.dirty_bitmap) {
@@ -443,7 +473,7 @@ int kvm_is_visible_gfn(struct kvm *kvm, 
 }
 EXPORT_SYMBOL_GPL(kvm_is_visible_gfn);
 
-static unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
+unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
 {
 	struct kvm_memory_slot *slot;
 
@@ -453,6 +483,7 @@ static unsigned long gfn_to_hva(struct k
 		return bad_hva();
 	return (slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE);
 }
+EXPORT_SYMBOL(gfn_to_hva);
 
 /*
  * Requires current->mm->mmap_sem to be held
Index: linux-2.6-x86-kvm/arch/x86/kvm/x86.c
===================================================================
--- linux-2.6-x86-kvm.orig/arch/x86/kvm/x86.c
+++ linux-2.6-x86-kvm/arch/x86/kvm/x86.c
@@ -75,6 +75,7 @@ struct kvm_stats_debugfs_item debugfs_en
 	{ "mmu_recycled", VM_STAT(mmu_recycled) },
 	{ "mmu_cache_miss", VM_STAT(mmu_cache_miss) },
 	{ "remote_tlb_flush", VM_STAT(remote_tlb_flush) },
+	{ "lpages", VM_STAT(lpages) },
 	{ NULL }
 };
 

[-- Attachment #2: qemu-largepage-hack.patch --]
[-- Type: text/plain, Size: 1302 bytes --]

Index: kvm-userspace/qemu/vl.c
===================================================================
--- kvm-userspace.orig/qemu/vl.c
+++ kvm-userspace/qemu/vl.c
@@ -8501,6 +8501,31 @@ void qemu_get_launch_info(int *argc, cha
     *opt_incoming = incoming;
 }
 
+#define HPAGE_SIZE 2*1024*1024
+
+void *alloc_huge_area(unsigned long memory)
+{
+	void *area;
+	int fd;
+	char path[] = "/mnt/kvm.XXXXXX";
+
+	mkstemp(path);
+	fd = open(path, O_RDWR);
+	if (fd < 0) {
+		perror("open");
+		exit(0);
+	}
+	memory = (memory+HPAGE_SIZE-1) & ~(HPAGE_SIZE-1);
+
+	area = mmap(0, memory, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0);
+	if (area == MAP_FAILED) {
+		perror("mmap");
+		exit(0);
+	}
+
+	return area;
+}
+
 int main(int argc, char **argv)
 {
 #ifdef CONFIG_GDBSTUB
@@ -9330,9 +9355,9 @@ int main(int argc, char **argv)
 
             ret = kvm_qemu_check_extension(KVM_CAP_USER_MEMORY);
             if (ret) {
-		printf("allocating %d MB\n", phys_ram_size/1024/1024);
-                phys_ram_base = qemu_vmalloc(phys_ram_size);
-	        if (!phys_ram_base) {
+                //phys_ram_base = qemu_vmalloc(phys_ram_size);
+		phys_ram_base = alloc_huge_area(phys_ram_size);
+		if (!phys_ram_base) {
 		        fprintf(stderr, "Could not allocate physical memory\n");
 		        exit(1);
 	        }

[-- Attachment #3: Type: text/plain, Size: 228 bytes --]

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

[-- Attachment #4: Type: text/plain, Size: 158 bytes --]

_______________________________________________
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: large page support for kvm
  2008-02-11 15:49         ` Marcelo Tosatti
@ 2008-02-12 11:55           ` Avi Kivity
  2008-02-13  0:15             ` Marcelo Tosatti
  0 siblings, 1 reply; 14+ messages in thread
From: Avi Kivity @ 2008-02-12 11:55 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm-devel

Marcelo Tosatti wrote:
> Ok, how does the following look. Still need to plug in large page
> creation in the nonpaging case, but this should be enough for comments.
>   

Most of the comments are cosmetic, but a couple have some meat.

>  
> +#define HPAGE_ALIGN_OFFSET(x) ((((x)+HPAGE_SIZE-1)&HPAGE_MASK) - (x))
>   

A function please.

> +/*
> + * Return the offset inside the memslot largepage integer map for a given
> + * gfn, handling slots that are not large page aligned.
> + */
> +int *slot_largepage_idx(gfn_t gfn, struct kvm_memory_slot *slot)
> +{
> +	unsigned long long idx;
> +	unsigned long memslot_align;
> +
> +	memslot_align = HPAGE_ALIGN_OFFSET(slot->base_gfn << PAGE_SHIFT);
> +	idx = ((gfn - slot->base_gfn) << PAGE_SHIFT) + memslot_align;
> +	idx /= HPAGE_SIZE;
>   

Can probably be reshuffled to avoid the long long.  The compiler may 
want to call helpers on i386...

We also need our own HPAGE_SIZE since kvm uses pae on non-pae hosts.
> +
> +static int has_wrprotected_page(struct kvm *kvm, gfn_t gfn)
> +{
> +	struct kvm_memory_slot *slot;
> +
> +	slot = gfn_to_memslot(kvm, gfn);
> +	if (slot) {
> +		int *largepage_idx;
> +		int end_gfn;
> +
> +		largepage_idx = slot_largepage_idx(gfn, slot);
> +		/* check if the largepage crosses a memslot */
> +		end_gfn = slot->base_gfn + slot->npages;
> +		if (gfn + (HPAGE_SIZE/PAGE_SIZE) >= end_gfn)
> +			return 1;
> +		else
> +			return *largepage_idx;
>   

We might initialize the boundary slots to 1 instead of zero and so avoid 
this check.

> +static int is_largepage_backed(struct kvm_vcpu *vcpu, gfn_t large_gfn)
> +{
> +	if (has_wrprotected_page(vcpu->kvm, large_gfn))
> +		return 0;
> +
> +	if (!host_largepage_backed(vcpu->kvm, large_gfn))
> +		return 0;
> +
> +	if ((large_gfn << PAGE_SHIFT) & (HPAGE_SIZE-1))
> +		return 0;
>   

I'd drop this check and simply ignore the least significant 9 bits of 
large_gfn.

> +
> +	/* guest has 4M pages */
> +	if (!is_pae(vcpu))
> +		return 0;
>   

We can drop this check too.  If we treat a 4MB page as two contiguous 
2MB pages, and don't zero out the large_gfn least significant bits, most 
of the code needn't know about it at all.

> @@ -362,6 +469,20 @@ static unsigned long *gfn_to_rmap(struct
>  	return &slot->rmap[gfn - slot->base_gfn];
>  }
>  /*
>   * Reverse mapping data structures:
>   *
> @@ -371,7 +492,7 @@ static unsigned long *gfn_to_rmap(struct
>   * If rmapp bit zero is one, (then rmap & ~1) points to a struct kvm_rmap_desc
>   * containing more mappings.
>   */
> -static void rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
> +static void rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn, int hpage)
>  {
>  	struct kvm_mmu_page *sp;
>  	struct kvm_rmap_desc *desc;
> @@ -383,7 +504,10 @@ static void rmap_add(struct kvm_vcpu *vc
>  	gfn = unalias_gfn(vcpu->kvm, gfn);
>  	sp = page_header(__pa(spte));
>  	sp->gfns[spte - sp->spt] = gfn;
> -	rmapp = gfn_to_rmap(vcpu->kvm, gfn);
> +	if (!hpage)
> +		rmapp = gfn_to_rmap(vcpu->kvm, gfn);
> +	else
> +		rmapp = gfn_to_rmap_pde(vcpu->kvm, gfn);
>   

This bit can go into a function

>  	if (!*rmapp) {
>  		rmap_printk("rmap_add: %p %llx 0->1\n", spte, *spte);
>  		*rmapp = (unsigned long)spte;
> @@ -449,7 +573,10 @@ static void rmap_remove(struct kvm *kvm,
>  		kvm_release_page_dirty(page);
>  	else
>  		kvm_release_page_clean(page);
> -	rmapp = gfn_to_rmap(kvm, sp->gfns[spte - sp->spt]);
> +	if (is_large_pte(*spte))
> +		rmapp = gfn_to_rmap_pde(kvm, sp->gfns[spte - sp->spt]);
> +	else
> +		rmapp = gfn_to_rmap(kvm, sp->gfns[spte - sp->spt]);
>   

As it is reused here.

>  
>  #ifdef MMU_DEBUG
> @@ -749,11 +895,19 @@ static void kvm_mmu_page_unlink_children
>  	for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
>  		ent = pt[i];
>  
> +		if (is_shadow_present_pte(pt[i]) && is_large_pte(pt[i])) {
> +			if (is_writeble_pte(pt[i]))
> +				--kvm->stat.lpages;
> +			rmap_remove(kvm, &pt[i]);
> +		}
> +
>   

pt[i] == ent here.  you can also move this code to be the 'else' part of 
the if () you introduce below.

>  		pt[i] = shadow_trap_nonpresent_pte;
>  		if (!is_shadow_present_pte(ent))
>  			continue;
> -		ent &= PT64_BASE_ADDR_MASK;
> -		mmu_page_remove_parent_pte(page_header(ent), &pt[i]);
> +		if (!is_large_pte(ent)) {
> +			ent &= PT64_BASE_ADDR_MASK;
> +			mmu_page_remove_parent_pte(page_header(ent), &pt[i]);
> +		}
>  	}
>  	kvm_flush_remote_tlbs(kvm);
>  }
>   

> @@ -886,12 +1042,21 @@ struct page *gva_to_page(struct kvm_vcpu
>  static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *shadow_pte,
>  			 unsigned pt_access, unsigned pte_access,
>  			 int user_fault, int write_fault, int dirty,
> -			 int *ptwrite, gfn_t gfn, struct page *page)
> +			 int *ptwrite, int largepage, gfn_t gfn,
> +			 struct page *page)
>  {
>  	u64 spte;
>  	int was_rmapped = is_rmap_pte(*shadow_pte);
>  	int was_writeble = is_writeble_pte(*shadow_pte);
>  
> +	/*
> + 	 * If its a largepage mapping, there could be a previous
> + 	 * pointer to a PTE page hanging there, which will falsely
> + 	 * set was_rmapped.
> + 	 */
> +	if (largepage)
> +		was_rmapped = is_large_pte(*shadow_pte);
> +
>   

But that pte page will have its parent_pte chain pointing to shadow_pte, 
no?  Either this can't happen, or we need to unlink that pte page first.

> @@ -951,10 +1119,17 @@ unshadowed:
>  		mark_page_dirty(vcpu->kvm, gfn);
>  
>  	pgprintk("%s: setting spte %llx\n", __FUNCTION__, spte);
> +	pgprintk("instantiating %s PTE (%s) at %d (%llx)\n",
> +		 (spte&PT_PAGE_SIZE_MASK)? "2MB" : "4kB",
> +		 (spte&PT_WRITABLE_MASK)?"RW":"R", gfn, spte);
>  	set_shadow_pte(shadow_pte, spte);
> +	if (!was_rmapped && (spte & (PT_PAGE_SIZE_MASK|PT_WRITABLE_MASK))
> +		== (PT_PAGE_SIZE_MASK|PT_WRITABLE_MASK))
> +		++vcpu->kvm->stat.lpages;
> +
>   

Why do you only account for writable large pages?

> Index: linux-2.6-x86-kvm/arch/x86/kvm/paging_tmpl.h
> ===================================================================
> --- linux-2.6-x86-kvm.orig/arch/x86/kvm/paging_tmpl.h
> +++ linux-2.6-x86-kvm/arch/x86/kvm/paging_tmpl.h
> @@ -71,6 +71,7 @@ struct guest_walker {
>  	unsigned pte_access;
>  	gfn_t gfn;
>  	u32 error_code;
> +	int largepage_backed;
>  };
>  
>  static gfn_t gpte_to_gfn(pt_element_t gpte)
> @@ -120,7 +121,8 @@ static unsigned FNAME(gpte_access)(struc
>   */
>  static int FNAME(walk_addr)(struct guest_walker *walker,
>  			    struct kvm_vcpu *vcpu, gva_t addr,
> -			    int write_fault, int user_fault, int fetch_fault)
> +			    int write_fault, int user_fault, int fetch_fault,
> +			    int faulting)
>  {
>  	pt_element_t pte;
>  	gfn_t table_gfn;
> @@ -130,6 +132,7 @@ static int FNAME(walk_addr)(struct guest
>  	pgprintk("%s: addr %lx\n", __FUNCTION__, addr);
>  walk:
>  	walker->level = vcpu->arch.mmu.root_level;
> +	walker->largepage_backed = 0;
>  	pte = vcpu->arch.cr3;
>  #if PTTYPE == 64
>  	if (!is_long_mode(vcpu)) {
> @@ -192,10 +195,19 @@ walk:
>  		if (walker->level == PT_DIRECTORY_LEVEL
>  		    && (pte & PT_PAGE_SIZE_MASK)
>  		    && (PTTYPE == 64 || is_pse(vcpu))) {
> -			walker->gfn = gpte_to_gfn_pde(pte);
> +			gfn_t gfn = gpte_to_gfn_pde(pte);
> +			walker->gfn = gfn;
> +
>  			walker->gfn += PT_INDEX(addr, PT_PAGE_TABLE_LEVEL);
>  			if (PTTYPE == 32 && is_cpuid_PSE36())
>  				walker->gfn += pse36_gfn_delta(pte);
> +
> +			if (faulting
> +			    && is_largepage_backed(vcpu, gfn)
> +			    && is_physical_memory(vcpu->kvm, walker->gfn)) {
> +				walker->largepage_backed = 1;
> +				walker->gfn = gfn;
> +			}
>   

I don't like this bit.  So far the walker has been independent of the 
host state, and only depended on guest data.  We can set 
largepage_backed unconditionally on a large guest pte, leave gfn 
containing the low-order bits, and leave the host checks for the actual 
mapping logic.

>  
>  /*
> @@ -299,6 +313,9 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
>  		shadow_ent = ((u64 *)__va(shadow_addr)) + index;
>  		if (level == PT_PAGE_TABLE_LEVEL)
>  			break;
> +		if (level == PT_DIRECTORY_LEVEL && walker->largepage_backed)
> +			break;
> +
>   

(here)

> @@ -395,6 +412,13 @@ static int FNAME(page_fault)(struct kvm_
>  
>  	page = gfn_to_page(vcpu->kvm, walker.gfn);
>  
> +	/* shortcut non-RAM accesses to avoid walking over a 2MB PMD entry */
> +	if (is_error_page(page)) {
> +		kvm_release_page_clean(page);
> +		up_read(&current->mm->mmap_sem);
> +		return 1;
> +	}
> +
>   

This bit is already in kvm.git?

>  
>  struct kvm_vcpu_stat {
> Index: linux-2.6-x86-kvm/include/linux/kvm_host.h
> ===================================================================
> --- linux-2.6-x86-kvm.orig/include/linux/kvm_host.h
> +++ linux-2.6-x86-kvm/include/linux/kvm_host.h
> @@ -99,7 +99,9 @@ struct kvm_memory_slot {
>  	unsigned long npages;
>  	unsigned long flags;
>  	unsigned long *rmap;
> +	unsigned long *rmap_pde;
>  	unsigned long *dirty_bitmap;
> +	int *largepage;
>   

We can combine rmap_pde and largepage into an array of struct { int 
write_count, unsigned long rmap_pde; } and save some cacheline accesses.

>  	
>  
>  void kvm_free_physmem(struct kvm *kvm)
> @@ -300,6 +308,28 @@ int __kvm_set_memory_region(struct kvm *
>  		new.user_alloc = user_alloc;
>  		new.userspace_addr = mem->userspace_addr;
>  	}
> +	if (npages && !new.rmap_pde) {
> +		int largepages = npages / (HPAGE_SIZE/PAGE_SIZE);
> +		if (npages % (HPAGE_SIZE/PAGE_SIZE))
> +			largepages++;
> +		new.rmap_pde = vmalloc(largepages * sizeof(struct page *));
> +
> +		if (!new.rmap_pde)
> +			goto out_free;
> +
> +		memset(new.rmap_pde, 0, largepages * sizeof(struct page *));
> +	}
> +	if (npages && !new.largepage) {
> +		int largepages = npages / (HPAGE_SIZE/PAGE_SIZE);
> +		if (npages % (HPAGE_SIZE/PAGE_SIZE))
> +			largepages++;
> +		new.largepage = kmalloc(largepages * sizeof(int), GFP_KERNEL);
> +
> +	

vmalloc() here too, for very large guests.

Please test the changes with user/test/x86/access.flat, with both normal 
and large pages backing.

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: large page support for kvm
  2008-02-12 11:55           ` Avi Kivity
@ 2008-02-13  0:15             ` Marcelo Tosatti
  2008-02-13  6:45               ` Avi Kivity
  0 siblings, 1 reply; 14+ messages in thread
From: Marcelo Tosatti @ 2008-02-13  0:15 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, kvm-devel

On Tue, Feb 12, 2008 at 01:55:54PM +0200, Avi Kivity wrote:
> Marcelo Tosatti wrote:
> >Ok, how does the following look. Still need to plug in large page
> >creation in the nonpaging case, but this should be enough for comments.
> >  
> 
> Most of the comments are cosmetic, but a couple have some meat.
> 
> > 
> >+#define HPAGE_ALIGN_OFFSET(x) ((((x)+HPAGE_SIZE-1)&HPAGE_MASK) - (x))
> >  
> 
> A function please.

Done.

> Can probably be reshuffled to avoid the long long.  The compiler may 
> want to call helpers on i386...

Reworked to operate on frame numbers instead of addresses, that gets rid
of long long (and its also neater).

> This bit can go into a function
> 
> > 	if (!*rmapp) {
> > 		rmap_printk("rmap_add: %p %llx 0->1\n", spte, *spte);
> > 		*rmapp = (unsigned long)spte;
> >@@ -449,7 +573,10 @@ static void rmap_remove(struct kvm *kvm,
> > 		kvm_release_page_dirty(page);
> > 	else
> > 		kvm_release_page_clean(page);
> >-	rmapp = gfn_to_rmap(kvm, sp->gfns[spte - sp->spt]);
> >+	if (is_large_pte(*spte))
> >+		rmapp = gfn_to_rmap_pde(kvm, sp->gfns[spte - sp->spt]);
> >+	else
> >+		rmapp = gfn_to_rmap(kvm, sp->gfns[spte - sp->spt]);
> >  
> 
> As it is reused here.

Ok, gfn_to_rmap_pde() has been merged into gfn_to_rmap() and there is 
a "lpage" parameter to know which array (and array offset) to use.

> >+	/*
> >+ 	 * If its a largepage mapping, there could be a previous
> >+ 	 * pointer to a PTE page hanging there, which will falsely
> >+ 	 * set was_rmapped.
> >+ 	 */
> >+	if (largepage)
> >+		was_rmapped = is_large_pte(*shadow_pte);
> >+
> >  
> 
> But that pte page will have its parent_pte chain pointing to shadow_pte, 
> no?  Either this can't happen, or we need to unlink that pte page first.

This can happen, say if you have a large page frame with shadowed pages
in it, therefore 4k mapped. When all shadowed pages are removed, it will
be mapped with a 2M page, overwriting the pointer to the (metaphysical)
PTE page.

So you say "need to unlink that pte page first" you mean simply remove
the parent pointer which now points to the 2M PMD entry, right? This
avoids a zap_mmu_page() on that metaphysical page from nukeing the
(unrelated) 2M PMD entry.

Nukeing the metaphysical page (zap_page) seems overkill and
unnecessary... it will eventually be recycled, or reused if the large
mapping is nuked.

It doesnt appear to be necessary to flush the TLB whenever a 2MB PMD
overwrote a PMD-to-PTE-page pointer.

In the worst case other CPU's will use the cached 4k translations for a
while. If there are permission changes on this translations the OS is
responsible for flushing the TLB anyway.

> >@@ -951,10 +1119,17 @@ unshadowed:
> > 		mark_page_dirty(vcpu->kvm, gfn);
> > 
> > 	pgprintk("%s: setting spte %llx\n", __FUNCTION__, spte);
> >+	pgprintk("instantiating %s PTE (%s) at %d (%llx)\n",
> >+		 (spte&PT_PAGE_SIZE_MASK)? "2MB" : "4kB",
> >+		 (spte&PT_WRITABLE_MASK)?"RW":"R", gfn, spte);
> > 	set_shadow_pte(shadow_pte, spte);
> >+	if (!was_rmapped && (spte & (PT_PAGE_SIZE_MASK|PT_WRITABLE_MASK))
> >+		== (PT_PAGE_SIZE_MASK|PT_WRITABLE_MASK))
> >+		++vcpu->kvm->stat.lpages;
> >+
> >  
> 
> Why do you only account for writable large pages?

No particular reason. Will fix.

> >Index: linux-2.6-x86-kvm/arch/x86/kvm/paging_tmpl.h
> >===================================================================
> >--- linux-2.6-x86-kvm.orig/arch/x86/kvm/paging_tmpl.h
> >+++ linux-2.6-x86-kvm/arch/x86/kvm/paging_tmpl.h
> >@@ -71,6 +71,7 @@ struct guest_walker {
> > 	unsigned pte_access;
> > 	gfn_t gfn;
> > 	u32 error_code;
> >+	int largepage_backed;
> > };
> > 
> > static gfn_t gpte_to_gfn(pt_element_t gpte)
> >@@ -120,7 +121,8 @@ static unsigned FNAME(gpte_access)(struc
> >  */
> > static int FNAME(walk_addr)(struct guest_walker *walker,
> > 			    struct kvm_vcpu *vcpu, gva_t addr,
> >-			    int write_fault, int user_fault, int fetch_fault)
> >+			    int write_fault, int user_fault, int fetch_fault,
> >+			    int faulting)
> > {
> > 	pt_element_t pte;
> > 	gfn_t table_gfn;
> >@@ -130,6 +132,7 @@ static int FNAME(walk_addr)(struct guest
> > 	pgprintk("%s: addr %lx\n", __FUNCTION__, addr);
> > walk:
> > 	walker->level = vcpu->arch.mmu.root_level;
> >+	walker->largepage_backed = 0;
> > 	pte = vcpu->arch.cr3;
> > #if PTTYPE == 64
> > 	if (!is_long_mode(vcpu)) {
> >@@ -192,10 +195,19 @@ walk:
> > 		if (walker->level == PT_DIRECTORY_LEVEL
> > 		    && (pte & PT_PAGE_SIZE_MASK)
> > 		    && (PTTYPE == 64 || is_pse(vcpu))) {
> >-			walker->gfn = gpte_to_gfn_pde(pte);
> >+			gfn_t gfn = gpte_to_gfn_pde(pte);
> >+			walker->gfn = gfn;
> >+
> > 			walker->gfn += PT_INDEX(addr, PT_PAGE_TABLE_LEVEL);
> > 			if (PTTYPE == 32 && is_cpuid_PSE36())
> > 				walker->gfn += pse36_gfn_delta(pte);
> >+
> >+			if (faulting
> >+			    && is_largepage_backed(vcpu, gfn)
> >+			    && is_physical_memory(vcpu->kvm, walker->gfn)) {
> >+				walker->largepage_backed = 1;
> >+				walker->gfn = gfn;
> >+			}
> >  
> 
> I don't like this bit.  So far the walker has been independent of the 
> host state, and only depended on guest data.  We can set 
> largepage_backed unconditionally on a large guest pte, leave gfn 
> containing the low-order bits, and leave the host checks for the actual 
> mapping logic.
> 
> > 
> > /*
> >@@ -299,6 +313,9 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
> > 		shadow_ent = ((u64 *)__va(shadow_addr)) + index;
> > 		if (level == PT_PAGE_TABLE_LEVEL)
> > 			break;
> >+		if (level == PT_DIRECTORY_LEVEL && walker->largepage_backed)
> >+			break;
> >+
> >  
> 
> (here)

gfn_to_page() needs to grab the struct page corresponding to the large
page, not the offset struct page for the faulting 4k address within
the large frame. Since gfn_to_page can sleep, there is no way to do
that in the mapping logic which happens under mmu_lock protection.
We don't want to grab the large page frame "struct page" unless the
is_largepage_backed() checks are successful.

The checks could be done in page_fault() if walker->level == 2, before
gfn_to_page()... But I don't see much difference of that and doing 
it inside walk_addr(). What do you say?

> 
> >@@ -395,6 +412,13 @@ static int FNAME(page_fault)(struct kvm_
> > 
> > 	page = gfn_to_page(vcpu->kvm, walker.gfn);
> > 
> >+	/* shortcut non-RAM accesses to avoid walking over a 2MB PMD entry */
> >+	if (is_error_page(page)) {
> >+		kvm_release_page_clean(page);
> >+		up_read(&current->mm->mmap_sem);
> >+		return 1;
> >+	}
> >+
> >  
> 
> This bit is already in kvm.git?

Yeah, great.

> > 
> > struct kvm_vcpu_stat {
> >Index: linux-2.6-x86-kvm/include/linux/kvm_host.h
> >===================================================================
> >--- linux-2.6-x86-kvm.orig/include/linux/kvm_host.h
> >+++ linux-2.6-x86-kvm/include/linux/kvm_host.h
> >@@ -99,7 +99,9 @@ struct kvm_memory_slot {
> > 	unsigned long npages;
> > 	unsigned long flags;
> > 	unsigned long *rmap;
> >+	unsigned long *rmap_pde;
> > 	unsigned long *dirty_bitmap;
> >+	int *largepage;
> >  
> 
> We can combine rmap_pde and largepage into an array of struct { int 
> write_count, unsigned long rmap_pde; } and save some cacheline accesses.

Done. Fixed the other comments too...

Thanks

> > 	
> > 
> > void kvm_free_physmem(struct kvm *kvm)
> >@@ -300,6 +308,28 @@ int __kvm_set_memory_region(struct kvm *
> > 		new.user_alloc = user_alloc;
> > 		new.userspace_addr = mem->userspace_addr;
> > 	}
> >+	if (npages && !new.rmap_pde) {
> >+		int largepages = npages / (HPAGE_SIZE/PAGE_SIZE);
> >+		if (npages % (HPAGE_SIZE/PAGE_SIZE))
> >+			largepages++;
> >+		new.rmap_pde = vmalloc(largepages * sizeof(struct page *));
> >+
> >+		if (!new.rmap_pde)
> >+			goto out_free;
> >+
> >+		memset(new.rmap_pde, 0, largepages * sizeof(struct page *));
> >+	}
> >+	if (npages && !new.largepage) {
> >+		int largepages = npages / (HPAGE_SIZE/PAGE_SIZE);
> >+		if (npages % (HPAGE_SIZE/PAGE_SIZE))
> >+			largepages++;
> >+		new.largepage = kmalloc(largepages * sizeof(int), 
> >GFP_KERNEL);
> >+
> >+	
> 
> vmalloc() here too, for very large guests.
> 
> Please test the changes with user/test/x86/access.flat, with both normal 
> and large pages backing.
> 
> -- 
> error compiling committee.c: too many arguments to function

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: large page support for kvm
  2008-02-13  0:15             ` Marcelo Tosatti
@ 2008-02-13  6:45               ` Avi Kivity
  2008-02-14 23:17                 ` Marcelo Tosatti
  0 siblings, 1 reply; 14+ messages in thread
From: Avi Kivity @ 2008-02-13  6:45 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm-devel

Marcelo Tosatti wrote:
>>> +	/*
>>> + 	 * If its a largepage mapping, there could be a previous
>>> + 	 * pointer to a PTE page hanging there, which will falsely
>>> + 	 * set was_rmapped.
>>> + 	 */
>>> +	if (largepage)
>>> +		was_rmapped = is_large_pte(*shadow_pte);
>>> +
>>>  
>>>       
>> But that pte page will have its parent_pte chain pointing to shadow_pte, 
>> no?  Either this can't happen, or we need to unlink that pte page first.
>>     
>
> This can happen, say if you have a large page frame with shadowed pages
> in it, therefore 4k mapped. When all shadowed pages are removed, it will
> be mapped with a 2M page, overwriting the pointer to the (metaphysical)
> PTE page.
>
> So you say "need to unlink that pte page first" you mean simply remove
> the parent pointer which now points to the 2M PMD entry, right? This
> avoids a zap_mmu_page() on that metaphysical page from nukeing the
> (unrelated) 2M PMD entry.
>   

Yes. We need to keep that tangle of pointers consistent.

> Nukeing the metaphysical page (zap_page) seems overkill and
> unnecessary... it will eventually be recycled, or reused if the large
> mapping is nuked.
>   

Yes.

> It doesnt appear to be necessary to flush the TLB whenever a 2MB PMD
> overwrote a PMD-to-PTE-page pointer.
>   

Since they're pointing to the same underlying memory, no, though we may 
need to transfer the dirty bits from the ptes to the struct page.

> In the worst case other CPU's will use the cached 4k translations for a
> while. If there are permission changes on this translations the OS is
> responsible for flushing the TLB anyway.
>   
>   
>>> Index: linux-2.6-x86-kvm/arch/x86/kvm/paging_tmpl.h
>>> ===================================================================
>>> --- linux-2.6-x86-kvm.orig/arch/x86/kvm/paging_tmpl.h
>>> +++ linux-2.6-x86-kvm/arch/x86/kvm/paging_tmpl.h
>>> @@ -71,6 +71,7 @@ struct guest_walker {
>>> 	unsigned pte_access;
>>> 	gfn_t gfn;
>>> 	u32 error_code;
>>> +	int largepage_backed;
>>> };
>>>
>>> static gfn_t gpte_to_gfn(pt_element_t gpte)
>>> @@ -120,7 +121,8 @@ static unsigned FNAME(gpte_access)(struc
>>>  */
>>> static int FNAME(walk_addr)(struct guest_walker *walker,
>>> 			    struct kvm_vcpu *vcpu, gva_t addr,
>>> -			    int write_fault, int user_fault, int fetch_fault)
>>> +			    int write_fault, int user_fault, int fetch_fault,
>>> +			    int faulting)
>>> {
>>> 	pt_element_t pte;
>>> 	gfn_t table_gfn;
>>> @@ -130,6 +132,7 @@ static int FNAME(walk_addr)(struct guest
>>> 	pgprintk("%s: addr %lx\n", __FUNCTION__, addr);
>>> walk:
>>> 	walker->level = vcpu->arch.mmu.root_level;
>>> +	walker->largepage_backed = 0;
>>> 	pte = vcpu->arch.cr3;
>>> #if PTTYPE == 64
>>> 	if (!is_long_mode(vcpu)) {
>>> @@ -192,10 +195,19 @@ walk:
>>> 		if (walker->level == PT_DIRECTORY_LEVEL
>>> 		    && (pte & PT_PAGE_SIZE_MASK)
>>> 		    && (PTTYPE == 64 || is_pse(vcpu))) {
>>> -			walker->gfn = gpte_to_gfn_pde(pte);
>>> +			gfn_t gfn = gpte_to_gfn_pde(pte);
>>> +			walker->gfn = gfn;
>>> +
>>> 			walker->gfn += PT_INDEX(addr, PT_PAGE_TABLE_LEVEL);
>>> 			if (PTTYPE == 32 && is_cpuid_PSE36())
>>> 				walker->gfn += pse36_gfn_delta(pte);
>>> +
>>> +			if (faulting
>>> +			    && is_largepage_backed(vcpu, gfn)
>>> +			    && is_physical_memory(vcpu->kvm, walker->gfn)) {
>>> +				walker->largepage_backed = 1;
>>> +				walker->gfn = gfn;
>>> +			}
>>>  
>>>       
>> I don't like this bit.  So far the walker has been independent of the 
>> host state, and only depended on guest data.  We can set 
>> largepage_backed unconditionally on a large guest pte, leave gfn 
>> containing the low-order bits, and leave the host checks for the actual 
>> mapping logic.
>>
>>     
>>> /*
>>> @@ -299,6 +313,9 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
>>> 		shadow_ent = ((u64 *)__va(shadow_addr)) + index;
>>> 		if (level == PT_PAGE_TABLE_LEVEL)
>>> 			break;
>>> +		if (level == PT_DIRECTORY_LEVEL && walker->largepage_backed)
>>> +			break;
>>> +
>>>  
>>>       
>> (here)
>>     
>
> gfn_to_page() needs to grab the struct page corresponding to the large
> page, not the offset struct page for the faulting 4k address within
> the large frame. Since gfn_to_page can sleep, there is no way to do
> that in the mapping logic which happens under mmu_lock protection.
> We don't want to grab the large page frame "struct page" unless the
> is_largepage_backed() checks are successful.
>
> The checks could be done in page_fault() if walker->level == 2, before
> gfn_to_page()... But I don't see much difference of that and doing 
> it inside walk_addr(). What do you say?
>
>   

I'd like to keep walk_addr() independent of the rest of the mmu (i.e. 
walk_addr is 100% guest oriented). Also, the issue you point out is 
shared by direct_map which doesn't call walk_addr().

An unrelated issue (pointed out by Jun Nakajima) is that this kills 
dirty log tracking (needed for migration). It could be solved simply by 
not using large page backing if dirty log tracking is enabled for that slot.

-- 
Any sufficiently difficult bug is indistinguishable from a feature.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: large page support for kvm
  2008-02-13  6:45               ` Avi Kivity
@ 2008-02-14 23:17                 ` Marcelo Tosatti
  2008-02-15  7:40                   ` Roedel, Joerg
  2008-02-17  9:38                   ` Avi Kivity
  0 siblings, 2 replies; 14+ messages in thread
From: Marcelo Tosatti @ 2008-02-14 23:17 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel

[-- Attachment #1: Type: text/plain, Size: 21758 bytes --]


On Wed, Feb 13, 2008 at 08:45:51AM +0200, Avi Kivity wrote:
> >gfn_to_page() needs to grab the struct page corresponding to the large
> >page, not the offset struct page for the faulting 4k address within
> >the large frame. Since gfn_to_page can sleep, there is no way to do
> >that in the mapping logic which happens under mmu_lock protection.
> >We don't want to grab the large page frame "struct page" unless the
> >is_largepage_backed() checks are successful.
> >
> >The checks could be done in page_fault() if walker->level == 2, before
> >gfn_to_page()... But I don't see much difference of that and doing 
> >it inside walk_addr(). What do you say?
> >
> >  
> 
> I'd like to keep walk_addr() independent of the rest of the mmu (i.e. 
> walk_addr is 100% guest oriented). Also, the issue you point out is 
> shared by direct_map which doesn't call walk_addr().
> 
> An unrelated issue (pointed out by Jun Nakajima) is that this kills 
> dirty log tracking (needed for migration). It could be solved simply by 
> not using large page backing if dirty log tracking is enabled for that slot.

Ok, fixed your comments and a bug which a root page was shadowed in the
large area being mapped. access.flat is happy.

Joerg, can you give this a try on a NPT-enabled system (need the
attached qemu-largepage-hack.patch).

Thanks

Index: kvm.largepages/arch/x86/kvm/mmu.c
===================================================================
--- kvm.largepages.orig/arch/x86/kvm/mmu.c
+++ kvm.largepages/arch/x86/kvm/mmu.c
@@ -27,6 +27,7 @@
 #include <linux/highmem.h>
 #include <linux/module.h>
 #include <linux/swap.h>
+#include <linux/hugetlb.h>
 
 #include <asm/page.h>
 #include <asm/cmpxchg.h>
@@ -211,6 +212,11 @@ static int is_shadow_present_pte(u64 pte
 		&& pte != shadow_notrap_nonpresent_pte;
 }
 
+static int is_large_pte(u64 pte)
+{
+	return pte & PT_PAGE_SIZE_MASK;
+}
+
 static int is_writeble_pte(unsigned long pte)
 {
 	return pte & PT_WRITABLE_MASK;
@@ -350,17 +356,120 @@ static void mmu_free_rmap_desc(struct kv
 	kfree(rd);
 }
 
+static int hpage_align_diff(unsigned long gfn)
+{
+	return ((gfn+KVM_PAGES_PER_HPAGE-1) & ~(KVM_PAGES_PER_HPAGE-1)) - gfn;
+}
+
+/*
+ * Return the pointer to the largepage write count for a given
+ * gfn, handling slots that are not large page aligned.
+ */
+static int *slot_largepage_idx(gfn_t gfn, struct kvm_memory_slot *slot)
+{
+	unsigned long idx;
+
+	idx = (gfn - slot->base_gfn) + hpage_align_diff(slot->base_gfn);
+	idx /= KVM_PAGES_PER_HPAGE;
+	return &slot->lpage_info[idx].write_count;
+}
+
+static void account_shadowed(struct kvm *kvm, gfn_t gfn)
+{
+	int *write_count;
+
+	write_count = slot_largepage_idx(gfn, gfn_to_memslot(kvm, gfn));
+	*write_count += 1;
+	WARN_ON(*write_count > KVM_PAGES_PER_HPAGE);
+}
+
+static void unaccount_shadowed(struct kvm *kvm, gfn_t gfn)
+{
+	int *write_count;
+
+	write_count = slot_largepage_idx(gfn, gfn_to_memslot(kvm, gfn));
+	*write_count -= 1;
+	WARN_ON(*write_count < 0);
+}
+
+static int has_wrprotected_page(struct kvm *kvm, gfn_t gfn)
+{
+	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
+	int *largepage_idx;
+
+	if (slot) {
+		largepage_idx = slot_largepage_idx(gfn, slot);
+		return *largepage_idx;
+	}
+
+	return 1;
+}
+
+static int host_largepage_backed(struct kvm *kvm, gfn_t gfn)
+{
+	struct vm_area_struct *vma;
+	unsigned long addr;
+
+	addr = gfn_to_hva(kvm, gfn);
+	if (kvm_is_error_hva(addr))
+		return 0;
+
+	vma = find_vma(current->mm, addr);
+	if (vma && is_vm_hugetlb_page(vma))
+		return 1;
+
+	return 0;
+}
+
+static int is_largepage_backed(struct kvm_vcpu *vcpu, gfn_t large_gfn)
+{
+	struct kvm_memory_slot *slot;
+
+	if (has_wrprotected_page(vcpu->kvm, large_gfn))
+		return 0;
+
+	if (!host_largepage_backed(vcpu->kvm, large_gfn))
+		return 0;
+
+	slot = gfn_to_memslot(vcpu->kvm, large_gfn);
+	if (slot && slot->dirty_bitmap)
+		return 0;
+
+	/* guest has 4M pages, host 2M */
+	if (!is_pae(vcpu) && HPAGE_SHIFT == 21)
+		return 0;
+
+	return 1;
+}
+
+static int is_physical_memory(struct kvm *kvm, gfn_t gfn)
+{
+	unsigned long addr;
+
+	addr = gfn_to_hva(kvm, gfn);
+	if (kvm_is_error_hva(addr))
+		return 0;
+
+	return 1;
+}
+
 /*
  * Take gfn and return the reverse mapping to it.
  * Note: gfn must be unaliased before this function get called
  */
 
-static unsigned long *gfn_to_rmap(struct kvm *kvm, gfn_t gfn)
+static unsigned long *gfn_to_rmap(struct kvm *kvm, gfn_t gfn, int lpage)
 {
 	struct kvm_memory_slot *slot;
+	unsigned long idx;
 
 	slot = gfn_to_memslot(kvm, gfn);
-	return &slot->rmap[gfn - slot->base_gfn];
+	if (!lpage)
+		return &slot->rmap[gfn - slot->base_gfn];
+
+	idx = gfn - slot->base_gfn + hpage_align_diff(slot->base_gfn);
+	idx /= KVM_PAGES_PER_HPAGE;
+	return &slot->lpage_info[idx].rmap_pde;
 }
 
 /*
@@ -372,7 +481,7 @@ static unsigned long *gfn_to_rmap(struct
  * If rmapp bit zero is one, (then rmap & ~1) points to a struct kvm_rmap_desc
  * containing more mappings.
  */
-static void rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
+static void rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn, int lpage)
 {
 	struct kvm_mmu_page *sp;
 	struct kvm_rmap_desc *desc;
@@ -384,7 +493,7 @@ static void rmap_add(struct kvm_vcpu *vc
 	gfn = unalias_gfn(vcpu->kvm, gfn);
 	sp = page_header(__pa(spte));
 	sp->gfns[spte - sp->spt] = gfn;
-	rmapp = gfn_to_rmap(vcpu->kvm, gfn);
+	rmapp = gfn_to_rmap(vcpu->kvm, gfn, lpage);
 	if (!*rmapp) {
 		rmap_printk("rmap_add: %p %llx 0->1\n", spte, *spte);
 		*rmapp = (unsigned long)spte;
@@ -450,7 +559,7 @@ static void rmap_remove(struct kvm *kvm,
 		kvm_release_page_dirty(page);
 	else
 		kvm_release_page_clean(page);
-	rmapp = gfn_to_rmap(kvm, sp->gfns[spte - sp->spt]);
+	rmapp = gfn_to_rmap(kvm, sp->gfns[spte - sp->spt], is_large_pte(*spte));
 	if (!*rmapp) {
 		printk(KERN_ERR "rmap_remove: %p %llx 0->BUG\n", spte, *spte);
 		BUG();
@@ -516,7 +625,7 @@ static void rmap_write_protect(struct kv
 	int write_protected = 0;
 
 	gfn = unalias_gfn(kvm, gfn);
-	rmapp = gfn_to_rmap(kvm, gfn);
+	rmapp = gfn_to_rmap(kvm, gfn, 0);
 
 	spte = rmap_next(kvm, rmapp, NULL);
 	while (spte) {
@@ -529,8 +638,27 @@ static void rmap_write_protect(struct kv
 		}
 		spte = rmap_next(kvm, rmapp, spte);
 	}
+	/* check for huge page mappings */
+	rmapp = gfn_to_rmap(kvm, gfn, 1);
+	spte = rmap_next(kvm, rmapp, NULL);
+	while (spte) {
+		BUG_ON(!spte);
+		BUG_ON(!(*spte & PT_PRESENT_MASK));
+		BUG_ON((*spte & (PT_PAGE_SIZE_MASK|PT_PRESENT_MASK)) != (PT_PAGE_SIZE_MASK|PT_PRESENT_MASK));
+		pgprintk("rmap_write_protect(large): spte %p %llx %lld\n", spte, *spte, gfn);
+		if (is_writeble_pte(*spte)) {
+			rmap_remove(kvm, spte);
+			--kvm->stat.lpages;
+			set_shadow_pte(spte, shadow_trap_nonpresent_pte);
+			write_protected = 1;
+		}
+		spte = rmap_next(kvm, rmapp, spte);
+	}
+
 	if (write_protected)
 		kvm_flush_remote_tlbs(kvm);
+
+	account_shadowed(kvm, gfn);
 }
 
 #ifdef MMU_DEBUG
@@ -750,11 +878,17 @@ static void kvm_mmu_page_unlink_children
 	for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
 		ent = pt[i];
 
+		if (is_shadow_present_pte(ent)) {
+			if (!is_large_pte(ent)) {
+				ent &= PT64_BASE_ADDR_MASK;
+				mmu_page_remove_parent_pte(page_header(ent),
+							   &pt[i]);
+			} else {
+				--kvm->stat.lpages;
+				rmap_remove(kvm, &pt[i]);
+			}
+		}
 		pt[i] = shadow_trap_nonpresent_pte;
-		if (!is_shadow_present_pte(ent))
-			continue;
-		ent &= PT64_BASE_ADDR_MASK;
-		mmu_page_remove_parent_pte(page_header(ent), &pt[i]);
 	}
 	kvm_flush_remote_tlbs(kvm);
 }
@@ -794,6 +928,8 @@ static void kvm_mmu_zap_page(struct kvm 
 	}
 	kvm_mmu_page_unlink_children(kvm, sp);
 	if (!sp->root_count) {
+		if (!sp->role.metaphysical)
+			unaccount_shadowed(kvm, sp->gfn);
 		hlist_del(&sp->hash_link);
 		kvm_mmu_free_page(kvm, sp);
 	} else
@@ -894,12 +1030,28 @@ struct page *gva_to_page(struct kvm_vcpu
 static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *shadow_pte,
 			 unsigned pt_access, unsigned pte_access,
 			 int user_fault, int write_fault, int dirty,
-			 int *ptwrite, gfn_t gfn, struct page *page)
+			 int *ptwrite, int largepage, gfn_t gfn,
+			 struct page *page)
 {
 	u64 spte;
 	int was_rmapped = is_rmap_pte(*shadow_pte);
 	int was_writeble = is_writeble_pte(*shadow_pte);
 
+	/*
+ 	 * If we overwrite a PTE page pointer with a 2MB PMD, unlink
+ 	 * the parent of the now unreachable PTE.
+ 	 */
+	if (largepage) {
+		if (was_rmapped && !is_large_pte(*shadow_pte)) {
+			struct kvm_mmu_page *child;
+			u64 pte = *shadow_pte;
+
+			child = page_header(pte & PT64_BASE_ADDR_MASK);
+			mmu_page_remove_parent_pte(child, shadow_pte);
+           	}
+		was_rmapped = is_large_pte(*shadow_pte);
+	}
+
 	pgprintk("%s: spte %llx access %x write_fault %d"
 		 " user_fault %d gfn %lx\n",
 		 __FUNCTION__, *shadow_pte, pt_access,
@@ -919,6 +1071,8 @@ static void mmu_set_spte(struct kvm_vcpu
 	spte |= PT_PRESENT_MASK;
 	if (pte_access & ACC_USER_MASK)
 		spte |= PT_USER_MASK;
+	if (largepage)
+		spte |= PT_PAGE_SIZE_MASK;
 
 	spte |= page_to_phys(page);
 
@@ -933,7 +1087,8 @@ static void mmu_set_spte(struct kvm_vcpu
 		}
 
 		shadow = kvm_mmu_lookup_page(vcpu->kvm, gfn);
-		if (shadow) {
+		if (shadow ||
+		   (largepage && has_wrprotected_page(vcpu->kvm, gfn))) {
 			pgprintk("%s: found shadow page for %lx, marking ro\n",
 				 __FUNCTION__, gfn);
 			pte_access &= ~ACC_WRITE_MASK;
@@ -941,6 +1096,18 @@ static void mmu_set_spte(struct kvm_vcpu
 				spte &= ~PT_WRITABLE_MASK;
 				kvm_x86_ops->tlb_flush(vcpu);
 			}
+			/*
+ 			 * Largepage creation is susceptible to a upper-level
+ 			 * table to be shadowed and write-protected in the
+ 			 * area being mapped. If that is the case, invalidate
+ 			 * the entry and let the instruction fault again
+ 			 * and use 4K mappings.
+ 			 */
+			if (largepage) {
+				spte = shadow_trap_nonpresent_pte;
+				kvm_x86_ops->tlb_flush(vcpu);
+				goto unshadowed;
+			}
 			if (write_fault)
 				*ptwrite = 1;
 		}
@@ -952,10 +1119,17 @@ unshadowed:
 		mark_page_dirty(vcpu->kvm, gfn);
 
 	pgprintk("%s: setting spte %llx\n", __FUNCTION__, spte);
+	pgprintk("instantiating %s PTE (%s) at %d (%llx)\n",
+		 (spte&PT_PAGE_SIZE_MASK)? "2MB" : "4kB",
+		 (spte&PT_WRITABLE_MASK)?"RW":"R", gfn, spte);
 	set_shadow_pte(shadow_pte, spte);
+	if (!was_rmapped && (spte & PT_PAGE_SIZE_MASK)
+	    && (spte & PT_PRESENT_MASK))
+		++vcpu->kvm->stat.lpages;
+
 	page_header_update_slot(vcpu->kvm, shadow_pte, gfn);
 	if (!was_rmapped) {
-		rmap_add(vcpu, shadow_pte, gfn);
+		rmap_add(vcpu, shadow_pte, gfn, largepage);
 		if (!is_rmap_pte(*shadow_pte))
 			kvm_release_page_clean(page);
 	} else {
@@ -973,7 +1147,8 @@ static void nonpaging_new_cr3(struct kvm
 }
 
 static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
-			   gfn_t gfn, struct page *page, int level)
+			   int largepage, gfn_t gfn, struct page *page,
+			   int level)
 {
 	hpa_t table_addr = vcpu->arch.mmu.root_hpa;
 	int pt_write = 0;
@@ -987,7 +1162,13 @@ static int __direct_map(struct kvm_vcpu 
 
 		if (level == 1) {
 			mmu_set_spte(vcpu, &table[index], ACC_ALL, ACC_ALL,
-				     0, write, 1, &pt_write, gfn, page);
+				     0, write, 1, &pt_write, 0, gfn, page);
+			return pt_write;
+		}
+
+		if (largepage && level == 2) {
+			mmu_set_spte(vcpu, &table[index], ACC_ALL, ACC_ALL,
+				    0, write, 1, &pt_write, 1, gfn, page);
 			return pt_write;
 		}
 
@@ -1017,12 +1198,19 @@ static int __direct_map(struct kvm_vcpu 
 static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn)
 {
 	int r;
+	int largepage = 0;
 
 	struct page *page;
 
 	down_read(&vcpu->kvm->slots_lock);
 
 	down_read(&current->mm->mmap_sem);
+	if (is_largepage_backed(vcpu, gfn & ~(KVM_PAGES_PER_HPAGE-1))
+	    && is_physical_memory(vcpu->kvm, gfn)) {
+		gfn &= ~(KVM_PAGES_PER_HPAGE-1);
+		largepage = 1;
+	}
+
 	page = gfn_to_page(vcpu->kvm, gfn);
 	up_read(&current->mm->mmap_sem);
 
@@ -1035,7 +1223,8 @@ static int nonpaging_map(struct kvm_vcpu
 
 	spin_lock(&vcpu->kvm->mmu_lock);
 	kvm_mmu_free_some_pages(vcpu);
-	r = __direct_map(vcpu, v, write, gfn, page, PT32E_ROOT_LEVEL);
+	r = __direct_map(vcpu, v, write, largepage, gfn, page,
+			 PT32E_ROOT_LEVEL);
 	spin_unlock(&vcpu->kvm->mmu_lock);
 
 	up_read(&vcpu->kvm->slots_lock);
@@ -1166,6 +1355,8 @@ static int tdp_page_fault(struct kvm_vcp
 {
 	struct page *page;
 	int r;
+	int largepage = 0;
+	gfn_t gfn = gpa >> PAGE_SHIFT;
 
 	ASSERT(vcpu);
 	ASSERT(VALID_PAGE(vcpu->arch.mmu.root_hpa));
@@ -1175,7 +1366,12 @@ static int tdp_page_fault(struct kvm_vcp
 		return r;
 
 	down_read(&current->mm->mmap_sem);
-	page = gfn_to_page(vcpu->kvm, gpa >> PAGE_SHIFT);
+	if (is_largepage_backed(vcpu, gfn & ~(KVM_PAGES_PER_HPAGE-1))
+	    && is_physical_memory(vcpu->kvm, gfn)) {
+		gfn &= ~(KVM_PAGES_PER_HPAGE-1);
+		largepage = 1;
+	}
+	page = gfn_to_page(vcpu->kvm, gfn);
 	if (is_error_page(page)) {
 		kvm_release_page_clean(page);
 		up_read(&current->mm->mmap_sem);
@@ -1184,7 +1380,7 @@ static int tdp_page_fault(struct kvm_vcp
 	spin_lock(&vcpu->kvm->mmu_lock);
 	kvm_mmu_free_some_pages(vcpu);
 	r = __direct_map(vcpu, gpa, error_code & PFERR_WRITE_MASK,
-			 gpa >> PAGE_SHIFT, page, TDP_ROOT_LEVEL);
+			 largepage, gfn, page, TDP_ROOT_LEVEL);
 	spin_unlock(&vcpu->kvm->mmu_lock);
 	up_read(&current->mm->mmap_sem);
 
@@ -1383,7 +1579,8 @@ static void mmu_pte_write_zap_pte(struct
 
 	pte = *spte;
 	if (is_shadow_present_pte(pte)) {
-		if (sp->role.level == PT_PAGE_TABLE_LEVEL)
+		if (sp->role.level == PT_PAGE_TABLE_LEVEL ||
+		    is_large_pte(pte))
 			rmap_remove(vcpu->kvm, spte);
 		else {
 			child = page_header(pte & PT64_BASE_ADDR_MASK);
@@ -1391,6 +1588,8 @@ static void mmu_pte_write_zap_pte(struct
 		}
 	}
 	set_shadow_pte(spte, shadow_trap_nonpresent_pte);
+	if (is_large_pte(pte))
+		--vcpu->kvm->stat.lpages;
 }
 
 static void mmu_pte_write_new_pte(struct kvm_vcpu *vcpu,
@@ -1398,7 +1597,8 @@ static void mmu_pte_write_new_pte(struct
 				  u64 *spte,
 				  const void *new)
 {
-	if (sp->role.level != PT_PAGE_TABLE_LEVEL) {
+	if ((sp->role.level != PT_PAGE_TABLE_LEVEL)
+	    && !vcpu->arch.update_pte.largepage) {
 		++vcpu->kvm->stat.mmu_pde_zapped;
 		return;
 	}
@@ -1446,6 +1646,8 @@ static void mmu_guess_page_from_pte_writ
 	u64 gpte = 0;
 	struct page *page;
 
+	vcpu->arch.update_pte.largepage = 0;
+
 	if (bytes != 4 && bytes != 8)
 		return;
 
@@ -1474,6 +1676,10 @@ static void mmu_guess_page_from_pte_writ
 	gfn = (gpte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT;
 
 	down_read(&current->mm->mmap_sem);
+	if (is_large_pte(gpte) && is_largepage_backed(vcpu, gfn)) {
+		gfn &= ~(KVM_PAGES_PER_HPAGE-1);
+		vcpu->arch.update_pte.largepage = 1;
+	}
 	page = gfn_to_page(vcpu->kvm, gfn);
 	up_read(&current->mm->mmap_sem);
 
Index: kvm.largepages/arch/x86/kvm/paging_tmpl.h
===================================================================
--- kvm.largepages.orig/arch/x86/kvm/paging_tmpl.h
+++ kvm.largepages/arch/x86/kvm/paging_tmpl.h
@@ -248,6 +248,7 @@ static void FNAME(update_pte)(struct kvm
 	pt_element_t gpte;
 	unsigned pte_access;
 	struct page *npage;
+	int largepage = vcpu->arch.update_pte.largepage;
 
 	gpte = *(const pt_element_t *)pte;
 	if (~gpte & (PT_PRESENT_MASK | PT_ACCESSED_MASK)) {
@@ -264,7 +265,8 @@ static void FNAME(update_pte)(struct kvm
 		return;
 	get_page(npage);
 	mmu_set_spte(vcpu, spte, page->role.access, pte_access, 0, 0,
-		     gpte & PT_DIRTY_MASK, NULL, gpte_to_gfn(gpte), npage);
+		     gpte & PT_DIRTY_MASK, NULL, largepage, gpte_to_gfn(gpte),
+		     npage);
 }
 
 /*
@@ -272,8 +274,8 @@ static void FNAME(update_pte)(struct kvm
  */
 static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 			 struct guest_walker *walker,
-			 int user_fault, int write_fault, int *ptwrite,
-			 struct page *page)
+			 int user_fault, int write_fault, int largepage,
+			 int *ptwrite, struct page *page)
 {
 	hpa_t shadow_addr;
 	int level;
@@ -302,6 +304,10 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
 		shadow_ent = ((u64 *)__va(shadow_addr)) + index;
 		if (level == PT_PAGE_TABLE_LEVEL)
 			break;
+
+		if (largepage && level == PT_DIRECTORY_LEVEL)
+			break;
+
 		if (is_shadow_present_pte(*shadow_ent)) {
 			shadow_addr = *shadow_ent & PT64_BASE_ADDR_MASK;
 			continue;
@@ -340,7 +346,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
 	mmu_set_spte(vcpu, shadow_ent, access, walker->pte_access & access,
 		     user_fault, write_fault,
 		     walker->ptes[walker->level-1] & PT_DIRTY_MASK,
-		     ptwrite, walker->gfn, page);
+		     ptwrite, largepage, walker->gfn, page);
 
 	return shadow_ent;
 }
@@ -370,6 +376,7 @@ static int FNAME(page_fault)(struct kvm_
 	int write_pt = 0;
 	int r;
 	struct page *page;
+	int largepage = 0;
 
 	pgprintk("%s: addr %lx err %x\n", __FUNCTION__, addr, error_code);
 	kvm_mmu_audit(vcpu, "pre page fault");
@@ -397,6 +404,15 @@ static int FNAME(page_fault)(struct kvm_
 	}
 
 	down_read(&current->mm->mmap_sem);
+	if (walker.level == PT_DIRECTORY_LEVEL) {
+		gfn_t large_gfn;
+		large_gfn = walker.gfn & ~(KVM_PAGES_PER_HPAGE-1);
+		if (is_largepage_backed(vcpu, large_gfn)
+		    && is_physical_memory(vcpu->kvm, walker.gfn)) {
+			walker.gfn = large_gfn;
+			largepage = 1;
+		}
+	}
 	page = gfn_to_page(vcpu->kvm, walker.gfn);
 	up_read(&current->mm->mmap_sem);
 
@@ -411,7 +427,7 @@ static int FNAME(page_fault)(struct kvm_
 	spin_lock(&vcpu->kvm->mmu_lock);
 	kvm_mmu_free_some_pages(vcpu);
 	shadow_pte = FNAME(fetch)(vcpu, addr, &walker, user_fault, write_fault,
-				  &write_pt, page);
+				  largepage, &write_pt, page);
 	pgprintk("%s: shadow pte %p %llx ptwrite %d\n", __FUNCTION__,
 		 shadow_pte, *shadow_pte, write_pt);
 
Index: kvm.largepages/arch/x86/kvm/x86.c
===================================================================
--- kvm.largepages.orig/arch/x86/kvm/x86.c
+++ kvm.largepages/arch/x86/kvm/x86.c
@@ -86,6 +86,7 @@ struct kvm_stats_debugfs_item debugfs_en
 	{ "mmu_recycled", VM_STAT(mmu_recycled) },
 	{ "mmu_cache_miss", VM_STAT(mmu_cache_miss) },
 	{ "remote_tlb_flush", VM_STAT(remote_tlb_flush) },
+	{ "lpages", VM_STAT(lpages) },
 	{ NULL }
 };
 
Index: kvm.largepages/include/asm-x86/kvm_host.h
===================================================================
--- kvm.largepages.orig/include/asm-x86/kvm_host.h
+++ kvm.largepages/include/asm-x86/kvm_host.h
@@ -38,6 +38,13 @@
 #define INVALID_PAGE (~(hpa_t)0)
 #define UNMAPPED_GVA (~(gpa_t)0)
 
+/* shadow tables are PAE even on non-PAE hosts */
+#define KVM_HPAGE_SHIFT 21
+#define KVM_HPAGE_SIZE (1UL << KVM_HPAGE_SHIFT)
+#define KVM_HPAGE_MASK (~(KVM_HPAGE_SIZE - 1))
+
+#define KVM_PAGES_PER_HPAGE (KVM_HPAGE_SIZE / PAGE_SIZE)
+
 #define DE_VECTOR 0
 #define UD_VECTOR 6
 #define NM_VECTOR 7
@@ -228,6 +235,7 @@ struct kvm_vcpu_arch {
 	struct {
 		gfn_t gfn;          /* presumed gfn during guest pte update */
 		struct page *page;  /* page corresponding to that gfn */
+		int largepage;
 	} update_pte;
 
 	struct i387_fxsave_struct host_fx_image;
@@ -298,6 +306,7 @@ struct kvm_vm_stat {
 	u32 mmu_recycled;
 	u32 mmu_cache_miss;
 	u32 remote_tlb_flush;
+	u32 lpages;
 };
 
 struct kvm_vcpu_stat {
Index: kvm.largepages/include/linux/kvm_host.h
===================================================================
--- kvm.largepages.orig/include/linux/kvm_host.h
+++ kvm.largepages/include/linux/kvm_host.h
@@ -102,6 +102,10 @@ struct kvm_memory_slot {
 	unsigned long flags;
 	unsigned long *rmap;
 	unsigned long *dirty_bitmap;
+	struct {
+		unsigned long rmap_pde;
+		int write_count;
+	} *lpage_info;
 	unsigned long userspace_addr;
 	int user_alloc;
 };
@@ -168,6 +172,7 @@ int kvm_arch_set_memory_region(struct kv
 				int user_alloc);
 gfn_t unalias_gfn(struct kvm *kvm, gfn_t gfn);
 struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn);
+unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn);
 void kvm_release_page_clean(struct page *page);
 void kvm_release_page_dirty(struct page *page);
 int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
Index: kvm.largepages/virt/kvm/kvm_main.c
===================================================================
--- kvm.largepages.orig/virt/kvm/kvm_main.c
+++ kvm.largepages/virt/kvm/kvm_main.c
@@ -189,9 +189,13 @@ static void kvm_free_physmem_slot(struct
 	if (!dont || free->dirty_bitmap != dont->dirty_bitmap)
 		vfree(free->dirty_bitmap);
 
+	if (!dont || free->lpage_info != dont->lpage_info)
+		vfree(free->lpage_info);
+
 	free->npages = 0;
 	free->dirty_bitmap = NULL;
 	free->rmap = NULL;
+	free->lpage_info = NULL;
 }
 
 void kvm_free_physmem(struct kvm *kvm)
@@ -301,6 +305,22 @@ int __kvm_set_memory_region(struct kvm *
 		new.user_alloc = user_alloc;
 		new.userspace_addr = mem->userspace_addr;
 	}
+	if (npages && !new.lpage_info) {
+		int largepages = npages / KVM_PAGES_PER_HPAGE;
+		if (npages % KVM_PAGES_PER_HPAGE)
+			largepages++;
+		new.lpage_info = vmalloc(largepages * sizeof(*new.lpage_info));
+
+		if (!new.lpage_info)
+			goto out_free;
+
+		memset(new.lpage_info, 0, largepages * sizeof(*new.lpage_info));
+		/* large page crosses memslot boundary */
+		if (npages % KVM_PAGES_PER_HPAGE) {
+			new.lpage_info[0].write_count = 1;
+			new.lpage_info[largepages-1].write_count = 1;
+		}
+	}
 
 	/* Allocate page dirty bitmap if needed */
 	if ((new.flags & KVM_MEM_LOG_DIRTY_PAGES) && !new.dirty_bitmap) {
@@ -444,7 +464,7 @@ int kvm_is_visible_gfn(struct kvm *kvm, 
 }
 EXPORT_SYMBOL_GPL(kvm_is_visible_gfn);
 
-static unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
+unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
 {
 	struct kvm_memory_slot *slot;
 
@@ -454,6 +474,7 @@ static unsigned long gfn_to_hva(struct k
 		return bad_hva();
 	return (slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE);
 }
+EXPORT_SYMBOL(gfn_to_hva);
 
 /*
  * Requires current->mm->mmap_sem to be held

[-- Attachment #2: qemu-largepage-hack.patch --]
[-- Type: text/plain, Size: 1302 bytes --]

Index: kvm-userspace/qemu/vl.c
===================================================================
--- kvm-userspace.orig/qemu/vl.c
+++ kvm-userspace/qemu/vl.c
@@ -8501,6 +8501,31 @@ void qemu_get_launch_info(int *argc, cha
     *opt_incoming = incoming;
 }
 
+#define HPAGE_SIZE 2*1024*1024
+
+void *alloc_huge_area(unsigned long memory)
+{
+	void *area;
+	int fd;
+	char path[] = "/mnt/kvm.XXXXXX";
+
+	mkstemp(path);
+	fd = open(path, O_RDWR);
+	if (fd < 0) {
+		perror("open");
+		exit(0);
+	}
+	memory = (memory+HPAGE_SIZE-1) & ~(HPAGE_SIZE-1);
+
+	area = mmap(0, memory, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0);
+	if (area == MAP_FAILED) {
+		perror("mmap");
+		exit(0);
+	}
+
+	return area;
+}
+
 int main(int argc, char **argv)
 {
 #ifdef CONFIG_GDBSTUB
@@ -9330,9 +9355,9 @@ int main(int argc, char **argv)
 
             ret = kvm_qemu_check_extension(KVM_CAP_USER_MEMORY);
             if (ret) {
-		printf("allocating %d MB\n", phys_ram_size/1024/1024);
-                phys_ram_base = qemu_vmalloc(phys_ram_size);
-	        if (!phys_ram_base) {
+                //phys_ram_base = qemu_vmalloc(phys_ram_size);
+		phys_ram_base = alloc_huge_area(phys_ram_size);
+		if (!phys_ram_base) {
 		        fprintf(stderr, "Could not allocate physical memory\n");
 		        exit(1);
 	        }

[-- Attachment #3: Type: text/plain, Size: 228 bytes --]

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

[-- Attachment #4: Type: text/plain, Size: 158 bytes --]

_______________________________________________
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: large page support for kvm
  2008-02-14 23:17                 ` Marcelo Tosatti
@ 2008-02-15  7:40                   ` Roedel, Joerg
  2008-02-17  9:38                   ` Avi Kivity
  1 sibling, 0 replies; 14+ messages in thread
From: Roedel, Joerg @ 2008-02-15  7:40 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm-devel, Avi Kivity

On Fri, Feb 15, 2008 at 12:17:39AM +0100, Marcelo Tosatti wrote:
> 
> On Wed, Feb 13, 2008 at 08:45:51AM +0200, Avi Kivity wrote:
> > >gfn_to_page() needs to grab the struct page corresponding to the large
> > >page, not the offset struct page for the faulting 4k address within
> > >the large frame. Since gfn_to_page can sleep, there is no way to do
> > >that in the mapping logic which happens under mmu_lock protection.
> > >We don't want to grab the large page frame "struct page" unless the
> > >is_largepage_backed() checks are successful.
> > >
> > >The checks could be done in page_fault() if walker->level == 2, before
> > >gfn_to_page()... But I don't see much difference of that and doing 
> > >it inside walk_addr(). What do you say?
> > >
> > >  
> > 
> > I'd like to keep walk_addr() independent of the rest of the mmu (i.e. 
> > walk_addr is 100% guest oriented). Also, the issue you point out is 
> > shared by direct_map which doesn't call walk_addr().
> > 
> > An unrelated issue (pointed out by Jun Nakajima) is that this kills 
> > dirty log tracking (needed for migration). It could be solved simply by 
> > not using large page backing if dirty log tracking is enabled for that slot.
> 
> Ok, fixed your comments and a bug which a root page was shadowed in the
> large area being mapped. access.flat is happy.
> 
> Joerg, can you give this a try on a NPT-enabled system (need the
> attached qemu-largepage-hack.patch).

Yeah. I will give it a try today. I am very curious about the
performance numbers.

Joerg

-- 
           |           AMD Saxony Limited Liability Company & Co. KG
 Operating |         Wilschdorfer Landstr. 101, 01109 Dresden, Germany
 System    |                  Register Court Dresden: HRA 4896
 Research  |              General Partner authorized to represent:
 Center    |             AMD Saxony LLC (Wilmington, Delaware, US)
           | General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy



-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: large page support for kvm
  2008-02-14 23:17                 ` Marcelo Tosatti
  2008-02-15  7:40                   ` Roedel, Joerg
@ 2008-02-17  9:38                   ` Avi Kivity
  2008-02-19 20:37                     ` Marcelo Tosatti
  1 sibling, 1 reply; 14+ messages in thread
From: Avi Kivity @ 2008-02-17  9:38 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm-devel

Marcelo Tosatti wrote:
> On Wed, Feb 13, 2008 at 08:45:51AM +0200, Avi Kivity wrote:
>   
>>> gfn_to_page() needs to grab the struct page corresponding to the large
>>> page, not the offset struct page for the faulting 4k address within
>>> the large frame. Since gfn_to_page can sleep, there is no way to do
>>> that in the mapping logic which happens under mmu_lock protection.
>>> We don't want to grab the large page frame "struct page" unless the
>>> is_largepage_backed() checks are successful.
>>>
>>> The checks could be done in page_fault() if walker->level == 2, before
>>> gfn_to_page()... But I don't see much difference of that and doing 
>>> it inside walk_addr(). What do you say?
>>>
>>>  
>>>       
>> I'd like to keep walk_addr() independent of the rest of the mmu (i.e. 
>> walk_addr is 100% guest oriented). Also, the issue you point out is 
>> shared by direct_map which doesn't call walk_addr().
>>
>> An unrelated issue (pointed out by Jun Nakajima) is that this kills 
>> dirty log tracking (needed for migration). It could be solved simply by 
>> not using large page backing if dirty log tracking is enabled for that slot.
>>     
>
> Ok, fixed your comments and a bug which a root page was shadowed in the
> large area being mapped. access.flat is happy.
>
> Joerg, can you give this a try on a NPT-enabled system (need the
> attached qemu-largepage-hack.patch).
>
> Thanks
>
> Index: kvm.largepages/arch/x86/kvm/mmu.c
> ===================================================================
> --- kvm.largepages.orig/arch/x86/kvm/mmu.c
> +++ kvm.largepages/arch/x86/kvm/mmu.c
> @@ -27,6 +27,7 @@
>  #include <linux/highmem.h>
>  #include <linux/module.h>
>  #include <linux/swap.h>
> +#include <linux/hugetlb.h>
>  
>  #include <asm/page.h>
>  #include <asm/cmpxchg.h>
> @@ -211,6 +212,11 @@ static int is_shadow_present_pte(u64 pte
>  		&& pte != shadow_notrap_nonpresent_pte;
>  }
>  
> +static int is_large_pte(u64 pte)
> +{
> +	return pte & PT_PAGE_SIZE_MASK;
> +}
> +
>  static int is_writeble_pte(unsigned long pte)
>  {
>  	return pte & PT_WRITABLE_MASK;
> @@ -350,17 +356,120 @@ static void mmu_free_rmap_desc(struct kv
>  	kfree(rd);
>  }
>  
> +static int hpage_align_diff(unsigned long gfn)
> +{
> +	return ((gfn+KVM_PAGES_PER_HPAGE-1) & ~(KVM_PAGES_PER_HPAGE-1)) - gfn;
> +}
> +
> +/*
> + * Return the pointer to the largepage write count for a given
> + * gfn, handling slots that are not large page aligned.
> + */
> +static int *slot_largepage_idx(gfn_t gfn, struct kvm_memory_slot *slot)
> +{
> +	unsigned long idx;
> +
> +	idx = (gfn - slot->base_gfn) + hpage_align_diff(slot->base_gfn);
> +	idx /= KVM_PAGES_PER_HPAGE;
> +	return &slot->lpage_info[idx].write_count;
> +}
>   

Can be further simplified to (gfn / KVM_PAGES_PER_HPAGE) - 
(slot->base_gfn / KVM_PAGES_PER_HPAGE).  Sorry for not noticing earlier.

> +
> +static int is_largepage_backed(struct kvm_vcpu *vcpu, gfn_t large_gfn)
> +{
> +	struct kvm_memory_slot *slot;
> +
> +	if (has_wrprotected_page(vcpu->kvm, large_gfn))
> +		return 0;
> +
> +	if (!host_largepage_backed(vcpu->kvm, large_gfn))
> +		return 0;
> +
> +	slot = gfn_to_memslot(vcpu->kvm, large_gfn);
> +	if (slot && slot->dirty_bitmap)
> +		return 0;
> +
> +	/* guest has 4M pages, host 2M */
> +	if (!is_pae(vcpu) && HPAGE_SHIFT == 21)
> +		return 0;
>   

Is this check necessary?  I think that if we remove it things will just 
work.  A 4MB page will be have either one or two 2MB sptes (which may 
even belong to different slots).

> @@ -894,12 +1030,28 @@ struct page *gva_to_page(struct kvm_vcpu
>  static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *shadow_pte,
>  			 unsigned pt_access, unsigned pte_access,
>  			 int user_fault, int write_fault, int dirty,
> -			 int *ptwrite, gfn_t gfn, struct page *page)
> +			 int *ptwrite, int largepage, gfn_t gfn,
> +			 struct page *page)
>  {
>  	u64 spte;
>  	int was_rmapped = is_rmap_pte(*shadow_pte);
>  	int was_writeble = is_writeble_pte(*shadow_pte);
>  
> +	/*
> + 	 * If we overwrite a PTE page pointer with a 2MB PMD, unlink
> + 	 * the parent of the now unreachable PTE.
> + 	 */
> +	if (largepage) {
> +		if (was_rmapped && !is_large_pte(*shadow_pte)) {
> +			struct kvm_mmu_page *child;
> +			u64 pte = *shadow_pte;
> +
> +			child = page_header(pte & PT64_BASE_ADDR_MASK);
> +			mmu_page_remove_parent_pte(child, shadow_pte);
> +           	}
> +		was_rmapped = is_large_pte(*shadow_pte);
> +	}
> +
>  	pgprintk("%s: spte %llx access %x write_fault %d"
>  		 " user_fault %d gfn %lx\n",
>  		 __FUNCTION__, *shadow_pte, pt_access,
> @@ -919,6 +1071,8 @@ static void mmu_set_spte(struct kvm_vcpu
>  	spte |= PT_PRESENT_MASK;
>  	if (pte_access & ACC_USER_MASK)
>  		spte |= PT_USER_MASK;
> +	if (largepage)
> +		spte |= PT_PAGE_SIZE_MASK;
>  
>  	spte |= page_to_phys(page);
>  
> @@ -933,7 +1087,8 @@ static void mmu_set_spte(struct kvm_vcpu
>  		}
>  
>  		shadow = kvm_mmu_lookup_page(vcpu->kvm, gfn);
> -		if (shadow) {
> +		if (shadow ||
> +		   (largepage && has_wrprotected_page(vcpu->kvm, gfn))) {
>  			pgprintk("%s: found shadow page for %lx, marking ro\n",
>  				 __FUNCTION__, gfn);
>  			pte_access &= ~ACC_WRITE_MASK;
> @@ -941,6 +1096,18 @@ static void mmu_set_spte(struct kvm_vcpu
>  				spte &= ~PT_WRITABLE_MASK;
>  				kvm_x86_ops->tlb_flush(vcpu);
>  			}
> +			/*
> + 			 * Largepage creation is susceptible to a upper-level
> + 			 * table to be shadowed and write-protected in the
> + 			 * area being mapped. If that is the case, invalidate
> + 			 * the entry and let the instruction fault again
> + 			 * and use 4K mappings.
> + 			 */
> +			if (largepage) {
> +				spte = shadow_trap_nonpresent_pte;
> +				kvm_x86_ops->tlb_flush(vcpu);
> +				goto unshadowed;
> +			}
>   

Would it not repeat exactly the same code path?  Or is this just for the 
case of the pte_update path?

> -	page = gfn_to_page(vcpu->kvm, gpa >> PAGE_SHIFT);
> +	if (is_largepage_backed(vcpu, gfn & ~(KVM_PAGES_PER_HPAGE-1))
> +	    && is_physical_memory(vcpu->kvm, gfn)) {
> +		gfn &= ~(KVM_PAGES_PER_HPAGE-1);
> +		largepage = 1;
> +	}
>   

Doesn't is_largepage_backed() imply is_physical_memory?

>  
> Index: kvm.largepages/arch/x86/kvm/x86.c
> ===================================================================
> --- kvm.largepages.orig/arch/x86/kvm/x86.c
> +++ kvm.largepages/arch/x86/kvm/x86.c
> @@ -86,6 +86,7 @@ struct kvm_stats_debugfs_item debugfs_en
>  	{ "mmu_recycled", VM_STAT(mmu_recycled) },
>  	{ "mmu_cache_miss", VM_STAT(mmu_cache_miss) },
>  	{ "remote_tlb_flush", VM_STAT(remote_tlb_flush) },
> +	{ "lpages", VM_STAT(lpages) },
>  	{ NULL }
>  };
>   

s/lpages/largepages/, this is user visible.

> +		new.lpage_info = vmalloc(largepages * sizeof(*new.lpage_info));
> +
> +		if (!new.lpage_info)
> +			goto out_free;
> +
> +		memset(new.lpage_info, 0, largepages * sizeof(*new.lpage_info));
> +		/* large page crosses memslot boundary */
> +		if (npages % KVM_PAGES_PER_HPAGE) {
> +			new.lpage_info[0].write_count = 1;
>   

This seems wrong, say a 3MB slot at 1GB, you kill the first largepage 
which is good.

> +			new.lpage_info[largepages-1].write_count = 1;
>   

OTOH, a 3MB slot at 3MB, the last page is fine.  The check needs to be 
against base_gfn and base_gfn + npages, not the number of pages.

> +		}
>   



> +	}
>  
>  	/* Allocate page dirty bitmap if needed */
>  	if ((new.flags & KVM_MEM_LOG_DIRTY_PAGES) && !new.dirty_bitmap) {
> @@ -444,7 +464,7 @@ int kvm_is_visible_gfn(struct kvm *kvm, 
>  }
>  EXPORT_SYMBOL_GPL(kvm_is_visible_gfn);
>  
> -static unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
> +unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
>  {
>  	struct kvm_memory_slot *slot;
>  
> @@ -454,6 +474,7 @@ static unsigned long gfn_to_hva(struct k
>  		return bad_hva();
>  	return (slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE);
>  }
> +EXPORT_SYMBOL(gfn_to_hva);
>  
>  /*
>   * Requires current->mm->mmap_sem to be held
>   
> ------------------------------------------------------------------------
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> ------------------------------------------------------------------------
>
> _______________________________________________
> kvm-devel mailing list
> kvm-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/kvm-devel


-- 
Any sufficiently difficult bug is indistinguishable from a feature.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: large page support for kvm
  2008-02-17  9:38                   ` Avi Kivity
@ 2008-02-19 20:37                     ` Marcelo Tosatti
  2008-02-20 14:25                       ` Avi Kivity
  0 siblings, 1 reply; 14+ messages in thread
From: Marcelo Tosatti @ 2008-02-19 20:37 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, kvm-devel

On Sun, Feb 17, 2008 at 11:38:51AM +0200, Avi Kivity wrote:
> >+ * Return the pointer to the largepage write count for a given
> >+ * gfn, handling slots that are not large page aligned.
> >+ */
> >+static int *slot_largepage_idx(gfn_t gfn, struct kvm_memory_slot *slot)
> >+{
> >+	unsigned long idx;
> >+
> >+	idx = (gfn - slot->base_gfn) + hpage_align_diff(slot->base_gfn);
> >+	idx /= KVM_PAGES_PER_HPAGE;
> >+	return &slot->lpage_info[idx].write_count;
> >+}
> >  
> 
> Can be further simplified to (gfn / KVM_PAGES_PER_HPAGE) - 
> (slot->base_gfn / KVM_PAGES_PER_HPAGE).  Sorry for not noticing earlier.

Right.

> >+	/* guest has 4M pages, host 2M */
> >+	if (!is_pae(vcpu) && HPAGE_SHIFT == 21)
> >+		return 0;
> >  
> 
> Is this check necessary?  I think that if we remove it things will just 
> work.  A 4MB page will be have either one or two 2MB sptes (which may 
> even belong to different slots).

You mentioned that before, I agree its not necessary.

> >+			/*
> >+ 			 * Largepage creation is susceptible to a upper-level
> >+ 			 * table to be shadowed and write-protected in the
> >+ 			 * area being mapped. If that is the case, invalidate
> >+ 			 * the entry and let the instruction fault again
> >+ 			 * and use 4K mappings.
> >+ 			 */
> >+			if (largepage) {
> >+				spte = shadow_trap_nonpresent_pte;
> >+				kvm_x86_ops->tlb_flush(vcpu);
> >+				goto unshadowed;
> >+			}
> >  
> 
> Would it not repeat exactly the same code path?  Or is this just for the 
> case of the pte_update path?

The problem is if the instruction writing to one of the roots can't be
emulated.

kvm_mmu_unprotect_page() does not know about largepages, so it will zap
a gfn inside the large page frame, but not the large translation itself.

And zapping the gfn brings the shadowed page count in large area to
zero, allowing has_wrprotected_page() to succeed. Endless unfixable
write faults.

> >-	page = gfn_to_page(vcpu->kvm, gpa >> PAGE_SHIFT);
> >+	if (is_largepage_backed(vcpu, gfn & ~(KVM_PAGES_PER_HPAGE-1))
> >+	    && is_physical_memory(vcpu->kvm, gfn)) {
> >+		gfn &= ~(KVM_PAGES_PER_HPAGE-1);
> >+		largepage = 1;
> >+	}
> >  
> 
> Doesn't is_largepage_backed() imply is_physical_memory?

Hum. I'll verify... it seems that now that the ends of the slots have
write_count set to 1 that should be true.

> > 
> >Index: kvm.largepages/arch/x86/kvm/x86.c
> >===================================================================
> >--- kvm.largepages.orig/arch/x86/kvm/x86.c
> >+++ kvm.largepages/arch/x86/kvm/x86.c
> >@@ -86,6 +86,7 @@ struct kvm_stats_debugfs_item debugfs_en
> > 	{ "mmu_recycled", VM_STAT(mmu_recycled) },
> > 	{ "mmu_cache_miss", VM_STAT(mmu_cache_miss) },
> > 	{ "remote_tlb_flush", VM_STAT(remote_tlb_flush) },
> >+	{ "lpages", VM_STAT(lpages) },
> > 	{ NULL }
> > };
> >  
> 
> s/lpages/largepages/, this is user visible.

OK.

> >+		new.lpage_info = vmalloc(largepages * 
> >sizeof(*new.lpage_info));
> >+
> >+		if (!new.lpage_info)
> >+			goto out_free;
> >+
> >+		memset(new.lpage_info, 0, largepages * 
> >sizeof(*new.lpage_info));
> >+		/* large page crosses memslot boundary */
> >+		if (npages % KVM_PAGES_PER_HPAGE) {
> >+			new.lpage_info[0].write_count = 1;
> >  
> 
> This seems wrong, say a 3MB slot at 1GB, you kill the first largepage 
> which is good.
> 
> >+			new.lpage_info[largepages-1].write_count = 1;
> >  
> 
> OTOH, a 3MB slot at 3MB, the last page is fine.  The check needs to be 
> against base_gfn and base_gfn + npages, not the number of pages.

Right, will fix. Will post an uptodated patch soon.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: large page support for kvm
  2008-02-19 20:37                     ` Marcelo Tosatti
@ 2008-02-20 14:25                       ` Avi Kivity
  2008-02-22  2:01                         ` Marcelo Tosatti
  0 siblings, 1 reply; 14+ messages in thread
From: Avi Kivity @ 2008-02-20 14:25 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm-devel

Marcelo Tosatti wrote:
>   
>>> +			/*
>>> + 			 * Largepage creation is susceptible to a upper-level
>>> + 			 * table to be shadowed and write-protected in the
>>> + 			 * area being mapped. If that is the case, invalidate
>>> + 			 * the entry and let the instruction fault again
>>> + 			 * and use 4K mappings.
>>> + 			 */
>>> +			if (largepage) {
>>> +				spte = shadow_trap_nonpresent_pte;
>>> +				kvm_x86_ops->tlb_flush(vcpu);
>>> +				goto unshadowed;
>>> +			}
>>>  
>>>       
>> Would it not repeat exactly the same code path?  Or is this just for the 
>> case of the pte_update path?
>>     
>
> The problem is if the instruction writing to one of the roots can't be
> emulated.
>
> kvm_mmu_unprotect_page() does not know about largepages, so it will zap
> a gfn inside the large page frame, but not the large translation itself.
>
> And zapping the gfn brings the shadowed page count in large area to
> zero, allowing has_wrprotected_page() to succeed. Endless unfixable
> write faults.
>
>   

I don't follow. Can you describe the scenario in more detail? The state 
of the guest and shadow page tables, and what actually happens?

Setting spte to a nonpresent pte seems to violate the rmap btw; rmap 
always expects a valid pte pointing at the page.

-- 
Any sufficiently difficult bug is indistinguishable from a feature.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: large page support for kvm
  2008-02-20 14:25                       ` Avi Kivity
@ 2008-02-22  2:01                         ` Marcelo Tosatti
  2008-02-22  7:16                           ` Avi Kivity
  0 siblings, 1 reply; 14+ messages in thread
From: Marcelo Tosatti @ 2008-02-22  2:01 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, kvm-devel

[-- Attachment #1: Type: text/plain, Size: 3019 bytes --]

On Wed, Feb 20, 2008 at 04:25:42PM +0200, Avi Kivity wrote:
> Marcelo Tosatti wrote:
> >  
> >>>+			/*
> >>>+ 			 * Largepage creation is susceptible to a upper-level
> >>>+ 			 * table to be shadowed and write-protected in the
> >>>+ 			 * area being mapped. If that is the case, invalidate
> >>>+ 			 * the entry and let the instruction fault again
> >>>+ 			 * and use 4K mappings.
> >>>+ 			 */
> >>>+			if (largepage) {
> >>>+				spte = shadow_trap_nonpresent_pte;
> >>>+				kvm_x86_ops->tlb_flush(vcpu);
> >>>+				goto unshadowed;
> >>>+			}
> >>> 
> >>>      
> >>Would it not repeat exactly the same code path?  Or is this just for the 
> >>case of the pte_update path?
> >>    
> >
> >The problem is if the instruction writing to one of the roots can't be
> >emulated.
> >
> >kvm_mmu_unprotect_page() does not know about largepages, so it will zap
> >a gfn inside the large page frame, but not the large translation itself.
> >
> >And zapping the gfn brings the shadowed page count in large area to
> >zero, allowing has_wrprotected_page() to succeed. Endless unfixable
> >write faults.
> >
> >  
> 
> I don't follow. Can you describe the scenario in more detail? The state 
> of the guest and shadow page tables, and what actually happens?

Have a kernel level-3 table at guest physical address 0x2800000.
The kernel direct translation which maps to that address is
0xffff810002800000.

The problem will occur if an instruction which can't be emulated
attempts to write via 0xffff810002800000 (optionally +4KB), with the
3-level table not yet shadowed.

What happens then is:

- mmu_page_fault()
- is_largepage_backed() finds no entry in the 0x2800000+2MB
shadow page count, happily saying its OK to use a largepage.
- shadow the 4-level entry
- shadow the 3-level entry (at 0x2800000).
- mmu_set_spte() sets the 2MB translation to be read-only, pt_write=1.
- the instruction emulation fails (because its not supported).
- kvm_mmu_unprotect_page() zaps the 3-level shadow table at 0x2800000.
- repeat 

Thinking this was an issue related to largepages only I decided to never
have 2MB forced-read-only pages around.

But I just noted that the same issue happens with 4kB pages too, even
though the chance of having one of the roots cached in the area being
mapped is much larger with 2MB pages.

I could not reproduce this anymore (though I have logs of that happening
on a stock kernel, attached), but a modified kernel allocating the
3-level entries at the proper place and writing to one of them through
the virtual mapping with an instruction which can't be emulated triggers
the issue (and its a valid scenario).

That said, it does not seem this problem should be dealt with in this
largepage patch (it can be made comparable to the current problem if we
never attempt to emulate instructions when a forced-read-only large page
is instantiated).

Do you see any way to fix that?

> Setting spte to a nonpresent pte seems to violate the rmap btw; rmap 
> always expects a valid pte pointing at the page.

[-- Attachment #2: shadow-hang.txt --]
[-- Type: text/plain, Size: 33878 bytes --]

Feb 13 15:37:59 harmony kernel: instantiating 2MB PTE (RW) at 1024 (183c000c7)
Feb 13 15:37:59 harmony kernel: account_shadowed for 1408
Feb 13 15:37:59 harmony kernel: account_shadowed for 1409
Feb 13 15:37:59 harmony kernel: account_shadowed for 1410
Feb 13 15:37:59 harmony kernel: paging64_page_fault: addr 27bfbc err 9
Feb 13 15:37:59 harmony kernel: largepage, walker->gfn=635 large_gfn=512
Feb 13 15:37:59 harmony kernel: instantiating 2MB PTE (R) at 512 (1796000c1)
Feb 13 15:37:59 harmony kernel: instantiating 2MB PTE (RW) at 512 (1796000c3)
Feb 13 15:37:59 harmony kernel: paging64_page_fault: addr 277f90 err 3
Feb 13 15:37:59 harmony kernel: largepage, walker->gfn=631 large_gfn=512
Feb 13 15:37:59 harmony kernel: instantiating 2MB PTE (RW) at 512 (1796000c3)
Feb 13 15:37:59 harmony kernel: account_shadowed for 513
Feb 13 15:37:59 harmony kernel: account_shadowed for 514
Feb 13 15:37:59 harmony kernel: account_shadowed for 518
Feb 13 15:37:59 harmony kernel: account_shadowed for 515
Feb 13 15:37:59 harmony kernel: account_shadowed for 519
Feb 13 15:37:59 harmony kernel: mmu_set_spte: found shadow page for 580, marking ro
Feb 13 15:37:59 harmony kernel: unaccount_shadowed for 1408
Feb 13 15:37:59 harmony kernel: mmu_set_spte: found shadow page for 581, marking ro
Feb 13 15:37:59 harmony kernel: unaccount_shadowed for 1409
Feb 13 15:37:59 harmony kernel: mmu_set_spte: found shadow page for 582, marking ro
Feb 13 15:37:59 harmony kernel: unaccount_shadowed for 1410
Feb 13 15:37:59 harmony kernel: paging64_page_fault: addr ffffffff80582001 err 3
Feb 13 15:37:59 harmony kernel: largepage, walker->gfn=1410 large_gfn=1024
Feb 13 15:37:59 harmony kernel: instantiating 2MB PTE (RW) at 1024 (183c000c3)
Feb 13 15:37:59 harmony kernel: mmu_set_spte: found shadow page for 201, marking ro
Feb 13 15:37:59 harmony kernel: mmu_set_spte: found shadow page for 207, marking ro
Feb 13 15:38:54 harmony kernel:last message repeated 13 times
Feb 13 15:37:59 harmony kernel: mmu_set_spte: found shadow page for 201, marking ro
Feb 13 15:37:59 harmony kernel: mmu_set_spte: found shadow page for 207, marking ro
Feb 13 15:38:54 harmony kernel:last message repeated 32 times
Feb 13 15:37:59 harmony kernel: account_shadowed for 8
Feb 13 15:37:59 harmony kernel: account_shadowed for 9
Feb 13 15:37:59 harmony kernel: mmu_set_spte: found shadow page for 207, marking ro
Feb 13 15:38:54 harmony kernel:last message repeated 5 times
Feb 13 15:37:59 harmony kernel: paging64_page_fault: addr ffff810001000000 err b
Feb 13 15:37:59 harmony kernel: largepage, walker->gfn=4096 large_gfn=4096
Feb 13 15:37:59 harmony kernel: instantiating 2MB PTE (RW) at 4096 (800000017cc000c3)
Feb 13 15:37:59 harmony kernel: mmu_set_spte: found shadow page for 201, marking ro
Feb 13 15:37:59 harmony kernel: paging64_page_fault: addr ffff810001200000 err b
Feb 13 15:37:59 harmony kernel: largepage, walker->gfn=4608 large_gfn=4608
Feb 13 15:37:59 harmony kernel: instantiating 2MB PTE (RW) at 4608 (800000017ce000c3)
Feb 13 15:37:59 harmony kernel: paging64_page_fault: addr ffff810001400000 err b
Feb 13 15:37:59 harmony kernel: largepage, walker->gfn=5120 large_gfn=5120
Feb 13 15:37:59 harmony kernel: instantiating 2MB PTE (RW) at 5120 (800000017c8000c3)
Feb 13 15:37:59 harmony kernel: paging64_page_fault: addr ffff810001600000 err b
Feb 13 15:37:59 harmony kernel: largepage, walker->gfn=5632 large_gfn=5632
Feb 13 15:37:59 harmony kernel: instantiating 2MB PTE (RW) at 5632 (800000017ca000c3)
Feb 13 15:37:59 harmony kernel: paging64_page_fault: addr ffff810001800000 err b
Feb 13 15:37:59 harmony kernel: largepage, walker->gfn=6144 large_gfn=6144
Feb 13 15:37:59 harmony kernel: instantiating 2MB PTE (RW) at 6144 (800000017c4000c3)
Feb 13 15:37:59 harmony kernel: paging64_page_fault: addr ffff810001a00000 err b
Feb 13 15:37:59 harmony kernel: largepage, walker->gfn=6656 large_gfn=6656
Feb 13 15:37:59 harmony kernel: instantiating 2MB PTE (RW) at 6656 (800000017c6000c3)
Feb 13 15:37:59 harmony kernel: paging64_page_fault: addr ffff810001c00000 err b
Feb 13 15:37:59 harmony kernel: largepage, walker->gfn=7168 large_gfn=7168
Feb 13 15:37:59 harmony kernel: instantiating 2MB PTE (RW) at 7168 (800000017c0000c3)
Feb 13 15:37:59 harmony kernel: paging64_page_fault: addr ffff810001e00000 err b
Feb 13 15:37:59 harmony kernel: largepage, walker->gfn=7680 large_gfn=7680
Feb 13 15:37:59 harmony kernel: instantiating 2MB PTE (RW) at 7680 (800000017c2000c3)
Feb 13 15:37:59 harmony kernel: paging64_page_fault: addr ffff810002000000 err b
Feb 13 15:37:59 harmony kernel: largepage, walker->gfn=8192 large_gfn=8192
Feb 13 15:37:59 harmony kernel: instantiating 2MB PTE (RW) at 8192 (800000017bc000c3)
Feb 13 15:37:59 harmony kernel: paging64_page_fault: addr ffff810002200000 err b
Feb 13 15:37:59 harmony kernel: largepage, walker->gfn=8704 large_gfn=8704
Feb 13 15:37:59 harmony kernel: instantiating 2MB PTE (RW) at 8704 (800000017be000c3)
Feb 13 15:37:59 harmony kernel: paging64_page_fault: addr ffff810002400000 err b
Feb 13 15:37:59 harmony kernel: largepage, walker->gfn=9216 large_gfn=9216
Feb 13 15:37:59 harmony kernel: instantiating 2MB PTE (RW) at 9216 (800000017b8000c3)
Feb 13 15:37:59 harmony kernel: paging64_page_fault: addr ffff810002600000 err b
Feb 13 15:37:59 harmony kernel: largepage, walker->gfn=9728 large_gfn=9728
Feb 13 15:37:59 harmony kernel: instantiating 2MB PTE (RW) at 9728 (800000017ba000c3)
Feb 13 15:37:59 harmony kernel: paging64_page_fault: addr ffff810002800000 err b
Feb 13 15:37:59 harmony kernel: largepage, walker->gfn=10240 large_gfn=10240
Feb 13 15:37:59 harmony kernel: instantiating 2MB PTE (RW) at 10240 (800000017b4000c3)
Feb 13 15:37:59 harmony kernel: paging64_page_fault: addr ffff810002a00000 err b
Feb 13 15:37:59 harmony kernel: largepage, walker->gfn=10752 large_gfn=10752
Feb 13 15:37:59 harmony kernel: instantiating 2MB PTE (RW) at 10752 (800000017b6000c3)
Feb 13 15:37:59 harmony kernel: paging64_page_fault: addr ffff810002c00000 err b
Feb 13 15:37:59 harmony kernel: largepage, walker->gfn=11264 large_gfn=11264
Feb 13 15:37:59 harmony kernel: instantiating 2MB PTE (RW) at 11264 (800000017b0000c3)
Feb 13 15:37:59 harmony kernel: paging64_page_fault: addr ffff810002e00000 err b
Feb 13 15:37:59 harmony kernel: largepage, walker->gfn=11776 large_gfn=11776
Feb 13 15:37:59 harmony kernel: instantiating 2MB PTE (RW) at 11776 (800000017b2000c3)
Feb 13 15:37:59 harmony kernel: paging64_page_fault: addr ffff810003000000 err b
Feb 13 15:37:59 harmony kernel: largepage, walker->gfn=12288 large_gfn=12288
Feb 13 15:37:59 harmony kernel: instantiating 2MB PTE (RW) at 12288 (800000017ac000c3)
Feb 13 15:37:59 harmony kernel: paging64_page_fault: addr ffff810003200000 err b
Feb 13 15:37:59 harmony kernel: largepage, walker->gfn=12800 large_gfn=12800
Feb 13 15:37:59 harmony kernel: instantiating 2MB PTE (RW) at 12800 (800000017ae000c3)
Feb 13 15:37:59 harmony kernel: paging64_page_fault: addr ffff810003400000 err b
Feb 13 15:37:59 harmony kernel: largepage, walker->gfn=13312 large_gfn=13312
Feb 13 15:37:59 harmony kernel: instantiating 2MB PTE (RW) at 13312 (800000017a8000c3)
Feb 13 15:37:59 harmony kernel: paging64_page_fault: addr ffff810003600000 err b
Feb 13 15:37:59 harmony kernel: largepage, walker->gfn=13824 large_gfn=13824
Feb 13 15:37:59 harmony kernel: instantiating 2MB PTE (RW) at 13824 (800000017aa000c3)
Feb 13 15:37:59 harmony kernel: paging64_page_fault: addr ffff810003800000 err b
Feb 13 15:37:59 harmony kernel: largepage, walker->gfn=14336 large_gfn=14336
Feb 13 15:37:59 harmony kernel: instantiating 2MB PTE (RW) at 14336 (800000017a4000c3)
Feb 13 15:37:59 harmony kernel: paging64_page_fault: addr ffff810003a00000 err b
Feb 13 15:37:59 harmony kernel: largepage, walker->gfn=14848 large_gfn=14848
Feb 13 15:37:59 harmony kernel: instantiating 2MB PTE (RW) at 14848 (800000017a6000c3)
Feb 13 15:37:59 harmony kernel: paging64_page_fault: addr ffff810003c00000 err b
Feb 13 15:37:59 harmony kernel: largepage, walker->gfn=15360 large_gfn=15360
Feb 13 15:37:59 harmony kernel: instantiating 2MB PTE (RW) at 15360 (80000001856000c3)
Feb 13 15:37:59 harmony kernel: paging64_page_fault: addr ffff810003e00000 err b
Feb 13 15:37:59 harmony kernel: largepage, walker->gfn=15872 large_gfn=15872
Feb 13 15:37:59 harmony kernel: instantiating 2MB PTE (RW) at 15872 (8000000184e000c3)
Feb 13 15:37:59 harmony kernel: paging64_page_fault: addr ffff810004000000 err b
Feb 13 15:37:59 harmony kernel: largepage, walker->gfn=16384 large_gfn=16384
Feb 13 15:37:59 harmony kernel: instantiating 2MB PTE (RW) at 16384 (80000001842000c3)
Feb 13 15:37:59 harmony kernel: paging64_page_fault: addr ffff810004200000 err b
Feb 13 15:37:59 harmony kernel: largepage, walker->gfn=16896 large_gfn=16896
Feb 13 15:37:59 harmony kernel: instantiating 2MB PTE (RW) at 16896 (8000000183e000c3)
Feb 13 15:37:59 harmony kernel: paging64_page_fault: addr ffff810004400000 err b
Feb 13 15:37:59 harmony kernel: largepage, walker->gfn=17408 large_gfn=17408
Feb 13 15:37:59 harmony kernel: instantiating 2MB PTE (RW) at 17408 (800000017e0000c3)
Feb 13 15:37:59 harmony kernel: paging64_page_fault: addr ffff810004600000 err b
Feb 13 15:37:59 harmony kernel: largepage, walker->gfn=17920 large_gfn=17920
Feb 13 15:37:59 harmony kernel: instantiating 2MB PTE (RW) at 17920 (800000018c2000c3)
Feb 13 15:37:59 harmony kernel: paging64_page_fault: addr ffff810004800000 err b
Feb 13 15:37:59 harmony kernel: largepage, walker->gfn=18432 large_gfn=18432
Feb 13 15:37:59 harmony kernel: instantiating 2MB PTE (RW) at 18432 (80000001834000c3)
Feb 13 15:38:00 harmony kernel: paging64_page_fault: addr ffffe20000000000 err 9
Feb 13 15:38:00 harmony kernel: largepage, walker->gfn=4608 large_gfn=4608
Feb 13 15:38:00 harmony kernel: account_shadowed for 4097
Feb 13 15:38:00 harmony kernel: account_shadowed for 4098
Feb 13 15:38:00 harmony kernel: instantiating 2MB PTE (RW) at 4608 (800000017ce000c3)
Feb 13 15:38:00 harmony kernel: paging64_page_fault: addr ffffe20000200000 err b
Feb 13 15:38:00 harmony kernel: largepage, walker->gfn=5632 large_gfn=5632
Feb 13 15:38:00 harmony kernel: instantiating 2MB PTE (RW) at 5632 (800000017ca000c3)
Feb 13 15:38:00 harmony kernel: paging64_page_fault: addr ffffe20000400018 err b
Feb 13 15:38:00 harmony kernel: largepage, walker->gfn=6656 large_gfn=6656
Feb 13 15:38:00 harmony kernel: instantiating 2MB PTE (RW) at 6656 (800000017c6000c3)
Feb 13 15:38:00 harmony kernel: paging64_page_fault: addr ffffe20000600010 err b
Feb 13 15:38:00 harmony kernel: largepage, walker->gfn=7680 large_gfn=7680
Feb 13 15:38:00 harmony kernel: instantiating 2MB PTE (RW) at 7680 (800000017c2000c3)
Feb 13 15:38:00 harmony kernel: paging64_page_fault: addr ffffe20000800008 err b
Feb 13 15:38:00 harmony kernel: largepage, walker->gfn=8704 large_gfn=8704
Feb 13 15:38:00 harmony kernel: instantiating 2MB PTE (RW) at 8704 (800000017be000c3)
Feb 13 15:38:00 harmony kernel: paging64_page_fault: addr ffffe20000a00000 err b
Feb 13 15:38:00 harmony kernel: largepage, walker->gfn=9728 large_gfn=9728
Feb 13 15:38:00 harmony kernel: instantiating 2MB PTE (RW) at 9728 (800000017ba000c3)
Feb 13 15:38:00 harmony kernel: paging64_page_fault: addr ffffe20000c00000 err b
Feb 13 15:38:00 harmony kernel: largepage, walker->gfn=10752 large_gfn=10752
Feb 13 15:38:00 harmony kernel: instantiating 2MB PTE (RW) at 10752 (800000017b6000c3)
Feb 13 15:38:00 harmony kernel: paging64_page_fault: addr ffffe20000e00000 err 9
Feb 13 15:38:00 harmony kernel: largepage, walker->gfn=11776 large_gfn=11776
Feb 13 15:38:00 harmony kernel: instantiating 2MB PTE (RW) at 11776 (800000017b2000c3)
Feb 13 15:38:00 harmony kernel: paging64_page_fault: addr ffffe20001000000 err b
Feb 13 15:38:00 harmony kernel: largepage, walker->gfn=12800 large_gfn=12800
Feb 13 15:38:00 harmony kernel: instantiating 2MB PTE (RW) at 12800 (800000017ae000c3)
Feb 13 15:38:00 harmony kernel: paging64_page_fault: addr ffffe20001200018 err b
Feb 13 15:38:00 harmony kernel: largepage, walker->gfn=13824 large_gfn=13824
Feb 13 15:38:00 harmony kernel: instantiating 2MB PTE (RW) at 13824 (800000017aa000c3)
Feb 13 15:38:00 harmony kernel: paging64_page_fault: addr ffffe20001400010 err b
Feb 13 15:38:00 harmony kernel: largepage, walker->gfn=14848 large_gfn=14848
Feb 13 15:38:00 harmony kernel: instantiating 2MB PTE (RW) at 14848 (800000017a6000c3)
Feb 13 15:38:00 harmony kernel: paging64_page_fault: addr ffffe20001600008 err b
Feb 13 15:38:00 harmony kernel: largepage, walker->gfn=15872 large_gfn=15872
Feb 13 15:38:00 harmony kernel: instantiating 2MB PTE (RW) at 15872 (8000000184e000c3)
Feb 13 15:38:00 harmony kernel: paging64_page_fault: addr ffffe20001800000 err b
Feb 13 15:38:00 harmony kernel: largepage, walker->gfn=16896 large_gfn=16896
Feb 13 15:38:00 harmony kernel: instantiating 2MB PTE (RW) at 16896 (8000000183e000c3)
Feb 13 15:38:00 harmony kernel: paging64_page_fault: addr ffffe20001a00000 err b
Feb 13 15:38:00 harmony kernel: largepage, walker->gfn=17920 large_gfn=17920
Feb 13 15:38:00 harmony kernel: instantiating 2MB PTE (RW) at 17920 (800000018c2000c3)
Feb 13 15:38:00 harmony kernel: mmu_set_spte: found shadow page for 203, marking ro
Feb 13 15:38:00 harmony kernel: mmu_set_spte: found shadow page for 207, marking ro
Feb 13 15:38:55 harmony kernel:last message repeated 19 times
Feb 13 15:38:00 harmony kernel: paging64_page_fault: addr ffff81007c800000 err b
Feb 13 15:38:00 harmony kernel: largepage, walker->gfn=509952 large_gfn=509952
Feb 13 15:38:00 harmony kernel: account_shadowed for 10
Feb 13 15:38:00 harmony kernel: instantiating 2MB PTE (RW) at 509952 (80000001808000c3)
Feb 13 15:38:00 harmony kernel: mmu_set_spte: found shadow page for 9, marking ro
Feb 13 15:38:00 harmony kernel: account_shadowed for 509952
Feb 13 15:38:00 harmony kernel: mmu_set_spte: found shadow page for 9, marking ro
Feb 13 15:38:00 harmony kernel: mmu_set_spte: found shadow page for 8, marking ro
Feb 13 15:38:00 harmony kernel: mmu_set_spte: found shadow page for a, marking ro
Feb 13 15:38:55 harmony kernel:last message repeated 3 times
Feb 13 15:38:00 harmony kernel: unaccount_shadowed for 10
Feb 13 15:38:00 harmony kernel: account_shadowed for 10
Feb 13 15:38:00 harmony kernel: mmu_set_spte: found shadow page for 207, marking ro
Feb 13 15:38:00 harmony kernel: paging64_page_fault: addr ffff81007ca31000 err b
Feb 13 15:38:00 harmony kernel: largepage, walker->gfn=510513 large_gfn=510464
Feb 13 15:38:00 harmony kernel: instantiating 2MB PTE (RW) at 510464 (80000001832000c3)
Feb 13 15:38:00 harmony kernel: mmu_set_spte: found shadow page for 201, marking ro
Feb 13 15:38:00 harmony kernel: account_shadowed for 510513
Feb 13 15:38:00 harmony kernel: account_shadowed for 510514
Feb 13 15:38:00 harmony kernel: account_shadowed for 510515
Feb 13 15:38:00 harmony kernel: mmu_set_spte: found shadow page for 7ca31, marking ro
Feb 13 15:38:00 harmony kernel: mmu_set_spte: found shadow page for 7ca32, marking ro
Feb 13 15:38:00 harmony kernel: mmu_set_spte: found shadow page for 7ca32, marking ro
Feb 13 15:38:00 harmony kernel: account_shadowed for 510773
Feb 13 15:38:00 harmony kernel: paging64_page_fault: addr ffff81007c400020 err b
Feb 13 15:38:00 harmony kernel: largepage, walker->gfn=508928 large_gfn=508928
Feb 13 15:38:00 harmony kernel: instantiating 2MB PTE (RW) at 508928 (80000001820000c3)
Feb 13 15:38:00 harmony kernel: paging64_page_fault: addr ffff810000511103 err b
Feb 13 15:38:00 harmony kernel: largepage, walker->gfn=1297 large_gfn=1024
Feb 13 15:38:00 harmony kernel: instantiating 2MB PTE (RW) at 1024 (8000000183c000c3)
Feb 13 15:38:00 harmony kernel: mmu_set_spte: found shadow page for 7cb35, marking ro
Feb 13 15:38:00 harmony kernel: mmu_set_spte: found shadow page for a, marking ro
Feb 13 15:38:00 harmony kernel: account_shadowed for 7
Feb 13 15:38:00 harmony kernel: mmu_set_spte: found shadow page for 7, marking ro
Feb 13 15:38:00 harmony kernel: unaccount_shadowed for 7
Feb 13 15:38:00 harmony kernel: account_shadowed for 7
Feb 13 15:38:00 harmony kernel: paging64_page_fault: addr ffff81007c000e00 err b
Feb 13 15:38:00 harmony kernel: largepage, walker->gfn=507904 large_gfn=507904
Feb 13 15:38:00 harmony kernel: instantiating 2MB PTE (RW) at 507904 (80000001802000c3)
Feb 13 15:38:00 harmony kernel: mmu_set_spte: found shadow page for 7, marking ro
Feb 13 15:38:00 harmony kernel: unaccount_shadowed for 7
Feb 13 15:38:00 harmony kernel: account_shadowed for 7
Feb 13 15:38:00 harmony kernel: mmu_set_spte: found shadow page for 7cb35, marking ro
Feb 13 15:38:56 harmony kernel:last message repeated 5 times
Feb 13 15:38:00 harmony kernel: unaccount_shadowed for 510773
Feb 13 15:38:00 harmony kernel: account_shadowed for 510773
Feb 13 15:38:00 harmony kernel: mmu_set_spte: found shadow page for 7ca32, marking ro
Feb 13 15:38:00 harmony kernel: account_shadowed for 508304
Feb 13 15:38:00 harmony kernel: mmu_set_spte: found shadow page for 7cb35, marking ro
Feb 13 15:38:56 harmony kernel:last message repeated 2 times
Feb 13 15:38:00 harmony kernel: unaccount_shadowed for 510773
Feb 13 15:38:00 harmony kernel: paging64_page_fault: addr ffff81007bd93000 err b
Feb 13 15:38:00 harmony kernel: largepage, walker->gfn=507283 large_gfn=506880
Feb 13 15:38:00 harmony kernel: instantiating 2MB PTE (RW) at 506880 (80000001846000c3)
Feb 13 15:38:00 harmony kernel: mmu_set_spte: found shadow page for 7ca32, marking ro
Feb 13 15:38:00 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:56 harmony kernel:last message repeated 3 times
Feb 13 15:38:00 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:00 harmony kernel: account_shadowed for 508304
Feb 13 15:38:00 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:56 harmony kernel:last message repeated 2 times
Feb 13 15:38:00 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:00 harmony kernel: account_shadowed for 508304
Feb 13 15:38:00 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:56 harmony kernel:last message repeated 2 times
Feb 13 15:38:00 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:00 harmony kernel: account_shadowed for 508304
Feb 13 15:38:00 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:56 harmony kernel:last message repeated 2 times
Feb 13 15:38:00 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:00 harmony kernel: account_shadowed for 508304
Feb 13 15:38:00 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:56 harmony kernel:last message repeated 2 times
Feb 13 15:38:00 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:00 harmony kernel: account_shadowed for 508304
Feb 13 15:38:00 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:56 harmony kernel:last message repeated 2 times
Feb 13 15:38:00 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:00 harmony kernel: account_shadowed for 508304
Feb 13 15:38:00 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:56 harmony kernel:last message repeated 2 times
Feb 13 15:38:00 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:00 harmony kernel: account_shadowed for 508304
Feb 13 15:38:00 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:56 harmony kernel:last message repeated 2 times
Feb 13 15:38:00 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:00 harmony kernel: account_shadowed for 508304
Feb 13 15:38:00 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:56 harmony kernel:last message repeated 2 times
Feb 13 15:38:00 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:00 harmony kernel: account_shadowed for 508304
Feb 13 15:38:00 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:56 harmony kernel:last message repeated 2 times
Feb 13 15:38:00 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:00 harmony kernel: account_shadowed for 508304
Feb 13 15:38:00 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:56 harmony kernel:last message repeated 2 times
Feb 13 15:38:00 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:00 harmony kernel: account_shadowed for 508304
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:56 harmony kernel:last message repeated 2 times
Feb 13 15:38:01 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:01 harmony kernel: account_shadowed for 508304
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:56 harmony kernel:last message repeated 2 times
Feb 13 15:38:01 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:01 harmony kernel: account_shadowed for 508304
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:56 harmony kernel:last message repeated 2 times
Feb 13 15:38:01 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:01 harmony kernel: account_shadowed for 508304
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:56 harmony kernel:last message repeated 2 times
Feb 13 15:38:01 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:01 harmony kernel: account_shadowed for 508304
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:56 harmony kernel:last message repeated 2 times
Feb 13 15:38:01 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:01 harmony kernel: account_shadowed for 508304
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:56 harmony kernel:last message repeated 2 times
Feb 13 15:38:01 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:01 harmony kernel: account_shadowed for 508304
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:56 harmony kernel:last message repeated 2 times
Feb 13 15:38:01 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:01 harmony kernel: account_shadowed for 508304
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:56 harmony kernel:last message repeated 2 times
Feb 13 15:38:01 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:01 harmony kernel: account_shadowed for 508304
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:56 harmony kernel:last message repeated 2 times
Feb 13 15:38:01 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:01 harmony kernel: account_shadowed for 508304
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:56 harmony kernel:last message repeated 2 times
Feb 13 15:38:01 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:01 harmony kernel: account_shadowed for 508304
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:56 harmony kernel:last message repeated 2 times
Feb 13 15:38:01 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:01 harmony kernel: account_shadowed for 508304
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:56 harmony kernel:last message repeated 2 times
Feb 13 15:38:01 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:01 harmony kernel: paging64_page_fault: addr ffff81007c02201c err b
Feb 13 15:38:01 harmony kernel: largepage, walker->gfn=507938 large_gfn=507904
Feb 13 15:38:01 harmony kernel: account_shadowed for 508304
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c000, marking ro
Feb 13 15:38:01 harmony kernel: instantiating 2MB PTE (R) at 507904 (80000001802000c1)
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:56 harmony kernel:last message repeated 2 times
Feb 13 15:38:01 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:01 harmony kernel: paging64_page_fault: addr ffff81007c02bf80 err 3
Feb 13 15:38:01 harmony kernel: largepage, walker->gfn=507947 large_gfn=507904
Feb 13 15:38:01 harmony kernel: account_shadowed for 508304
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c000, marking ro
Feb 13 15:38:01 harmony kernel: instantiating 2MB PTE (R) at 507904 (80000001802000c1)
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:01 harmony kernel: emulation failed (mmio) rip ffffffff8024f1a5 e8 45 fd ff
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:01 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:01 harmony kernel: paging64_page_fault: addr ffff81007c02bf80 err 3
Feb 13 15:38:01 harmony kernel: largepage, walker->gfn=507947 large_gfn=507904
Feb 13 15:38:01 harmony kernel: account_shadowed for 508304
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c000, marking ro
Feb 13 15:38:01 harmony kernel: instantiating 2MB PTE (R) at 507904 (80000001802000c1)
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:57 harmony kernel:last message repeated 2 times
Feb 13 15:38:01 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:01 harmony kernel: paging64_page_fault: addr ffff81007c02bf80 err 3
Feb 13 15:38:01 harmony kernel: largepage, walker->gfn=507947 large_gfn=507904
Feb 13 15:38:01 harmony kernel: account_shadowed for 508304
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c000, marking ro
Feb 13 15:38:01 harmony kernel: instantiating 2MB PTE (R) at 507904 (80000001802000c1)
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:57 harmony kernel:last message repeated 2 times
Feb 13 15:38:01 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:01 harmony kernel: paging64_page_fault: addr ffff81007c02bf80 err 3
Feb 13 15:38:01 harmony kernel: largepage, walker->gfn=507947 large_gfn=507904
Feb 13 15:38:01 harmony kernel: account_shadowed for 508304
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c000, marking ro
Feb 13 15:38:01 harmony kernel: instantiating 2MB PTE (R) at 507904 (80000001802000c1)
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:57 harmony kernel:last message repeated 2 times
Feb 13 15:38:01 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:01 harmony kernel: paging64_page_fault: addr ffff81007c02bf80 err 3
Feb 13 15:38:01 harmony kernel: largepage, walker->gfn=507947 large_gfn=507904
Feb 13 15:38:01 harmony kernel: account_shadowed for 508304
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c000, marking ro
Feb 13 15:38:01 harmony kernel: instantiating 2MB PTE (R) at 507904 (80000001802000c1)
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:57 harmony kernel:last message repeated 2 times
Feb 13 15:38:01 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:01 harmony kernel: paging64_page_fault: addr ffff81007c02bf80 err 3
Feb 13 15:38:01 harmony kernel: largepage, walker->gfn=507947 large_gfn=507904
Feb 13 15:38:01 harmony kernel: account_shadowed for 508304
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c000, marking ro
Feb 13 15:38:01 harmony kernel: instantiating 2MB PTE (R) at 507904 (80000001802000c1)
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:57 harmony kernel:last message repeated 2 times
Feb 13 15:38:01 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:01 harmony kernel: paging64_page_fault: addr ffff81007c02bf80 err 3
Feb 13 15:38:01 harmony kernel: largepage, walker->gfn=507947 large_gfn=507904
Feb 13 15:38:01 harmony kernel: account_shadowed for 508304
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c000, marking ro
Feb 13 15:38:01 harmony kernel: instantiating 2MB PTE (R) at 507904 (80000001802000c1)
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:57 harmony kernel:last message repeated 2 times
Feb 13 15:38:01 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:01 harmony kernel: paging64_page_fault: addr ffff81007c02bf80 err 3
Feb 13 15:38:01 harmony kernel: largepage, walker->gfn=507947 large_gfn=507904
Feb 13 15:38:01 harmony kernel: account_shadowed for 508304
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c000, marking ro
Feb 13 15:38:01 harmony kernel: instantiating 2MB PTE (R) at 507904 (80000001802000c1)
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:57 harmony kernel:last message repeated 2 times
Feb 13 15:38:01 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:01 harmony kernel: paging64_page_fault: addr ffff81007c02bf80 err 3
Feb 13 15:38:01 harmony kernel: largepage, walker->gfn=507947 large_gfn=507904
Feb 13 15:38:01 harmony kernel: account_shadowed for 508304
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c000, marking ro
Feb 13 15:38:01 harmony kernel: instantiating 2MB PTE (R) at 507904 (80000001802000c1)
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:57 harmony kernel:last message repeated 2 times
Feb 13 15:38:01 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:01 harmony kernel: paging64_page_fault: addr ffff81007c02bf80 err 3
Feb 13 15:38:01 harmony kernel: largepage, walker->gfn=507947 large_gfn=507904
Feb 13 15:38:01 harmony kernel: account_shadowed for 508304
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c000, marking ro
Feb 13 15:38:01 harmony kernel: instantiating 2MB PTE (R) at 507904 (80000001802000c1)
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:57 harmony kernel:last message repeated 2 times
Feb 13 15:38:01 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:01 harmony kernel: paging64_page_fault: addr ffff81007c02bf80 err 3
Feb 13 15:38:01 harmony kernel: largepage, walker->gfn=507947 large_gfn=507904
Feb 13 15:38:01 harmony kernel: account_shadowed for 508304
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c000, marking ro
Feb 13 15:38:01 harmony kernel: instantiating 2MB PTE (R) at 507904 (80000001802000c1)
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:57 harmony kernel:last message repeated 2 times
Feb 13 15:38:01 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:01 harmony kernel: paging64_page_fault: addr ffff81007c02bf80 err 3
Feb 13 15:38:01 harmony kernel: largepage, walker->gfn=507947 large_gfn=507904
Feb 13 15:38:01 harmony kernel: account_shadowed for 508304
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c000, marking ro
Feb 13 15:38:01 harmony kernel: instantiating 2MB PTE (R) at 507904 (80000001802000c1)
Feb 13 15:38:01 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:57 harmony kernel:last message repeated 2 times
Feb 13 15:38:02 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:02 harmony kernel: paging64_page_fault: addr ffff81007c02bf80 err 3
Feb 13 15:38:02 harmony kernel: largepage, walker->gfn=507947 large_gfn=507904
Feb 13 15:38:02 harmony kernel: account_shadowed for 508304
Feb 13 15:38:02 harmony kernel: mmu_set_spte: found shadow page for 7c000, marking ro
Feb 13 15:38:02 harmony kernel: instantiating 2MB PTE (R) at 507904 (80000001802000c1)
Feb 13 15:38:02 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:57 harmony kernel:last message repeated 2 times
Feb 13 15:38:02 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:02 harmony kernel: paging64_page_fault: addr ffff81007c02bf80 err 3
Feb 13 15:38:02 harmony kernel: largepage, walker->gfn=507947 large_gfn=507904
Feb 13 15:38:02 harmony kernel: account_shadowed for 508304
Feb 13 15:38:02 harmony kernel: mmu_set_spte: found shadow page for 7c000, marking ro
Feb 13 15:38:02 harmony kernel: instantiating 2MB PTE (R) at 507904 (80000001802000c1)
Feb 13 15:38:02 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:57 harmony kernel:last message repeated 2 times
Feb 13 15:38:02 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:02 harmony kernel: paging64_page_fault: addr ffff81007c02bf80 err 3
Feb 13 15:38:02 harmony kernel: largepage, walker->gfn=507947 large_gfn=507904
Feb 13 15:38:02 harmony kernel: account_shadowed for 508304
Feb 13 15:38:02 harmony kernel: mmu_set_spte: found shadow page for 7c000, marking ro
Feb 13 15:38:02 harmony kernel: instantiating 2MB PTE (R) at 507904 (80000001802000c1)
Feb 13 15:38:02 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro
Feb 13 15:38:57 harmony kernel:last message repeated 2 times
Feb 13 15:38:02 harmony kernel: unaccount_shadowed for 508304
Feb 13 15:38:02 harmony kernel: paging64_page_fault: addr ffff81007c02bf80 err 3
Feb 13 15:38:02 harmony kernel: largepage, walker->gfn=507947 large_gfn=507904
Feb 13 15:38:02 harmony kernel: account_shadowed for 508304
Feb 13 15:38:02 harmony kernel: mmu_set_spte: found shadow page for 7c000, marking ro
Feb 13 15:38:02 harmony kernel: instantiating 2MB PTE (R) at 507904 (80000001802000c1)
Feb 13 15:38:02 harmony kernel: mmu_set_spte: found shadow page for 7c190, marking ro

[-- Attachment #3: Type: text/plain, Size: 228 bytes --]

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

[-- Attachment #4: Type: text/plain, Size: 158 bytes --]

_______________________________________________
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: large page support for kvm
  2008-02-22  2:01                         ` Marcelo Tosatti
@ 2008-02-22  7:16                           ` Avi Kivity
  0 siblings, 0 replies; 14+ messages in thread
From: Avi Kivity @ 2008-02-22  7:16 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm-devel

Marcelo Tosatti wrote:
>>>  
>>>       
>> I don't follow. Can you describe the scenario in more detail? The state 
>> of the guest and shadow page tables, and what actually happens?
>>     
>
> Have a kernel level-3 table at guest physical address 0x2800000.
> The kernel direct translation which maps to that address is
> 0xffff810002800000.
>
> The problem will occur if an instruction which can't be emulated
> attempts to write via 0xffff810002800000 (optionally +4KB), with the
> 3-level table not yet shadowed.
>
> What happens then is:
>
> - mmu_page_fault()
> - is_largepage_backed() finds no entry in the 0x2800000+2MB
> shadow page count, happily saying its OK to use a largepage.
> - shadow the 4-level entry
> - shadow the 3-level entry (at 0x2800000).
> - mmu_set_spte() sets the 2MB translation to be read-only, pt_write=1.
> - the instruction emulation fails (because its not supported).
> - kvm_mmu_unprotect_page() zaps the 3-level shadow table at 0x2800000.
> - repeat 
>
> Thinking this was an issue related to largepages only I decided to never
> have 2MB forced-read-only pages around.
>
> But I just noted that the same issue happens with 4kB pages too, even
> though the chance of having one of the roots cached in the area being
> mapped is much larger with 2MB pages.
>
> I could not reproduce this anymore (though I have logs of that happening
> on a stock kernel, attached), but a modified kernel allocating the
> 3-level entries at the proper place and writing to one of them through
> the virtual mapping with an instruction which can't be emulated triggers
> the issue (and its a valid scenario).
>
> That said, it does not seem this problem should be dealt with in this
> largepage patch (it can be made comparable to the current problem if we
> never attempt to emulate instructions when a forced-read-only large page
> is instantiated).
>
> Do you see any way to fix that?
>   

A write to page that is a page table that also maps the instruction 
itself or one of the operands must be emulated.  The best fix IMO is to 
implement the missing instruction in the emulator (I'm surprised we have 
such a missing instruction, btw, I thought it was reasonably complete 
wrt paging).

There are other potential fixes, like single-stepping the instruction in 
the guest and re-shadowing the page, but they are a can of worms and I 
don't think we want to go that way.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2008-02-22  7:16 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-29 17:20 large page support for kvm Avi Kivity
     [not found] ` <479F604C.20107-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2008-01-30 18:40   ` Joerg Roedel
     [not found]     ` <20080130184035.GS6960-5C7GfCeVMHo@public.gmane.org>
2008-01-31  5:44       ` Avi Kivity
2008-02-11 15:49         ` Marcelo Tosatti
2008-02-12 11:55           ` Avi Kivity
2008-02-13  0:15             ` Marcelo Tosatti
2008-02-13  6:45               ` Avi Kivity
2008-02-14 23:17                 ` Marcelo Tosatti
2008-02-15  7:40                   ` Roedel, Joerg
2008-02-17  9:38                   ` Avi Kivity
2008-02-19 20:37                     ` Marcelo Tosatti
2008-02-20 14:25                       ` Avi Kivity
2008-02-22  2:01                         ` Marcelo Tosatti
2008-02-22  7:16                           ` Avi Kivity

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox