Re: large page support for kvm

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Avi Kivity <avi@qumranet.com>
To: Marcelo Tosatti <marcelo@kvack.org>
Cc: kvm-devel <kvm-devel@lists.sourceforge.net>
Subject: Re: large page support for kvm
Date: Sun, 17 Feb 2008 11:38:51 +0200	[thread overview]
Message-ID: <47B800AB.20905@qumranet.com> (raw)
In-Reply-To: <20080214231739.GA7787@dmt>

Marcelo Tosatti wrote:
> On Wed, Feb 13, 2008 at 08:45:51AM +0200, Avi Kivity wrote:
>   
>>> gfn_to_page() needs to grab the struct page corresponding to the large
>>> page, not the offset struct page for the faulting 4k address within
>>> the large frame. Since gfn_to_page can sleep, there is no way to do
>>> that in the mapping logic which happens under mmu_lock protection.
>>> We don't want to grab the large page frame "struct page" unless the
>>> is_largepage_backed() checks are successful.
>>>
>>> The checks could be done in page_fault() if walker->level == 2, before
>>> gfn_to_page()... But I don't see much difference of that and doing 
>>> it inside walk_addr(). What do you say?
>>>
>>>  
>>>       
>> I'd like to keep walk_addr() independent of the rest of the mmu (i.e. 
>> walk_addr is 100% guest oriented). Also, the issue you point out is 
>> shared by direct_map which doesn't call walk_addr().
>>
>> An unrelated issue (pointed out by Jun Nakajima) is that this kills 
>> dirty log tracking (needed for migration). It could be solved simply by 
>> not using large page backing if dirty log tracking is enabled for that slot.
>>     
>
> Ok, fixed your comments and a bug which a root page was shadowed in the
> large area being mapped. access.flat is happy.
>
> Joerg, can you give this a try on a NPT-enabled system (need the
> attached qemu-largepage-hack.patch).
>
> Thanks
>
> Index: kvm.largepages/arch/x86/kvm/mmu.c
> ===================================================================
> --- kvm.largepages.orig/arch/x86/kvm/mmu.c
> +++ kvm.largepages/arch/x86/kvm/mmu.c
> @@ -27,6 +27,7 @@
>  #include <linux/highmem.h>
>  #include <linux/module.h>
>  #include <linux/swap.h>
> +#include <linux/hugetlb.h>
>  
>  #include <asm/page.h>
>  #include <asm/cmpxchg.h>
> @@ -211,6 +212,11 @@ static int is_shadow_present_pte(u64 pte
>  		&& pte != shadow_notrap_nonpresent_pte;
>  }
>  
> +static int is_large_pte(u64 pte)
> +{
> +	return pte & PT_PAGE_SIZE_MASK;
> +}
> +
>  static int is_writeble_pte(unsigned long pte)
>  {
>  	return pte & PT_WRITABLE_MASK;
> @@ -350,17 +356,120 @@ static void mmu_free_rmap_desc(struct kv
>  	kfree(rd);
>  }
>  
> +static int hpage_align_diff(unsigned long gfn)
> +{
> +	return ((gfn+KVM_PAGES_PER_HPAGE-1) & ~(KVM_PAGES_PER_HPAGE-1)) - gfn;
> +}
> +
> +/*
> + * Return the pointer to the largepage write count for a given
> + * gfn, handling slots that are not large page aligned.
> + */
> +static int *slot_largepage_idx(gfn_t gfn, struct kvm_memory_slot *slot)
> +{
> +	unsigned long idx;
> +
> +	idx = (gfn - slot->base_gfn) + hpage_align_diff(slot->base_gfn);
> +	idx /= KVM_PAGES_PER_HPAGE;
> +	return &slot->lpage_info[idx].write_count;
> +}
>   

Can be further simplified to (gfn / KVM_PAGES_PER_HPAGE) - 
(slot->base_gfn / KVM_PAGES_PER_HPAGE).  Sorry for not noticing earlier.

> +
> +static int is_largepage_backed(struct kvm_vcpu *vcpu, gfn_t large_gfn)
> +{
> +	struct kvm_memory_slot *slot;
> +
> +	if (has_wrprotected_page(vcpu->kvm, large_gfn))
> +		return 0;
> +
> +	if (!host_largepage_backed(vcpu->kvm, large_gfn))
> +		return 0;
> +
> +	slot = gfn_to_memslot(vcpu->kvm, large_gfn);
> +	if (slot && slot->dirty_bitmap)
> +		return 0;
> +
> +	/* guest has 4M pages, host 2M */
> +	if (!is_pae(vcpu) && HPAGE_SHIFT == 21)
> +		return 0;
>   

Is this check necessary?  I think that if we remove it things will just 
work.  A 4MB page will be have either one or two 2MB sptes (which may 
even belong to different slots).

> @@ -894,12 +1030,28 @@ struct page *gva_to_page(struct kvm_vcpu
>  static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *shadow_pte,
>  			 unsigned pt_access, unsigned pte_access,
>  			 int user_fault, int write_fault, int dirty,
> -			 int *ptwrite, gfn_t gfn, struct page *page)
> +			 int *ptwrite, int largepage, gfn_t gfn,
> +			 struct page *page)
>  {
>  	u64 spte;
>  	int was_rmapped = is_rmap_pte(*shadow_pte);
>  	int was_writeble = is_writeble_pte(*shadow_pte);
>  
> +	/*
> + 	 * If we overwrite a PTE page pointer with a 2MB PMD, unlink
> + 	 * the parent of the now unreachable PTE.
> + 	 */
> +	if (largepage) {
> +		if (was_rmapped && !is_large_pte(*shadow_pte)) {
> +			struct kvm_mmu_page *child;
> +			u64 pte = *shadow_pte;
> +
> +			child = page_header(pte & PT64_BASE_ADDR_MASK);
> +			mmu_page_remove_parent_pte(child, shadow_pte);
> +           	}
> +		was_rmapped = is_large_pte(*shadow_pte);
> +	}
> +
>  	pgprintk("%s: spte %llx access %x write_fault %d"
>  		 " user_fault %d gfn %lx\n",
>  		 __FUNCTION__, *shadow_pte, pt_access,
> @@ -919,6 +1071,8 @@ static void mmu_set_spte(struct kvm_vcpu
>  	spte |= PT_PRESENT_MASK;
>  	if (pte_access & ACC_USER_MASK)
>  		spte |= PT_USER_MASK;
> +	if (largepage)
> +		spte |= PT_PAGE_SIZE_MASK;
>  
>  	spte |= page_to_phys(page);
>  
> @@ -933,7 +1087,8 @@ static void mmu_set_spte(struct kvm_vcpu
>  		}
>  
>  		shadow = kvm_mmu_lookup_page(vcpu->kvm, gfn);
> -		if (shadow) {
> +		if (shadow ||
> +		   (largepage && has_wrprotected_page(vcpu->kvm, gfn))) {
>  			pgprintk("%s: found shadow page for %lx, marking ro\n",
>  				 __FUNCTION__, gfn);
>  			pte_access &= ~ACC_WRITE_MASK;
> @@ -941,6 +1096,18 @@ static void mmu_set_spte(struct kvm_vcpu
>  				spte &= ~PT_WRITABLE_MASK;
>  				kvm_x86_ops->tlb_flush(vcpu);
>  			}
> +			/*
> + 			 * Largepage creation is susceptible to a upper-level
> + 			 * table to be shadowed and write-protected in the
> + 			 * area being mapped. If that is the case, invalidate
> + 			 * the entry and let the instruction fault again
> + 			 * and use 4K mappings.
> + 			 */
> +			if (largepage) {
> +				spte = shadow_trap_nonpresent_pte;
> +				kvm_x86_ops->tlb_flush(vcpu);
> +				goto unshadowed;
> +			}
>   

Would it not repeat exactly the same code path?  Or is this just for the 
case of the pte_update path?

> -	page = gfn_to_page(vcpu->kvm, gpa >> PAGE_SHIFT);
> +	if (is_largepage_backed(vcpu, gfn & ~(KVM_PAGES_PER_HPAGE-1))
> +	    && is_physical_memory(vcpu->kvm, gfn)) {
> +		gfn &= ~(KVM_PAGES_PER_HPAGE-1);
> +		largepage = 1;
> +	}
>   

Doesn't is_largepage_backed() imply is_physical_memory?

>  
> Index: kvm.largepages/arch/x86/kvm/x86.c
> ===================================================================
> --- kvm.largepages.orig/arch/x86/kvm/x86.c
> +++ kvm.largepages/arch/x86/kvm/x86.c
> @@ -86,6 +86,7 @@ struct kvm_stats_debugfs_item debugfs_en
>  	{ "mmu_recycled", VM_STAT(mmu_recycled) },
>  	{ "mmu_cache_miss", VM_STAT(mmu_cache_miss) },
>  	{ "remote_tlb_flush", VM_STAT(remote_tlb_flush) },
> +	{ "lpages", VM_STAT(lpages) },
>  	{ NULL }
>  };
>   

s/lpages/largepages/, this is user visible.

> +		new.lpage_info = vmalloc(largepages * sizeof(*new.lpage_info));
> +
> +		if (!new.lpage_info)
> +			goto out_free;
> +
> +		memset(new.lpage_info, 0, largepages * sizeof(*new.lpage_info));
> +		/* large page crosses memslot boundary */
> +		if (npages % KVM_PAGES_PER_HPAGE) {
> +			new.lpage_info[0].write_count = 1;
>   

This seems wrong, say a 3MB slot at 1GB, you kill the first largepage 
which is good.

> +			new.lpage_info[largepages-1].write_count = 1;
>   

OTOH, a 3MB slot at 3MB, the last page is fine.  The check needs to be 
against base_gfn and base_gfn + npages, not the number of pages.

> +		}
>   



> +	}
>  
>  	/* Allocate page dirty bitmap if needed */
>  	if ((new.flags & KVM_MEM_LOG_DIRTY_PAGES) && !new.dirty_bitmap) {
> @@ -444,7 +464,7 @@ int kvm_is_visible_gfn(struct kvm *kvm, 
>  }
>  EXPORT_SYMBOL_GPL(kvm_is_visible_gfn);
>  
> -static unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
> +unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
>  {
>  	struct kvm_memory_slot *slot;
>  
> @@ -454,6 +474,7 @@ static unsigned long gfn_to_hva(struct k
>  		return bad_hva();
>  	return (slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE);
>  }
> +EXPORT_SYMBOL(gfn_to_hva);
>  
>  /*
>   * Requires current->mm->mmap_sem to be held
>   
> ------------------------------------------------------------------------
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> ------------------------------------------------------------------------
>
> _______________________________________________
> kvm-devel mailing list
> kvm-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/kvm-devel


-- 
Any sufficiently difficult bug is indistinguishable from a feature.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

next prev parent reply	other threads:[~2008-02-17  9:38 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-01-29 17:20 large page support for kvm Avi Kivity
     [not found] ` <479F604C.20107-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2008-01-30 18:40   ` Joerg Roedel
     [not found]     ` <20080130184035.GS6960-5C7GfCeVMHo@public.gmane.org>
2008-01-31  5:44       ` Avi Kivity
2008-02-11 15:49         ` Marcelo Tosatti
2008-02-12 11:55           ` Avi Kivity
2008-02-13  0:15             ` Marcelo Tosatti
2008-02-13  6:45               ` Avi Kivity
2008-02-14 23:17                 ` Marcelo Tosatti
2008-02-15  7:40                   ` Roedel, Joerg
2008-02-17  9:38                   ` Avi Kivity [this message]
2008-02-19 20:37                     ` Marcelo Tosatti
2008-02-20 14:25                       ` Avi Kivity
2008-02-22  2:01                         ` Marcelo Tosatti
2008-02-22  7:16                           ` Avi Kivity

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=47B800AB.20905@qumranet.com \
    --to=avi@qumranet.com \
    --cc=kvm-devel@lists.sourceforge.net \
    --cc=marcelo@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.