Re: large page support for kvm

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

From: Avi Kivity <avi@qumranet.com>
To: Marcelo Tosatti <marcelo@kvack.org>
Cc: kvm-devel <kvm-devel@lists.sourceforge.net>
Subject: Re: large page support for kvm
Date: Wed, 13 Feb 2008 08:45:51 +0200	[thread overview]
Message-ID: <47B2921F.1040905@qumranet.com> (raw)
In-Reply-To: <20080213001519.GA32134@dmt>

Marcelo Tosatti wrote:
>>> +	/*
>>> + 	 * If its a largepage mapping, there could be a previous
>>> + 	 * pointer to a PTE page hanging there, which will falsely
>>> + 	 * set was_rmapped.
>>> + 	 */
>>> +	if (largepage)
>>> +		was_rmapped = is_large_pte(*shadow_pte);
>>> +
>>>  
>>>       
>> But that pte page will have its parent_pte chain pointing to shadow_pte, 
>> no?  Either this can't happen, or we need to unlink that pte page first.
>>     
>
> This can happen, say if you have a large page frame with shadowed pages
> in it, therefore 4k mapped. When all shadowed pages are removed, it will
> be mapped with a 2M page, overwriting the pointer to the (metaphysical)
> PTE page.
>
> So you say "need to unlink that pte page first" you mean simply remove
> the parent pointer which now points to the 2M PMD entry, right? This
> avoids a zap_mmu_page() on that metaphysical page from nukeing the
> (unrelated) 2M PMD entry.
>   

Yes. We need to keep that tangle of pointers consistent.

> Nukeing the metaphysical page (zap_page) seems overkill and
> unnecessary... it will eventually be recycled, or reused if the large
> mapping is nuked.
>   

Yes.

> It doesnt appear to be necessary to flush the TLB whenever a 2MB PMD
> overwrote a PMD-to-PTE-page pointer.
>   

Since they're pointing to the same underlying memory, no, though we may 
need to transfer the dirty bits from the ptes to the struct page.

> In the worst case other CPU's will use the cached 4k translations for a
> while. If there are permission changes on this translations the OS is
> responsible for flushing the TLB anyway.
>   
>   
>>> Index: linux-2.6-x86-kvm/arch/x86/kvm/paging_tmpl.h
>>> ===================================================================
>>> --- linux-2.6-x86-kvm.orig/arch/x86/kvm/paging_tmpl.h
>>> +++ linux-2.6-x86-kvm/arch/x86/kvm/paging_tmpl.h
>>> @@ -71,6 +71,7 @@ struct guest_walker {
>>> 	unsigned pte_access;
>>> 	gfn_t gfn;
>>> 	u32 error_code;
>>> +	int largepage_backed;
>>> };
>>>
>>> static gfn_t gpte_to_gfn(pt_element_t gpte)
>>> @@ -120,7 +121,8 @@ static unsigned FNAME(gpte_access)(struc
>>>  */
>>> static int FNAME(walk_addr)(struct guest_walker *walker,
>>> 			    struct kvm_vcpu *vcpu, gva_t addr,
>>> -			    int write_fault, int user_fault, int fetch_fault)
>>> +			    int write_fault, int user_fault, int fetch_fault,
>>> +			    int faulting)
>>> {
>>> 	pt_element_t pte;
>>> 	gfn_t table_gfn;
>>> @@ -130,6 +132,7 @@ static int FNAME(walk_addr)(struct guest
>>> 	pgprintk("%s: addr %lx\n", __FUNCTION__, addr);
>>> walk:
>>> 	walker->level = vcpu->arch.mmu.root_level;
>>> +	walker->largepage_backed = 0;
>>> 	pte = vcpu->arch.cr3;
>>> #if PTTYPE == 64
>>> 	if (!is_long_mode(vcpu)) {
>>> @@ -192,10 +195,19 @@ walk:
>>> 		if (walker->level == PT_DIRECTORY_LEVEL
>>> 		    && (pte & PT_PAGE_SIZE_MASK)
>>> 		    && (PTTYPE == 64 || is_pse(vcpu))) {
>>> -			walker->gfn = gpte_to_gfn_pde(pte);
>>> +			gfn_t gfn = gpte_to_gfn_pde(pte);
>>> +			walker->gfn = gfn;
>>> +
>>> 			walker->gfn += PT_INDEX(addr, PT_PAGE_TABLE_LEVEL);
>>> 			if (PTTYPE == 32 && is_cpuid_PSE36())
>>> 				walker->gfn += pse36_gfn_delta(pte);
>>> +
>>> +			if (faulting
>>> +			    && is_largepage_backed(vcpu, gfn)
>>> +			    && is_physical_memory(vcpu->kvm, walker->gfn)) {
>>> +				walker->largepage_backed = 1;
>>> +				walker->gfn = gfn;
>>> +			}
>>>  
>>>       
>> I don't like this bit.  So far the walker has been independent of the 
>> host state, and only depended on guest data.  We can set 
>> largepage_backed unconditionally on a large guest pte, leave gfn 
>> containing the low-order bits, and leave the host checks for the actual 
>> mapping logic.
>>
>>     
>>> /*
>>> @@ -299,6 +313,9 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
>>> 		shadow_ent = ((u64 *)__va(shadow_addr)) + index;
>>> 		if (level == PT_PAGE_TABLE_LEVEL)
>>> 			break;
>>> +		if (level == PT_DIRECTORY_LEVEL && walker->largepage_backed)
>>> +			break;
>>> +
>>>  
>>>       
>> (here)
>>     
>
> gfn_to_page() needs to grab the struct page corresponding to the large
> page, not the offset struct page for the faulting 4k address within
> the large frame. Since gfn_to_page can sleep, there is no way to do
> that in the mapping logic which happens under mmu_lock protection.
> We don't want to grab the large page frame "struct page" unless the
> is_largepage_backed() checks are successful.
>
> The checks could be done in page_fault() if walker->level == 2, before
> gfn_to_page()... But I don't see much difference of that and doing 
> it inside walk_addr(). What do you say?
>
>   

I'd like to keep walk_addr() independent of the rest of the mmu (i.e. 
walk_addr is 100% guest oriented). Also, the issue you point out is 
shared by direct_map which doesn't call walk_addr().

An unrelated issue (pointed out by Jun Nakajima) is that this kills 
dirty log tracking (needed for migration). It could be solved simply by 
not using large page backing if dirty log tracking is enabled for that slot.

-- 
Any sufficiently difficult bug is indistinguishable from a feature.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

next prev parent reply	other threads:[~2008-02-13  6:45 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-01-29 17:20 large page support for kvm Avi Kivity
     [not found] ` <479F604C.20107-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2008-01-30 18:40   ` Joerg Roedel
     [not found]     ` <20080130184035.GS6960-5C7GfCeVMHo@public.gmane.org>
2008-01-31  5:44       ` Avi Kivity
2008-02-11 15:49         ` Marcelo Tosatti
2008-02-12 11:55           ` Avi Kivity
2008-02-13  0:15             ` Marcelo Tosatti
2008-02-13  6:45               ` Avi Kivity [this message]
2008-02-14 23:17                 ` Marcelo Tosatti
2008-02-15  7:40                   ` Roedel, Joerg
2008-02-17  9:38                   ` Avi Kivity
2008-02-19 20:37                     ` Marcelo Tosatti
2008-02-20 14:25                       ` Avi Kivity
2008-02-22  2:01                         ` Marcelo Tosatti
2008-02-22  7:16                           ` Avi Kivity

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=47B2921F.1040905@qumranet.com \
    --to=avi@qumranet.com \
    --cc=kvm-devel@lists.sourceforge.net \
    --cc=marcelo@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox