public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Paolo Bonzini <pbonzini@redhat.com>
To: Gleb Natapov <gleb@redhat.com>
Cc: kvm@vger.kernel.org,
	Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>,
	Jun Nakajima <jun.nakajima@intel.com>,
	Yang Zhang <yang.z.zhang@intel.com>
Subject: Re: [PATCH v4 06/13] nEPT: Add EPT tables support to paging_tmpl.h
Date: Mon, 29 Jul 2013 13:55:46 +0200	[thread overview]
Message-ID: <51F65842.30103@redhat.com> (raw)
In-Reply-To: <20130729113354.GF18009@redhat.com>

Il 29/07/2013 13:33, Gleb Natapov ha scritto:
> On Mon, Jul 29, 2013 at 11:48:06AM +0200, Paolo Bonzini wrote:
>> Il 25/07/2013 12:59, Gleb Natapov ha scritto:
>>> From: Nadav Har'El <nyh@il.ibm.com>
>>>
>>> This is the first patch in a series which adds nested EPT support to KVM's
>>> nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can use
>>> EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest
>>> to set its own cr3 and take its own page faults without either of L0 or L1
>>> getting involved. This often significanlty improves L2's performance over the
>>> previous two alternatives (shadow page tables over EPT, and shadow page
>>> tables over shadow page tables).
>>>
>>> This patch adds EPT support to paging_tmpl.h.
>>>
>>> paging_tmpl.h contains the code for reading and writing page tables. The code
>>> for 32-bit and 64-bit tables is very similar, but not identical, so
>>> paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once
>>> with PTTYPE=64, and this generates the two sets of similar functions.
>>>
>>> There are subtle but important differences between the format of EPT tables
>>> and that of ordinary x86 64-bit page tables, so for nested EPT we need a
>>> third set of functions to read the guest EPT table and to write the shadow
>>> EPT table.
>>>
>>> So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions (prefixed
>>> with "EPT") which correctly read and write EPT tables.
>>>
>>> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
>>> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
>>> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
>>> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
>>> Signed-off-by: Gleb Natapov <gleb@redhat.com>
>>> ---
>>>  arch/x86/kvm/mmu.c         |    5 +++++
>>>  arch/x86/kvm/paging_tmpl.h |   43 +++++++++++++++++++++++++++++++++++++++----
>>>  2 files changed, 44 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>>> index 4c4274d..b5273c3 100644
>>> --- a/arch/x86/kvm/mmu.c
>>> +++ b/arch/x86/kvm/mmu.c
>>> @@ -3494,6 +3494,11 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gp
>>>  	return mmu->last_pte_bitmap & (1 << index);
>>>  }
>>>  
>>> +#define PTTYPE_EPT 18 /* arbitrary */
>>> +#define PTTYPE PTTYPE_EPT
>>> +#include "paging_tmpl.h"
>>> +#undef PTTYPE
>>> +
>>>  #define PTTYPE 64
>>>  #include "paging_tmpl.h"
>>>  #undef PTTYPE
>>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
>>> index 7581395..e38b3c0 100644
>>> --- a/arch/x86/kvm/paging_tmpl.h
>>> +++ b/arch/x86/kvm/paging_tmpl.h
>>> @@ -58,6 +58,21 @@
>>>  	#define PT_GUEST_DIRTY_SHIFT PT_DIRTY_SHIFT
>>>  	#define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT
>>>  	#define CMPXCHG cmpxchg
>>> +#elif PTTYPE == PTTYPE_EPT
>>> +	#define pt_element_t u64
>>> +	#define guest_walker guest_walkerEPT
>>> +	#define FNAME(name) ept_##name
>>> +	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
>>> +	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
>>> +	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
>>> +	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
>>> +	#define PT_LEVEL_BITS PT64_LEVEL_BITS
>>> +	#define PT_GUEST_ACCESSED_MASK 0
>>> +	#define PT_GUEST_DIRTY_MASK 0
>>> +	#define PT_GUEST_DIRTY_SHIFT 0
>>> +	#define PT_GUEST_ACCESSED_SHIFT 0
>>> +	#define CMPXCHG cmpxchg64
>>> +	#define PT_MAX_FULL_LEVELS 4
>>>  #else
>>>  	#error Invalid PTTYPE value
>>>  #endif
>>> @@ -90,6 +105,10 @@ static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl)
>>>  
>>>  static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
>>>  {
>>> +#if PT_GUEST_DIRTY_MASK == 0
>>> +	/* dirty bit is not supported, so no need to track it */
>>> +	return;
>>> +#else
>>>  	unsigned mask;
>>>  
>>>  	BUILD_BUG_ON(PT_WRITABLE_MASK != ACC_WRITE_MASK);
>>> @@ -99,6 +118,7 @@ static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
>>>  	mask |= (gpte >> (PT_GUEST_DIRTY_SHIFT - PT_WRITABLE_SHIFT)) &
>>>  		PT_WRITABLE_MASK;
>>>  	*access &= mask;
>>> +#endif
>>
>> Please put this #if/#else/#endif in the previous patch.  (See also
>> below on leaving out protect_clean_gpte altogether).
>>
> Why? This change does not make much sense before EPT is introduced. The
> previous patch is just a rename that should be easily verifiable by any
> reviewer to be NOP.

My initial impression to this patch was "everything's ready after the
previous patch, you just have to set the mask to 0".  Which is not quite
true.  Maybe you need three patches instead of two.

>>         if (!write_fault)
>>                 protect_clean_gpte(&pte_access, pte);
>>         else
>>                 /*
>>                  * On a write fault, fold the dirty bit into accessed_dirty by
>>                  * shifting it one place right.
>>                  */
>>  		accessed_dirty &= pte >>
>>  			(PT_GUEST_DIRTY_SHIFT - PT_GUEST_ACCESSED_SHIFT);
>>
>> should be compiled out (in the previous patch) if dirty bits are not in use.
>> The "then" branch does nothing in that case, and the "else" branch is dead
>> code that makes no sense.
>
> I disagree, there ifdef was there and it was ugly. protect_clean_gpte
> and update_accessed_dirty_bits had to be ifdefed too.

protect_clean_gpte is still #ifdef'ed, so we're talking of a net gain of
2 lines.

Something like this:

+       /* if dirty bit is not supported, no need to track it */
+#if PT_GUEST_DIRTY_MASK == 0
        if (!write_fault)
                 protect_clean_gpte(&pte_access, pte);
...
        if (unlikely(!accessed_dirty)) {
...
        }
+#endif

doesn't look bad at all.  With the old check on EPT it looked ugly, but
with the new check on PT_GUEST_DIRTY_MASK it is quite natural.  Also
because you have anyway a reference to PT_GUEST_DIRTY_MASK in the "if".
 If I see

        if (!write_fault)
                protect_clean_gpte(&pte_access, pte);
        else
                /*
                 * On a write fault, fold the dirty bit into
		 * accessed_dirty by
                 * shifting it one place right.
                 */
                accessed_dirty &=
			pte >> (PT_DIRTY_SHIFT - PT_ACCESSED_SHIFT);

        if (PT_GUEST_DIRTY_MASK != 0 && unlikely(!accessed_dirty)) {

the obvious reaction is "what, is there a case where I'm using
accessed_dirty if PT_GUEST_DIRTY_MASK == 0?"  Of course it makes sense
to leave the otherwise dead assignments to accessed_dirty in the loop
(the compiler removes them anyway); but here the assignment is too close
to the user to not raise such questions.

Perhaps it's obvious to you because you're more familiar with this code.

> Compiler should be
> smart enough to get rid of all of this code when PT_GUEST_DIRTY_MASK is 0.
> Doing it like that was Xiao idea and it looks much nicer. 
> 
>> Once you do this, you can add a
>>
>> 	BUILD_BUG_ON(PT_GUEST_DIRTY_SHIFT < PT_GUEST_ACCESSED_SHIFT);
>>
>> before the shift.
>>
>> Please check if, with these changes, you can avoid defining
>> PT_GUEST_{DIRTY,ACCESSED}_SHIFT altogether in the EPT case.
>> This is safer because you are sure you left no undefined
>> behaviors when a bit is being folded onto another.
> You basically ask me to get back to the patch how it was before I
> addressed Xiao comment and add some more idfefs because previously not
> all places where A/D bits were used were protected by it. IMO this would
> be a step backward especially as the method in this patch series is a
> preparation for A/D support for EPT. When those bits are supported with
> EPT they are different than in regular page tables.

Yes, and if they will put the dirty bit below the accessed bit (rather
than above), you will silently get undefined behavior for a shift by
negative value.  Adding BUILD_BUG_ONs for this, and ensuring
PT_GUEST_{DIRTY,ACCESSED}_SHIFT are never used in the EPT case, is
important in my opinion.

I very much like what you did in the previous patch, allowing
customization of the masks instead of using EPT ifdefs all over the
place.  On the other hand, once you did that the benefits of Xiao's
proposed change pretty much go away.

>> In principle, with these changes you could leave protect_clean_gpte in mmu.c.
> 
> Only if I ifdef all other uses of in in the file.

Yeah, scrap that.

Paolo


  reply	other threads:[~2013-07-29 11:57 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-25 10:59 [PATCH v4 00/13] Nested EPT Gleb Natapov
2013-07-25 10:59 ` [PATCH v4 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Gleb Natapov
2013-07-29  8:32   ` Paolo Bonzini
2013-07-29 13:12     ` Gleb Natapov
2013-07-29 14:13       ` Paolo Bonzini
2013-07-25 10:59 ` [PATCH v4 02/13] nEPT: Fix cr3 handling in nested exit and entry Gleb Natapov
2013-07-25 10:59 ` [PATCH v4 03/13] nEPT: Fix wrong test in kvm_set_cr3 Gleb Natapov
2013-07-29  8:36   ` Paolo Bonzini
2013-07-29 10:43     ` Gleb Natapov
2013-07-31  8:02   ` Xiao Guangrong
2013-07-25 10:59 ` [PATCH v4 04/13] nEPT: Move common code to paging_tmpl.h Gleb Natapov
2013-07-31  8:02   ` Xiao Guangrong
2013-07-31  8:36     ` Gleb Natapov
2013-07-25 10:59 ` [PATCH v4 05/13] nEPT: make guest's A/D bits depends on guest's paging mode Gleb Natapov
2013-07-25 10:59 ` [PATCH v4 06/13] nEPT: Add EPT tables support to paging_tmpl.h Gleb Natapov
2013-07-29  9:48   ` Paolo Bonzini
2013-07-29 11:33     ` Gleb Natapov
2013-07-29 11:55       ` Paolo Bonzini [this message]
2013-07-29 12:24         ` Gleb Natapov
2013-07-29 13:19           ` Paolo Bonzini
2013-07-29 13:27             ` Gleb Natapov
2013-07-29 14:15               ` Paolo Bonzini
2013-07-29 16:14                 ` Gleb Natapov
2013-07-29 16:28                   ` Paolo Bonzini
2013-07-29 16:43                     ` Gleb Natapov
2013-07-29 17:06                       ` Paolo Bonzini
2013-07-29 17:11                         ` Gleb Natapov
2013-07-30 10:03   ` Paolo Bonzini
2013-07-30 11:56     ` Gleb Natapov
2013-07-30 12:13       ` Paolo Bonzini
2013-07-30 14:22         ` Gleb Natapov
2013-07-30 14:36           ` Gleb Natapov
2013-07-25 10:59 ` [PATCH v4 07/13] nEPT: Redefine EPT-specific link_shadow_page() Gleb Natapov
2013-07-25 10:59 ` [PATCH v4 08/13] nEPT: Nested INVEPT Gleb Natapov
2013-07-25 10:59 ` [PATCH v4 09/13] nEPT: Add nEPT violation/misconfigration support Gleb Natapov
2013-07-29  8:59   ` Paolo Bonzini
2013-07-29 10:52     ` Gleb Natapov
2013-07-29 10:59       ` Paolo Bonzini
2013-07-29 11:43         ` Gleb Natapov
2013-07-29 12:05           ` Paolo Bonzini
2013-07-29 12:34             ` Gleb Natapov
2013-07-29 13:11               ` Paolo Bonzini
2013-07-29 13:20                 ` Gleb Natapov
2013-07-29 14:12                   ` Paolo Bonzini
2013-07-29 16:24                     ` Gleb Natapov
2013-07-29 16:36                       ` Paolo Bonzini
2013-07-29 16:54                         ` Gleb Natapov
2013-07-25 10:59 ` [PATCH v4 10/13] nEPT: MMU context for nested EPT Gleb Natapov
2013-07-25 10:59 ` [PATCH v4 11/13] nEPT: Advertise EPT to L1 Gleb Natapov
2013-07-29  9:21   ` Paolo Bonzini
2013-07-29 11:11     ` Gleb Natapov
2013-07-29 11:33       ` Paolo Bonzini
2013-07-29 11:35         ` Gleb Natapov
2013-07-25 11:00 ` [PATCH v4 12/13] nEPT: Some additional comments Gleb Natapov
2013-07-25 11:00 ` [PATCH v4 13/13] nEPT: Miscelleneous cleanups Gleb Natapov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51F65842.30103@redhat.com \
    --to=pbonzini@redhat.com \
    --cc=gleb@redhat.com \
    --cc=jun.nakajima@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=xiaoguangrong@linux.vnet.ibm.com \
    --cc=yang.z.zhang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox