From mboxrd@z Thu Jan  1 00:00:00 1970
From: Avi Kivity <avi@redhat.com>
Subject: Re: [PATCH 18/24] Exiting from L2 to L1
Date: Sun, 12 Sep 2010 16:29:54 +0200
Message-ID: <4C8CE3E2.7060708@redhat.com>
References: <1276431753-nyh@il.ibm.com> <201006131231.o5DCVlKB013102@rice.haifa.ibm.com> <4C161AB8.4060905@redhat.com> <20100912140530.GA26346@fermat.math.technion.ac.il>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: kvm@vger.kernel.org
To: "Nadav Har'El" <nyh@math.technion.ac.il>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:57995 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1752994Ab0ILOaC (ORCPT <rfc822;kvm@vger.kernel.org>);
	Sun, 12 Sep 2010 10:30:02 -0400
In-Reply-To: <20100912140530.GA26346@fermat.math.technion.ac.il>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

  On 09/12/2010 04:05 PM, Nadav Har'El wrote:
> Hi,
>
> Continuing to work on the nested VMX patches,

Great.  Hopefully you can commit some time to it or you'll spend a lot 
of cycles just catching up.

Joerg just merged nested npt; as far as I can tell it is 100% in line 
with nested ept, but please take a look as well.

>
>>> +	vmcs12->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
>>>
>> Without msr bitmaps, cannot change.
> I added a TODO before this (and a couple of others) for future optimization.
> I'm not even convinced how much quicker it is to check the MSR bitmap before
> doing vmcs_read64, vs just to going ahead and vmreading it in any case.

IIRC we don't support msr bitmaps now, so no need to check anything.

In general, avoid vmcs reads as much as possible.  Just think of your 
code running in a guest!

>>> +    vmcs12->vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER);
>> Can this change?
> Well, according to the spec, (SDM vol 3B), VMCS link pointer is a guest-state
> field, but it is listed as being for "future expansion". I guess that with
> current hardware, it cannot change, but for future hardware it might. I'm
> not sure if it's wiser to ignore this field for now (and shave a bit off
> the l2->l1 switch time), or just copy it anyway, as I do now.
> What would you prefer?

If it changes in the future, it can only be under the influence of some 
control or at least guest-discoverable capability.  Since we don't 
expose such control or capability, there's no need to copy it.

>>> +	vmcs12->vm_entry_intr_info_field =
>>> +		vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
>>>
>> Autocleared, no need to read.
> Well, we need to clear the "valid bit" on exit, so we don't mistakenly inject
> the same interrupt twice.

I don't think so.  We write this field as part of guest entry (that is, 
after the decision re which vmcs to use, yes?), so guest entry will 
always follow writing this field.  Since guest entry clears the field, 
reading it after an exit will necessarily return 0.

What can happen is that the contents of the field is transferred to the 
IDT_VECTORING_INFO field or VM_EXIT_INTR_INFO field.

(question: on a failed vmentry, is this field cleared?)

> There were two ways to do it: 1. clear it ourselves,
> or 2. copy the value from vmcs02 where the processor already cleared it.
> There are pros and cons for each approach, but I'll change like you suggest,
> to clear it ourselves:
>
> 	vmcs12->vm_entry_intr_info_field&= ~INTR_INFO_VALID_MASK;

That's really a temporary variable, I don't think you need to touch it.

>>> +	vmcs12->vm_exit_reason  = vmcs_read32(VM_EXIT_REASON);
>>> +	vmcs12->vm_exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
>>> +	vmcs12->vm_exit_intr_error_code =
>>> vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
>>> +	vmcs12->idt_vectoring_info_field =
>>> +		vmcs_read32(IDT_VECTORING_INFO_FIELD);
>>> +	vmcs12->idt_vectoring_error_code =
>>> +		vmcs_read32(IDT_VECTORING_ERROR_CODE);
>>> +	vmcs12->vm_exit_instruction_len =
>>> vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
>>> +	vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
>>>
>> For the above, if the host handles the exit, we must not clobber guest
>> fields.  A subsequent guest vmread will see the changed values even
>> though from its point of view a vmexit has not occurred.
>>
>> But no, that can't happen, since a vmread needs to have a vmexit first
>> to happen.  Still, best to delay this.+
> All this code is in prepare_vmcs_12, which only gets called if we exit from
> L2 to L1 - it doesn't get called when we exit from L2 to L0 (when the host
> handles the exit).

Ok.  That answers my concern.

> As long as a certain field gets written to on *every* exit, and not just
> on some of them, I believe we can safely copy the current values in vmcs02
> to vmcs12, knowing these are current values from the current exit, and not
> some old values we shouldn't be copying.
>
> You may have a point (if that was your point?) that some fields are not
> always written to - e.g., for most exits vm_exit_intr_info doesn't get
> written to and just one "valid" bit is cleared. As the code is now, we copy
> vmcs02's field, which might have been written earlier (e.g., during
> an exit to L0) and not now, and an observant L1 might notice this value
> it should not have seen. However, I don't see any problem with that, because
> the "valid" bit would be correctly turned off, and the spec says that all
> other bits are undefined (for irrelevant exits) and copying-old-values is one
> legal setting for undefined bits...

No, I wasn't worried about that, simply misunderstood the code.

>>> +	/* If any of the CRO_GUEST_HOST_MASK bits are off, the L2 guest may
>>> +	 * have changed some cr0 bits without us ever saving them in the
>>> shadow
>>> +	 * vmcs. So we need to save these changes now.
>> ...
>>> +
>>> +	vmcs12->guest_cr4 = vmcs_readl(GUEST_CR4);
>> Can't we have the same issue with cr4?
> I guess we can. I didn't think this (giving guest full control over cr4 bits)
> was happening in KVM, but maybe I was wrong, or maybe this will happen in the
> future, so no reason not to do it for cr4 as well. So I'll do it for cr4 as
> well.

We give the guest partial control over cr4:

     #define KVM_CR4_GUEST_OWNED_BITS                                   \
              (X86_CR4_PVI | X86_CR4_DE | X86_CR4_PCE | 
X86_CR4_OSFXSR      \
               | X86_CR4_OSXMMEXCPT)

Plus PGE if EPT is enabled.

>> Better to have some helpers to do the common magic, and not encode the
>> special knowledge about TS into it (make it generic).
> I thought that since in current KVM code the only cr0_guest_owned_bits bit
> that could possibly be turned on was TS, then I should only deal with that
> bit. But you're right, no reason not to make it more general, to look for
> any bits which L0 traps but L1 didn't think it was trapping. As in:
>
> 	long bits;
> 	bits = vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask;
> 	vmcs12->guest_cr0 = (vmcs_readl(GUEST_CR0)&  bits) |
> 				(vmcs_readl(CR0_READ_SHADOW)&  ~bits);
>
> (the "bits" lists all the bits which are already fine in guest_cr0, i.e.,
> either guest_owned_bit (so L2 wrote to it directly) or guest_host_mask
> (so L1 didn't expect them to be updated in guest_cr0 anyway). All other
> bits need to be copied from the read_shadow).
>
> I don't know how to put this into a helper function, because these two
> statements have so many dependencies on the word "cr0" that making one
> that would work for either cr0 or cr4 seems too difficult to be worth the
> trouble.

I didn't mean register independent helper; one function for cr0 and one 
function for cr4 so the reader can just see the name and pretend to 
understand what it does, instead of seeing a bunch of incomprehensible 
bitwise operations.

(well, reading what I wrote, maybe I did mean a cr independent helper, 
but don't do it if it results in more complication)

-- 
error compiling committee.c: too many arguments to function