From mboxrd@z Thu Jan  1 00:00:00 1970
From: Avi Kivity <avi@redhat.com>
Subject: Re: [PATCH 18/24] Exiting from L2 to L1
Date: Mon, 14 Jun 2010 15:04:08 +0300
Message-ID: <4C161AB8.4060905@redhat.com>
References: <1276431753-nyh@il.ibm.com> <201006131231.o5DCVlKB013102@rice.haifa.ibm.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: kvm@vger.kernel.org
To: "Nadav Har'El" <nyh@il.ibm.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:17661 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1750742Ab0FNMEM (ORCPT <rfc822;kvm@vger.kernel.org>);
	Mon, 14 Jun 2010 08:04:12 -0400
In-Reply-To: <201006131231.o5DCVlKB013102@rice.haifa.ibm.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On 06/13/2010 03:31 PM, Nadav Har'El wrote:
> This patch implements nested_vmx_vmexit(), called when the nested L2 guest
> exits and we want to run its L1 parent and let it handle this exit.
>
> Note that this will not necessarily be called on every L2 exit. L0 may decide
> to handle a particular exit on its own, without L1's involvement; In that
> case, L0 will handle the exit, and resume running L2, without running L1 and
> without calling nested_vmx_vmexit(). The logic for deciding whether to handle
> a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(),
> will appear in the next patch.
>
>
>
> +/* prepare_vmcs_12 is called when the nested L2 guest exits and we want to
> + * prepare to run its L1 parent. L1 keeps a vmcs for L2 (vmcs12), and this
> + * function updates it to reflect the state of the registers during the exit,
> + * and to reflect some changes that happened while L2 was running (and perhaps
> + * made some exits which were handled directly by L0 without going back to L1).
> + */
> +void prepare_vmcs_12(struct kvm_vcpu *vcpu)
> +{
> +	struct shadow_vmcs *vmcs12 = get_shadow_vmcs(vcpu);
> +
> +	vmcs12->guest_es_selector = vmcs_read16(GUEST_ES_SELECTOR);
> +	vmcs12->guest_cs_selector = vmcs_read16(GUEST_CS_SELECTOR);
> +	vmcs12->guest_ss_selector = vmcs_read16(GUEST_SS_SELECTOR);
> +	vmcs12->guest_ds_selector = vmcs_read16(GUEST_DS_SELECTOR);
> +	vmcs12->guest_fs_selector = vmcs_read16(GUEST_FS_SELECTOR);
> +	vmcs12->guest_gs_selector = vmcs_read16(GUEST_GS_SELECTOR);
> +	vmcs12->guest_ldtr_selector = vmcs_read16(GUEST_LDTR_SELECTOR);
> +	vmcs12->guest_tr_selector = vmcs_read16(GUEST_TR_SELECTOR);
> +
> +	vmcs12->tsc_offset = vmcs_read64(TSC_OFFSET);
>    

TSC_OFFSET cannot have changed.

> +	vmcs12->guest_physical_address = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
>    

Not available without EPT.

> +	vmcs12->vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER);
>    

Can this change?

> +	vmcs12->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
>    

Without msr bitmaps, cannot change.

> +	if (vmcs_config.vmentry_ctrl&  VM_ENTRY_LOAD_IA32_PAT)
> +		vmcs12->guest_ia32_pat = vmcs_read64(GUEST_IA32_PAT);
>    

Should check for VM_EXIT_SAVE_IA32_PAT, no?  Also unneeded without msr 
bitmaps and passthrough for this msr.

> +	vmcs12->cr3_target_count = vmcs_read32(CR3_TARGET_COUNT);
>    

R/O

> +	vmcs12->vm_entry_intr_info_field =
> +		vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
>    

Autocleared, no need to read.

> +	vmcs12->vm_entry_exception_error_code =
> +		vmcs_read32(VM_ENTRY_EXCEPTION_ERROR_CODE);
> +	vmcs12->vm_entry_instruction_len =
> +		vmcs_read32(VM_ENTRY_INSTRUCTION_LEN);
>    

R/O

> +	vmcs12->vm_instruction_error = vmcs_read32(VM_INSTRUCTION_ERROR);
>    

We don't want to pass this to the guest?

> +	vmcs12->vm_exit_reason  = vmcs_read32(VM_EXIT_REASON);
> +	vmcs12->vm_exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
> +	vmcs12->vm_exit_intr_error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
> +	vmcs12->idt_vectoring_info_field =
> +		vmcs_read32(IDT_VECTORING_INFO_FIELD);
> +	vmcs12->idt_vectoring_error_code =
> +		vmcs_read32(IDT_VECTORING_ERROR_CODE);
> +	vmcs12->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
> +	vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
>    

For the above, if the host handles the exit, we must not clobber guest 
fields.  A subsequent guest vmread will see the changed values even 
though from its point of view a vmexit has not occurred.

But no, that can't happen, since a vmread needs to have a vmexit first 
to happen.  Still, best to delay this.+
> +	/* If any of the CRO_GUEST_HOST_MASK bits are off, the L2 guest may
> +	 * have changed some cr0 bits without us ever saving them in the shadow
> +	 * vmcs. So we need to save these changes now.
> +	 * In the current code, the only GHM bit which can be off is TS (it
> +	 * will be off when fpu_active and L1 also set it to off).
> +	 */
> +	vmcs12->guest_cr0 = vmcs_readl(GUEST_CR0);
> +
> +	/* But this may not be the guest_cr0 that the L1 guest hypervisor
> +	 * actually thought it was giving its L2 guest. It is possible that
> +	 * L1 wished to allow its guest to set a cr0 bit directly, but we (L0)
> +	 * captured this attempt and instead set just the read shadow. If this
> +	 * is the case, we need copy these read-shadow bits back to guest_cr0,
> +	 * where L1 believes they already are. Note that we must read the
> +	 * actual CR0_READ_SHADOW (which is what L0 may have changed), not
> +	 * vmcs12->cr0_read_shadow (which L1 defined, and we don't
> +	 * change without being told by L1). Currently, the only bit where
> +	 * this can happen is TS.
> +	 */
> +	if (!(vcpu->arch.cr0_guest_owned_bits&  X86_CR0_TS)
> +			&&  !(vmcs12->cr0_guest_host_mask&  X86_CR0_TS))
> +		vmcs12->guest_cr0 =
> +			(vmcs12->guest_cr0&  ~X86_CR0_TS) |
> +			(vmcs_readl(CR0_READ_SHADOW)&  X86_CR0_TS);
> +
> +	vmcs12->guest_cr4 = vmcs_readl(GUEST_CR4);
>    

Can't we have the same issue with cr4?

Better to have some helpers to do the common magic, and not encode the 
special knowledge about TS into it (make it generic).

> +
> +int switch_back_vmcs(struct kvm_vcpu *vcpu)
> +{
> +	struct shadow_vmcs *src = to_vmx(vcpu)->nested.l1_shadow_vmcs;
> +
> +	if (enable_vpid&&  src->virtual_processor_id != 0)
> +		vmcs_write16(VIRTUAL_PROCESSOR_ID, src->virtual_processor_id);
>    

IIUC vpids are not exposed to the guest yet?  So the VPID should not 
change between guest and nested guest.

> +
> +	vmcs_write64(IO_BITMAP_A, src->io_bitmap_a);
> +	vmcs_write64(IO_BITMAP_B, src->io_bitmap_b);
>    

Why change the I/O bitmap?

> +
> +	if (cpu_has_vmx_msr_bitmap())
> +		vmcs_write64(MSR_BITMAP, src->msr_bitmap);
>    

Or the msr bitmap?  After all, we're switching the entire vmcs?

> +
> +	vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, src->virtual_apic_page_addr);
> +
> +	if (vm_need_virtualize_apic_accesses(vcpu->kvm))
> +		vmcs_write64(APIC_ACCESS_ADDR,
> +			     src->apic_access_addr);
> +
> +	if (enable_ept) {
> +		vmcs_write64(EPT_POINTER, src->ept_pointer);
> +		vmcs_write64(GUEST_PDPTR0, src->guest_pdptr0);
> +		vmcs_write64(GUEST_PDPTR1, src->guest_pdptr1);
> +		vmcs_write64(GUEST_PDPTR2, src->guest_pdptr2);
> +		vmcs_write64(GUEST_PDPTR3, src->guest_pdptr3);
> +	}
>    

A kvm_set_cr3(src->host_cr3) should do all that and more, no?

> +
> +	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL, src->pin_based_vm_exec_control);
> +	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, src->cpu_based_vm_exec_control);
> +	vmcs_write32(EXCEPTION_BITMAP, src->exception_bitmap);
> +	vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK,
> +		     src->page_fault_error_code_mask);
> +	vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH,
> +		     src->page_fault_error_code_match);
> +	vmcs_write32(VM_EXIT_CONTROLS, src->vm_exit_controls);
> +	vmcs_write32(VM_ENTRY_CONTROLS, src->vm_entry_controls);
>    

Why write all these?  What could have changed them?

> +
> +	if (cpu_has_secondary_exec_ctrls())
> +		vmcs_write32(SECONDARY_VM_EXEC_CONTROL,
> +			     src->secondary_vm_exec_control);
> +
> +	load_vmcs_common(src);
> +
> +	load_vmcs_host_state(to_vmx(vcpu)->nested.l1_shadow_vmcs);
> +
> +	return 0;
> +}
> +
> +static int nested_vmx_vmexit(struct kvm_vcpu *vcpu,
> +			     bool is_interrupt)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	int efer_offset;
> +
> +	if (!vmx->nested.nested_mode) {
> +		printk(KERN_INFO "WARNING: %s called but not in nested mode\n",
> +		       __func__);
> +		return 0;
> +	}
> +
> +	sync_cached_regs_to_vmcs(vcpu);
> +
> +	if (!nested_map_current(vcpu)) {
> +		printk(KERN_INFO "Error mapping shadow vmcs\n");
> +		set_rflags_to_vmx_fail_valid(vcpu);
> +		return 1;
> +	}
> +
> +	prepare_vmcs_12(vcpu);
> +	if (is_interrupt)
> +		get_shadow_vmcs(vcpu)->vm_exit_reason =
> +			EXIT_REASON_EXTERNAL_INTERRUPT;
> +
> +	vmx->nested.current_l2_page->launched = vmx->launched;
> +	vmx->nested.current_l2_page->cpu = vcpu->cpu;
> +
> +	nested_unmap_current(vcpu);
> +
> +	vmx->vmcs = vmx->nested.l1_vmcs;
> +	vcpu->cpu = vmx->nested.l1_state.cpu;
> +	vmx->launched = vmx->nested.l1_state.launched;
> +
> +	vmx_vcpu_load(vcpu, get_cpu());
> +	put_cpu();
> +
> +	vcpu->arch.efer = vmx->nested.l1_state.efer;
> +	if ((vcpu->arch.efer&  EFER_LMA)&&
> +	    !(vcpu->arch.efer&  EFER_SCE))
> +		vcpu->arch.efer |= EFER_SCE;
> +
> +	efer_offset = __find_msr_index(vmx, MSR_EFER);
> +	if (update_transition_efer(vmx, efer_offset))
> +		wrmsrl(MSR_EFER, vmx->guest_msrs[efer_offset].data);
> +
> +	/* We're running a regular L1 guest again, so we do the regular KVM
> +	 * thing: run vmx_set_cr0 with the cr0 bits the guest thinks it has
> +	 * (this can be figured out by combining its old guest_cr0 and
> +	 * cr0_read_shadow, using the cr0_guest_host_mask). vmx_set_cr0 might
> +	 * use slightly different bits on the new guest_cr0 it sets, e.g.,
> +	 * add TS when !fpu_active.
> +	 */
> +	vmx_set_cr0(vcpu,
> +		(vmx->nested.l1_shadow_vmcs->cr0_guest_host_mask&
> +		vmx->nested.l1_shadow_vmcs->cr0_read_shadow) |
> +		(~vmx->nested.l1_shadow_vmcs->cr0_guest_host_mask&
> +		vmx->nested.l1_shadow_vmcs->guest_cr0));
>    

Helper wanted.

> +
> +	vmx_set_cr4(vcpu, vmx->nested.l1_state.cr4);
> +
>    

Again, the kvm_set_crx() versions have more meat.

> +	if (enable_ept) {
> +		vcpu->arch.cr3 = vmx->nested.l1_shadow_vmcs->guest_cr3;
> +		vmcs_write32(GUEST_CR3, vmx->nested.l1_shadow_vmcs->guest_cr3);
> +	} else {
> +		kvm_set_cr3(vcpu, vmx->nested.l1_state.cr3);
> +	}
>    

kvm_set_cr3() will load the PDPTRs in the EPT case (correctly in case 
the nested guest was able to corrupted the guest's PDPT).

> +
> +	if (!nested_map_current(vcpu)) {
> +		printk(KERN_INFO "Error mapping shadow vmcs\n");
> +		set_rflags_to_vmx_fail_valid(vcpu);
> +		return 1;
> +	}
> +
> +	switch_back_vmcs(vcpu);
> +
> +	nested_unmap_current(vcpu);
> +
> +	kvm_register_write(vcpu, VCPU_REGS_RSP,
> +			   vmx->nested.l1_shadow_vmcs->guest_rsp);
> +	kvm_register_write(vcpu, VCPU_REGS_RIP,
> +			   vmx->nested.l1_shadow_vmcs->guest_rip);
> +
> +	vmx->nested.nested_mode = 0;
> +
> +	/* If we did fpu_activate()/fpu_deactive() during l2's run, we need
> +	 * to apply the same changes also when running l1. We don't need to
> +	 * change cr0 here - we already did this above - just the
> +	 * cr0_guest_host_mask, and exception bitmap.
> +	 */
> +	vmcs_write32(EXCEPTION_BITMAP,
> +		(vmx->nested.l1_shadow_vmcs->exception_bitmap&
> +			~(1u<<NM_VECTOR)) |
> +			(vcpu->fpu_active ? 0 : (1u<<NM_VECTOR)));
> +	vcpu->arch.cr0_guest_owned_bits = (vcpu->fpu_active ? X86_CR0_TS : 0);
> +	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
> +
> +	kvm_mmu_reset_context(vcpu);
> +	kvm_mmu_load(vcpu);
>    

kvm_mmu_load() unneeded, usually.

> +
> +	if (unlikely(vmx->fail)) {
> +		vmx->fail = 0;
> +		set_rflags_to_vmx_fail_valid(vcpu);
> +	} else
> +		clear_rflags_cf_zf(vcpu);
> +
> +	return 0;
> +}
> +
>   static struct kvm_x86_ops vmx_x86_ops = {
>   	.cpu_has_kvm_support = cpu_has_kvm_support,
>   	.disabled_by_bios = vmx_disabled_by_bios,
>    

I'm probably missing something about the read/write of various vmcs fields.

-- 
error compiling committee.c: too many arguments to function