From mboxrd@z Thu Jan 1 00:00:00 1970 From: Avi Kivity Subject: Re: [PATCH 18/24] Exiting from L2 to L1 Date: Mon, 14 Jun 2010 15:04:08 +0300 Message-ID: <4C161AB8.4060905@redhat.com> References: <1276431753-nyh@il.ibm.com> <201006131231.o5DCVlKB013102@rice.haifa.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: kvm@vger.kernel.org To: "Nadav Har'El" Return-path: Received: from mx1.redhat.com ([209.132.183.28]:17661 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750742Ab0FNMEM (ORCPT ); Mon, 14 Jun 2010 08:04:12 -0400 In-Reply-To: <201006131231.o5DCVlKB013102@rice.haifa.ibm.com> Sender: kvm-owner@vger.kernel.org List-ID: On 06/13/2010 03:31 PM, Nadav Har'El wrote: > This patch implements nested_vmx_vmexit(), called when the nested L2 guest > exits and we want to run its L1 parent and let it handle this exit. > > Note that this will not necessarily be called on every L2 exit. L0 may decide > to handle a particular exit on its own, without L1's involvement; In that > case, L0 will handle the exit, and resume running L2, without running L1 and > without calling nested_vmx_vmexit(). The logic for deciding whether to handle > a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(), > will appear in the next patch. > > > > +/* prepare_vmcs_12 is called when the nested L2 guest exits and we want to > + * prepare to run its L1 parent. L1 keeps a vmcs for L2 (vmcs12), and this > + * function updates it to reflect the state of the registers during the exit, > + * and to reflect some changes that happened while L2 was running (and perhaps > + * made some exits which were handled directly by L0 without going back to L1). > + */ > +void prepare_vmcs_12(struct kvm_vcpu *vcpu) > +{ > + struct shadow_vmcs *vmcs12 = get_shadow_vmcs(vcpu); > + > + vmcs12->guest_es_selector = vmcs_read16(GUEST_ES_SELECTOR); > + vmcs12->guest_cs_selector = vmcs_read16(GUEST_CS_SELECTOR); > + vmcs12->guest_ss_selector = vmcs_read16(GUEST_SS_SELECTOR); > + vmcs12->guest_ds_selector = vmcs_read16(GUEST_DS_SELECTOR); > + vmcs12->guest_fs_selector = vmcs_read16(GUEST_FS_SELECTOR); > + vmcs12->guest_gs_selector = vmcs_read16(GUEST_GS_SELECTOR); > + vmcs12->guest_ldtr_selector = vmcs_read16(GUEST_LDTR_SELECTOR); > + vmcs12->guest_tr_selector = vmcs_read16(GUEST_TR_SELECTOR); > + > + vmcs12->tsc_offset = vmcs_read64(TSC_OFFSET); > TSC_OFFSET cannot have changed. > + vmcs12->guest_physical_address = vmcs_read64(GUEST_PHYSICAL_ADDRESS); > Not available without EPT. > + vmcs12->vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER); > Can this change? > + vmcs12->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL); > Without msr bitmaps, cannot change. > + if (vmcs_config.vmentry_ctrl& VM_ENTRY_LOAD_IA32_PAT) > + vmcs12->guest_ia32_pat = vmcs_read64(GUEST_IA32_PAT); > Should check for VM_EXIT_SAVE_IA32_PAT, no? Also unneeded without msr bitmaps and passthrough for this msr. > + vmcs12->cr3_target_count = vmcs_read32(CR3_TARGET_COUNT); > R/O > + vmcs12->vm_entry_intr_info_field = > + vmcs_read32(VM_ENTRY_INTR_INFO_FIELD); > Autocleared, no need to read. > + vmcs12->vm_entry_exception_error_code = > + vmcs_read32(VM_ENTRY_EXCEPTION_ERROR_CODE); > + vmcs12->vm_entry_instruction_len = > + vmcs_read32(VM_ENTRY_INSTRUCTION_LEN); > R/O > + vmcs12->vm_instruction_error = vmcs_read32(VM_INSTRUCTION_ERROR); > We don't want to pass this to the guest? > + vmcs12->vm_exit_reason = vmcs_read32(VM_EXIT_REASON); > + vmcs12->vm_exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO); > + vmcs12->vm_exit_intr_error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE); > + vmcs12->idt_vectoring_info_field = > + vmcs_read32(IDT_VECTORING_INFO_FIELD); > + vmcs12->idt_vectoring_error_code = > + vmcs_read32(IDT_VECTORING_ERROR_CODE); > + vmcs12->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN); > + vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO); > For the above, if the host handles the exit, we must not clobber guest fields. A subsequent guest vmread will see the changed values even though from its point of view a vmexit has not occurred. But no, that can't happen, since a vmread needs to have a vmexit first to happen. Still, best to delay this.+ > + /* If any of the CRO_GUEST_HOST_MASK bits are off, the L2 guest may > + * have changed some cr0 bits without us ever saving them in the shadow > + * vmcs. So we need to save these changes now. > + * In the current code, the only GHM bit which can be off is TS (it > + * will be off when fpu_active and L1 also set it to off). > + */ > + vmcs12->guest_cr0 = vmcs_readl(GUEST_CR0); > + > + /* But this may not be the guest_cr0 that the L1 guest hypervisor > + * actually thought it was giving its L2 guest. It is possible that > + * L1 wished to allow its guest to set a cr0 bit directly, but we (L0) > + * captured this attempt and instead set just the read shadow. If this > + * is the case, we need copy these read-shadow bits back to guest_cr0, > + * where L1 believes they already are. Note that we must read the > + * actual CR0_READ_SHADOW (which is what L0 may have changed), not > + * vmcs12->cr0_read_shadow (which L1 defined, and we don't > + * change without being told by L1). Currently, the only bit where > + * this can happen is TS. > + */ > + if (!(vcpu->arch.cr0_guest_owned_bits& X86_CR0_TS) > + && !(vmcs12->cr0_guest_host_mask& X86_CR0_TS)) > + vmcs12->guest_cr0 = > + (vmcs12->guest_cr0& ~X86_CR0_TS) | > + (vmcs_readl(CR0_READ_SHADOW)& X86_CR0_TS); > + > + vmcs12->guest_cr4 = vmcs_readl(GUEST_CR4); > Can't we have the same issue with cr4? Better to have some helpers to do the common magic, and not encode the special knowledge about TS into it (make it generic). > + > +int switch_back_vmcs(struct kvm_vcpu *vcpu) > +{ > + struct shadow_vmcs *src = to_vmx(vcpu)->nested.l1_shadow_vmcs; > + > + if (enable_vpid&& src->virtual_processor_id != 0) > + vmcs_write16(VIRTUAL_PROCESSOR_ID, src->virtual_processor_id); > IIUC vpids are not exposed to the guest yet? So the VPID should not change between guest and nested guest. > + > + vmcs_write64(IO_BITMAP_A, src->io_bitmap_a); > + vmcs_write64(IO_BITMAP_B, src->io_bitmap_b); > Why change the I/O bitmap? > + > + if (cpu_has_vmx_msr_bitmap()) > + vmcs_write64(MSR_BITMAP, src->msr_bitmap); > Or the msr bitmap? After all, we're switching the entire vmcs? > + > + vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, src->virtual_apic_page_addr); > + > + if (vm_need_virtualize_apic_accesses(vcpu->kvm)) > + vmcs_write64(APIC_ACCESS_ADDR, > + src->apic_access_addr); > + > + if (enable_ept) { > + vmcs_write64(EPT_POINTER, src->ept_pointer); > + vmcs_write64(GUEST_PDPTR0, src->guest_pdptr0); > + vmcs_write64(GUEST_PDPTR1, src->guest_pdptr1); > + vmcs_write64(GUEST_PDPTR2, src->guest_pdptr2); > + vmcs_write64(GUEST_PDPTR3, src->guest_pdptr3); > + } > A kvm_set_cr3(src->host_cr3) should do all that and more, no? > + > + vmcs_write32(PIN_BASED_VM_EXEC_CONTROL, src->pin_based_vm_exec_control); > + vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, src->cpu_based_vm_exec_control); > + vmcs_write32(EXCEPTION_BITMAP, src->exception_bitmap); > + vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, > + src->page_fault_error_code_mask); > + vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, > + src->page_fault_error_code_match); > + vmcs_write32(VM_EXIT_CONTROLS, src->vm_exit_controls); > + vmcs_write32(VM_ENTRY_CONTROLS, src->vm_entry_controls); > Why write all these? What could have changed them? > + > + if (cpu_has_secondary_exec_ctrls()) > + vmcs_write32(SECONDARY_VM_EXEC_CONTROL, > + src->secondary_vm_exec_control); > + > + load_vmcs_common(src); > + > + load_vmcs_host_state(to_vmx(vcpu)->nested.l1_shadow_vmcs); > + > + return 0; > +} > + > +static int nested_vmx_vmexit(struct kvm_vcpu *vcpu, > + bool is_interrupt) > +{ > + struct vcpu_vmx *vmx = to_vmx(vcpu); > + int efer_offset; > + > + if (!vmx->nested.nested_mode) { > + printk(KERN_INFO "WARNING: %s called but not in nested mode\n", > + __func__); > + return 0; > + } > + > + sync_cached_regs_to_vmcs(vcpu); > + > + if (!nested_map_current(vcpu)) { > + printk(KERN_INFO "Error mapping shadow vmcs\n"); > + set_rflags_to_vmx_fail_valid(vcpu); > + return 1; > + } > + > + prepare_vmcs_12(vcpu); > + if (is_interrupt) > + get_shadow_vmcs(vcpu)->vm_exit_reason = > + EXIT_REASON_EXTERNAL_INTERRUPT; > + > + vmx->nested.current_l2_page->launched = vmx->launched; > + vmx->nested.current_l2_page->cpu = vcpu->cpu; > + > + nested_unmap_current(vcpu); > + > + vmx->vmcs = vmx->nested.l1_vmcs; > + vcpu->cpu = vmx->nested.l1_state.cpu; > + vmx->launched = vmx->nested.l1_state.launched; > + > + vmx_vcpu_load(vcpu, get_cpu()); > + put_cpu(); > + > + vcpu->arch.efer = vmx->nested.l1_state.efer; > + if ((vcpu->arch.efer& EFER_LMA)&& > + !(vcpu->arch.efer& EFER_SCE)) > + vcpu->arch.efer |= EFER_SCE; > + > + efer_offset = __find_msr_index(vmx, MSR_EFER); > + if (update_transition_efer(vmx, efer_offset)) > + wrmsrl(MSR_EFER, vmx->guest_msrs[efer_offset].data); > + > + /* We're running a regular L1 guest again, so we do the regular KVM > + * thing: run vmx_set_cr0 with the cr0 bits the guest thinks it has > + * (this can be figured out by combining its old guest_cr0 and > + * cr0_read_shadow, using the cr0_guest_host_mask). vmx_set_cr0 might > + * use slightly different bits on the new guest_cr0 it sets, e.g., > + * add TS when !fpu_active. > + */ > + vmx_set_cr0(vcpu, > + (vmx->nested.l1_shadow_vmcs->cr0_guest_host_mask& > + vmx->nested.l1_shadow_vmcs->cr0_read_shadow) | > + (~vmx->nested.l1_shadow_vmcs->cr0_guest_host_mask& > + vmx->nested.l1_shadow_vmcs->guest_cr0)); > Helper wanted. > + > + vmx_set_cr4(vcpu, vmx->nested.l1_state.cr4); > + > Again, the kvm_set_crx() versions have more meat. > + if (enable_ept) { > + vcpu->arch.cr3 = vmx->nested.l1_shadow_vmcs->guest_cr3; > + vmcs_write32(GUEST_CR3, vmx->nested.l1_shadow_vmcs->guest_cr3); > + } else { > + kvm_set_cr3(vcpu, vmx->nested.l1_state.cr3); > + } > kvm_set_cr3() will load the PDPTRs in the EPT case (correctly in case the nested guest was able to corrupted the guest's PDPT). > + > + if (!nested_map_current(vcpu)) { > + printk(KERN_INFO "Error mapping shadow vmcs\n"); > + set_rflags_to_vmx_fail_valid(vcpu); > + return 1; > + } > + > + switch_back_vmcs(vcpu); > + > + nested_unmap_current(vcpu); > + > + kvm_register_write(vcpu, VCPU_REGS_RSP, > + vmx->nested.l1_shadow_vmcs->guest_rsp); > + kvm_register_write(vcpu, VCPU_REGS_RIP, > + vmx->nested.l1_shadow_vmcs->guest_rip); > + > + vmx->nested.nested_mode = 0; > + > + /* If we did fpu_activate()/fpu_deactive() during l2's run, we need > + * to apply the same changes also when running l1. We don't need to > + * change cr0 here - we already did this above - just the > + * cr0_guest_host_mask, and exception bitmap. > + */ > + vmcs_write32(EXCEPTION_BITMAP, > + (vmx->nested.l1_shadow_vmcs->exception_bitmap& > + ~(1u< + (vcpu->fpu_active ? 0 : (1u< + vcpu->arch.cr0_guest_owned_bits = (vcpu->fpu_active ? X86_CR0_TS : 0); > + vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits); > + > + kvm_mmu_reset_context(vcpu); > + kvm_mmu_load(vcpu); > kvm_mmu_load() unneeded, usually. > + > + if (unlikely(vmx->fail)) { > + vmx->fail = 0; > + set_rflags_to_vmx_fail_valid(vcpu); > + } else > + clear_rflags_cf_zf(vcpu); > + > + return 0; > +} > + > static struct kvm_x86_ops vmx_x86_ops = { > .cpu_has_kvm_support = cpu_has_kvm_support, > .disabled_by_bios = vmx_disabled_by_bios, > I'm probably missing something about the read/write of various vmcs fields. -- error compiling committee.c: too many arguments to function