Re: [PATCH 17/27] nVMX: Implement VMLAUNCH and VMRESUME

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Avi Kivity <avi@redhat.com>
To: "Nadav Har'El" <nyh@il.ibm.com>
Cc: kvm@vger.kernel.org, gleb@redhat.com
Subject: Re: [PATCH 17/27] nVMX: Implement VMLAUNCH and VMRESUME
Date: Sun, 17 Oct 2010 17:06:20 +0200	[thread overview]
Message-ID: <4CBB10EC.8090103@redhat.com> (raw)
In-Reply-To: <201010171012.o9HACH9D029503@rice.haifa.ibm.com>

  On 10/17/2010 12:12 PM, Nadav Har'El wrote:
> Implement the VMLAUNCH and VMRESUME instructions, allowing a guest
> hypervisor to run its own guests.
>
> Signed-off-by: Nadav Har'El<nyh@il.ibm.com>
> ---
>   arch/x86/kvm/vmx.c |  221 ++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 218 insertions(+), 3 deletions(-)
>
> --- .before/arch/x86/kvm/vmx.c	2010-10-17 11:52:02.000000000 +0200
> +++ .after/arch/x86/kvm/vmx.c	2010-10-17 11:52:02.000000000 +0200
> @@ -281,6 +281,9 @@ struct __packed vmcs12 {
>   	struct vmcs_fields fields;
>
>   	bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
> +
> +	int cpu;

Why is this needed?

> +	int launched;

Doesn't it duplicate launch_state?

If I asked these before, it may indicate a comment is needed.

>   };
>
>   /*
> @@ -315,6 +318,23 @@ struct nested_vmx {
>   	/* list of real (hardware) VMCS, one for each L2 guest of L1 */
>   	struct list_head vmcs02_list; /* a vmcs_list */
>   	int vmcs02_num;
> +
> +	/* Are we running a nested guest now */
> +	bool nested_mode;

TODO: live migration for this state.

> +	/* Level 1 state for switching to level 2 and back */
> +	struct  {
> +		u64 efer;

Redundant? LMA/LME set but IA32E_MODE_GUEST, the other bits unchanged by 
transition.

> +		unsigned long cr3;
> +		unsigned long cr4;

Redundant with L1's HOST_CRx?

> +		u64 io_bitmap_a;
> +		u64 io_bitmap_b;

Unneeded? Should not ever change.

> +		u64 msr_bitmap;

Update using setup_msrs().

> +		int cpu;
> +		int launched;

Hmm.

> +	} l1_state;
> +	/* Saving the VMCS that we used for running L1 */
> +	struct vmcs *vmcs01;
> +	struct vmcs_fields *vmcs01_fields;

vmcs01_fields unneeded if we can restructure according to my comments to 
the previous patch.

>   };
>
>   struct vcpu_vmx {
> @@ -1349,6 +1369,16 @@ static void vmx_vcpu_load(struct kvm_vcp
>
>   		rdmsrl(MSR_IA32_SYSENTER_ESP, sysenter_esp);
>   		vmcs_writel(HOST_IA32_SYSENTER_ESP, sysenter_esp); /* 22.2.3 */
> +
> +		if (vmx->nested.vmcs01_fields != NULL) {
> +			struct vmcs_fields *vmcs01 = vmx->nested.vmcs01_fields;
> +			vmcs01->host_tr_base = vmcs_readl(HOST_TR_BASE);
> +			vmcs01->host_gdtr_base = vmcs_readl(HOST_GDTR_BASE);
> +			vmcs01->host_ia32_sysenter_esp =
> +				vmcs_readl(HOST_IA32_SYSENTER_ESP);
> +			if (vmx->nested.nested_mode)
> +				load_vmcs_host_state(vmcs01);
> +		}
>   	}
>   }

Instead, you can call a subset of vmx_vcpu_load() which updates these 
fields on nested vmexit.  In fact I think calling vmx_vcpu_load() as is 
may work.

Same for nested vmentry.  Once you switch the vmcs, call vmx_vcpu_load() 
and it will update the per-cpu parts of vmcs02.

It will also update per_cpu(current_vmcs) and per_cpu(vcpus_on_vcpu) 
which are needed for smp and for suspend/resume.  You'll also need to 
call vmx_vcpu_put() (but without the __vmx_load_host_state() part).

> +
> +static int handle_launch_or_resume(struct kvm_vcpu *vcpu, bool launch)
> +{
> +	if (!nested_vmx_check_permission(vcpu))
> +		return 1;
> +
> +	/* yet another strange pre-requisite listed in the VMX spec */
> +	if (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO)&  GUEST_INTR_STATE_MOV_SS){
> +		nested_vmx_failValid(vcpu,
> +			VMXERR_ENTRY_EVENTS_BLOCKED_BY_MOV_SS);
> +		skip_emulated_instruction(vcpu);
> +		return 1;
> +	}

Can't you just let the guest launch and handle the failure if it happens?

Can be done later; just add a TODO.

> +
> +	if (to_vmx(vcpu)->nested.current_vmcs12->launch_state == launch) {
> +		/* Must use VMLAUNCH for the first time, VMRESUME later */
> +		nested_vmx_failValid(vcpu,
> +			launch ? VMXERR_VMLAUNCH_NONCLEAR_VMCS :
> +				 VMXERR_VMRESUME_NONLAUNCHED_VMCS);
> +		skip_emulated_instruction(vcpu);
> +		return 1;
> +	}

Ditto.  Less critical since it doesn't involve a VMREAD.

> +
> +	skip_emulated_instruction(vcpu);
> +
> +	nested_vmx_run(vcpu);
> +	return 1;
> +}
> +
>
> +
> +static int nested_vmx_run(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	vmx->nested.nested_mode = true;
> +	sync_cached_regs_to_vmcs(vcpu);
> +	save_vmcs(vmx->nested.vmcs01_fields);
> +
> +	vmx->nested.l1_state.efer = vcpu->arch.efer;
> +	if (!enable_ept)
> +		vmx->nested.l1_state.cr3 = vcpu->arch.cr3;
> +	vmx->nested.l1_state.cr4 = vcpu->arch.cr4;
> +
> +	if (cpu_has_vmx_msr_bitmap())
> +		vmx->nested.l1_state.msr_bitmap = vmcs_read64(MSR_BITMAP);
> +	else
> +		vmx->nested.l1_state.msr_bitmap = 0;
> +
> +	vmx->nested.l1_state.io_bitmap_a = vmcs_read64(IO_BITMAP_A);
> +	vmx->nested.l1_state.io_bitmap_b = vmcs_read64(IO_BITMAP_B);
> +	vmx->nested.vmcs01 = vmx->vmcs;
> +	vmx->nested.l1_state.cpu = vcpu->cpu;
> +	vmx->nested.l1_state.launched = vmx->launched;
> +
> +	vmx->vmcs = nested_get_current_vmcs(vcpu);
> +	if (!vmx->vmcs) {
> +		printk(KERN_ERR "Missing VMCS\n");

Guest exploitable printk(), remove.  There are debug printk macros 
around, you can use them if they help debugging.

> +		nested_vmx_failValid(vcpu, VMXERR_VMRESUME_CORRUPTED_VMCS);
> +		return 1;
> +	}
> +
> +	vcpu->cpu = vmx->nested.current_vmcs12->cpu;
> +	vmx->launched = vmx->nested.current_vmcs12->launched;

These bits are volatile (changed by process migration) so can only be 
used in preempt disabled contexts.


> +
> +	if (!vmx->nested.current_vmcs12->launch_state || !vmx->launched) {
> +		vmcs_clear(vmx->vmcs);
> +		vmx->launched = 0;
> +		vmx->nested.current_vmcs12->launch_state = 1;

launch_state == 1 -> not launched?  strange.

vmcs_clear() needs to happen on the right cpu.

> +	}
> +
> +	vmx_vcpu_load(vcpu, get_cpu());

Does this not do everything correctly?

I think you need to move the get_cpu() above to disable preemption.  Not 
sure how much.
> +	put_cpu();
> +
> +	prepare_vmcs02(vcpu,
> +		get_vmcs12_fields(vcpu), vmx->nested.vmcs01_fields);
> +
> +	if (get_vmcs12_fields(vcpu)->vm_entry_controls&
> +	    VM_ENTRY_IA32E_MODE) {
> +		if (!((vcpu->arch.efer&  EFER_LMA)&&
> +		      (vcpu->arch.efer&  EFER_LME)))

> +			vcpu->arch.efer |= (EFER_LMA | EFER_LME);
> +	} else {
> +		if ((vcpu->arch.efer&  EFER_LMA) ||
> +		    (vcpu->arch.efer&  EFER_LME))
> +			vcpu->arch.efer = 0;

Clearing all of EFER is incorrect.  Just assign IA32E_MODE 
unconditionally to both EFER.LMA and EFER.LME.

> +	}
> +
> +	vmx->rmode.vm86_active =
> +		!(get_vmcs12_fields(vcpu)->cr0_read_shadow&  X86_CR0_PE);

Needs to be unconditionally false since we don't support real mode 
nested guests.  No need to clear since we can't vmenter from a real mode 
guest.

> +
> +	/* vmx_set_cr0() sets the cr0 that L2 will read, to be the one that L1
> +	 * dictated, and takes appropriate actions for special cr0 bits (like
> +	 * real mode, etc.).
> +	 */
> +	vmx_set_cr0(vcpu, guest_readable_cr0(get_vmcs12_fields(vcpu)));

Don't we want vmcs12->guest_cr0 here?  guest_readable_cr0() is only 
useful for lmsw and emulated_read_cr().

Paging mode etc. are set by guest_cr0; consider cr0_read_shadow.pg=0 and 
guest_cr0.pg=1.

> +
> +	/* However, vmx_set_cr0 incorrectly enforces KVM's relationship between
> +	 * GUEST_CR0 and CR0_READ_SHADOW, e.g., that the former is the same as
> +	 * the latter with with TS added if !fpu_active. We need to take the
> +	 * actual GUEST_CR0 that L1 wanted, just with added TS if !fpu_active
> +	 * like KVM wants (for the "lazy fpu" feature, to avoid the costly
> +	 * restoration of fpu registers until the FPU is really used).
> +	 */
> +	vmcs_writel(GUEST_CR0, get_vmcs12_fields(vcpu)->guest_cr0 |
> +		(vcpu->fpu_active ? 0 : X86_CR0_TS));

See?

> +
> +	vmx_set_cr4(vcpu, get_vmcs12_fields(vcpu)->guest_cr4);
> +	vmcs_writel(CR4_READ_SHADOW,
> +		    get_vmcs12_fields(vcpu)->cr4_read_shadow);
> +
> +	/* we have to set the X86_CR0_PG bit of the cached cr0, because
> +	 * kvm_mmu_reset_context enables paging only if X86_CR0_PG is set in
> +	 * CR0 (we need the paging so that KVM treat this guest as a paging
> +	 * guest so we can easly forward page faults to L1.)
> +	 */
> +	vcpu->arch.cr0 |= X86_CR0_PG;

Shouldn't be needed (but should check that guest_cr0 has the always-on 
bits set).

> +
> +	if (enable_ept) {
> +		vmcs_write32(GUEST_CR3, get_vmcs12_fields(vcpu)->guest_cr3);
> +		vmx->vcpu.arch.cr3 = get_vmcs12_fields(vcpu)->guest_cr3;
> +	} else {
> +		int r;
> +		kvm_set_cr3(vcpu, get_vmcs12_fields(vcpu)->guest_cr3);
> +		kvm_mmu_reset_context(vcpu);
> +
> +		r = kvm_mmu_load(vcpu);
> +		if (unlikely(r)) {
> +			printk(KERN_ERR "Error in kvm_mmu_load r %d\n", r);
> +			nested_vmx_failValid(vcpu,
> +				VMXERR_VMRESUME_CORRUPTED_VMCS /* ? */);
> +			/* switch back to L1 */
> +			vmx->nested.nested_mode = false;
> +			vmx->vmcs = vmx->nested.vmcs01;
> +			vcpu->cpu = vmx->nested.l1_state.cpu;
> +			vmx->launched = vmx->nested.l1_state.launched;
> +
> +			vmx_vcpu_load(vcpu, get_cpu());
> +			put_cpu();
> +
> +			return 1;
> +		}
> +	}

I think you can call kvm_set_cr3() unconditionally.  It will return 1 on 
bad cr3 which you can use for failing the entry.

btw, kvm_mmu_reset_context() is needed on ept as well; kvm_mmu_load() is 
not needed AFAICT (the common entry code will take care of it).


-- 
error compiling committee.c: too many arguments to function

next prev parent reply	other threads:[~2010-10-17 15:06 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-10-17 10:03 [PATCH 0/27] nVMX: Nested VMX, v6 Nadav Har'El
2010-10-17 10:04 ` [PATCH 01/27] nVMX: Add "nested" module option to vmx.c Nadav Har'El
2010-10-17 10:04 ` [PATCH 02/27] nVMX: Add VMX and SVM to list of supported cpuid features Nadav Har'El
2010-10-17 10:05 ` [PATCH 03/27] nVMX: Implement VMXON and VMXOFF Nadav Har'El
2010-10-17 12:24   ` Avi Kivity
2010-10-17 12:47     ` Nadav Har'El
2010-10-17 13:07   ` Avi Kivity
2010-10-17 10:05 ` [PATCH 04/27] nVMX: Allow setting the VMXE bit in CR4 Nadav Har'El
2010-10-17 12:31   ` Avi Kivity
2010-10-17 10:06 ` [PATCH 05/27] nVMX: Introduce vmcs12: a VMCS structure for L1 Nadav Har'El
2010-10-17 12:34   ` Avi Kivity
2010-10-17 13:18     ` Nadav Har'El
2010-10-17 13:29       ` Avi Kivity
2010-10-17 10:06 ` [PATCH 06/27] nVMX: Implement reading and writing of VMX MSRs Nadav Har'El
2010-10-17 12:52   ` Avi Kivity
2010-10-17 10:07 ` [PATCH 07/27] nVMX: Decoding memory operands of VMX instructions Nadav Har'El
2010-10-17 10:07 ` [PATCH 08/27] nVMX: Hold a vmcs02 for each vmcs12 Nadav Har'El
2010-10-17 13:00   ` Avi Kivity
2010-10-17 10:08 ` [PATCH 09/27] nVMX: Success/failure of VMX instructions Nadav Har'El
2010-10-17 10:08 ` [PATCH 10/27] nVMX: Implement VMCLEAR Nadav Har'El
2010-10-17 13:05   ` Avi Kivity
2010-10-17 13:25     ` Nadav Har'El
2010-10-17 13:27       ` Avi Kivity
2010-10-17 13:37         ` Nadav Har'El
2010-10-17 14:12           ` Avi Kivity
2010-10-17 14:14             ` Gleb Natapov
2010-10-17 10:09 ` [PATCH 11/27] nVMX: Implement VMPTRLD Nadav Har'El
2010-10-17 10:09 ` [PATCH 12/27] nVMX: Implement VMPTRST Nadav Har'El
2010-10-17 10:10 ` [PATCH 13/27] nVMX: Add VMCS fields to the vmcs12 Nadav Har'El
2010-10-17 13:15   ` Avi Kivity
2010-10-17 10:10 ` [PATCH 14/27] nVMX: Implement VMREAD and VMWRITE Nadav Har'El
2010-10-17 13:25   ` Avi Kivity
2010-10-17 10:11 ` [PATCH 15/27] nVMX: Prepare vmcs02 from vmcs01 and vmcs12 Nadav Har'El
2010-10-17 14:08   ` Avi Kivity
2011-02-08 12:13     ` Nadav Har'El
2011-02-08 12:27       ` Avi Kivity
2011-02-08 12:36         ` Nadav Har'El
2011-02-08 12:39           ` Avi Kivity
2011-02-08 12:27       ` Avi Kivity
2010-10-17 10:11 ` [PATCH 16/27] nVMX: Move register-syncing to a function Nadav Har'El
2010-10-17 10:12 ` [PATCH 17/27] nVMX: Implement VMLAUNCH and VMRESUME Nadav Har'El
2010-10-17 15:06   ` Avi Kivity [this message]
2010-10-17 10:12 ` [PATCH 18/27] nVMX: No need for handle_vmx_insn function any more Nadav Har'El
2010-10-17 10:13 ` [PATCH 19/27] nVMX: Exiting from L2 to L1 Nadav Har'El
2010-10-17 15:58   ` Avi Kivity
2010-10-17 10:13 ` [PATCH 20/27] nVMX: Deciding if L0 or L1 should handle an L2 exit Nadav Har'El
2010-10-20 12:13   ` Avi Kivity
2010-10-20 14:57     ` Avi Kivity
2010-10-17 10:14 ` [PATCH 21/27] nVMX: Correct handling of interrupt injection Nadav Har'El
2010-10-17 10:14 ` [PATCH 22/27] nVMX: Correct handling of exception injection Nadav Har'El
2010-10-17 10:15 ` [PATCH 23/27] nVMX: Correct handling of idt vectoring info Nadav Har'El
2010-10-17 10:15 ` [PATCH 24/27] nVMX: Handling of CR0.TS and #NM for Lazy FPU loading Nadav Har'El
2010-10-17 10:16 ` [PATCH 25/27] nVMX: Additional TSC-offset handling Nadav Har'El
2010-10-19 19:13   ` Zachary Amsden
2010-10-17 10:16 ` [PATCH 26/27] nVMX: Miscellenous small corrections Nadav Har'El
2010-10-17 10:17 ` [PATCH 27/27] nVMX: Documentation Nadav Har'El

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4CBB10EC.8090103@redhat.com \
    --to=avi@redhat.com \
    --cc=gleb@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=nyh@il.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.