From: Mukesh Rathor <mukesh.rathor@oracle.com>
To: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: "Xen-devel@lists.xensource.com" <Xen-devel@lists.xensource.com>
Subject: Re: [PATCH 9/18 V2]: PVH xen: create PVH vmcs, and initialization
Date: Mon, 18 Mar 2013 18:00:36 -0700 [thread overview]
Message-ID: <20130318180036.211c57e9@mantra.us.oracle.com> (raw)
In-Reply-To: <20130318152843.GK24560@phenom.dumpdata.com>
On Mon, 18 Mar 2013 11:28:43 -0400
Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
> On Fri, Mar 15, 2013 at 05:39:25PM -0700, Mukesh Rathor wrote:
> > This patch mainly contains code to create a VMCS for PVH guest, and
> > HVM specific vcpu/domain creation code.
> >
> > Changes in V2:
> > - Avoid call to hvm_do_resume() at call site rather than return
> > in it.
> > - Return for PVH vmx_do_resume prior to intel debugger stuff.
> >
> > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
> > ---
> > xen/arch/x86/hvm/hvm.c | 90 ++++++++++++++-
> > xen/arch/x86/hvm/vmx/vmcs.c | 266
> > ++++++++++++++++++++++++++++++++++++++++++-
> > xen/arch/x86/hvm/vmx/vmx.c | 34 ++++++ 3 files changed, 383
> > insertions(+), 7 deletions(-)
> >
> > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> > index ea7adf6..18889ad 100644
> > --- a/xen/arch/x86/hvm/hvm.c
> > +++ b/xen/arch/x86/hvm/hvm.c
> > @@ -510,6 +510,29 @@ static int hvm_print_line(
> > return X86EMUL_OKAY;
> > }
> >
> > +static int hvm_pvh_dom_initialise(struct domain *d)
> > +{
> > + int rc;
> > +
> > + if (!d->arch.hvm_domain.hap_enabled)
> > + return -EINVAL;
> > +
> > + spin_lock_init(&d->arch.hvm_domain.irq_lock);
> > + hvm_init_guest_time(d);
> > +
> > + hvm_init_cacheattr_region_list(d);
> > +
> > + if ( (rc=paging_enable(d,
> > PG_refcounts|PG_translate|PG_external)) != 0 )
> > + goto fail1; <=================================== GOTO
> > +
> > + if ( (rc = hvm_funcs.domain_initialise(d)) == 0 )
> > + return 0;
> > +
> > +fail1:
>
> I don't think you need the label here? You are not doing an goto.
Right above.
> > long))hvm_assert_evtchn_irq,
> > + (unsigned long)v );
> > +
> > + v->arch.hvm_vcpu.hcall_64bit = 1;
> > + v->arch.hvm_vcpu.hvm_pvh.vcpu_info_mfn = INVALID_MFN;
> > + v->arch.user_regs.eflags = 2;
>
> So that sets the Reserved flag. Could you include a comment
> explaining why.. Ah, is it b/c we later on bit-shift it and use it to
> figure out whether IOPL needs to be virtualized in
> arch_set_info_guest? Or is it just b/c this function is based off
> hvm_vcpu_initialise? If so, since you are being called by it, can you
> skip it?
That resvd bit is required to be set for bootstrap. Set in other places
also, like arch_set_info_guest():
v->arch.user_regs.eflags |= 2;
>
> > + v->arch.hvm_vcpu.inject_trap.vector = -1;
> > +
> > + if ( (rc=hvm_vcpu_cacheattr_init(v)) != 0 ) {
>
> The syntax here is off.
Hmm... space surrounding "=" in rc=hvm* ?
> > int hvm_vcpu_initialise(struct vcpu *v)
> > {
> > int rc;
> > struct domain *d = v->domain;
> > - domid_t dm_domid =
> > d->arch.hvm_domain.params[HVM_PARAM_DM_DOMAIN];
> > + domid_t dm_domid;
>
> Not sure I follow, why the move of it further down?
params is not defined/allocated for PVH.
> > + /* VMCS controls. */
> > + vmx_pin_based_exec_control &= ~PIN_BASED_VIRTUAL_NMIS;
> > + __vmwrite(PIN_BASED_VM_EXEC_CONTROL,
> > vmx_pin_based_exec_control); +
> > + v->arch.hvm_vmx.exec_control = vmx_cpu_based_exec_control;
> > +
> > + /* if rdtsc exiting is turned on and it goes thru
> > emulate_privileged_op,
> > + * then pv_vcpu.ctrlreg must be added to pvh struct */
>
> That would be the 'timer_mode' syntax in the guest config right?
> Perhaps then a check at the top of the function to see which
> timer_mode is used and exit out with -ENOSYS?
The vtsc setting. We set it to 0 for PVH guests.
>
> > + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_RDTSC_EXITING;
> > + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_USE_TSC_OFFSETING;
> > +
> > + v->arch.hvm_vmx.exec_control &= ~(CPU_BASED_INVLPG_EXITING |
> > + CPU_BASED_CR3_LOAD_EXITING |
> > + CPU_BASED_CR3_STORE_EXITING);
> > + v->arch.hvm_vmx.exec_control |=
> > CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
> > + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_MONITOR_TRAP_FLAG;
>
> Is that b/c the PV code ends up making the SCHED_yield_op hypercall
> and we don't need the monitor/mwait capability? If so, could you add
> that comment in please?
No, MTF is debugging feature used mostly for single step.
> > + vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_CS,
> > msr_type);
> > + vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_ESP,
> > msr_type);
> > + vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_EIP,
> > msr_type);
> > + vmx_disable_intercept_for_msr(v, MSR_SHADOW_GS_BASE,
> > msr_type);
>
> So this looks like the one vmcs.c except that one has this extra:
>
> 895 if ( cpu_has_vmx_pat && paging_mode_hap(d) )
> 896 vmx_disable_intercept_for_msr(v, MSR_IA32_CR_PAT,
> MSR_TYPE_R | MSR_TYPE_W); 897 }
>
> Did you miss that?
I'll add it. I guess default must be disabled.
> > +
> > + /* pure hvm doesn't do this. safe? see:
> > long_mode_do_msr_write() */ +#if 0
> > + vmx_disable_intercept_for_msr(v, MSR_STAR);
> > + vmx_disable_intercept_for_msr(v, MSR_LSTAR);
> > + vmx_disable_intercept_for_msr(v, MSR_CSTAR);
> > + vmx_disable_intercept_for_msr(v, MSR_SYSCALL_MASK);
> > +#endif
>
> I would just provide a comment saying:
>
> /*
> * long_mode_do_msr_write() takes care of
> MSR_[STAR|LSTAR|CSTAR|SYSCALL_MASK] */
Good Idea. I left the "#if 0" there for suggestion.
> > + } else {
> > + printk("PVH: CPU does NOT have msr bitmap\n");
>
> Perhaps:
>
> printk(XENLOG_G_ERR "%s: ..\n", __func__);
> ?
> > + return -EINVAL;
> > + }
> > +
> > + if ( !cpu_has_vmx_vpid ) {
> > + printk("PVH: At present VPID support is required to run
> > PVH\n");
>
> Should you de-allocate msr_bitmap at this point?
>
> Or perhaps move this check (and the one below) to the start of the
> function? So you have:
>
> if ( !cpu_has_vmx_vpid )
> gdprintk ("%s: VPID required for PVH mode.\n",
> __func__);
>
> if ( !cpu_has_vmx_secondary_exec_control )
> .. bla bla
>
>
> > + return -EINVAL;
> > + }
> > +
> > + v->arch.hvm_vmx.secondary_exec_control =
> > vmx_secondary_exec_control; +
> > + if ( cpu_has_vmx_secondary_exec_control ) {
> > + v->arch.hvm_vmx.secondary_exec_control &= ~0x4FF; /* turn
> > off all */
> > + v->arch.hvm_vmx.secondary_exec_control |=
> > +
> > SECONDARY_EXEC_PAUSE_LOOP_EXITING;
> > + v->arch.hvm_vmx.secondary_exec_control |=
> > SECONDARY_EXEC_ENABLE_VPID; +
> > + v->arch.hvm_vmx.secondary_exec_control |=
> > SECONDARY_EXEC_ENABLE_EPT;
> > + __vmwrite(SECONDARY_VM_EXEC_CONTROL,
> > + v->arch.hvm_vmx.secondary_exec_control);
> > + } else {
> > + printk("PVH: NO Secondary Exec control\n");
> > + return -EINVAL;
>
> Ditto - should you de-allocate msr_bitmap ? Or if you are going to
> move the check for cpu_has_vmx_secondary_exec_control, then there is
> no need for this if (.. ) else ..
>
>
> > + }
> > +
> > + __vmwrite(VM_EXIT_CONTROLS, vmexit_ctl);
> > +
> > + #define VM_ENTRY_LOAD_DEBUG_CTLS 0x4
> > + #define VM_ENTRY_LOAD_EFER 0x8000
> > + vmentry_ctl &= ~VM_ENTRY_LOAD_DEBUG_CTLS;
> > + vmentry_ctl &= ~VM_ENTRY_LOAD_EFER;
> > + vmentry_ctl &= ~VM_ENTRY_SMM;
> > + vmentry_ctl &= ~VM_ENTRY_DEACT_DUAL_MONITOR;
> > + vmentry_ctl |= VM_ENTRY_IA32E_MODE;
> > + __vmwrite(VM_ENTRY_CONTROLS, vmentry_ctl);
> > +
>
> From here on, it looks mostly the same as construct_vmcs right?
>
> Perhaps you can add a comment saying so - so when a cleanup effort
> is done later on - these can be candidates for it?
>
> > + /* MSR intercepts. */
> > + __vmwrite(VM_ENTRY_MSR_LOAD_COUNT, 0);
> > + __vmwrite(VM_EXIT_MSR_LOAD_COUNT, 0);
> > + __vmwrite(VM_EXIT_MSR_STORE_COUNT, 0);
> > +
> > + /* Host data selectors. */
> > + __vmwrite(HOST_SS_SELECTOR, __HYPERVISOR_DS);
> > + __vmwrite(HOST_DS_SELECTOR, __HYPERVISOR_DS);
> > + __vmwrite(HOST_ES_SELECTOR, __HYPERVISOR_DS);
> > + __vmwrite(HOST_FS_SELECTOR, 0);
> > + __vmwrite(HOST_GS_SELECTOR, 0);
> > + __vmwrite(HOST_FS_BASE, 0);
> > + __vmwrite(HOST_GS_BASE, 0);
> > +
> > + vmx_set_host_env(v);
> > +
> > + /* Host control registers. */
> > + v->arch.hvm_vmx.host_cr0 = read_cr0() | X86_CR0_TS;
> > + __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0);
> > + __vmwrite(HOST_CR4, mmu_cr4_features|(cpu_has_xsave ?
> > X86_CR4_OSXSAVE : 0));
>
> That formatting looks odd.
Copied from hvm code. whats wrong?
> > + /* Set default guest context values here. Some of these are
> > then overwritten
> > + * in vmx_pvh_set_vcpu_info() by guest itself during vcpu
> > bringup */
> > + __vmwrite(GUEST_CS_BASE, 0);
> > + __vmwrite(GUEST_CS_LIMIT, ~0u);
> > + __vmwrite(GUEST_CS_AR_BYTES, 0xa09b); /* CS.L == 1 */
> > + __vmwrite(GUEST_CS_SELECTOR, 0x10);
>
> 0x10. Could you use a #define for it? Somehow I thought it would
> be running in FLAT_KERNEL_CS but that would be odd. And of course
> since are booting in the Linux kernel without the PV MMU we would
> be using its native segments. So this would correspond to
> GDT_ENTRY_KERNEL_CS right? Might want to mention that
> so if the Linux kernel alters its GDT page we don't blow up?
>
> Thought I guess it does not matter - this is just the initial
> bootstrap segments. Presumarily the load_gdt in the Linux kernel
> later on resets it to whatever the "new" GDT is.
Correct:
#define __KERNEL_CS (GDT_ENTRY_KERNEL_CS*8)
And load_gdt loads a new GDT natively.
> > + __vmwrite(GUEST_INTERRUPTIBILITY_INFO, 0);
> > + __vmwrite(GUEST_DR7, 0);
> > + __vmwrite(VMCS_LINK_POINTER, ~0UL);
> > +
> > + __vmwrite(PAGE_FAULT_ERROR_CODE_MASK, 0);
> > + __vmwrite(PAGE_FAULT_ERROR_CODE_MATCH, 0);
>
> Weird. In the vmcs.c file these are somewhat higher in the code.
Yes. I just didn't copy the existing function, but created PVH function
to make it easier for PVH.
> > +
> > + v->arch.hvm_vmx.exception_bitmap =
> > + HVM_TRAP_MASK | (1 <<
> > TRAP_debug) |
> > + (1U << TRAP_int3) | (1U <<
> > TRAP_no_device);
>
> Odd syntax.
Similar to existing hvm code, whats wrong?
> > + __vmwrite(EXCEPTION_BITMAP, v->arch.hvm_vmx.exception_bitmap);
> > +
> > + __vmwrite(TSC_OFFSET, 0);
>
> Hm, so you did earlier:
>
> v->arch.hvm_vmx.exec_control &= ~CPU_BASED_USE_TSC_OFFSETING;
>
> so is this neccessary? Or is just that you want it to be set
> to default baseline state?
Not necessary, doesn't hurt either. I can remove it.
> > +
> > + /* Set WP bit so rdonly pages are not written from CPL 0 */
> > + tmpval = X86_CR0_PG | X86_CR0_NE | X86_CR0_PE | X86_CR0_WP;
> > + __vmwrite(GUEST_CR0, tmpval);
> > + __vmwrite(CR0_READ_SHADOW, tmpval);
> > + v->arch.hvm_vcpu.hw_cr[0] = v->arch.hvm_vcpu.guest_cr[0] =
> > tmpval; +
> > + tmpval = real_cr4_to_pv_guest_cr4(mmu_cr4_features);
> > + required = X86_CR4_PAE | X86_CR4_VMXE | X86_CR4_OSFXSR;
> > + if ( (tmpval & required) != required )
> > + {
> > + printk("PVH: required CR4 features not available:%lx\n",
> > required);
> > + return -EINVAL;
>
> You might want to move that to the top of the code. Or if you want
> it here, then at least free the msr_bitmap
I think I'll just move all the checks top of the code.
> > {
> > struct domain *d = v->domain;
> > @@ -825,6 +1072,12 @@ static int construct_vmcs(struct vcpu *v)
> >
> > vmx_vmcs_enter(v);
> >
> > + if ( is_pvh_vcpu(v) ) {
> > + int rc = pvh_construct_vmcs(v);
> > + vmx_vmcs_exit(v);
>
> Do you need to call paging_update_paging_modes as construct_vmcs()
> does?
Nop. We don't need to as the arch_set_info_guest() does it for PVH.
Thanks Konrad.
Mukesh
next prev parent reply other threads:[~2013-03-19 1:00 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-03-16 0:39 [PATCH 9/18 V2]: PVH xen: create PVH vmcs, and initialization Mukesh Rathor
2013-03-18 12:03 ` Jan Beulich
2013-03-18 15:28 ` Konrad Rzeszutek Wilk
2013-03-19 1:00 ` Mukesh Rathor [this message]
2013-03-19 9:19 ` Jan Beulich
2013-03-19 13:23 ` Konrad Rzeszutek Wilk
2013-03-26 22:30 ` Mukesh Rathor
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130318180036.211c57e9@mantra.us.oracle.com \
--to=mukesh.rathor@oracle.com \
--cc=Xen-devel@lists.xensource.com \
--cc=konrad.wilk@oracle.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).