From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: Mukesh Rathor <mukesh.rathor@oracle.com>
Cc: "Xen-devel@lists.xensource.com" <Xen-devel@lists.xensource.com>
Subject: Re: [PATCH 9/18 V2]: PVH xen: create PVH vmcs, and initialization
Date: Tue, 19 Mar 2013 09:23:30 -0400 [thread overview]
Message-ID: <20130319132330.GD2706@phenom.dumpdata.com> (raw)
In-Reply-To: <20130318180036.211c57e9@mantra.us.oracle.com>
On Mon, Mar 18, 2013 at 06:00:36PM -0700, Mukesh Rathor wrote:
> On Mon, 18 Mar 2013 11:28:43 -0400
> Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
>
> > On Fri, Mar 15, 2013 at 05:39:25PM -0700, Mukesh Rathor wrote:
> > > This patch mainly contains code to create a VMCS for PVH guest, and
> > > HVM specific vcpu/domain creation code.
> > >
> > > Changes in V2:
> > > - Avoid call to hvm_do_resume() at call site rather than return
> > > in it.
> > > - Return for PVH vmx_do_resume prior to intel debugger stuff.
> > >
> > > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
> > > ---
> > > xen/arch/x86/hvm/hvm.c | 90 ++++++++++++++-
> > > xen/arch/x86/hvm/vmx/vmcs.c | 266
> > > ++++++++++++++++++++++++++++++++++++++++++-
> > > xen/arch/x86/hvm/vmx/vmx.c | 34 ++++++ 3 files changed, 383
> > > insertions(+), 7 deletions(-)
> > >
> > > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> > > index ea7adf6..18889ad 100644
> > > --- a/xen/arch/x86/hvm/hvm.c
> > > +++ b/xen/arch/x86/hvm/hvm.c
> > > @@ -510,6 +510,29 @@ static int hvm_print_line(
> > > return X86EMUL_OKAY;
> > > }
> > >
> > > +static int hvm_pvh_dom_initialise(struct domain *d)
> > > +{
> > > + int rc;
> > > +
> > > + if (!d->arch.hvm_domain.hap_enabled)
> > > + return -EINVAL;
> > > +
> > > + spin_lock_init(&d->arch.hvm_domain.irq_lock);
> > > + hvm_init_guest_time(d);
> > > +
> > > + hvm_init_cacheattr_region_list(d);
> > > +
> > > + if ( (rc=paging_enable(d,
> > > PG_refcounts|PG_translate|PG_external)) != 0 )
> > > + goto fail1; <=================================== GOTO
> > > +
> > > + if ( (rc = hvm_funcs.domain_initialise(d)) == 0 )
> > > + return 0;
> > > +
> > > +fail1:
> >
> > I don't think you need the label here? You are not doing an goto.
>
> Right above.
Duh!
>
> > > long))hvm_assert_evtchn_irq,
> > > + (unsigned long)v );
> > > +
> > > + v->arch.hvm_vcpu.hcall_64bit = 1;
> > > + v->arch.hvm_vcpu.hvm_pvh.vcpu_info_mfn = INVALID_MFN;
> > > + v->arch.user_regs.eflags = 2;
> >
> > So that sets the Reserved flag. Could you include a comment
> > explaining why.. Ah, is it b/c we later on bit-shift it and use it to
> > figure out whether IOPL needs to be virtualized in
> > arch_set_info_guest? Or is it just b/c this function is based off
> > hvm_vcpu_initialise? If so, since you are being called by it, can you
> > skip it?
>
> That resvd bit is required to be set for bootstrap. Set in other places
> also, like arch_set_info_guest():
>
> v->arch.user_regs.eflags |= 2;
OK, but why? I am not seeing that bit defined? Or was it needed to get
the machine to boot up.
>
> >
> > > + v->arch.hvm_vcpu.inject_trap.vector = -1;
> > > +
> > > + if ( (rc=hvm_vcpu_cacheattr_init(v)) != 0 ) {
> >
> > The syntax here is off.
>
> Hmm... space surrounding "=" in rc=hvm* ?
Yes, like this:
385 if ( (rc = vcpu_init_fpu(v)) != 0 )
>
> > > int hvm_vcpu_initialise(struct vcpu *v)
> > > {
> > > int rc;
> > > struct domain *d = v->domain;
> > > - domid_t dm_domid =
> > > d->arch.hvm_domain.params[HVM_PARAM_DM_DOMAIN];
> > > + domid_t dm_domid;
> >
> > Not sure I follow, why the move of it further down?
>
> params is not defined/allocated for PVH.
But you are still using it in the code. As in, you are still
fetching it (just later).
>
> > > + /* VMCS controls. */
> > > + vmx_pin_based_exec_control &= ~PIN_BASED_VIRTUAL_NMIS;
> > > + __vmwrite(PIN_BASED_VM_EXEC_CONTROL,
> > > vmx_pin_based_exec_control); +
> > > + v->arch.hvm_vmx.exec_control = vmx_cpu_based_exec_control;
> > > +
> > > + /* if rdtsc exiting is turned on and it goes thru
> > > emulate_privileged_op,
> > > + * then pv_vcpu.ctrlreg must be added to pvh struct */
> >
> > That would be the 'timer_mode' syntax in the guest config right?
> > Perhaps then a check at the top of the function to see which
> > timer_mode is used and exit out with -ENOSYS?
>
> The vtsc setting. We set it to 0 for PVH guests.
>
OK, so that is the the 'timer_mode' always set to '2' or rather
timer_mode="native" (in the guest config). Which then does
the xc_domain_set_tsc_info hypercall to set the vtsc.
You need to document that please.
> >
> > > + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_RDTSC_EXITING;
> > > + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_USE_TSC_OFFSETING;
> > > +
> > > + v->arch.hvm_vmx.exec_control &= ~(CPU_BASED_INVLPG_EXITING |
> > > + CPU_BASED_CR3_LOAD_EXITING |
> > > + CPU_BASED_CR3_STORE_EXITING);
> > > + v->arch.hvm_vmx.exec_control |=
> > > CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
> > > + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_MONITOR_TRAP_FLAG;
> >
> > Is that b/c the PV code ends up making the SCHED_yield_op hypercall
> > and we don't need the monitor/mwait capability? If so, could you add
> > that comment in please?
>
> No, MTF is debugging feature used mostly for single step.
>
> > > + vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_CS,
> > > msr_type);
> > > + vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_ESP,
> > > msr_type);
> > > + vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_EIP,
> > > msr_type);
> > > + vmx_disable_intercept_for_msr(v, MSR_SHADOW_GS_BASE,
> > > msr_type);
> >
> > So this looks like the one vmcs.c except that one has this extra:
> >
> > 895 if ( cpu_has_vmx_pat && paging_mode_hap(d) )
> > 896 vmx_disable_intercept_for_msr(v, MSR_IA32_CR_PAT,
> > MSR_TYPE_R | MSR_TYPE_W); 897 }
> >
> > Did you miss that?
>
> I'll add it. I guess default must be disabled.
>
> > > +
> > > + /* pure hvm doesn't do this. safe? see:
> > > long_mode_do_msr_write() */ +#if 0
> > > + vmx_disable_intercept_for_msr(v, MSR_STAR);
> > > + vmx_disable_intercept_for_msr(v, MSR_LSTAR);
> > > + vmx_disable_intercept_for_msr(v, MSR_CSTAR);
> > > + vmx_disable_intercept_for_msr(v, MSR_SYSCALL_MASK);
> > > +#endif
> >
> > I would just provide a comment saying:
> >
> > /*
> > * long_mode_do_msr_write() takes care of
> > MSR_[STAR|LSTAR|CSTAR|SYSCALL_MASK] */
>
> Good Idea. I left the "#if 0" there for suggestion.
>
>
> > > + } else {
> > > + printk("PVH: CPU does NOT have msr bitmap\n");
> >
> > Perhaps:
> >
> > printk(XENLOG_G_ERR "%s: ..\n", __func__);
> > ?
> > > + return -EINVAL;
> > > + }
> > > +
> > > + if ( !cpu_has_vmx_vpid ) {
> > > + printk("PVH: At present VPID support is required to run
> > > PVH\n");
> >
> > Should you de-allocate msr_bitmap at this point?
> >
> > Or perhaps move this check (and the one below) to the start of the
> > function? So you have:
> >
> > if ( !cpu_has_vmx_vpid )
> > gdprintk ("%s: VPID required for PVH mode.\n",
> > __func__);
> >
> > if ( !cpu_has_vmx_secondary_exec_control )
> > .. bla bla
> >
> >
> > > + return -EINVAL;
> > > + }
> > > +
> > > + v->arch.hvm_vmx.secondary_exec_control =
> > > vmx_secondary_exec_control; +
> > > + if ( cpu_has_vmx_secondary_exec_control ) {
> > > + v->arch.hvm_vmx.secondary_exec_control &= ~0x4FF; /* turn
> > > off all */
> > > + v->arch.hvm_vmx.secondary_exec_control |=
> > > +
> > > SECONDARY_EXEC_PAUSE_LOOP_EXITING;
> > > + v->arch.hvm_vmx.secondary_exec_control |=
> > > SECONDARY_EXEC_ENABLE_VPID; +
> > > + v->arch.hvm_vmx.secondary_exec_control |=
> > > SECONDARY_EXEC_ENABLE_EPT;
> > > + __vmwrite(SECONDARY_VM_EXEC_CONTROL,
> > > + v->arch.hvm_vmx.secondary_exec_control);
> > > + } else {
> > > + printk("PVH: NO Secondary Exec control\n");
> > > + return -EINVAL;
> >
> > Ditto - should you de-allocate msr_bitmap ? Or if you are going to
> > move the check for cpu_has_vmx_secondary_exec_control, then there is
> > no need for this if (.. ) else ..
> >
> >
> > > + }
> > > +
> > > + __vmwrite(VM_EXIT_CONTROLS, vmexit_ctl);
> > > +
> > > + #define VM_ENTRY_LOAD_DEBUG_CTLS 0x4
> > > + #define VM_ENTRY_LOAD_EFER 0x8000
> > > + vmentry_ctl &= ~VM_ENTRY_LOAD_DEBUG_CTLS;
> > > + vmentry_ctl &= ~VM_ENTRY_LOAD_EFER;
> > > + vmentry_ctl &= ~VM_ENTRY_SMM;
> > > + vmentry_ctl &= ~VM_ENTRY_DEACT_DUAL_MONITOR;
> > > + vmentry_ctl |= VM_ENTRY_IA32E_MODE;
> > > + __vmwrite(VM_ENTRY_CONTROLS, vmentry_ctl);
> > > +
> >
> > From here on, it looks mostly the same as construct_vmcs right?
> >
> > Perhaps you can add a comment saying so - so when a cleanup effort
> > is done later on - these can be candidates for it?
> >
> > > + /* MSR intercepts. */
> > > + __vmwrite(VM_ENTRY_MSR_LOAD_COUNT, 0);
> > > + __vmwrite(VM_EXIT_MSR_LOAD_COUNT, 0);
> > > + __vmwrite(VM_EXIT_MSR_STORE_COUNT, 0);
> > > +
> > > + /* Host data selectors. */
> > > + __vmwrite(HOST_SS_SELECTOR, __HYPERVISOR_DS);
> > > + __vmwrite(HOST_DS_SELECTOR, __HYPERVISOR_DS);
> > > + __vmwrite(HOST_ES_SELECTOR, __HYPERVISOR_DS);
> > > + __vmwrite(HOST_FS_SELECTOR, 0);
> > > + __vmwrite(HOST_GS_SELECTOR, 0);
> > > + __vmwrite(HOST_FS_BASE, 0);
> > > + __vmwrite(HOST_GS_BASE, 0);
> > > +
> > > + vmx_set_host_env(v);
> > > +
> > > + /* Host control registers. */
> > > + v->arch.hvm_vmx.host_cr0 = read_cr0() | X86_CR0_TS;
> > > + __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0);
> > > + __vmwrite(HOST_CR4, mmu_cr4_features|(cpu_has_xsave ?
> > > X86_CR4_OSXSAVE : 0));
> >
> > That formatting looks odd.
>
> Copied from hvm code. whats wrong?
The HVM code has some extra spaces.
>
> > > + /* Set default guest context values here. Some of these are
> > > then overwritten
> > > + * in vmx_pvh_set_vcpu_info() by guest itself during vcpu
> > > bringup */
> > > + __vmwrite(GUEST_CS_BASE, 0);
> > > + __vmwrite(GUEST_CS_LIMIT, ~0u);
> > > + __vmwrite(GUEST_CS_AR_BYTES, 0xa09b); /* CS.L == 1 */
> > > + __vmwrite(GUEST_CS_SELECTOR, 0x10);
> >
> > 0x10. Could you use a #define for it? Somehow I thought it would
> > be running in FLAT_KERNEL_CS but that would be odd. And of course
> > since are booting in the Linux kernel without the PV MMU we would
> > be using its native segments. So this would correspond to
> > GDT_ENTRY_KERNEL_CS right? Might want to mention that
> > so if the Linux kernel alters its GDT page we don't blow up?
> >
> > Thought I guess it does not matter - this is just the initial
> > bootstrap segments. Presumarily the load_gdt in the Linux kernel
> > later on resets it to whatever the "new" GDT is.
>
> Correct:
> #define __KERNEL_CS (GDT_ENTRY_KERNEL_CS*8)
>
> And load_gdt loads a new GDT natively.
>
>
> > > + __vmwrite(GUEST_INTERRUPTIBILITY_INFO, 0);
> > > + __vmwrite(GUEST_DR7, 0);
> > > + __vmwrite(VMCS_LINK_POINTER, ~0UL);
> > > +
> > > + __vmwrite(PAGE_FAULT_ERROR_CODE_MASK, 0);
> > > + __vmwrite(PAGE_FAULT_ERROR_CODE_MATCH, 0);
> >
> > Weird. In the vmcs.c file these are somewhat higher in the code.
>
> Yes. I just didn't copy the existing function, but created PVH function
> to make it easier for PVH.
Sure, but the next step (not these patches) will be to collapse some
of the redundant logic.
>
> > > +
> > > + v->arch.hvm_vmx.exception_bitmap =
> > > + HVM_TRAP_MASK | (1 <<
> > > TRAP_debug) |
> > > + (1U << TRAP_int3) | (1U <<
> > > TRAP_no_device);
> >
> > Odd syntax.
>
> Similar to existing hvm code, whats wrong?
Tabs. The HVM looks different (I think it uses spaces?)
>
>
> > > + __vmwrite(EXCEPTION_BITMAP, v->arch.hvm_vmx.exception_bitmap);
> > > +
> > > + __vmwrite(TSC_OFFSET, 0);
> >
> > Hm, so you did earlier:
> >
> > v->arch.hvm_vmx.exec_control &= ~CPU_BASED_USE_TSC_OFFSETING;
> >
> > so is this neccessary? Or is just that you want it to be set
> > to default baseline state?
>
> Not necessary, doesn't hurt either. I can remove it.
>
> > > +
> > > + /* Set WP bit so rdonly pages are not written from CPL 0 */
> > > + tmpval = X86_CR0_PG | X86_CR0_NE | X86_CR0_PE | X86_CR0_WP;
> > > + __vmwrite(GUEST_CR0, tmpval);
> > > + __vmwrite(CR0_READ_SHADOW, tmpval);
> > > + v->arch.hvm_vcpu.hw_cr[0] = v->arch.hvm_vcpu.guest_cr[0] =
> > > tmpval; +
> > > + tmpval = real_cr4_to_pv_guest_cr4(mmu_cr4_features);
> > > + required = X86_CR4_PAE | X86_CR4_VMXE | X86_CR4_OSFXSR;
> > > + if ( (tmpval & required) != required )
> > > + {
> > > + printk("PVH: required CR4 features not available:%lx\n",
> > > required);
> > > + return -EINVAL;
> >
> > You might want to move that to the top of the code. Or if you want
> > it here, then at least free the msr_bitmap
>
> I think I'll just move all the checks top of the code.
>
> > > {
> > > struct domain *d = v->domain;
> > > @@ -825,6 +1072,12 @@ static int construct_vmcs(struct vcpu *v)
> > >
> > > vmx_vmcs_enter(v);
> > >
> > > + if ( is_pvh_vcpu(v) ) {
> > > + int rc = pvh_construct_vmcs(v);
> > > + vmx_vmcs_exit(v);
> >
> > Do you need to call paging_update_paging_modes as construct_vmcs()
> > does?
>
> Nop. We don't need to as the arch_set_info_guest() does it for PVH.
Ok, could you provide a comment please? That way when one looks at this
code and the HVM one it will be clear.
>
>
> Thanks Konrad.
Of course. Thank you for bearing with this tiring review process.
> Mukesh
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>
next prev parent reply other threads:[~2013-03-19 13:23 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-03-16 0:39 [PATCH 9/18 V2]: PVH xen: create PVH vmcs, and initialization Mukesh Rathor
2013-03-18 12:03 ` Jan Beulich
2013-03-18 15:28 ` Konrad Rzeszutek Wilk
2013-03-19 1:00 ` Mukesh Rathor
2013-03-19 9:19 ` Jan Beulich
2013-03-19 13:23 ` Konrad Rzeszutek Wilk [this message]
2013-03-26 22:30 ` Mukesh Rathor
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130319132330.GD2706@phenom.dumpdata.com \
--to=konrad.wilk@oracle.com \
--cc=Xen-devel@lists.xensource.com \
--cc=mukesh.rathor@oracle.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).