Re: [PATCH 9/18 V2]: PVH xen: create PVH vmcs, and initialization

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: Mukesh Rathor <mukesh.rathor@oracle.com>
Cc: "Xen-devel@lists.xensource.com" <Xen-devel@lists.xensource.com>
Subject: Re: [PATCH 9/18 V2]: PVH xen: create PVH vmcs, and initialization
Date: Tue, 19 Mar 2013 09:23:30 -0400	[thread overview]
Message-ID: <20130319132330.GD2706@phenom.dumpdata.com> (raw)
In-Reply-To: <20130318180036.211c57e9@mantra.us.oracle.com>

On Mon, Mar 18, 2013 at 06:00:36PM -0700, Mukesh Rathor wrote:
> On Mon, 18 Mar 2013 11:28:43 -0400
> Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
> 
> > On Fri, Mar 15, 2013 at 05:39:25PM -0700, Mukesh Rathor wrote:
> > > This patch mainly contains code to create a VMCS for PVH guest, and
> > > HVM specific vcpu/domain creation code.
> > > 
> > > Changes in V2:
> > >   - Avoid call to hvm_do_resume() at call site rather than return
> > > in it.
> > >   - Return for PVH vmx_do_resume prior to intel debugger stuff.
> > > 
> > > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
> > > ---
> > >  xen/arch/x86/hvm/hvm.c      |   90 ++++++++++++++-
> > >  xen/arch/x86/hvm/vmx/vmcs.c |  266
> > > ++++++++++++++++++++++++++++++++++++++++++-
> > > xen/arch/x86/hvm/vmx/vmx.c  |   34 ++++++ 3 files changed, 383
> > > insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> > > index ea7adf6..18889ad 100644
> > > --- a/xen/arch/x86/hvm/hvm.c
> > > +++ b/xen/arch/x86/hvm/hvm.c
> > > @@ -510,6 +510,29 @@ static int hvm_print_line(
> > >      return X86EMUL_OKAY;
> > >  }
> > >  
> > > +static int hvm_pvh_dom_initialise(struct domain *d)
> > > +{
> > > +    int rc;
> > > +
> > > +    if (!d->arch.hvm_domain.hap_enabled)
> > > +        return -EINVAL;
> > > +
> > > +    spin_lock_init(&d->arch.hvm_domain.irq_lock);
> > > +    hvm_init_guest_time(d);
> > > +
> > > +    hvm_init_cacheattr_region_list(d);
> > > +
> > > +    if ( (rc=paging_enable(d,
> > > PG_refcounts|PG_translate|PG_external)) != 0 )
> > > +        goto fail1;        <=================================== GOTO
> > > +
> > > +    if ( (rc = hvm_funcs.domain_initialise(d)) == 0 )
> > > +        return 0;
> > > +
> > > +fail1:
> > 
> > I don't think you need the label here? You are not doing an goto.
> 
> Right above.

Duh!
> 
> > > long))hvm_assert_evtchn_irq,
> > > +                          (unsigned long)v );
> > > +
> > > +    v->arch.hvm_vcpu.hcall_64bit = 1;
> > > +    v->arch.hvm_vcpu.hvm_pvh.vcpu_info_mfn = INVALID_MFN;
> > > +    v->arch.user_regs.eflags = 2;
> > 
> > So that sets the Reserved flag. Could you include a comment
> > explaining why.. Ah, is it b/c we later on bit-shift it and use it to
> > figure out whether IOPL needs to be virtualized in
> > arch_set_info_guest? Or is it just b/c this function is based off
> > hvm_vcpu_initialise? If so, since you are being called by it, can you
> > skip it?
> 
> That resvd bit is required to be set for bootstrap. Set in other places
> also, like arch_set_info_guest():
> 
> v->arch.user_regs.eflags |= 2;

OK, but why? I am not seeing that bit defined? Or was it needed to get
the machine to boot up.
> 
> > 
> > > +    v->arch.hvm_vcpu.inject_trap.vector = -1;
> > > +
> > > +    if ( (rc=hvm_vcpu_cacheattr_init(v)) != 0 ) {
> > 
> > The syntax here is off.
> 
> Hmm... space surrounding "=" in rc=hvm* ?

Yes, like this:
 385     if ( (rc = vcpu_init_fpu(v)) != 0 )

> 
> > >  int hvm_vcpu_initialise(struct vcpu *v)
> > >  {
> > >      int rc;
> > >      struct domain *d = v->domain;
> > > -    domid_t dm_domid =
> > > d->arch.hvm_domain.params[HVM_PARAM_DM_DOMAIN];
> > > +    domid_t dm_domid; 
> > 
> > Not sure I follow, why the move of it further down?
> 
> params is not defined/allocated for PVH.

But you are still using it in the code. As in, you are still
fetching it (just later).

> 
> > > +    /* VMCS controls. */
> > > +    vmx_pin_based_exec_control &= ~PIN_BASED_VIRTUAL_NMIS;
> > > +    __vmwrite(PIN_BASED_VM_EXEC_CONTROL,
> > > vmx_pin_based_exec_control); +
> > > +    v->arch.hvm_vmx.exec_control = vmx_cpu_based_exec_control;
> > > +
> > > +    /* if rdtsc exiting is turned on and it goes thru
> > > emulate_privileged_op,
> > > +     * then pv_vcpu.ctrlreg must be added to pvh struct */
> > 
> > That would be the 'timer_mode' syntax in the guest config right?
> > Perhaps then a check at the top of the function to see which
> > timer_mode is used and exit out with -ENOSYS?
> 
> The vtsc setting. We set it to 0 for PVH guests.
> 

OK, so that is the the 'timer_mode' always set to '2' or rather
timer_mode="native" (in the guest config). Which then does
the xc_domain_set_tsc_info hypercall to set the vtsc.

You need to document that please.

> > 
> > > +    v->arch.hvm_vmx.exec_control &= ~CPU_BASED_RDTSC_EXITING;
> > > +    v->arch.hvm_vmx.exec_control &= ~CPU_BASED_USE_TSC_OFFSETING;
> > > +
> > > +    v->arch.hvm_vmx.exec_control &= ~(CPU_BASED_INVLPG_EXITING |
> > > +                                      CPU_BASED_CR3_LOAD_EXITING |
> > > +                                      CPU_BASED_CR3_STORE_EXITING);
> > > +    v->arch.hvm_vmx.exec_control |=
> > > CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
> > > +    v->arch.hvm_vmx.exec_control &= ~CPU_BASED_MONITOR_TRAP_FLAG;
> > 
> > Is that b/c the PV code ends up making the SCHED_yield_op hypercall
> > and we don't need the monitor/mwait capability? If so, could you add
> > that comment in please?
> 
> No, MTF is debugging feature used mostly for single step.
> 
> > > +        vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_CS,
> > > msr_type);
> > > +        vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_ESP,
> > > msr_type);
> > > +        vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_EIP,
> > > msr_type);
> > > +        vmx_disable_intercept_for_msr(v, MSR_SHADOW_GS_BASE,
> > > msr_type);
> > 
> > So this looks like the one vmcs.c except that one has this extra:
> > 
> >  895         if ( cpu_has_vmx_pat && paging_mode_hap(d) )
> >  896             vmx_disable_intercept_for_msr(v, MSR_IA32_CR_PAT,
> > MSR_TYPE_R | MSR_TYPE_W); 897     }
> > 
> > Did you miss that?
> 
> I'll add it. I guess default must be disabled.
> 
> > > +
> > > +        /* pure hvm doesn't do this. safe? see:
> > > long_mode_do_msr_write() */ +#if 0
> > > +        vmx_disable_intercept_for_msr(v, MSR_STAR);
> > > +        vmx_disable_intercept_for_msr(v, MSR_LSTAR);
> > > +        vmx_disable_intercept_for_msr(v, MSR_CSTAR);
> > > +        vmx_disable_intercept_for_msr(v, MSR_SYSCALL_MASK);
> > > +#endif
> > 
> > I would just provide a comment saying:
> > 
> > 	/*
> > 	* long_mode_do_msr_write() takes care of
> > MSR_[STAR|LSTAR|CSTAR|SYSCALL_MASK] */
> 
> Good Idea. I left the "#if 0" there for suggestion.
> 
> 
> > > +    } else {
> > > +        printk("PVH: CPU does NOT have msr bitmap\n");
> > 
> > Perhaps:
> > 
> > 	printk(XENLOG_G_ERR "%s: ..\n", __func__);
> > ?
> > > +        return -EINVAL;
> > > +    }
> > > +
> > > +    if ( !cpu_has_vmx_vpid ) {
> > > +        printk("PVH: At present VPID support is required to run
> > > PVH\n");
> > 
> > Should you de-allocate msr_bitmap at this point?
> > 
> > Or perhaps move this check (and the one below) to the start of the
> > function? So you have:
> > 
> > 	if ( !cpu_has_vmx_vpid )
> > 		gdprintk ("%s: VPID required for PVH mode.\n",
> > __func__);
> > 
> > 	if ( !cpu_has_vmx_secondary_exec_control )
> > 		.. bla bla
> > 
> > 
> > > +        return -EINVAL;
> > > +    }
> > > +
> > > +    v->arch.hvm_vmx.secondary_exec_control =
> > > vmx_secondary_exec_control; +
> > > +    if ( cpu_has_vmx_secondary_exec_control ) {
> > > +        v->arch.hvm_vmx.secondary_exec_control &= ~0x4FF; /* turn
> > > off all */
> > > +        v->arch.hvm_vmx.secondary_exec_control |= 
> > > +
> > > SECONDARY_EXEC_PAUSE_LOOP_EXITING;
> > > +        v->arch.hvm_vmx.secondary_exec_control |=
> > > SECONDARY_EXEC_ENABLE_VPID; +
> > > +        v->arch.hvm_vmx.secondary_exec_control |=
> > > SECONDARY_EXEC_ENABLE_EPT;
> > > +        __vmwrite(SECONDARY_VM_EXEC_CONTROL,
> > > +                  v->arch.hvm_vmx.secondary_exec_control);
> > > +    } else {
> > > +        printk("PVH: NO Secondary Exec control\n");
> > > +        return -EINVAL;
> > 
> > Ditto - should you de-allocate msr_bitmap ? Or if you are going to
> > move the check for cpu_has_vmx_secondary_exec_control, then there is
> > no need for this if (.. ) else ..
> > 
> > 
> > > +    }
> > > +
> > > +    __vmwrite(VM_EXIT_CONTROLS, vmexit_ctl);
> > > +
> > > +    #define VM_ENTRY_LOAD_DEBUG_CTLS 0x4
> > > +    #define VM_ENTRY_LOAD_EFER 0x8000
> > > +    vmentry_ctl &= ~VM_ENTRY_LOAD_DEBUG_CTLS;
> > > +    vmentry_ctl &= ~VM_ENTRY_LOAD_EFER;
> > > +    vmentry_ctl &= ~VM_ENTRY_SMM;
> > > +    vmentry_ctl &= ~VM_ENTRY_DEACT_DUAL_MONITOR;
> > > +    vmentry_ctl |= VM_ENTRY_IA32E_MODE;
> > > +    __vmwrite(VM_ENTRY_CONTROLS, vmentry_ctl);
> > > +
> > 
> > From here on, it looks mostly the same as construct_vmcs right?
> > 
> > Perhaps you can add a comment saying so - so when a cleanup effort
> > is done later on - these can be candidates for it?
> > 
> > > +    /* MSR intercepts. */
> > > +    __vmwrite(VM_ENTRY_MSR_LOAD_COUNT, 0);
> > > +    __vmwrite(VM_EXIT_MSR_LOAD_COUNT, 0);
> > > +    __vmwrite(VM_EXIT_MSR_STORE_COUNT, 0);
> > > +
> > > +    /* Host data selectors. */
> > > +    __vmwrite(HOST_SS_SELECTOR, __HYPERVISOR_DS);
> > > +    __vmwrite(HOST_DS_SELECTOR, __HYPERVISOR_DS);
> > > +    __vmwrite(HOST_ES_SELECTOR, __HYPERVISOR_DS);
> > > +    __vmwrite(HOST_FS_SELECTOR, 0);
> > > +    __vmwrite(HOST_GS_SELECTOR, 0);
> > > +    __vmwrite(HOST_FS_BASE, 0);
> > > +    __vmwrite(HOST_GS_BASE, 0);
> > > +
> > > +    vmx_set_host_env(v);
> > > +
> > > +    /* Host control registers. */
> > > +    v->arch.hvm_vmx.host_cr0 = read_cr0() | X86_CR0_TS;
> > > +    __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0);
> > > +    __vmwrite(HOST_CR4, mmu_cr4_features|(cpu_has_xsave ?
> > > X86_CR4_OSXSAVE : 0));
> > 
> > That formatting looks odd. 
> 
> Copied from hvm code. whats wrong?

The HVM code has some extra spaces.

> 
> > > +    /* Set default guest context values here. Some of these are
> > > then overwritten
> > > +     * in vmx_pvh_set_vcpu_info() by guest itself during vcpu
> > > bringup */
> > > +    __vmwrite(GUEST_CS_BASE, 0);
> > > +    __vmwrite(GUEST_CS_LIMIT, ~0u);
> > > +    __vmwrite(GUEST_CS_AR_BYTES, 0xa09b); /* CS.L == 1 */
> > > +    __vmwrite(GUEST_CS_SELECTOR, 0x10);
> > 
> > 0x10. Could you use a #define for it? Somehow I thought it would
> > be running in FLAT_KERNEL_CS but that would be odd. And of course
> > since are booting in the Linux kernel without the PV MMU we would
> > be using its native segments. So this would correspond to
> > GDT_ENTRY_KERNEL_CS right? Might want to mention that
> > so if the Linux kernel alters its GDT page we don't blow up?
> > 
> > Thought I guess it does not matter - this is just the initial
> > bootstrap segments. Presumarily the load_gdt in the Linux kernel
> > later on resets it to whatever the "new" GDT is.
> 
> Correct:
> #define __KERNEL_CS     (GDT_ENTRY_KERNEL_CS*8)
> 
> And load_gdt loads a new GDT natively.
> 
> 
> > > +    __vmwrite(GUEST_INTERRUPTIBILITY_INFO, 0);
> > > +    __vmwrite(GUEST_DR7, 0);
> > > +    __vmwrite(VMCS_LINK_POINTER, ~0UL);
> > > +
> > > +    __vmwrite(PAGE_FAULT_ERROR_CODE_MASK, 0);
> > > +    __vmwrite(PAGE_FAULT_ERROR_CODE_MATCH, 0);
> > 
> > Weird. In the vmcs.c file these are somewhat higher in the code.
> 
> Yes. I just didn't copy the existing function, but created PVH function
> to make it easier for PVH.

Sure, but the next step (not these patches) will be to collapse some
of the redundant logic.

> 
> > > +
> > > +    v->arch.hvm_vmx.exception_bitmap = 
> > > +                                   HVM_TRAP_MASK     | (1 <<
> > > TRAP_debug) | 
> > > +                                   (1U << TRAP_int3) | (1U <<
> > > TRAP_no_device);
> > 
> > Odd syntax.
> 
> Similar to existing hvm code, whats wrong?

Tabs. The HVM looks different (I think it uses spaces?)
> 
> 
> > > +    __vmwrite(EXCEPTION_BITMAP, v->arch.hvm_vmx.exception_bitmap);
> > > +
> > > +    __vmwrite(TSC_OFFSET, 0);
> > 
> > Hm, so you did earlier:
> > 
> > v->arch.hvm_vmx.exec_control &= ~CPU_BASED_USE_TSC_OFFSETING;
> > 
> > so is this neccessary? Or is just that you want it to be set
> > to default baseline state?
> 
> Not necessary, doesn't hurt either. I can remove it.
> 
> > > +
> > > +    /* Set WP bit so rdonly pages are not written from CPL 0 */
> > > +    tmpval = X86_CR0_PG | X86_CR0_NE | X86_CR0_PE | X86_CR0_WP;
> > > +    __vmwrite(GUEST_CR0, tmpval);
> > > +    __vmwrite(CR0_READ_SHADOW, tmpval);
> > > +    v->arch.hvm_vcpu.hw_cr[0] = v->arch.hvm_vcpu.guest_cr[0] =
> > > tmpval; +
> > > +    tmpval = real_cr4_to_pv_guest_cr4(mmu_cr4_features);
> > > +    required = X86_CR4_PAE | X86_CR4_VMXE | X86_CR4_OSFXSR;
> > > +    if ( (tmpval & required) != required )
> > > +    {
> > > +        printk("PVH: required CR4 features not available:%lx\n",
> > > required);
> > > +        return -EINVAL;
> > 
> > You might want to move that to the top of the code. Or if you want
> > it here, then at least free the msr_bitmap
> 
> I think I'll just move all the checks top of the code.
> 
> > >  {
> > >      struct domain *d = v->domain;
> > > @@ -825,6 +1072,12 @@ static int construct_vmcs(struct vcpu *v)
> > >  
> > >      vmx_vmcs_enter(v);
> > >  
> > > +    if ( is_pvh_vcpu(v) ) {
> > > +        int rc = pvh_construct_vmcs(v);
> > > +        vmx_vmcs_exit(v);
> > 
> > Do you need to call paging_update_paging_modes as construct_vmcs()
> > does?
> 
> Nop. We don't need to as the arch_set_info_guest() does it for PVH.

Ok, could you provide a comment please? That way when one looks at this
code and the HVM one it will be clear.
> 
> 
> Thanks Konrad.

Of course. Thank you for bearing with this tiring review process.
> Mukesh
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>

next prev parent reply	other threads:[~2013-03-19 13:23 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-03-16  0:39 [PATCH 9/18 V2]: PVH xen: create PVH vmcs, and initialization Mukesh Rathor
2013-03-18 12:03 ` Jan Beulich
2013-03-18 15:28 ` Konrad Rzeszutek Wilk
2013-03-19  1:00   ` Mukesh Rathor
2013-03-19  9:19     ` Jan Beulich
2013-03-19 13:23     ` Konrad Rzeszutek Wilk [this message]
2013-03-26 22:30       ` Mukesh Rathor

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130319132330.GD2706@phenom.dumpdata.com \
    --to=konrad.wilk@oracle.com \
    --cc=Xen-devel@lists.xensource.com \
    --cc=mukesh.rathor@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.