From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Ky Srinivasan" Subject: Re: [PATCH][RFC] Supporting Enlightened Windows 2008Server Date: Sun, 13 Apr 2008 13:04:37 -0600 Message-ID: <48DF4585.E57C.0030.0@novell.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable Return-path: Content-Disposition: inline List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Steven Smith Cc: xen-devel@lists.xensource.com, Tim Deegan , Keir Fraser List-Id: xen-devel@lists.xenproject.org Steven, Thanks for the detailed review. I will address the issues you have raised = as part of my next cleanup of these patches. I have responded to your = comments in-line. Regards, K. Y >>> Steven Smith 04/10/08 7:23 AM >>>=20 > I have a couple of comments on the new patches: > Would you mind generating diffs with -p, please? It makes them a fair > bit easier to read. Will do. > =20 > static void svm_vmexit_do_cpuid(struct cpu_user_regs *regs) > { > - unsigned int eax, ebx, ecx, edx, inst_len; > + unsigned int eax, ebx, ecx, edx, inst_len, input; > =20 > inst_len =3D __get_instruction_length(current, INSTR_CPUID, NULL); > if ( inst_len =3D=3D 0 )=20 > ... > @@ -968,6 +970,7 @@ > regs->ecx =3D ecx; > regs->edx =3D edx; > =20 > + hyperx_intercept_do_cpuid(input, regs); > __update_guest_eip(regs, inst_len); > } > =20 > Would it be easier to put this bit in hvm_cpuid, or maybe > cpuid_hypervisor_leaves? That would avoid needing to make the same > change to both the VMX and SVM CPUID infrastructure. Good point. I will look into it. > @@ -984,6 +987,10 @@ > struct vcpu *v =3D current; > struct vmcb_struct *vmcb =3D v->arch.hvm_svm.vmcb; > =20 > + if (hyperx_intercept_do_msr_read(ecx, regs)) > + { > + goto done; > + } > switch ( ecx ) > { > case MSR_IA32_TSC: >=20 > Likewise, it seems like that could be done better in > rdmsr_hypervisor_regs(). Yep. > @@ -1085,6 +1092,10 @@ > msr_content =3D (u32)regs->eax | ((u64)regs->edx << 32); > =20 > hvmtrace_msr_write(v, ecx, msr_content); > + if (hyperx_intercept_do_msr_write(ecx, regs)) > + { > + goto done_msr_write; > + } > =20 > switch ( ecx ) > { > Or wrmsr_hypervisor_regs(). Will do. > @@ -2210,6 +2226,10 @@ > if ( a.value > HVMPTM_one_missed_tick_pending ) > goto param_fail; > break; > + case HVM_PARAM_EXTEND_HYPERVISOR: > +printk("KYS: EXTEND hypervisor id is %d\n", (int)a.value); > + if ((a.value =3D=3D 1) && hyperv_initialize(d))=20 > + goto param_fail; > } > d->arch.hvm_domain.params[a.index] =3D a.value; > rc =3D 0; >=20 > It might be a good idea to fail with -EINVAL or something if > a.value > 1, so that if anyone ever introduces another hypervisor > extension it's easy for the tools to tell whether it's available or > not. Will do. > Index: xen-unstable.hg/xen/arch/x86/hvm/save.c > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > --- xen-unstable.hg.orig/xen/arch/x86/hvm/save.c2008-04-04 14:28:26.00000= 0000 -0400 > +++ xen-unstable.hg/xen/arch/x86/hvm/save.c2008-04-04 14:28:44.000000000 = -0400 > @@ -23,6 +23,8 @@ > =20 > #include > #include > +#include > +#include > =20 > void arch_hvm_save(struct domain *d, struct hvm_save_header *hdr) > { >=20 > The only change made to this file was to add some #include's. Why > were they necessary? This was necessary at some point in the past. This is the vestiges of a = partial cleanup! This will be fixed. > Index: xen-unstable.hg/xen/include/public/arch-x86/hvm/save.h > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > --- xen-unstable.hg.orig/xen/include/public/arch-x86/hvm/save.h2008-04-04= 14:28:26.000000000 -0400 > +++ xen-unstable.hg/xen/include/public/arch-x86/hvm/save.h2008-04-04 = 14:28:44.000000000 -0400 > @@ -38,7 +38,7 @@ > uint32_t version; /* File format version */ > uint64_t changeset; /* Version of Xen that saved this file = */ > uint32_t cpuid; /* CPUID[0x01][%eax] on the saving = machine */ > - uint32_t pad0; > + uint32_t pad0;/* extension ID */ > }; > =20 > DECLARE_HVM_SAVE_TYPE(HEADER, 1, struct hvm_save_header); >=20 > I think it might be a good idea to give the field a more > descriptive name than ``pad0'' if it's being used for something > other than padding. This again is the result of a partial cleanup. I was stashing away the = extension ID and Tim rightly pointed out that we had enough state and did = not need the extension ID in the save record. This will be cleaned up. > It might be an even better idea to move the ID out of the header > entirely and put it in its own HVM_SAVE_TYPE. You already have > hvm_hyperv_dom; can you use that to identify the presence or > absence of the extensions? I will look into this. > --- /dev/null1970-01-01 00:00:00.000000000 +0000 > +++ xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_errno.h2008-03-26 = 13:56:39.000000000 -0400 > @@ -0,0 +1,4 @@ ... > +/* > + * hv_errno.h > + * Error codes for the Novell Shim. >=20 >> These are just the Viridian error codes, aren't they, rather than >> something specific to the Novell shim? Yes. This will be fixed. > + * > + * Engineering Contact: K. Y. Srinivasan > + */ > + > +#ifndef HV_ERRNO_H > +#define HV_ERRNO_H > + > +#define HV_STATUS_SUCCESS0x0000 > +#define HV_STATUS_INVALID_HYPERCALL_CODE0x0002 > +#define HV_STATUS_INVALID_HYPERCALL_INPUT0x0003 > +#define HV_STATUS_INVALID_ALIGNMENT0x0004 > +#define HV_STATUS_INVALID_PARAMETER0x0005 ... > +#endif=20 > Index: xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_hypercall.c > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > --- /dev/null1970-01-01 00:00:00.000000000 +0000 > +++ xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_hypercall.c2008-03-26 = 14:18:32.000000000 -0400 > @@ -0,0 +1,13 @@ ... > +void hv_collect_stats(int event, hv_vcpu_stats_t *statsp) > +{ > +switch (event) { > +case HV_CSWITCH: > +statsp->num_switches++; > +return; > +case HV_FLUSH_VA: > +statsp->num_flushes++; > +return; > +case HV_FLUSH_RANGE: > +statsp->num_flush_ranges++; > +return; > That looks a bit pointless. Would it not be easier to just inline > this whole function? Now that, I have dropped all the TLB flush enlightenments, I will clean = this up. ... > +} ... > +static int > +hv_get_vp_registers(paddr_t input, paddr_t output) > +{ > +hv_vcpu_t *vcpup, *targetp; > +hv_partition_t *curp =3D hv_get_current_partition(); > +get_vp_registers_input_t*inbuf; > +get_vp_registers_output_t*outbuf; > +struct vcpu_guest_context*vcpu_ctx; > +u32*reg_indexp; > +get_vp_registers_output_t*out_regp; > +u32num_output_bytes =3D 0; > + > + vcpup =3D &curp->vcpu_state[hv_get_current_vcpu_index()]; > +inbuf =3D vcpup->input_buffer; > +outbuf =3D vcpup->output_buffer; > +out_regp =3D outbuf; > +/* > + * Copy the input data to the per-cpu input buffer. > + * This may be an overkill; obviously it is better to only > + * copy what we need. XXXKYS: Check with Mike. > + */ > Who's Mike? What did he say? Mike is the MSFT contact.=20 > +if (hvm_copy_from_guest_phys(inbuf, input, PAGE_SIZE)) { > +return (HV_STATUS_INVALID_ALIGNMENT); > +} ... > +/* > + * Now that we have the register state; select what we want and > + * populate the output buffer. > + */ > +reg_indexp =3D &inbuf->reg_index; > +while (*reg_indexp !=3D 0) { > +switch(*reg_indexp) { > +/* > + * XXXKYS: need mapping code here; populate > + * outbuf. > + */ > +panic("hv_get_vp_registers not supported\n"); > Yeah, that's not really good enough. It would be better to just > not support this hypercall at all than have an implementation which > always crashes Xen. Early on, I decided what Hypercalls I would support; and from this list I = wanted=20 to only support those that were used by windows 2008 server. I had put in = a bunch of panics to trap all the hypercalls that the guest would invoke. = This hypercall is not used. Thus the implementation is incomplete. I will = clean this up. > +} > +reg_indexp++; > +out_regp++ ;/*128 bit registers */ > +num_output_bytes +=3D16; > +if ((char *)reg_indexp > ((char *)inbuf + PAGE_SIZE)) { > +/* > + *input list not reminated correctly; bail out. > + */ > +panic("hv_get_vp_registers:input list not terminated\n");=20 > +break; > It would be nice if this caused the hypercall to fail, rather than > crashing Xen. Will be cleaned up. > +} > +} > +if (hvm_copy_to_guest_phys(output, outbuf, num_output_bytes)) { > +/* Some problem copying data out*/ > +panic("hv_get_vp_registers:copyout problem\n"); =20 > And again. Will be cleaned up. > +} > +xfree(vcpu_ctx); > +return (HV_STATUS_SUCCESS); > +} > Has this hypercall had any testing at all? If it hasn't, then what > is it doing here? As I noted earlier, win2k8 does not currently use this API. I will clean = this up. > + > +static int > +hv_set_vp_registers(paddr_t input, paddr_t output) > Again, I don't believe this function has ever been called. That > suggests that this may not be a high-value optimisation suitable > for inclusion in an initial version. As I noted earlier, win2k8 does not currently use this API. I will clean = this up. ... >=20 > +static int > +hv_switch_va(paddr_t input) > +{ > +hv_partition_t *curp =3D hv_get_current_partition(); > + hv_vcpu_t *vcpup =3D &curp->vcpu_state[hv_get_current_vcpu_index= ()]; > + > +/* > + * XXXKYS: the spec sys the asID is passed via memory at offset 0 of=20 > + * the page whose GPA is in the input register. However, it appears=20 > + * the current build of longhorn (longhorn-2007-02-06-x86_64-fv-02) > + * passes the asID in the input register instead. Need to check if=20 > + * future builds do this. > + */ > +hvm_set_cr3(input);=20 > +HV_STATS_COLLECT(HV_CSWITCH, &vcpup->stats); > +return (HV_STATUS_SUCCESS); > +} > Do you have any evidence that this is actually faster than just > doing the CR3 write in the guest? It's a win on real Viridian > because you avoid blowing the shadow page table cache, but I don't > think that applies on Xen's current shadow page table > implementation. As I noted earlier, initially I identified all the hypercalls that made = for a win2k8 guest on our platform. From this list I implemented those = that the guest actually invoked. I implemented all the hypercalls that the = guest was invoking without regard to the performance value of the = enlightenment. I agree with you that this may be more important on = Veridian than on our platform.=20 > +static int > +hv_flush_va(paddr_t input) > +{ ... > +/* > + * Now operate on what we are given > + * XXXKYS: For now we are ignoring as_id and fushGlobal flag. > + * May have to revisit this. But first stash away the processed=20 > + * parameters for subsequent use. > + */ > +flush_argp->as_handle =3D as_id; > +flush_argp->flags =3D flush_global; > +flush_argp->v_mask =3D vcpu_mask; > + > + > +ret_val =3D hv_build_hcall_retval(HV_STATUS_SUCCESS, 0); > +hv_set_syscall_retval(guest_cpu_user_regs(), > + curp->long_mode_guest, > + ret_val); > +HV_STATS_COLLECT(HV_FLUSH_VA_STAT, &cur_vcpup->stats); > +panic("hv_flush_va not supported\n");=20 > So if this hypercall is ever correctly used by the guest, Xen will > crash? In the original patches that I posted, this hypercall was implemented. As = part of my first round of cleanup, I removed the support for the TLB = flush hypercalls. As part of this cleanup, I also disabled TLB related = enlightenments. This will be cleaned up. > +return (HV_STATUS_SUCCESS); > +} > + > +static int > +hv_flush_va_range(paddr_t input, unsigned short start_index,=20 > +unsigned short rep_count, unsigned short *reps_done) > +{ ... > +ret_val =3D hv_build_hcall_retval(HV_STATUS_SUCCESS, rep_count); > +hv_set_syscall_retval(guest_cpu_user_regs(), > + curp->long_mode_guest, > + ret_val); > + > + > +HV_STATS_COLLECT(HV_FLUSH_RANGE, &cur_vcpup->stats); > +panic("hv_flush_vaRange not supported\n"); =20 > And again. Will be cleaned up. > +return (HV_STATUS_SUCCESS); > +} > + > +void > +hv_handle_hypercall(u64 opcode, u64 input, u64 output,=20 > + u64 *ret_val) > Would it be easier to just return the return value? Perhaps. > +{ ... > +verb =3D (short)(opcode & 0xffff); > +rep_count =3D (short)((opcode >>32) & 0xfff); > +start_index =3D (short)((opcode >> 48) & 0xfff); > +switch (verb) { > +case HV_CREATE_PARTITION: > +/* > + * Xen only allows dom0 to create domains. > + */ > +*ret_val =3D hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0); > +return; > +case HV_INITIALIZE_PARTITION: > +/* > + * We don't support this. > + */ > +*ret_val =3D hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0); > +return; > Would it be easier to just make HV_STATUS_ACCESS_DENIED the > default? Actually, my reading of the spec is that you should > return HV_STATUS_INVALID_HYPERCALL_CODE if the guest makes a > hypercall which you don't support, but I can't imagine it makes a > great deal of difference. Perhaps not.=20 ... > +case HV_GET_PARTITION_ID: > +if (!hv_privilege_check(curp, HV_ACCESS_PARTITION_ID)) { > +*ret_val =3D=20 > +hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0); > +return; > +} > +partition_id =3D (u64)current->domain->domain_id; > +if (hvm_copy_to_guest_phys(output,=20 > +&partition_id, 8)) { > +/* > + * Invalid output area. > + */ > +*ret_val =3D=20 > +hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0); > I think you mean HV_STATUS_INVALID_ALIGNMENT. Agreed. > +return; > +} > +*ret_val =3D hv_build_hcall_retval(HV_STATUS_SUCCESS, 0); > +return; >=20 > +case HV_GET_VP_REGISTERS: > +*ret_val =3D hv_build_hcall_retval( > +hv_get_vp_registers(input, output), 0); > +return; > +case HV_SET_VP_REGISTERS: > +*ret_val =3D hv_build_hcall_retval( > +hv_set_vp_registers(input, output), 0); ... > +case HV_FLUSH_VA: > +*ret_val =3D=20 > +hv_build_hcall_retval(hv_flush_va(input), 0); > +return; > +case HV_FLUSH_VA_LIST: > +value =3D hv_flush_va_range(input, start_index,=20 > +rep_count, &reps_done); > +*ret_val =3D hv_build_hcall_retval(value, reps_done); =20 > +return; > As mentioned above, your implementations of these hypercalls are > really quite broken. These hypercalls are not invoked by the guest and that is the reason they = are broken. I will clean it up though. ... >=20 > +} > +} > Index: xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_hypercall.h > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > --- /dev/null1970-01-01 00:00:00.000000000 +0000 > +++ xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_hypercall.h2008-03-26 = 13:56:39.000000000 -0400 > @@ -0,0 +1,21 @@ > + > +/* > + * nshypercall.h > + * Memory layouts for the various hypercalls supported.=20 > + * > + * Engineering Contact: K. Y. Srinivasan > + */ > + > +#ifndef HV_HYPERCALL_H > +#define HV_HYPERCALL_H > + > +#include > + > + > +typedef struct get_vp_registers_input { > +u64partition_id; > +u64vp_index; > +u32reg_index; > +u32pad1; > +u64pad2; > +} get_vp_registers_input_t; > My reading of the 0.83 specification is that this should have been: > typedef struct get_vp_registers_input { > u64 partition_id; > u32 vp_index; > u32 pad; > u32 reg_index[0]; > } get_vp_registers_input_t; > Is my copy of the specification out of date? Nope.=20 > +typedef struct set_vp_register_spec { > +u32reg_name; > +u32pad; > +u64pad1; > +u64low_value; > +u64high_value; > +} set_vp_register_spec_t; > + > +typedef struct set_vp_registers_input { > +u64partition_id; > +u64vp_index; > +set_vp_register_spec_treg_spec; > +} set_vp_registers_input_t; > Again, the 0.83 spec says that vp_index should be a u32, and should > be followed by a u32 MBZ pad. Right. > + > + > +typedef struct flush_va { > +u64as_handle; > +u64flags; > +union { > +u64processor_mask; > +cpumask_t vcpu_mask; > +} proc_mask; > +#define p_mask proc_mask.processor_mask > +#define v_maskproc_mask.vcpu_mask > +u64gva; > +} flush_va_t; > Is there something wrong with anonymous unions? > +/* > + * Hypercall verbs. > + */ > + > +#define HV_CREATE_PARTITION 0x0010 > +#define HV_INITIALIZE_PARTITION 0x0011 > +#define HV_DELETE_PARTITION0x0014 > Is it really necesary to include #define's for hypercalls which we > don't support and will probably never support? No; I will clean this file up. > Index: xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_intercept.c > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > --- /dev/null1970-01-01 00:00:00.000000000 +0000 > +++ xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_intercept.c2008-03-26 = 14:26:38.000000000 -0400 > @@ -0,0 +1,0 @@ >=20 > +/* > + * Local includes; extension specific. > + */ > +#include "hv_errno.h" > +#include "hv_shim.h" > + > + > +/* > + * Implement the Hyperv Shim. > + */ > + > +extern struct cpuinfo_x86 boot_cpu_data; > Is there something wrong with processor.h? This will be fixed. > +extern struct hvm_mmio_handler vlapic_mmio_handler; > We should probably move that to a header file somewhere. Will do. > +static inline void * > +get_virt_from_gmfn(struct domain *d, unsigned long gmfn) > +{ > +unsigned long mfn =3D gmfn_to_mfn(d, gmfn); > +if (mfn =3D=3D INVALID_MFN) { > +return (NULL); > +} > +return (map_domain_page_global(mfn)); > +} > Is it really necessary to use map_domain_page_global() here? > map_domain_page() is usually quite a bit faster. To be honest, I'm > not convinced that this wrapper function has any real value anyway. I will look into this. > +static inline unsigned long > +get_mfn_from_gva(unsigned long va) > +{ > +uint32_t pfec =3D PFEC_page_present; > +unsigned long gfn; > +gfn =3D paging_gva_to_gfn(current, va, &pfec); > +return (gmfn_to_mfn((current->domain), gfn)); > +} > That's only valid if you can guarantee that the resulting MFN will > never be written to. Yes. I use this to figure out if the hypercall is from the PV drivers in = the guest or the hyparcall from the guest. > +static inline void=20 > +hv_write_hypercall_msr(hv_partition_t *curp, > + hv_vcpu_t*cur_vcpu, > + u64msr_content) > +{ ... > +hv_hypercall_page_initialize(hypercall_page, curp); > +curp->hypercall_mfn =3D gmfn_to_mfn(d, gmfn); > +#ifdef CONFIG_DOMAIN_PAGE > +unmap_domain_page_global(hypercall_page); > +#endif > unmap_domain_page_global() is #define'd to nothing #ifndef > CONFIG_DOMAIN_PAGE, so you don't need the #ifdef. Yes. > +curp->hypercall_msr =3D msr_content; > +spin_unlock(&curp->lock); > +cur_vcpu->flags |=3D HV_VCPU_UP; > +} > + > + > +static inline void hv_write_sx_msr(uint32_t idx, hv_partition_t *curp, > + hv_vcpu_t*cur_vcpu, > + u64msr_content) > +{ ... > +sx_page =3D get_virt_from_gmfn(d, gmfn); > +if (sx_page =3D=3D NULL) { > +/* > + * The guest specified a bogus GPA; inject a GP fault > + * into the guest. > + */ > +hv_inject_exception(TRAP_gp_fault); > The spec says tha SIEFP can be outside the physical address space, > and you're just supposed to deal with it. The guest will be unable > to access the frame, but the MSR write is supposed to succeed > (section 14.6.3 of version 0.83). Likewise the SIMP (14.6.4). You are right. As it turns out, the win2k8 guest on our platform never = runs into this situation. I will fix this. > +return; > +} > +switch (idx) { > +case HV_MSR_SIEFP: > +hv_init_event_page(sx_page); > +cur_vcpu->siefp_msr =3D msr_content;=20 > +cur_vcpu->sief_page =3D sx_page;=20 > +break; > +case HV_MSR_SIMP: > +hv_init_message_page(sx_page); > +cur_vcpu->simp_msr =3D msr_content; > +cur_vcpu->sim_page =3D sx_page; > +break; > +} > + > +} > + > + > +/* > + * Time this domain booted. > + */ > +s_time_t hv_domain_boot_time; > Time which domain booted? What if you have two domains which both > use the HyperV extensions? Oops! Will fix this. > +static inline void > +hv_inject_exception(int trap) > +{ > +hvm_funcs.inject_exception(trap, 0, 0); > +} > hvm_inject_exception() might be a better choice. Ok. > +static inline void=20 > +hv_set_partition_privileges(hv_partition_t *hvpp) > +{ > +/* > + * This is based on the hypervisor spec under section 5.2.3.=20 > + */ > +hvpp->privileges =3D 0x000000020000007f; > +} > So you allow AccessVpRuntime, AccessTimeCounters, AccessSynicMsrs, > AccessSyntheticTimers, AccessApicMsrs, AccessHypercallMsrs, > AccessVpIndex, and AccessSelfPartitionId? That sounds like a > pretty reasonable set, but it'd be nice if you'd documented the > fact rather than just dropping in a magic number. Something like > this, maybe: > hvpp->privileges =3D HV_PRIV_ACCESS_VP_RUNTIME | > HV_PRIV_ACCESS_TIME_COUNTERS | > HV_PRIV_ACCESS_SYNIC_MSRS | ... Ok. > +static inline u32 > +hv_get_recommendations(void) > +{ > +/* > + *For now we recommend all the features. Need to validate. > + */ > +if ( paging_mode_hap(current->domain)) { > +/* > + * If HAP is enabled; the guest should not use TLB flush > + * related enlightenments. > + */ > +return (0x19); > +} else { > +/* > + * For now disable TLB flush enlightenments. > + */ > +return (0x19);=20 > +} > +} > So you recommend the use of hypercalls for address switches, MSRs > for APIC registers, and MSRs for rebooting? The first two make > sense, but I'm not so sure that enlightened reboot is a useful > optimisation. Originally, I had implemented TLB flush enlightenments as well. Based on = some of the performance analysis we are currently doing, I may re-introduce= them. As I noted earlier, I just wanted to implement all the features = that made sense for the guest on our platform - hence I implemented reboot = as well! > Again, it would have been useful if the actual set of features was > documented rather than just using a magic number. Yep. > +static inline void=20 > +hv_set_partition_features(hv_partition_t *hvpp) > +{ > +hvpp->supported_features =3D 0x1f; > +} >i.e. VP_RUNTIME | PARTITION_REF_COUNT | SYNIC MSRS | SYNTHETIC_TIMERS | > APIC_MSRS | HYPERCALL_MSRS. > It's kind of surprising that you recommend guests use the reboot MSRs, > but then don't claim to support it. Was that deliberate? > Actually, looking some more, it doesn't look like you ever use the > supported_features field, and have instead hardcoded the value to > 0xff everywhere you need it. Will fix this. > + > +static inline u16=20 > +hv_get_guest_major(void) > +{ > +//KYS: Check! > +return (0); > +} > +static inline u16 > +hv_get_guest_minor(void) > +{ > +//KYS: Check! > +return (0); > +} > +static inline u32 > +hv_get_guest_service_pack(void) > +{ > +//KYS: Check! > +return (0); > +} > Err... are these supposed to return the guest major/minor/sp > reported in the identification MSR? Yep. Was not quite sure what numbers made sense! > +static inline void > +hv_read_icr(u64 *icr_content) ... > +static inline void > +hv_read_tpr(u64 *tpr_content) ... > +static inline void > +hv_write_eoi(u64 msr_content) ... > +static inline void > +hv_write_icr(u64 msr_content) ... > +static inline void > +hv_write_tpr(u64 msr_content) ... > These all have really weird implementations and lots of gratuitous > use of hardcoded magic numbers (have a look at apicdef.h). Will fix it. > +static void > +hv_timeout_handler(void *arg) > +{ ... > +/* > + * First post the message and then optionally deal with the=20 > + * interrupt notification. > + */ > +if (cur_vcpu->sim_page =3D=3D NULL) { > +panic("Novell Shim: Sim page not setup\n"); > +} > So if a guest requests a timeout before they've set up the sim > page, Xen crashes? What if they clear the sim page after setting > up a timeout, but before it fires? I will fix this. As I noted earlier, win2k8 currently does not use any of = these features. ... > +} > + > +int > +hyperv_do_hypercall(struct cpu_user_regs *pregs) > +{ > +hv_partition_t*curp =3D hv_get_current_partition(); > +hv_vcpu_t *vcpup; > +intlong_mode_guest =3D curp->long_mode_guest; > +unsigned long hypercall_mfn; > +unsigned long gmfn; > +gmfn =3D (curp->hypercall_msr >> 12); > + > +hypercall_mfn =3D get_mfn_from_gva(pregs->eip); > That's exciting. Do you expect that guests will make hypercalls > from anywhere other than the Viridian hypercall page? We support PV drivers for win2k8 guests. Given that both HyperV and Xen = use the same hypercall numbers to implement different functionality, I = needed a way to differentiate HyperV hypercalls. I use the mfn of the = hypercall page to figure out who should handle the hypercall. ... > +int > +hyperv_vcpu_initialize(struct vcpu *v) > +{ ... > +/* > + * Setup the input page for handling hypercalls. > + * > + */ > +vcpup->input_buffer_page =3D=20 > +alloc_domheap_page(NULL); ... > +vcpup->input_buffer =3D > +get_virt_from_page_ptr(vcpup->input_buffer_page); ... > +vcpup->output_buffer_page =3D=20 > +alloc_domheap_page(NULL); ... > +vcpup->output_buffer =3D > +get_virt_from_page_ptr(vcpup->output_buffer_page); > What exactly are these used for? The only place I can see it > referenced are immediately before a panic(), which seems a bit > pointless. These were used in implementing TLB flush enlightenments. I got rid of = those hypercalls in the current patch set. I will clean these up. > +vcpup->xen_vcpu =3D v;=20 > + > +return (0); > +} > + > + > +static int=20 > +hyperv_dom_restore(struct domain *d, hvm_domain_context_t *h) > +{ > +struct hvm_hyperv_dom ctxt; > +hv_partition_t*curp; > + > +if ( hvm_load_entry(HYPERV_DOM, h, &ctxt) !=3D 0 ) > + return -22; > +d->arch.hvm_domain.params[HVM_PARAM_EXTEND_HYPERVISOR] =3D ctxt.ext_id; > +if ((ctxt.ext_id =3D=3D 0) || (ctxt.ext_id > 1)) { > +return 0; > +} > +if (hyperv_initialize(d)) { > +return -22; > +} > ???? Did you mean -EINVAL there? Yep. > +curp =3D d->arch.hvm_domain.hyperv_handle; > + > +curp->guest_id_msr =3D ctxt.guestid_msr; > +curp->hypercall_msr =3D ctxt.hypercall_msr; > +curp->long_mode_guest =3D ctxt.long_mode; > +curp->hypercall_mfn =3D > +gmfn_to_mfn(d, (ctxt.hypercall_msr >> 12)); > + > +return 0;=20 > +} > + > +static int > +hv_preprocess_cpuid_leaves(unsigned int input, struct cpu_user_regs = *regs) > +{ > +uint32_t idx; > +struct domain*d =3D current->domain; > +intextid =3D d->arch.hvm_domain.params[HVM_PARAM_EXTEND_HYPERVISOR]; > + > +if (extid =3D=3D 1) { > +/* > + * Enlightened Windows guest; need to remap and handle=20 > + * leaves used by PV front-end drivers. > + */ > +if ((input >=3D 0x40000000) && (input <=3D 0x40000005)) { > +return (0); > +} > +/* > + * PV drivers use cpuid to query the hypervisor for details. On > + * Windows we will use the following leaves for this: > + * > + * 4096: VMM Sinature (corresponds to 0x40000000 on Linux) > + * 4097: VMM Version (corresponds to 0x40000001 on Linux) > + * 4098: Hypercall details (corresponds to 0x40000002 on Linux) > + */ > I think it would be better to just unconditionally duplicate the > Xen leaves at 0x40001000, regardless of whether the viridian > enlightenments are enabled. That would make it much easier to > write PV drivers which work regardless of whether the > enlightenments are enabled, and if we can do that then it might (in > a few years' time) be possible to default to enlightments-on for > all domains. That isn't really an option when there's no common > Xen API. > This would also mean that you don't need to duplicate the whole of > hypervisor_cpuid_leaves() in the Viridian implementation. I will look into cleaning this up. ... > +} > + > +int=20 > +hyperv_do_cpu_id(unsigned int input, struct cpu_user_regs *regs) > +{ > +uint32_t idx; > + > +/* > + * hvmloader uses cpuid to set up a hypercall page; we don't want to > + * intercept calls coming from the bootstrap (bios) code in the HVM=20 > + * guest; we discriminate based on the instruction pointer. > + */ > +if (hv_call_from_bios(regs)) { > +/* > + * We don't intercept this. > + */ > +return (0); > +} > + > +if (input =3D=3D 0x00000001) {=20 > +regs->ecx =3D (regs->ecx | 0x80000000); > +return (1); > +}=20 > I think we should be doing this even when the shim is disabled. > That bit is supposed to indicate ``running on a VMM'', not > ``running on a Viridian-compatible VMM'', and it has the side > effect of making Windows much more tolerant of vcpus getting > scheduled at different rates. Good point. > +if (hv_preprocess_cpuid_leaves(input, regs)) { > +return (0); > +} > +idx =3D (input - 0x40000000); > + > +switch (idx) { > +case 0: > +/* > + * 0x40000000: Hypervisor identification.=20 > + */ > +regs->eax =3D 0x40000005; /* For now clamp this */ > +regs->ebx =3D 0x65766f4e; /* "Nove" */=20 > +regs->ecx =3D 0x68536c6c; /* "llSh" */ > +regs->edx =3D 0x76486d69; /* "imHv" */=20 > +break; > I think this is supposed to identify the hypervisor (Xen), rather > than the Viridian implementation (Novell shim). You are right. ... > +case 5: > +/* > + * 0x40000005: Implementation limits. > + * Currently we retrieve maximum number of vcpus and=20 > + * logical processors (hardware threads) supported. > + */ > +regs->eax =3D hv_get_max_vcpus_supported(); >=20 > +regs->ebx =3D hv_get_max_lcpus_supported(); > a) ebx is marked as reserved in the specification, and > b) it doesn't make a great deal of sense to expose the maximum > number of physical CPUs to a virtualised guest anyway. I thought this is what the spec mandated. I will look into this. > +regs->ecx =3D 0; /* Reserved */ > +regs->edx =3D 0; /* Reserved */ > +break;=20 > + > +default: > +/* > + * We don't handle this leaf. > + */ > +return (0); > + > +} > +return (1); > +} > + > +int > +hyperv_do_rd_msr(uint32_t idx, struct cpu_user_regs *regs) > +{ ... > +if (extid > 1) { > +/* > + * For now this is all other "Enlightened" operating systems > + * other than Longhorn. > + */ > +if (idx =3D=3D 0x40000000) { > +/* > + * PV driver hypercall setup. Let xen handle this. > + */ > +return (0); > +} > +if (idx =3D=3D 0x40001000) { > +idx =3D 0x40000000; > +} > Eh? What's this doing? At one point, I was planning to support other enlightened operating = systems. I will clean it up. > +} > +switch (idx) { ... > +case HV_MSR_TIMER0_COUNT: > +timer =3D 0; > +goto process_timer_count_read; > +case HV_MSR_TIMER1_COUNT: > +timer =3D 1; > +goto process_timer_count_read; > +case HV_MSR_TIMER2_COUNT: > +timer =3D 2; > +goto process_timer_count_read; > +case HV_MSR_TIMER3_COUNT: > +timer =3D 3; > +process_timer_count_read: > How much does timer support actually buy you? It's pretty complicated > all by itself, and it also seems to be the only reason you need SYNIC > support. The feature is not used currently by the win2k8 guest. I may choose to get = rid of this body of code. > +if (!hv_privilege_check(curp, HV_ACCESS_SYNC_TIMERS)) { > +goto msr_read_error; > +} ... > +} > + > +int > +hyperv_do_wr_msr(uint32_t idx, struct cpu_user_regs *regs) > +{ ... > +switch (idx) { ... > +case HV_MSR_SEOM: > +if (!hv_privilege_check(curp, HV_ACCESS_SYNC_MSRS)) { > +goto msr_write_error; > +} > +cur_vcpu->eom_msr =3D msr_content;=20 > +hv_process_message_q(curp, cur_vcpu); > +break; > You don't support queued messages, so there's not much point in > supporting the SEOM MSR either. Also, eom_msr never seems to get > read, except for saving it when you do a suspend. Right. > +case HV_MSR_TIMER3_CONFIG: > +timer =3D 3; > +process_timer_config: > +if (!hv_privilege_check(curp, HV_ACCESS_SYNC_TIMERS)) { > +goto msr_write_error; > +} > +/* > + * Assume that the client is going to write the whole msr.=20 > + */ > +if (!(msr_content & 0x9)) { > +/* > + * We are neither setting Auto Enable or Enable;=20 > + * silently exit. > + * Should this be considered to turn off a=20 > + * timer that may be currently=20 > + * active; XXXKYS: Check. For now we are=20 > + * not doing anything here. > + */ > +break; > +} > +if (!(((u32)(msr_content >> 16)) & 0x0000000f)) { > +/* > + * sintx is 0; clear the enable bit(s). > + */ > +msr_content &=3D ~(0x1); > +} > Again, some #define's would make these magic numbers a bit more > clear. I will fix the code. Regards, K. Y > Steven.