From mboxrd@z Thu Jan  1 00:00:00 1970
From: "Ky Srinivasan" <ksrinivasan@novell.com>
Subject: Re: [PATCH][RFC] Supporting Enlightened Windows 2008Server
Date: Sun, 13 Apr 2008 13:04:37 -0600
Message-ID: <48DF4585.E57C.0030.0@novell.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable
Return-path: <xen-devel-bounces@lists.xensource.com>
Content-Disposition: inline
List-Unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: Steven Smith <steven.smith@citrix.com>
Cc: xen-devel@lists.xensource.com, Tim Deegan <Tim.Deegan@citrix.com>, Keir Fraser <keir.fraser@eu.citrix.com>
List-Id: xen-devel@lists.xenproject.org

Steven,

Thanks for the detailed review. I will address the issues you have raised =
as part of my next cleanup of these patches. I have responded to your =
comments in-line.

Regards,

K. Y

>>> Steven Smith <steven.smith@citrix.com> 04/10/08 7:23 AM >>>=20
> I have a couple of comments on the new patches:


> Would you mind generating diffs with -p, please?  It makes them a fair
> bit easier to read.

Will do.

> =20
>  static void svm_vmexit_do_cpuid(struct cpu_user_regs *regs)
>  {
> -    unsigned int eax, ebx, ecx, edx, inst_len;
> +    unsigned int eax, ebx, ecx, edx, inst_len, input;
> =20
>      inst_len =3D __get_instruction_length(current, INSTR_CPUID, NULL);
>      if ( inst_len =3D=3D 0 )=20
> ...
> @@ -968,6 +970,7 @@
>      regs->ecx =3D ecx;
>      regs->edx =3D edx;
> =20
> +    hyperx_intercept_do_cpuid(input, regs);
>      __update_guest_eip(regs, inst_len);
>  }
>
 =20
> Would it be easier to put this bit in hvm_cpuid, or maybe
> cpuid_hypervisor_leaves?  That would avoid needing to make the same
> change to both the VMX and SVM CPUID infrastructure.

Good point. I will look into it.


> @@ -984,6 +987,10 @@
>      struct vcpu *v =3D current;
>      struct vmcb_struct *vmcb =3D v->arch.hvm_svm.vmcb;
> =20
> +    if (hyperx_intercept_do_msr_read(ecx, regs))
> +    {
> +            goto done;
> +    }
>      switch ( ecx )
>      {
>      case MSR_IA32_TSC:
>=20
> Likewise, it seems like that could be done better in
> rdmsr_hypervisor_regs().

Yep.

> @@ -1085,6 +1092,10 @@
>      msr_content =3D (u32)regs->eax | ((u64)regs->edx << 32);
> =20
>      hvmtrace_msr_write(v, ecx, msr_content);
> +    if (hyperx_intercept_do_msr_write(ecx, regs))
> +    {
> +            goto done_msr_write;
> +    }
> =20
>      switch ( ecx )
>      {

> Or wrmsr_hypervisor_regs().

Will do.


> @@ -2210,6 +2226,10 @@
>                  if ( a.value > HVMPTM_one_missed_tick_pending )
>                      goto param_fail;
>                  break;
> +            case HVM_PARAM_EXTEND_HYPERVISOR:
> +printk("KYS: EXTEND hypervisor id is %d\n", (int)a.value);
> +                if ((a.value =3D=3D 1) && hyperv_initialize(d))=20
> +                    goto param_fail;
>              }
>              d->arch.hvm_domain.params[a.index] =3D a.value;
>              rc =3D 0;
>=20

> It might be a good idea to fail with -EINVAL or something if
> a.value > 1, so that if anyone ever introduces another hypervisor
> extension it's easy for the tools to tell whether it's available or
> not.

Will do.


> Index: xen-unstable.hg/xen/arch/x86/hvm/save.c
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> --- xen-unstable.hg.orig/xen/arch/x86/hvm/save.c2008-04-04 14:28:26.00000=
0000 -0400
> +++ xen-unstable.hg/xen/arch/x86/hvm/save.c2008-04-04 14:28:44.000000000 =
-0400
> @@ -23,6 +23,8 @@
> =20
>  #include <asm/hvm/support.h>
>  #include <public/hvm/save.h>
> +#include <public/hvm/params.h>
> +#include <asm/hvm/hvm_extensions.h>
> =20
>  void arch_hvm_save(struct domain *d, struct hvm_save_header *hdr)
>  {
>=20

> The only change made to this file was to add some #include's.  Why
> were they necessary?

This was necessary at some point in the past. This is the vestiges of a =
partial cleanup! This will be fixed.


> Index: xen-unstable.hg/xen/include/public/arch-x86/hvm/save.h
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> --- xen-unstable.hg.orig/xen/include/public/arch-x86/hvm/save.h2008-04-04=
 14:28:26.000000000 -0400
> +++ xen-unstable.hg/xen/include/public/arch-x86/hvm/save.h2008-04-04 =
14:28:44.000000000 -0400
> @@ -38,7 +38,7 @@
>      uint32_t version;           /* File format version */
>      uint64_t changeset;         /* Version of Xen that saved this file =
*/
>      uint32_t cpuid;             /* CPUID[0x01][%eax] on the saving =
machine */
> -    uint32_t pad0;
> +    uint32_t pad0;/* extension ID */
>  };
> =20
>  DECLARE_HVM_SAVE_TYPE(HEADER, 1, struct hvm_save_header);
>=20
> I think it might be a good idea to give the field a more
> descriptive name than ``pad0'' if it's being used for something
> other than padding.

This again is the result of a partial cleanup. I was stashing away the =
extension ID and Tim rightly pointed out that we had enough state and did =
not need the extension ID in the save record. This will be cleaned up.

> It might be an even better idea to move the ID out of the header
> entirely and put it in its own HVM_SAVE_TYPE.  You already have
> hvm_hyperv_dom; can you use that to identify the presence or
> absence of the extensions?

I will look into this.


> --- /dev/null1970-01-01 00:00:00.000000000 +0000
> +++ xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_errno.h2008-03-26 =
13:56:39.000000000 -0400
> @@ -0,0 +1,4 @@
...
> +/*
> + * hv_errno.h
> + * Error codes for the  Novell Shim.
>=20
>> These are just the Viridian error codes, aren't they, rather than
>> something specific to the Novell shim?

Yes. This will be fixed.


> + *
> + * Engineering Contact: K. Y. Srinivasan
> + */
> +
> +#ifndef HV_ERRNO_H
> +#define HV_ERRNO_H
> +
> +#define HV_STATUS_SUCCESS0x0000
> +#define HV_STATUS_INVALID_HYPERCALL_CODE0x0002
> +#define HV_STATUS_INVALID_HYPERCALL_INPUT0x0003
> +#define HV_STATUS_INVALID_ALIGNMENT0x0004
> +#define HV_STATUS_INVALID_PARAMETER0x0005
...
> +#endif=20
> Index: xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_hypercall.c
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> --- /dev/null1970-01-01 00:00:00.000000000 +0000
> +++ xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_hypercall.c2008-03-26 =
14:18:32.000000000 -0400
> @@ -0,0 +1,13 @@
...
> +void hv_collect_stats(int event, hv_vcpu_stats_t *statsp)
> +{
> +switch (event) {
> +case HV_CSWITCH:
> +statsp->num_switches++;
> +return;
> +case HV_FLUSH_VA:
> +statsp->num_flushes++;
> +return;
> +case HV_FLUSH_RANGE:
> +statsp->num_flush_ranges++;
> +return;
> That looks a bit pointless.  Would it not be easier to just inline
> this whole function?

Now that, I have dropped all the TLB flush enlightenments, I will clean =
this up.


...
> +}
...
> +static int
> +hv_get_vp_registers(paddr_t input, paddr_t output)
> +{
> +hv_vcpu_t        *vcpup, *targetp;
> +hv_partition_t   *curp =3D hv_get_current_partition();
> +get_vp_registers_input_t*inbuf;
> +get_vp_registers_output_t*outbuf;
> +struct vcpu_guest_context*vcpu_ctx;
> +u32*reg_indexp;
> +get_vp_registers_output_t*out_regp;
> +u32num_output_bytes =3D 0;
> +
> +        vcpup =3D &curp->vcpu_state[hv_get_current_vcpu_index()];
> +inbuf =3D vcpup->input_buffer;
> +outbuf =3D vcpup->output_buffer;
> +out_regp =3D outbuf;
> +/*
> + * Copy the input data to the per-cpu input buffer.
> + * This may be an overkill; obviously it is better to only
> + * copy what we need. XXXKYS: Check with Mike.
> + */
> Who's Mike?  What did he say?

Mike is  the MSFT contact.=20


> +if (hvm_copy_from_guest_phys(inbuf, input, PAGE_SIZE)) {
> +return (HV_STATUS_INVALID_ALIGNMENT);
> +}
...
> +/*
> + * Now that we have the register state; select what we want and
> + * populate the output buffer.
> + */
> +reg_indexp =3D &inbuf->reg_index;
> +while (*reg_indexp !=3D 0) {
> +switch(*reg_indexp) {
> +/*
> + * XXXKYS: need mapping code here; populate
> + * outbuf.
> + */
> +panic("hv_get_vp_registers not supported\n");

> Yeah, that's not really good enough.  It would be better to just
> not support this hypercall at all than have an implementation which
> always crashes Xen.

Early on, I decided what Hypercalls I would support; and from this list I =
wanted=20
to only support those that were used by windows 2008 server. I  had put in =
a bunch of panics to trap all the hypercalls that the guest would invoke. =
This hypercall is not used. Thus the implementation is incomplete. I will =
clean this up.

> +}
> +reg_indexp++;
> +out_regp++ ;/*128 bit registers */
> +num_output_bytes +=3D16;
> +if ((char *)reg_indexp > ((char *)inbuf + PAGE_SIZE)) {
> +/*
> + *input list not reminated correctly; bail out.
> + */
> +panic("hv_get_vp_registers:input list not terminated\n");=20
> +break;

> It would be nice if this caused the hypercall to fail, rather than
> crashing Xen.

Will be cleaned up.


> +}
> +}
> +if (hvm_copy_to_guest_phys(output, outbuf, num_output_bytes)) {
> +/* Some problem copying data out*/
> +panic("hv_get_vp_registers:copyout problem\n");
=20
> And again.

Will be cleaned up.


> +}
> +xfree(vcpu_ctx);
> +return (HV_STATUS_SUCCESS);
> +}

> Has this hypercall had any testing at all?  If it hasn't, then what
> is it doing here?

As I noted earlier, win2k8 does not currently use this API. I will clean =
this up.


> +
> +static int
> +hv_set_vp_registers(paddr_t input, paddr_t output)

> Again, I don't believe this function has ever been called.  That
> suggests that this may not be a high-value optimisation suitable
> for inclusion in an initial version.

As I noted earlier, win2k8 does not currently use this API. I will clean =
this up.


...
>=20
> +static int
> +hv_switch_va(paddr_t input)
> +{
> +hv_partition_t   *curp =3D hv_get_current_partition();
> +        hv_vcpu_t *vcpup =3D &curp->vcpu_state[hv_get_current_vcpu_index=
()];
> +
> +/*
> + * XXXKYS: the spec sys the asID is passed via memory at offset 0 of=20
> + * the page whose GPA is in the input register. However, it appears=20
> + * the current build of longhorn (longhorn-2007-02-06-x86_64-fv-02)
> + * passes the asID in the input register instead. Need to check if=20
> + * future builds do this.
> + */
> +hvm_set_cr3(input);=20
> +HV_STATS_COLLECT(HV_CSWITCH, &vcpup->stats);
> +return (HV_STATUS_SUCCESS);
> +}

> Do you have any evidence that this is actually faster than just
> doing the CR3 write in the guest?  It's a win on real Viridian
> because you avoid blowing the shadow page table cache, but I don't
> think that applies on Xen's current shadow page table
> implementation.

As I noted earlier, initially I identified all the hypercalls that made =
for a win2k8 guest on our platform. From this list I implemented those =
that the guest actually invoked. I implemented all the hypercalls that the =
guest was invoking without regard to the performance value of the =
enlightenment. I agree with you that this may be more important on =
Veridian than on our platform.=20


> +static int
> +hv_flush_va(paddr_t input)
> +{
...
> +/*
> + * Now operate on what we are given
> + * XXXKYS: For now we are ignoring as_id and fushGlobal flag.
> + * May have to revisit this. But first stash away the processed=20
> + * parameters for subsequent use.
> + */
> +flush_argp->as_handle =3D as_id;
> +flush_argp->flags =3D flush_global;
> +flush_argp->v_mask =3D vcpu_mask;
> +
> +
> +ret_val =3D hv_build_hcall_retval(HV_STATUS_SUCCESS, 0);
> +hv_set_syscall_retval(guest_cpu_user_regs(),
> +                                   curp->long_mode_guest,
> +                                   ret_val);
> +HV_STATS_COLLECT(HV_FLUSH_VA_STAT, &cur_vcpup->stats);
> +panic("hv_flush_va not supported\n");=20

> So if this hypercall is ever correctly used by the guest, Xen will
> crash?

In the original patches that I posted, this hypercall was implemented. As =
part of my first round of cleanup, I removed the support for the TLB  =
flush hypercalls. As part of this cleanup, I also disabled TLB related =
enlightenments. This will be cleaned up.

> +return (HV_STATUS_SUCCESS);
> +}
> +
> +static int
> +hv_flush_va_range(paddr_t input, unsigned short start_index,=20
> +unsigned short rep_count, unsigned short *reps_done)
> +{
...
> +ret_val =3D hv_build_hcall_retval(HV_STATUS_SUCCESS, rep_count);
> +hv_set_syscall_retval(guest_cpu_user_regs(),
> +                                   curp->long_mode_guest,
> +                                   ret_val);
> +
> +
> +HV_STATS_COLLECT(HV_FLUSH_RANGE, &cur_vcpup->stats);
> +panic("hv_flush_vaRange not supported\n");
=20
> And again.

Will be cleaned up.


> +return (HV_STATUS_SUCCESS);
> +}
> +
> +void
> +hv_handle_hypercall(u64 opcode, u64 input, u64 output,=20
> +  u64 *ret_val)

> Would it be easier to just return the return value?

Perhaps.


> +{
...
> +verb =3D (short)(opcode & 0xffff);
> +rep_count =3D (short)((opcode >>32) & 0xfff);
> +start_index =3D (short)((opcode >> 48) & 0xfff);
> +switch (verb) {
> +case HV_CREATE_PARTITION:
> +/*
> + * Xen only allows dom0 to create domains.
> + */
> +*ret_val =3D hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
> +return;
> +case HV_INITIALIZE_PARTITION:
> +/*
> + * We don't support this.
> + */
> +*ret_val =3D hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
> +return;

> Would it be easier to just make HV_STATUS_ACCESS_DENIED the
> default?  Actually, my reading of the spec is that you should
> return HV_STATUS_INVALID_HYPERCALL_CODE if the guest makes a
> hypercall which you don't support, but I can't imagine it makes a
> great deal of difference.

Perhaps not.=20


...
> +case HV_GET_PARTITION_ID:
> +if (!hv_privilege_check(curp, HV_ACCESS_PARTITION_ID)) {
> +*ret_val =3D=20
> +hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
> +return;
> +}
> +partition_id =3D (u64)current->domain->domain_id;
> +if (hvm_copy_to_guest_phys(output,=20
> +&partition_id, 8)) {
> +/*
> + * Invalid output area.
> + */
> +*ret_val =3D=20
> +hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);

> I think you mean HV_STATUS_INVALID_ALIGNMENT.

Agreed.


> +return;
> +}
> +*ret_val =3D hv_build_hcall_retval(HV_STATUS_SUCCESS, 0);
> +return;
>=20
> +case HV_GET_VP_REGISTERS:
> +*ret_val =3D hv_build_hcall_retval(
> +hv_get_vp_registers(input, output), 0);
> +return;
> +case HV_SET_VP_REGISTERS:
> +*ret_val =3D hv_build_hcall_retval(
> +hv_set_vp_registers(input, output), 0);
...
> +case HV_FLUSH_VA:
> +*ret_val =3D=20
> +hv_build_hcall_retval(hv_flush_va(input), 0);
> +return;
> +case HV_FLUSH_VA_LIST:
> +value  =3D hv_flush_va_range(input, start_index,=20
> +rep_count, &reps_done);
> +*ret_val =3D hv_build_hcall_retval(value, reps_done); =20
> +return;

> As mentioned above, your implementations of these hypercalls are
> really quite broken.

These hypercalls are not invoked by the guest and that is the reason they =
are broken. I will clean it up though.


...
>=20
> +}
> +}
> Index: xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_hypercall.h
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> --- /dev/null1970-01-01 00:00:00.000000000 +0000
> +++ xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_hypercall.h2008-03-26 =
13:56:39.000000000 -0400
> @@ -0,0 +1,21 @@
> +
> +/*
> + * nshypercall.h
> + * Memory layouts for the various hypercalls supported.=20
> + *
> + * Engineering Contact: K. Y. Srinivasan
> + */
> +
> +#ifndef HV_HYPERCALL_H
> +#define HV_HYPERCALL_H
> +
> +#include <xen/cpumask.h>
> +
> +
> +typedef struct get_vp_registers_input {
> +u64partition_id;
> +u64vp_index;
> +u32reg_index;
> +u32pad1;
> +u64pad2;
> +} get_vp_registers_input_t;

> My reading of the 0.83 specification is that this should have been:

> typedef struct get_vp_registers_input {
> u64 partition_id;
> u32 vp_index;
> u32 pad;
> u32 reg_index[0];
> } get_vp_registers_input_t;

> Is my copy of the specification out of date?

Nope.=20


> +typedef struct set_vp_register_spec {
> +u32reg_name;
> +u32pad;
> +u64pad1;
> +u64low_value;
> +u64high_value;
> +} set_vp_register_spec_t;
> +
> +typedef struct set_vp_registers_input {
> +u64partition_id;
> +u64vp_index;
> +set_vp_register_spec_treg_spec;
> +} set_vp_registers_input_t;

> Again, the 0.83 spec says that vp_index should be a u32, and should
> be followed by a u32 MBZ pad.
Right.

> +
> +
> +typedef struct flush_va {
> +u64as_handle;
> +u64flags;
> +union  {
> +u64processor_mask;
> +cpumask_t vcpu_mask;
> +} proc_mask;
> +#define p_mask proc_mask.processor_mask
> +#define v_maskproc_mask.vcpu_mask
> +u64gva;
> +} flush_va_t;

> Is there something wrong with anonymous unions?


> +/*
> + * Hypercall verbs.
> + */
> +
> +#define HV_CREATE_PARTITION 0x0010
> +#define HV_INITIALIZE_PARTITION 0x0011
> +#define HV_DELETE_PARTITION0x0014

> Is it really necesary to include #define's for hypercalls which we
> don't support and will probably never support?

No; I will clean this file up.


> Index: xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_intercept.c
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> --- /dev/null1970-01-01 00:00:00.000000000 +0000
> +++ xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_intercept.c2008-03-26 =
14:26:38.000000000 -0400
> @@ -0,0 +1,0 @@
>=20
> +/*
> + * Local includes; extension specific.
> + */
> +#include "hv_errno.h"
> +#include "hv_shim.h"
> +
> +
> +/*
> + * Implement the Hyperv Shim.
> + */
> +
> +extern struct cpuinfo_x86 boot_cpu_data;

> Is there something wrong with processor.h?

This will be fixed.

> +extern struct hvm_mmio_handler vlapic_mmio_handler;

> We should probably move that to a header file somewhere.

Will do.

> +static inline void *
> +get_virt_from_gmfn(struct domain *d, unsigned long gmfn)
> +{
> +unsigned long mfn =3D gmfn_to_mfn(d, gmfn);
> +if (mfn =3D=3D INVALID_MFN) {
> +return (NULL);
> +}
> +return (map_domain_page_global(mfn));
> +}

> Is it really necessary to use map_domain_page_global() here?
> map_domain_page() is usually quite a bit faster.  To be honest, I'm
> not convinced that this wrapper function has any real value anyway.

I will look into this.


> +static inline unsigned long
> +get_mfn_from_gva(unsigned long va)
> +{
> +uint32_t pfec =3D PFEC_page_present;
> +unsigned long gfn;
> +gfn =3D paging_gva_to_gfn(current, va, &pfec);
> +return (gmfn_to_mfn((current->domain), gfn));
> +}

> That's only valid if you can guarantee that the resulting MFN will
> never be written to.

Yes. I use this to figure out if the hypercall is from the PV drivers in =
the guest or the hyparcall from the guest.


> +static inline void=20
> +hv_write_hypercall_msr(hv_partition_t *curp,
> +  hv_vcpu_t*cur_vcpu,
> +  u64msr_content)
> +{
...
> +hv_hypercall_page_initialize(hypercall_page, curp);
> +curp->hypercall_mfn =3D gmfn_to_mfn(d, gmfn);
> +#ifdef CONFIG_DOMAIN_PAGE
> +unmap_domain_page_global(hypercall_page);
> +#endif

> unmap_domain_page_global() is #define'd to nothing #ifndef
> CONFIG_DOMAIN_PAGE, so you don't need the #ifdef.

Yes.


> +curp->hypercall_msr =3D msr_content;
> +spin_unlock(&curp->lock);
> +cur_vcpu->flags |=3D HV_VCPU_UP;
> +}
> +
> +
> +static inline void hv_write_sx_msr(uint32_t idx, hv_partition_t *curp,
> +  hv_vcpu_t*cur_vcpu,
> +  u64msr_content)
> +{
...
> +sx_page =3D get_virt_from_gmfn(d, gmfn);
> +if (sx_page =3D=3D NULL) {
> +/*
> + * The guest specified a bogus GPA; inject a GP fault
> + * into the guest.
> + */
> +hv_inject_exception(TRAP_gp_fault);

> The spec says tha SIEFP can be outside the physical address space,
> and you're just supposed to deal with it.  The guest will be unable
> to access the frame, but the MSR write is supposed to succeed
> (section 14.6.3 of version 0.83).  Likewise the SIMP (14.6.4).

You are right. As it turns out, the win2k8 guest on our platform never =
runs into this situation. I will fix this.


> +return;
> +}
> +switch (idx) {
> +case HV_MSR_SIEFP:
> +hv_init_event_page(sx_page);
> +cur_vcpu->siefp_msr =3D msr_content;=20
> +cur_vcpu->sief_page =3D sx_page;=20
> +break;
> +case HV_MSR_SIMP:
> +hv_init_message_page(sx_page);
> +cur_vcpu->simp_msr =3D msr_content;
> +cur_vcpu->sim_page =3D sx_page;
> +break;
> +}
> +
> +}
> +

> +
> +/*
> + * Time this domain booted.
> + */
> +s_time_t hv_domain_boot_time;
> Time which domain booted?  What if you have two domains which both
> use the HyperV extensions?

Oops! Will fix this.


> +static inline void
> +hv_inject_exception(int trap)
> +{
> +hvm_funcs.inject_exception(trap, 0, 0);
> +}
> hvm_inject_exception() might be a better choice.

Ok.

> +static inline void=20
> +hv_set_partition_privileges(hv_partition_t *hvpp)
> +{
> +/*
> + * This is based on the hypervisor spec under section 5.2.3.=20
> + */
> +hvpp->privileges =3D 0x000000020000007f;
> +}
> So you allow AccessVpRuntime, AccessTimeCounters, AccessSynicMsrs,
> AccessSyntheticTimers, AccessApicMsrs, AccessHypercallMsrs,
> AccessVpIndex, and AccessSelfPartitionId?  That sounds like a
> pretty reasonable set, but it'd be nice if you'd documented the
> fact rather than just dropping in a magic number.  Something like
> this, maybe:

> hvpp->privileges =3D HV_PRIV_ACCESS_VP_RUNTIME |
>   HV_PRIV_ACCESS_TIME_COUNTERS |
>                  HV_PRIV_ACCESS_SYNIC_MSRS |
                   ...

Ok.

> +static inline u32
> +hv_get_recommendations(void)
> +{
> +/*
> + *For now we recommend all the features. Need to validate.
> + */
> +if ( paging_mode_hap(current->domain)) {
> +/*
> + * If HAP is enabled; the guest should not use TLB flush
> + * related enlightenments.
> + */
> +return (0x19);
> +} else {
> +/*
> + * For now disable TLB flush enlightenments.
> + */
> +return (0x19);=20
> +}
> +}

> So you recommend the use of hypercalls for address switches, MSRs
> for APIC registers, and MSRs for rebooting?  The first two make
> sense, but I'm not so sure that enlightened reboot is a useful
> optimisation.

Originally, I had implemented TLB flush enlightenments as well. Based on =
some of the performance analysis we are currently doing, I may re-introduce=
 them. As I noted earlier, I just wanted to implement all the features =
that made sense for the guest on our platform - hence I implemented reboot =
as well!

> Again, it would have been useful if the actual set of features was
> documented rather than just using a magic number.

Yep.

> +static inline void=20
> +hv_set_partition_features(hv_partition_t *hvpp)
> +{
> +hvpp->supported_features =3D 0x1f;
> +}
>i.e. VP_RUNTIME | PARTITION_REF_COUNT | SYNIC MSRS | SYNTHETIC_TIMERS |
>     APIC_MSRS | HYPERCALL_MSRS.

> It's kind of surprising that you recommend guests use the reboot MSRs,
> but then don't claim to support it.  Was that deliberate?

> Actually, looking some more, it doesn't look like you ever use the
> supported_features field, and have instead hardcoded the value to
> 0xff everywhere you need it.

Will fix this.

> +
> +static inline u16=20
> +hv_get_guest_major(void)
> +{
> +//KYS: Check!
> +return (0);
> +}
> +static inline u16
> +hv_get_guest_minor(void)
> +{
> +//KYS: Check!
> +return (0);
> +}
> +static inline u32
> +hv_get_guest_service_pack(void)
> +{
> +//KYS: Check!
> +return (0);
> +}
> Err... are these supposed to return the guest major/minor/sp
> reported in the identification MSR?

Yep. Was not quite sure what numbers made sense!


> +static inline void
> +hv_read_icr(u64 *icr_content)
...
> +static inline void
> +hv_read_tpr(u64 *tpr_content)
...
> +static inline void
> +hv_write_eoi(u64 msr_content)
...
> +static inline void
> +hv_write_icr(u64 msr_content)
...
> +static inline void
> +hv_write_tpr(u64 msr_content)
...

> These all have really weird implementations and lots of gratuitous
> use of hardcoded magic numbers (have a look at apicdef.h).

Will fix it.


> +static void
> +hv_timeout_handler(void *arg)
> +{
...
> +/*
> + * First post the message and then optionally deal with the=20
> + * interrupt notification.
> + */
> +if (cur_vcpu->sim_page =3D=3D NULL) {
> +panic("Novell Shim: Sim page not setup\n");
> +}

> So if a guest requests a timeout before they've set up the sim
> page, Xen crashes?  What if they clear the sim page after setting
> up a timeout, but before it fires?

I will fix this. As I noted earlier, win2k8 currently does not use any of =
these features.

...
> +}
> +

> +int
> +hyperv_do_hypercall(struct cpu_user_regs *pregs)
> +{
> +hv_partition_t*curp =3D hv_get_current_partition();
> +hv_vcpu_t        *vcpup;
> +intlong_mode_guest =3D curp->long_mode_guest;
> +unsigned long hypercall_mfn;
> +unsigned long gmfn;
> +gmfn =3D (curp->hypercall_msr >> 12);
> +
> +hypercall_mfn =3D get_mfn_from_gva(pregs->eip);

> That's exciting.  Do you expect that guests will make hypercalls
> from anywhere other than the Viridian hypercall page?

We support PV drivers for win2k8 guests. Given that both HyperV and Xen =
use the same hypercall numbers to implement different functionality, I =
needed a way to differentiate HyperV hypercalls. I use the mfn of the =
hypercall page to figure out who should handle the hypercall.

...

> +int
> +hyperv_vcpu_initialize(struct vcpu *v)
> +{
...
> +/*
> + * Setup the input page for handling hypercalls.
> + *
> + */
> +vcpup->input_buffer_page =3D=20
> +alloc_domheap_page(NULL);
...
> +vcpup->input_buffer =3D
> +get_virt_from_page_ptr(vcpup->input_buffer_page);
...
> +vcpup->output_buffer_page =3D=20
> +alloc_domheap_page(NULL);
...
> +vcpup->output_buffer =3D
> +get_virt_from_page_ptr(vcpup->output_buffer_page);

> What exactly are these used for?  The only place I can see it
> referenced are immediately before a panic(), which seems a bit
> pointless.

These were used in implementing TLB flush enlightenments. I got rid of =
those hypercalls in the current patch set. I will clean these up.


> +vcpup->xen_vcpu =3D v;=20
> +
> +return (0);
> +}
> +

> +
> +static int=20
> +hyperv_dom_restore(struct domain *d, hvm_domain_context_t *h)
> +{
> +struct hvm_hyperv_dom ctxt;
> +hv_partition_t*curp;
> +
> +if ( hvm_load_entry(HYPERV_DOM, h, &ctxt) !=3D 0 )
> +        return -22;
> +d->arch.hvm_domain.params[HVM_PARAM_EXTEND_HYPERVISOR] =3D ctxt.ext_id;
> +if ((ctxt.ext_id =3D=3D 0) || (ctxt.ext_id > 1)) {
> +return 0;
> +}
> +if (hyperv_initialize(d)) {
> +return -22;
> +}
> ????  Did you mean -EINVAL there?

Yep.

> +curp =3D d->arch.hvm_domain.hyperv_handle;
> +
> +curp->guest_id_msr =3D ctxt.guestid_msr;
> +curp->hypercall_msr =3D ctxt.hypercall_msr;
> +curp->long_mode_guest =3D ctxt.long_mode;
> +curp->hypercall_mfn =3D
> +gmfn_to_mfn(d, (ctxt.hypercall_msr >> 12));
> +
> +return 0;=20
> +}
> +

> +static int
> +hv_preprocess_cpuid_leaves(unsigned int input, struct cpu_user_regs =
*regs)
> +{
> +uint32_t idx;
> +struct domain*d =3D current->domain;
> +intextid =3D d->arch.hvm_domain.params[HVM_PARAM_EXTEND_HYPERVISOR];
> +
> +if (extid =3D=3D 1) {
> +/*
> + * Enlightened Windows guest; need to remap and handle=20
> + * leaves used by PV front-end drivers.
> + */
> +if ((input >=3D 0x40000000) && (input <=3D 0x40000005)) {
> +return (0);
> +}
> +/*
> + * PV drivers use cpuid to query the hypervisor for details. On
> + * Windows we will use the following leaves for this:
> + *
> + * 4096: VMM Sinature (corresponds to 0x40000000 on Linux)
> + * 4097: VMM Version (corresponds to 0x40000001 on Linux)
> + * 4098: Hypercall details (corresponds to 0x40000002 on Linux)
> + */

> I think it would be better to just unconditionally duplicate the
> Xen leaves at 0x40001000, regardless of whether the viridian
> enlightenments are enabled.  That would make it much easier to
> write PV drivers which work regardless of whether the
> enlightenments are enabled, and if we can do that then it might (in
> a few years' time) be possible to default to enlightments-on for
> all domains.  That isn't really an option when there's no common
> Xen API.

> This would also mean that you don't need to duplicate the whole of
> hypervisor_cpuid_leaves() in the Viridian implementation.

I will look into cleaning this up.

...
> +}
> +
> +int=20
> +hyperv_do_cpu_id(unsigned int input, struct cpu_user_regs *regs)
> +{
> +uint32_t idx;
> +
> +/*
> + * hvmloader uses cpuid to set up a hypercall page; we don't want to
> + * intercept calls coming from the bootstrap (bios) code in the HVM=20
> + * guest; we discriminate based on the instruction pointer.
> + */
> +if (hv_call_from_bios(regs)) {
> +/*
> + * We don't intercept this.
> + */
> +return (0);
> +}
> +
> +if (input =3D=3D 0x00000001) {=20
> +regs->ecx =3D (regs->ecx | 0x80000000);
> +return (1);
> +}=20

> I think we should be doing this even when the shim is disabled.
> That bit is supposed to indicate ``running on a VMM'', not
> ``running on a Viridian-compatible VMM'', and it has the side
> effect of making Windows much more tolerant of vcpus getting
> scheduled at different rates.

Good point.

> +if (hv_preprocess_cpuid_leaves(input, regs)) {
> +return (0);
> +}
> +idx =3D (input - 0x40000000);
> +
> +switch (idx) {
> +case 0:
> +/*
> + * 0x40000000: Hypervisor identification.=20
> + */
> +regs->eax =3D 0x40000005; /* For now clamp this */
> +regs->ebx =3D 0x65766f4e; /* "Nove" */=20
> +regs->ecx =3D 0x68536c6c; /* "llSh" */
> +regs->edx =3D 0x76486d69; /* "imHv" */=20
> +break;

> I think this is supposed to identify the hypervisor (Xen), rather
> than the Viridian implementation (Novell shim).

You are right.

...
> +case 5:
> +/*
> + * 0x40000005: Implementation limits.
> + * Currently we retrieve maximum number of vcpus and=20
> + * logical processors (hardware threads) supported.
> + */
> +regs->eax =3D hv_get_max_vcpus_supported();
>=20
> +regs->ebx =3D hv_get_max_lcpus_supported();
> a) ebx is marked as reserved in the specification, and
> b) it doesn't make a great deal of sense to expose the maximum
>   number of physical CPUs to a virtualised guest anyway.

I thought this is what the spec mandated.  I will look into this.

> +regs->ecx =3D 0; /* Reserved */
> +regs->edx =3D 0; /* Reserved */
> +break;=20
> +
> +default:
> +/*
> + * We don't handle this leaf.
> + */
> +return (0);
> +
> +}
> +return (1);
> +}
> +
> +int
> +hyperv_do_rd_msr(uint32_t idx, struct cpu_user_regs *regs)
> +{
...
> +if (extid > 1) {
> +/*
> + * For now this is all other "Enlightened" operating systems
> + * other than Longhorn.
> + */
> +if (idx =3D=3D 0x40000000) {
> +/*
> + * PV driver hypercall setup. Let xen handle this.
> + */
> +return (0);
> +}
> +if (idx =3D=3D 0x40001000) {
> +idx =3D 0x40000000;
> +}

> Eh?  What's this doing?

At one point, I was planning to support other enlightened operating =
systems. I will clean it up.


> +}
> +switch (idx) {
...
> +case HV_MSR_TIMER0_COUNT:
> +timer =3D 0;
> +goto process_timer_count_read;
> +case HV_MSR_TIMER1_COUNT:
> +timer =3D 1;
> +goto process_timer_count_read;
> +case HV_MSR_TIMER2_COUNT:
> +timer =3D 2;
> +goto process_timer_count_read;
> +case HV_MSR_TIMER3_COUNT:
> +timer =3D 3;
> +process_timer_count_read:

> How much does timer support actually buy you?  It's pretty complicated
> all by itself, and it also seems to be the only reason you need SYNIC
> support.

The feature is not used currently by the win2k8 guest. I may choose to get =
rid of this body of code.

> +if (!hv_privilege_check(curp, HV_ACCESS_SYNC_TIMERS)) {
> +goto msr_read_error;
> +}
...
> +}
> +
> +int
> +hyperv_do_wr_msr(uint32_t idx, struct cpu_user_regs *regs)
> +{
...
> +switch (idx) {
...
> +case HV_MSR_SEOM:
> +if (!hv_privilege_check(curp, HV_ACCESS_SYNC_MSRS)) {
> +goto msr_write_error;
> +}
> +cur_vcpu->eom_msr =3D msr_content;=20
> +hv_process_message_q(curp, cur_vcpu);
> +break;

> You don't support queued messages, so there's not much point in
> supporting the SEOM MSR either.  Also, eom_msr never seems to get
> read, except for saving it when you do a suspend.

Right.


> +case HV_MSR_TIMER3_CONFIG:
> +timer =3D 3;
> +process_timer_config:
> +if (!hv_privilege_check(curp, HV_ACCESS_SYNC_TIMERS)) {
> +goto msr_write_error;
> +}
> +/*
> + * Assume that the client is going to write the whole msr.=20
> + */
> +if (!(msr_content & 0x9)) {
> +/*
> + * We are neither setting Auto Enable or Enable;=20
> + * silently exit.
> + * Should this be considered to turn off a=20
> + * timer that may be currently=20
> + * active; XXXKYS: Check. For now we are=20
> + * not doing anything here.
> + */
> +break;
> +}
> +if (!(((u32)(msr_content >> 16)) & 0x0000000f)) {
> +/*
> + * sintx is 0; clear the enable bit(s).
> + */
> +msr_content &=3D ~(0x1);
> +}

> Again, some #define's would make these magic numbers a bit more
> clear.

I will fix the code.

Regards,

K. Y


> Steven.